# Rag From Scratch: Overview

These notebooks walk through the process of building logal RAG app(s) from scratch.

In [1]:
import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")

### Environment Setup

In [2]:
# Installing Required Packages
! pip install langchain_community tiktoken langchain ollama faiss-cpu sentence-transformers



Install Ollama on linux

In [3]:
! curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


**Run Ollama Server in the Background**

The nohup **(no hangup**) command will prevent the process from stopping when the notebook cell finishes or when the Colab runtime disconnects. You should also redirect both the output and error to a log file.

In [4]:
!nohup ollama serve > ollama.log 2>&1 &

## Part 1: Running Local Models with Ollama and FAISS

In [5]:
# Pull LLaMA 3.1 model using Ollama (local language model)
! ollama pull llama3.1

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest 
pulling 8eeb52dfb3bb... 100% ▕▏ 4.7 GB                         
pulling 948af2743fc7... 100% ▕▏ 1.5 KB                         
pulling 0ba8f0e314b4... 100% ▕▏  12 KB                         
pulling 56bb8bd477a5... 100% ▕▏   96 B                         
pulling 1a4c3c319823... 100% ▕▏  485 B                         
verifying sha256 digest 
writing manifest 
success [?25h


#### 1. Import necessary libraries

This section imports key libraries necessary for building a **local Retrieval-Augmented Generation (RAG) pipeline**:

1. **Ollama**: Used to load and interact with local instances of the **LLaMA** model, allowing for local language model inference.
2. **FAISS**: A highly efficient vector store for performing fast similarity searches and document retrieval in-memory.
3. **HuggingFaceEmbeddings**: Provides local text embeddings using models from Hugging Face, enabling semantic search and document vectorization.


In [6]:
import ollama
from langchain.llms import Ollama
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

#### 2. Load the local LLM

This line initializes the **LLaMA 3.1** model using **Ollama**:

- **llm = Ollama(model="llama3.1")**: Loads the **LLaMA 3.1** language model locally through the **Ollama** library. This model allows for performing natural language generation and other tasks entirely offline, leveraging the power of large language models without relying on external services or APIs.


In [7]:
llm = Ollama(model="llama3.1")

#### 3. Use local embeddings for retrieval

This line initializes the **Hugging Face embeddings** model:

- **embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")**: Loads the `all-mpnet-base-v2` model from Hugging Face's **sentence-transformers** library to generate text embeddings locally. These embeddings represent the semantic meaning of text and are essential for tasks like similarity search and document retrieval.

In [8]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


#### 4. Load your documents

This line defines a list of sample text documents for embedding and retrieval:

- **texts = ["Sample text 1", "Sample text 2", "Sample text 3"]**: This creates a list of text strings that will be embedded and stored in the vector store. These sample texts are used to simulate documents or pieces of information that the system will later search through based on similarity to a query.


In [9]:
texts = ["Sample text 1", "Sample text 2", "Sample text 3"]

#### 5. Create the FAISS vectorstore locally

This line creates a **FAISS vector store** from the provided texts:

- **faiss_index = FAISS.from_texts(texts, embeddings)**: Converts the `texts` into vector embeddings using the specified embedding model (`embeddings`) and stores them in a **FAISS** vector index. FAISS is used for efficient nearest-neighbor search and similarity retrieval based on the vector representations of the texts.

In [10]:
faiss_index = FAISS.from_texts(texts, embeddings)

#### 6. Create the retrieval-based chain

This function defines the **local Retrieval-Augmented Generation (RAG) pipeline**:

- **`def local_rag(question, faiss_index, llm):`**: This function takes in a `question`, the FAISS vector store (`faiss_index`), and the local LLaMA model (`llm`) to generate a response based on the most relevant documents.

1. **Document Retrieval**:
    - **`docs = faiss_index.similarity_search(question)`**: The function performs a similarity search in the FAISS index using the provided question. It retrieves documents that are semantically closest to the query.
   
2. **Prompt Construction**:
    - **`prompt = f"Context: {docs}\n\nQuestion: {question}\n\nAnswer:"`**: Constructs a prompt that includes the retrieved documents as context, along with the original question. This prompt is then passed to the LLaMA model for generation.

3. **Response Generation**:
    - **`response = llm(prompt, clean_up_tokenization_spaces=False)`**: The local LLaMA model generates an answer based on the provided context and question. The `clean_up_tokenization_spaces=False` parameter is set to prevent tokenization-related warnings from the transformers library.

4. **Return Response**:
    - **`return response`**: The generated answer is returned as the output of the function.

In [11]:
def local_rag(question, faiss_index, llm):
    # Retrieve relevant documents
    docs = faiss_index.similarity_search(question)

    # Create the prompt with retrieved documents
    prompt = f"Context: {docs}\n\nQuestion: {question}\n\nAnswer:"

    # Set clean_up_tokenization_spaces to False in transformers to avoid future warnings
    response = llm(prompt, clean_up_tokenization_spaces=False)

    return response

#### Run the RAG Pipeline:

**1. `question = "What is Retrieval-Augmented Generation (RAG)?"`**:
Defines the input question to be answered by the system.

In [12]:
question = "What is Retrieval-Augmented Generation (RAG)?"

**2. `response = local_rag(question, faiss_index, llm)`**: The function retrieves relevant documents based on the question, constructs a prompt, and generates a response from the local LLaMA model.

In [13]:
response = local_rag(question, faiss_index, llm)

  response = llm(prompt, clean_up_tokenization_spaces=False)


Output the response

In [14]:
print(response)

Retrieval-Augmented Generation (RAG) is a type of model that combines the strengths of two techniques:

1. **Retrieval**: This involves searching through a large database or corpus to find relevant information related to a given query or prompt.
2. **Generation**: This involves using a language model to generate new text based on the input prompt or query.

In RAG, the retrieval step is used to gather a set of relevant documents or passages from a large corpus, and then these retrieved passages are used as input to a generation model. The generation model then uses this input to produce a coherent and informative response to the original query or prompt.

RAG models have been particularly useful in applications such as:

* Text summarization: RAG can be used to retrieve relevant passages from a large corpus and then summarize them into a concise and accurate summary.
* Conversational AI: RAG can be used to generate responses to user queries by retrieving relevant information from a lar

### Part 2: Efficient Document Indexing with FAISS

- **Hugging Face Embeddings**: Loads the `sentence-transformers/all-mpnet-base-v2` model to embed both a `question` and a `document`. This allows for vectorizing text for similarity search.
    ```python
    embd = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    query_result = embd.embed_query(question)
    document_result = embd.embed_query(document)
    ```

- **Cosine Similarity**: Measures how similar the embedded vectors of the `question` and `document` are. A value of `1` indicates identical vectors.
    ```python
    similarity = cosine_similarity(query_result, document_result)
    ```


In [32]:
# Load the Hugging Face Embeddings for local use
from langchain.embeddings import HuggingFaceEmbeddings

# Example documents
question = "What kinds of pets do I like?"
document = "My favorite pet is a cat."

# Embed the query and document using Hugging Face embeddings
embd = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)

# Calculate the length of the query embeddings
len(query_result)

768

In [33]:
# Implement cosine similarity for comparing query and document embeddings
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)
print("Cosine Similarity:", similarity)


Cosine Similarity: 0.5595268748748163


### Part 3: Document Splitting and Vectorization

- **Document Loader and Splitting**: Loads blog content from a URL and splits it into smaller chunks using the `RecursiveCharacterTextSplitter` to prepare for vectorization. This ensures efficient retrieval of smaller document chunks based on the query.
    ```python
    loader = WebBaseLoader(...)
    splits = text_splitter.split_documents(blog_docs)
    ```

- **FAISS Vector Store**: Embeds the split documents using Hugging Face embeddings and stores them in a FAISS vector store. This vector store enables fast retrieval of the most relevant documents based on similarity to the query.
    ```python
    vectorstore = FAISS.from_texts(texts=texts, embedding=embedding_model)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
    ```

In [34]:
# Load blog posts and use RecursiveCharacterTextSplitter for splitting documents
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Load blog from web
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))
    )
)
blog_docs = loader.load()

In [35]:
# Split the blog documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50
)
splits = text_splitter.split_documents(blog_docs)

In [36]:
# Use FAISS for local vector storage and retrieval
texts = [doc.page_content for doc in splits]
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [37]:
# Create FAISS vectorstore
vectorstore = FAISS.from_texts(texts=texts, embedding=embedding_model)

In [38]:
# Convert the vectorstore into a retriever with search options
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

In [39]:
# Test document retrieval
docs = retriever.get_relevant_documents("What is Task Decomposition?")
print("Number of relevant documents:", len(docs))

Number of relevant documents: 1


### Part 4: Generating Responses with Local LLaMA

- **LLaMA Model**: Loads the local LLaMA model using **Ollama** and defines a **prompt template** to generate answers based on the retrieved documents.
    ```python
    llm = Ollama(model="llama3.1")
    prompt = ChatPromptTemplate.from_template(template)
    ```

- **RAG Chain**: Combines document retrieval and generation by passing the retrieved documents as context to the LLaMA model to generate a detailed response.
    ```python
    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    response = rag_chain.invoke("What is Task Decomposition?")
    ```
    
This flow demonstrates how the system embeds documents, retrieves relevant information, and generates responses using a fully local pipeline with FAISS, Hugging Face embeddings, and the LLaMA model.


In [42]:
from langchain.llms import Ollama
from langchain.prompts import ChatPromptTemplate

# Initialize the local LLaMA model using Ollama
llm = Ollama(model="llama3.1")

# Define the prompt template for generating responses
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# Example usage with retrieved context and a question
context = "This is the context text for the question."
question = "What is the main topic of the context?"

response = llm(prompt.format(context=context, question=question))
print("Generated Response:", response)


Generated Response: There is no context provided. The statement says "This is the context text for the question", but there is no actual text or information to draw from. If you provide a context, I'll be happy to help!


In [43]:
# Create a more advanced RAG chain using retrieved documents and the local LLaMA model
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

response = rag_chain.invoke("What is Task Decomposition?")
print("RAG Chain Response:", response)


RAG Chain Response: Task decomposition refers to the process of breaking down a complex user request into multiple manageable tasks, each with its own attributes such as type, ID, dependencies, and arguments. This process is done by a Large Language Model (LLM) in the first stage of the system described in the context, known as Task planning.
