# Retrieval Augmented Generation with Unstructured Data using GPUs

## Setup

Using the command line, create a new Conda environment using the `environment.yml` file:
```bash
module load miniforge
conda env create -f environment.yml
conda activate rag_ollama
```

Alternatively, install the necessary packages manually:

```bash
module load miniforge
conda create -n rag_ollama jupyterlab langchain-ollama langchain-chroma langchain-community
conda activate rag_ollama
pip install "unstructured[pdf]"
```

Create a Jupyter kernel for your environment:
```bash
python -m ipykernel install --user --name rag_ollama
```

Connect this notebook to the Jupyter kernel you just created. You may need to disconnect from and reconnect to your Jupyter session.

Run the setup script to start Ollama and download the embedding and language models:
```bash
sh start_ollama.sh
```

Import packages:

In [None]:
import json
import logging
import os

from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.agents.middleware import dynamic_prompt, ModelRequest
from langchain.agents import create_agent
from IPython.display import Markdown, display

Set environment variables:

In [None]:
with open("config.json", "r") as f:
    config = json.load(f)

EMBEDDING_MODEL_NAME = config["embedding_model"]
LLM_NAME = config["llm"]

DOCS_PATH = os.path.join(os.getcwd(), "docs")
VECTOR_STORE_PATH = os.path.join(os.getcwd(), "vector_store")

Initialize embedding model:

In [None]:
embeddings = OllamaEmbeddings(
    model=EMBEDDING_MODEL_NAME
)

## Processing PDFs

We have a set of PDFs that we would like to input into our RAG pipeline. We cannot do this directly, however. While PDFs are optimized for humans to read and comprehend, machines have a harder time. So, we must first process our documents so that they can be efficiently searched by a computer. We will do this in two steps:
1. Extract the raw text from the PDFs
2. Convert the text into vectors using an embedding model

### Extracting Text from PDFs

The `unstructured` software has a PDF loading tool that extracts text from PDFs and ignores images. This software uses the `pdfminer.six` Python package under the hood, which is very popular for reading PDFs using Python.

*Note: `unstructured` has loaders for other file formats as well, such as Markdown or Word documents.*

In [None]:
# Mute pdfminer warnings globally
logging.getLogger("pdfminer").setLevel(logging.ERROR)
logging.getLogger("pdfminer.pdffont").setLevel(logging.ERROR)

def load_documents(docs_path):
    """
    Load documents from the specified directory recursively. Documents must be
    in .pdf format.
    """

    # Load the documents recursively:
    documents = []
    for file_name in os.listdir(docs_path):
        file_path = os.path.join(docs_path, file_name)
        if file_name.endswith('.pdf'):
            loader = UnstructuredPDFLoader(file_path, languages=["eng"])
            doc = loader.load()
            doc[0].metadata["source"] = file_name
            documents.extend(doc)
        elif os.path.isdir(file_path):
            documents.extend(load_documents(file_path))
    return documents

documents = load_documents(DOCS_PATH)

This creates a list of documents:

In [None]:
documents[0]

Each document object contains metadata, such as the document title, as well as the raw text.

### Creating a Vector Store

Now that we have extracted the text from the PDFs, we must further process our data so that it can be efficiently searched by our pipeline. We will do this by saving our documents into a **vector store**.

#### Chunking

The vectors will be created using an embedding model, but before we do this, we must **chunk** our documents. We have to do this because embedding models have a context limit, and some of our documents are too large to fit into a single vector. For example, the embedding model that we're using, `bge-m3`, has a context limit of 8,000 tokens (per its [datasheet](https://ollama.com/library/bge-m3)).

How you chunk your documents is important because each chunk should represent a coherent idea that reflects the intended meaning from the original document. If your chunks are too large, you risk feeding your pipeline unnecessary or only tangentially relevant information. If your chunks are too small, then you may lose essential context that helps the retriever and model understand what a chunk is actually about. Consider this example:

> Red squirrels have a varied and adaptable diet that changes with the seasons. They primarily eat seeds from conifer cones, such as pine, spruce, and fir, carefully stripping the cones to reach the nutritious seeds inside. In addition to seeds, they consume nuts, berries, fruits, buds, and fungi, especially mushrooms. Red squirrels are also known to occasionally eat insects, bird eggs, or nestlings when plant food is scarce.

> Red squirrels typically live in forests dominated by coniferous or mixed trees, which provide both food and shelter. They build nests, called dreys, high in the trees using twigs, leaves, moss, and bark for insulation. Some individuals also use hollow trees or abandoned woodpecker holes for nesting. Their habitat usually includes well-defined territories that they actively defend from other squirrels.

If we combine both paragraphs into a single chunk, then a query about the diet of red squirrels will retrieve information about their habitat and nesting behavior as well. While this information is related, it is not directly relevant to the question being asked. As a result, the retrieved context may fill up the model’s context window more quickly and crowd out other, more relevant chunks from different documents.

On the other hand, if we split the document too aggressively (e.g., by making each sentence into its own chunk), then the sentences' original context is lost. Important information that is implicit in the surrounding sentences may no longer be available to the retriever. For example, if a user asks, "What do red squirrels eat?", the retriever may fail to identify the following sentence as relevant:

> They primarily eat seeds from conifer cones, such as pine, spruce, and fir, carefully stripping the cones to reach the nutritious seeds inside.

On its own, this sentence does not explicitly mention red squirrels. Without the surrounding context, the retriever (and the model) has no clear signal that the sentence is describing the diet of red squirrels rather than some other animal.

In this example, the best method would be to treat each paragraph as its own chunk, as each paragraph has a distinct topic.

Of course, we cannot manually chunk every document. Instead, chunking tools allow us to specify chunk sizes, and also include chunk overlaps, which help avoid context loss.

In [None]:
chunk_size = 10000
chunk_overlap = int(.2 * chunk_size)
separators=["\n\n", "\n", ". ", " ", ""]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, # Given in characters, not tokens (1 token = 3-4 characters)
    chunk_overlap=chunk_overlap,
    length_function=len,
    separators=separators,
)

split_docs = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} docs -> split into {len(split_docs)}")

After splitting, we have multiple documents, each representing a different chunk of each source:

In [None]:
for doc in split_docs:
    print(doc.metadata["source"])

Because we used overlapping, the documents are not mutually exclusive:

In [None]:
print("Document 0:\n")
print(split_docs[0].page_content[-chunk_overlap:])
print("~~~~~~~~~~~~~~~~~~~~~~~~~")
print("Document 1:\n")
print(split_docs[1].page_content[:chunk_overlap])

#### Converting Chunks into Vectors

The final step of data preparation is to convert our documents into "vectors," which are numerical representations that capture the meaning of words, sentences, and passages. This allows the pipeline to efficiently run a similarity search with queries to extract contextually relevant documents.

*How do we choose an embedding model?*

The embedded documents get saved to a Chroma database ("vector store"):

In [None]:
Chroma.from_documents(documents=split_docs,
                      embedding=embeddings,
                      persist_directory=VECTOR_STORE_PATH) # Specifying persist_directory saves the vector store as a file so we don't have to recreate it

Once the vector store has been saved to a file, you can read it using the following command:

In [None]:
vector_store = Chroma(persist_directory=VECTOR_STORE_PATH,
                      embedding_function=embeddings)

Now that we have a vector store, we can evaluate its ability to retrieve relevant information using similarity searches. For example:

In [None]:
def get_relevant_docs(query, vector_store):
    results = vector_store.similarity_search_with_score(
        query, k=5
    )
    for res, score in results:
        print(res.metadata["source"], f"({round(score, 2)})")

query = "How does BBR work?"
get_relevant_docs(query, vector_store)

Using a basic query, our retriever seems to work well. Let's try something more complex:

In [None]:
query = "Contrast the strategies used by CUBIC, Hybla, and BBR to handle connections with long Round Trip Times (RTT)."
get_relevant_docs(query, vector_store)

In [None]:
query = "How do 'loss-based' protocols and 'delay-based' protocols struggle specifically in the context of Low-Earth-Orbit (LEO) satellite networks?"
get_relevant_docs(query, vector_store)

## Running RAG

### Run an LLM locally

In [None]:
llm = OllamaLLM(model=LLM_NAME, temperature=0.5)
display(Markdown(llm.invoke("Where is MIT located?")))

In [None]:
display(Markdown(llm.invoke("How do I use Pandas to read a Parquet file in Python?")))

___

### Set up the RAG pipeline

In [None]:
@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Inject context into state messages."""
    last_query = request.state["messages"][-1].text
    retrieved_docs = vector_store.similarity_search(last_query, k=3)

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    # Print documents used:
    print("\nRetrieved documents:") ##
    for doc in retrieved_docs: ##
        print(doc.metadata["source"]) ##
    print()

    system_message = (
        "You are a helpful assistant. Answer only using the information from the following documents."
        f"\n\n{docs_content}"
    )

    return system_message


agent = create_agent(llm, tools=[], middleware=[prompt_with_context])

def pose(query):
    for step in agent.stream(
        {"messages": [{"role": "user", "content": query}]},
        stream_mode="values",
    ):
        display(Markdown(step["messages"][-1].text))
        

Testing questions:

In [None]:
query = "How do the design philosophies of BBR, CUBIC, and NewReno differ in their interpretation of network 'signals'?"
pose(query)

In [None]:
query = "Why does TCP Vegas perform poorly in LEO satellite networks compared to BBR?"
pose(query)

In [None]:
query = "Compare how TCP CUBIC and TCP Hybla address the problem of 'RTT Unfairness.'"
pose(query)

In [None]:
query = "What is the significance of the 'recover' variable and 'Partial ACKs' in the NewReno algorithm?"
pose(query)

In [None]:
query = "Explain the sequential relationship between BBR’s 'Startup' and 'Drain' states."
pose(query)

___

How do we know that these answers are coming from the documents or are from the model's pre-trained knowledge? This takes a bit of prompt engineering. Let's try editing the system message:

In [None]:
@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Inject context into state messages."""
    last_query = request.state["messages"][-1].text
    retrieved_docs = vector_store.similarity_search(last_query, k=3)

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    # Print documents used:
    print("\nRetrieved documents:") ##
    for doc in retrieved_docs: ##
        print(doc.metadata["source"]) ##
    print()

    system_message = (
        # Edit here:
        "You are a helpful assistant. Answer only using the information from the following documents. If the documents do not answer the question, please say so."
        f"\n\n{docs_content}"
    )

    return system_message


agent = create_agent(llm, tools=[], middleware=[prompt_with_context])

def pose(query):
    for step in agent.stream(
        {"messages": [{"role": "user", "content": query}]},
        stream_mode="values",
    ):
        display(Markdown(step["messages"][-1].text))

In [None]:
query = "Who was the fifth president of the United States?"
pose(query)