**TODO:**
- Clean up and add more instructions
- Update to use satellite documents for class
- Look into further ways we can play with metadata
- Add structured data portion

# Retrieval Augmented Generation with Unstructured Data using GPUs

## Setup

Using the command line, create a new Conda environment using the `environment.yml` file:
```bash
module load miniforge
conda env create -f environment.yml
conda activate rag_ollama
```

Alternatively, install the necessary packages manually:

```bash
module load miniforge
conda create -n rag_ollama jupyterlab langchain-ollama langchain-chroma langchain-community
conda activate rag_ollama
pip install "unstructured[pdf]"
```

Create a Jupyter kernel for your environment:
```bash
python -m ipykernel install --user --name rag_ollama
```

Connect this notebook to the Jupyter kernel you just created. You may need to disconnect from and reconnect to your Jupyter session.

Run the setup script to start Ollama and download the embedding and language models:
```bash
sh start_ollama.sh
```

Import packages:

In [1]:
import os
import json
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.agents.middleware import dynamic_prompt, ModelRequest
from langchain.agents import create_agent
import logging
from IPython.display import Markdown, display

  from .autonotebook import tqdm as notebook_tqdm


Set environment variables:

In [2]:
with open("config.json", "r") as f:
    config = json.load(f)

EMBEDDING_MODEL_NAME = config["embedding_model"]
LLM_NAME = config["llm"]

DOCS_PATH = os.path.join(os.getcwd(), "docs")
VECTOR_STORE_PATH = os.path.join(os.getcwd(), "vector_store")

Initialize components:

In [3]:
embeddings = OllamaEmbeddings(
    model=EMBEDDING_MODEL_NAME
)

## Processing PDFs

We have a set of PDFs that we would like to input into our RAG pipeline. We cannot do this directly, however. While PDFs are optimized for humans to read and comprehend, machines have a harder time. So, we must first process our documents so that they can be efficiently searched by a computer. We will do this in two steps:
1. Extract the raw text from the PDFs
2. Convert the text into vectors using an embedding model

### Extracting Text from PDFs

The `unstructured` software has a PDF loading tool that extracts text from PDFs and ignores images. This software uses the `pdfminer.six` Python package under the hood, which is very popular for reading PDFs using Python.

*Note: `unstructured` has loaders for other file formats as well, such as Markdown or Word documents.*

In [4]:
# Mute pdfminer warnings globally
logging.getLogger("pdfminer").setLevel(logging.ERROR)
logging.getLogger("pdfminer.pdffont").setLevel(logging.ERROR)

def load_documents(docs_path):
    """
    Load documents from the specified directory recursively. Documents must be
    in .pdf format.
    """

    # Load the documents recursively:
    documents = []
    for file_name in os.listdir(docs_path):
        file_path = os.path.join(docs_path, file_name)
        if file_name.endswith('.pdf'):
            loader = UnstructuredPDFLoader(file_path, languages=["eng"])
            doc = loader.load()
            doc[0].metadata["source"] = file_name
            documents.extend(doc)
        elif os.path.isdir(file_path):
            documents.extend(load_documents(file_path))
    return documents

documents = load_documents(DOCS_PATH)

This creates a list of documents:

In [5]:
documents[0]

Document(metadata={'source': 'NewReno.pdf'}, page_content='Network Working Group S. Floyd Request for Comments: 3782 ICSI Obsoletes: 2582 T. Henderson Category: Standards Track Boeing A. Gurtov TeliaSonera April 2004\n\nThe NewReno Modification to TCP’s Fast Recovery Algorithm\n\nStatus of this Memo\n\nThis document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.\n\nCopyright Notice\n\nCopyright (C) The Internet Society (2004). All Rights Reserved.\n\nAbstract\n\nThe purpose of this document is to advance NewReno TCP’s Fast Retransmit and Fast Recovery algorithms in RFC 2582 from Experimental to Standards Track status.\n\nThe main change in this document relative to RFC 2582 is to specify the Careful variant of NewRe

Each document object contains metadata, such as the document title, as well as the raw text.

### Creating a Vector Store

Now that we have extracted the text from the PDFs, we must further process our data so that it can be efficiently searched by our pipeline. We will do this by saving our documents into a **vector store**.

#### Chunking

The vectors will be created using an embedding model, but before we do this, we must **chunk** our documents. We have to do this because embedding models have a context limit, and some of our documents are too large to fit into a single vector. For example, the embedding model that we're using, `mxbai-embed-large`, has a context limit of 512 tokens (per its [datasheet](https://ollama.com/library/mxbai-embed-large)).

How you chunk your documents is important, because each chunk should represent a coherent idea that reflects the intended meaning from the original document. If your chunks are too large, you risk feeding your pipeline unnecessary or only tangentially relevant information. If your chunks are too small, then you may lose essential context that helps the retriever and model understand what a chunk is actually about. Consider this example:

> Red squirrels have a varied and adaptable diet that changes with the seasons. They primarily eat seeds from conifer cones, such as pine, spruce, and fir, carefully stripping the cones to reach the nutritious seeds inside. In addition to seeds, they consume nuts, berries, fruits, buds, and fungi, especially mushrooms. Red squirrels are also known to occasionally eat insects, bird eggs, or nestlings when plant food is scarce.

> Red squirrels typically live in forests dominated by coniferous or mixed trees, which provide both food and shelter. They build nests, called dreys, high in the trees using twigs, leaves, moss, and bark for insulation. Some individuals also use hollow trees or abandoned woodpecker holes for nesting. Their habitat usually includes well-defined territories that they actively defend from other squirrels.

If we combine both paragraphs into a single chunk, then a query about the diet of red squirrels will retrieve information about their habitat and nesting behavior as well. While this information is related, it is not directly relevant to the question being asked. As a result, the retrieved context may fill up the model’s context window more quickly and crowd out other, more relevant chunks from different documents.

On the other hand, if we split the document too aggressively (e.g., by making each sentence into its own chunk), then the sentences' original context is lost. Important information that is implicit in the surrounding sentences may no longer be available to the retriever. For example, if a user asks, "What do red squirrels eat?", the retriever may fail to identify the following sentence as relevant:

> They primarily eat seeds from conifer cones, such as pine, spruce, and fir, carefully stripping the cones to reach the nutritious seeds inside.

On its own, this sentence does not explicitly mention red squirrels. Without the surrounding context, the retriever (and the model) has no clear signal that the sentence is describing the diet of red squirrels rather than some other animal.

In this example, the best method would be to treat each paragraph as its own chunk, as each paragraph has a distinct topic.

Of course, we cannot manually chunk every document. Instead, chunking tools allow us to specify chunk sizes, and also include chunk overlaps, which help avoid context loss. Here, I've specified 1200 characters, which is about the length of a short paragraph.

In [6]:
chunk_size = 10000
chunk_overlap = int(.2 * chunk_size)
# separators=["\n\n", "\n", ". ", " ", ""]
separators=["\n\n", "\n"]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, # 1200 characters. mxbai-embed-large is limited to 512 tokens (~1500-1600 characters)
    chunk_overlap=chunk_overlap,
    length_function=len,
    separators=separators,
)
split_docs = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} docs -> split into {len(split_docs)}")

Loaded 6 docs -> split into 39


After splitting, we have multiple documents, each representing a different chunk of each source:

In [7]:
for doc in split_docs:
    print(doc.metadata["source"])

NewReno.pdf
NewReno.pdf
NewReno.pdf
NewReno.pdf
NewReno.pdf
NewReno.pdf
CUBIC.pdf
CUBIC.pdf
CUBIC.pdf
CUBIC.pdf
CUBIC.pdf
CUBIC.pdf
CUBIC.pdf
Hybla.pdf
Hybla.pdf
Hybla.pdf
Hybla.pdf
Hybla.pdf
Hybla.pdf
Hybla.pdf
Hybla.pdf
comparative_study.pdf
comparative_study.pdf
comparative_study.pdf
comparative_study.pdf
comparative_study.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf
BBR.pdf
BBR.pdf
BBR.pdf
BBR.pdf
BBR.pdf
BBR.pdf


#### Converting Chunks into Vectors

The final step of data preparation is to convert our documents into "vectors," which are numerical representations that capture the meaning of words, sentences, and passages. This allows the pipeline to efficiently run a similarity search with queries to extract contextually relevant documents.

*How do we choose an embedding model?*

The embedded documents get saved to a Chroma database ("vector store"):

In [8]:
Chroma.from_documents(documents=split_docs,
                      embedding=embeddings,
                      persist_directory=VECTOR_STORE_PATH) # Specifying persist_directory saves the vector store as a file so we don't have to recreate it

<langchain_chroma.vectorstores.Chroma at 0x14f99ad1f7a0>

Once the vector store has been saved to a file, you can read it using the following command:

In [9]:
vector_store = Chroma(persist_directory=VECTOR_STORE_PATH,
                      embedding_function=embeddings)

Now that we have a vector store, we can evaluate its ability to retrieve relevant information using similarity searches. For example:

In [10]:
def get_relevant_docs(query, vector_store):
    results = vector_store.similarity_search_with_score(
        query, k=5
    )
    for res, score in results:
        # print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
        print(res.metadata["source"], f"({round(score, 2)})")

query = "How is BBR related to bottlenecks?"
get_relevant_docs(query, vector_store)

BBR.pdf (0.75)
BBR.pdf (0.75)
BBR.pdf (0.8)
BBR.pdf (0.8)
BBR.pdf (0.83)


~~Our retriever seems to be working well, as it knows it can find helpful information about VS Code in the "VSCode Remote SSH" document. Let's try something a bit more complicated:

## Running RAG

Play around with LLM:

In [11]:
llm = OllamaLLM(model=LLM_NAME, temperature=0.5)
display(Markdown(llm.invoke("Where is MIT located?")))

The Massachusetts Institute of Technology (MIT) is located in Cambridge, Massachusetts, United States. Specifically:

* Address: 77 Massachusetts Avenue, Cambridge, MA 02139
* Located about 0.5 miles from downtown Boston and the Charles River.

MIT's campus spans across several buildings and facilities in the Kendall Square area of Cambridge, with some satellite locations nearby.

Set up the RAG pipeline:

In [12]:
import time

@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Inject context into state messages."""
    last_query = request.state["messages"][-1].text
    retrieved_docs = vector_store.similarity_search(last_query, k=3)

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
    # print(docs_content) ##

    # Print documents used:
    print("\nRetrieved documents:") ##
    for doc in retrieved_docs: ##
        print(doc.metadata["source"]) ##
    print()

    system_message = (
        "You are a helpful assistant. Answer only using the information from the following documents."
        f"\n\n{docs_content}"
    )

    return system_message


agent = create_agent(llm, tools=[], middleware=[prompt_with_context])

def pose(query):
    for step in agent.stream(
        {"messages": [{"role": "user", "content": query}]},
        stream_mode="values",
    ):
        display(Markdown(step["messages"][-1].text))
        

Example questions I got from NotebookLM:

In [14]:
query = "How do the design philosophies of BBR, CUBIC, and NewReno differ in their interpretation of network 'signals'?"
pose(query)

How do the design philosophies of BBR, CUBIC, and NewReno differ in their interpretation of network 'signals'?


Retrieved documents:
BBR.pdf
BBR.pdf
BBR.pdf



The design philosophies of BBR (Bottleneck Bandwidth and Round-trip time), CUBIC, and NewReno differ in their interpretation of network "signals" as follows:

1.  **NewReno**: NewReno uses packet loss as a primary signal to indicate congestion. It assumes that packet loss is caused by the network being congested, and it reacts to this loss by reducing its transmission rate.

2.  **CUBIC**: CUBIC also relies heavily on packet loss but introduces some improvements over NewReno. It tries to estimate the bottleneck bandwidth (BtlBw) more accurately using a combination of algorithms that measure the network's capacity and the round-trip time (RTT). However, it still primarily uses packet loss as an indicator of congestion.

3.  **BBR**: BBR takes a fundamentally different approach by not relying on packet loss to signal congestion. Instead, it focuses on measuring the available bandwidth and RTT directly through probing mechanisms like ProbeRTT and ProbeBW. These probes allow BBR to estimate the bottleneck bandwidth more accurately and react accordingly without waiting for packet loss.

In summary, while NewReno and CUBIC both use packet loss as a primary signal for congestion, BBR focuses on measuring network capacity and RTT directly through probing mechanisms, making it less dependent on packet loss as an indicator of congestion.

In [15]:
query = "Why does TCP Vegas perform poorly in LEO satellite networks compared to BBR?"
pose(query)

Why does TCP Vegas perform poorly in LEO satellite networks compared to BBR?


Retrieved documents:
comparative_study.pdf
comparative_study.pdf
comparative_study.pdf



TCP Vegas performs poorly in LEO satellite networks compared to BBR because of its sensitivity to network delay. In a LEO satellite network, the satellites and ground devices are constantly moving, causing path changes and latency variations. TCP Vegas detects congestion by measuring increases in round-trip time (RTT), which can be caused by these path changes rather than actual congestion. As a result, it may react too quickly to perceived congestion, reducing its sending rate and leading to low throughput.

In contrast, BBR is more resilient to network delay variations because it frequently measures the bottleneck bandwidth and minimal RTT to obtain the bandwidth-delay product (BDP) of the path. This allows it to regulate its sending rate in a way that takes into account the dynamic nature of the LEO satellite network, resulting in better performance.

Additionally, the nearest-satellite strategy used in the simulations can also contribute to TCP Vegas's poor performance. When using this strategy, the ground device may connect to a satellite that is not on the shortest path to the destination, leading to higher latency and more frequent path changes, which can trigger TCP Vegas's congestion detection mechanism unnecessarily.

Overall, the combination of TCP Vegas's sensitivity to network delay and the nearest-satellite strategy used in the simulations contribute to its poor performance in LEO satellite networks compared to BBR.

In [16]:
query = "Compare how TCP CUBIC and TCP Hybla address the problem of 'RTT Unfairness.'"
pose(query)

Compare how TCP CUBIC and TCP Hybla address the problem of 'RTT Unfairness.'


Retrieved documents:
CUBIC.pdf
CUBIC-checkpoint.pdf
CUBIC-checkpoint.pdf



TCP CUBIC and TCP Hybla are both high-speed TCP variants designed to improve network utilization and fairness in environments with varying Round-Trip Times (RTTs). While they share some similarities, they approach the problem of RTT unfairness differently.

**TCP CUBIC:**

TCP CUBIC addresses RTT unfairness by employing a cubic growth function for congestion avoidance. This means that the sender's window size grows cubically with time, rather than linearly or exponentially as in traditional TCP. The cubic growth function helps to reduce the impact of RTT variations on fairness and ensures that senders with longer RTTs do not starve senders with shorter RTTs.

**TCP Hybla:**

TCP Hybla, on the other hand, uses a modified additive increase/multiplicative decrease (AIMD) algorithm to address RTT unfairness. The key innovation in TCP Hybla is its use of a dynamic increase phase, where the sender's window size increases multiplicatively with time, and a multiplicative decrease phase, where the window size decreases multiplicatively when congestion is detected. This approach helps to reduce the impact of RTT variations on fairness by allowing senders with longer RTTs to catch up with senders having shorter RTTs.

**Comparison:**

While both TCP CUBIC and TCP Hybla address RTT unfairness, they differ in their approaches:

*   TCP CUBIC uses a cubic growth function for congestion avoidance, which helps to reduce the impact of RTT variations on fairness.
*   TCP Hybla employs a modified AIMD algorithm with dynamic increase and multiplicative decrease phases, allowing senders with longer RTTs to catch up with those having shorter RTTs.

In summary, both TCP CUBIC and TCP Hybla are designed to improve network utilization and fairness in environments with varying RTTs. However, they approach the problem of RTT unfairness differently, with TCP CUBIC using a cubic growth function and TCP Hybla employing a modified AIMD algorithm.

In [17]:
query = "What is the significance of the 'recover' variable and 'Partial ACKs' in the NewReno algorithm?"
pose(query)

What is the significance of the 'recover' variable and 'Partial ACKs' in the NewReno algorithm?


Retrieved documents:
NewReno.pdf
NewReno.pdf
NewReno.pdf



The `recover` variable is a state variable that keeps track of the highest sequence number acknowledged by the receiver. It plays a crucial role in the NewReno algorithm.

When the sender receives duplicate acknowledgments (ACKs) indicating that a segment has been lost, it enters the Fast Recovery procedure. In this procedure, the sender uses the `recover` variable to determine whether the duplicate ACKs are due to a lost segment or unnecessary retransmissions.

If the cumulative acknowledgment field in the latest non-duplicate ACK does not cover more than the value stored in the `recover` variable, it indicates that the duplicate ACKs are likely due to unnecessary retransmissions. In this case, the sender proceeds with Step 1B of the NewReno algorithm, which involves halving the congestion window and continuing to send segments.

However, if the cumulative acknowledgment field does cover more than the value stored in the `recover` variable, it indicates that a segment has been lost. In this case, the sender proceeds with Step 1A of the NewReno algorithm, which involves sending a new segment and adjusting the congestion window accordingly.

The `recover` variable is essential for distinguishing between duplicate ACKs caused by a lost segment and those caused by unnecessary retransmissions. Its value helps the sender to correctly implement the Fast Recovery procedure and avoid unnecessary retransmissions.

Regarding "Partial ACKs", in the context of NewReno, a Partial ACK is an acknowledgment that indicates some segments have been received out of order or missing. When the receiver sends a partial ACK, it includes the sequence number of the first segment that was not received correctly. The sender uses this information to determine which segments are missing and needs to be retransmitted.

In NewReno, Partial ACKs play a crucial role in accelerating loss recovery by allowing the sender to quickly identify and retransmit missing segments. When the sender receives a partial ACK, it can immediately send new segments, reducing the time spent in the Fast Recovery procedure.

The use of partial ACKs in NewReno is based on the idea that the receiver should acknowledge out-of-order data segments immediately, as specified in RFC 2581. This allows the sender to respond quickly to packet losses and reduce the time spent in the recovery process.

In [18]:
query = "Explain the sequential relationship between BBR’s 'Startup' and 'Drain' states."
pose(query)

Explain the sequential relationship between BBR’s 'Startup' and 'Drain' states.


Retrieved documents:
BBR.pdf
BBR.pdf
BBR.pdf



BBR's 'Startup' and 'Drain' states are sequentially related in the following way:

1. When a new connection is established, the BBR flow enters the 'Startup' state.
2. In this state, the flow grows its inflight (the amount of data being transmitted) aggressively to quickly fill the available bandwidth.
3. As the inflight increases, the round-trip time (RTT) also increases due to the growing queue at the bottleneck link.
4. When the RTT reaches a certain threshold (clamped by cwnd_gain), the flow transitions from 'Startup' to 'Drain'.
5. In the 'Drain' state, the flow uses an inverse gain to reduce its inflight and get rid of any excess packets in the queue.
6. Once the inflight drops to a BDP (Bandwidth-Delay Product), the flow returns to the 'ProbeBW' state.

In summary, the 'Startup' state is used for rapidly filling the available bandwidth, while the 'Drain' state is used for reducing the inflight and getting rid of excess packets in the queue. The transition from 'Startup' to 'Drain' occurs when the RTT reaches a certain threshold, indicating that the flow has reached its maximum allowed inflight.