# About

This python notebook aims to take the concepts of a RAG System and put them into practice with some working code. There's an emphasis here to write the imperative steps, and not rely on packages or libraries that abstract away the inner workings, so that we can better understand the plumbing of such a system. The overall process looks like this:

- Read your local data files
- Chunk the data for efficiency
- Embed the data as vectors, making it easier for models to complete similarity search
- Store the vectors
- Retrieve relevant chunks of data for a user query
- Pass the query and relevant chunks to an LLM for response generation


## Read in the data

In this simple example we'll be reading a single PDF.

In [20]:
# needed to add this to suppress a warning during the embedding process below. Not part of the tutorial really...
from tqdm.auto import tqdm

In [34]:
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

In [3]:
from pypdf import PdfReader

# Using a raw string, absolute path
FILE_PATH = r"/Users/michaeldownes/Library/Mobile Documents/iCloud~md~obsidian/Documents/Cloud Vault/Attachments/EXAMPLE-PDF.pdf"

reader = PdfReader(FILE_PATH)
number_of_pages = len(reader.pages)

pdf_text = ""
for page_num in range(number_of_pages):
    page = reader.pages[page_num]
    pdf_text += page.extract_text()

print(pdf_text[:200])

Drylab Newsfor in vestors & friends · Ma y 2017
Welcome to our first newsletter of 2017! It's
been a while since the last one, and a lot has
happened. W e promise to k eep them coming
every two months


## Chunk time (Simple)

We'll use a simple fixed-length chunk technique here

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
text_chunks = text_splitter.split_text(pdf_text)
print(f"Total chunks: {len(text_chunks)}")

Total chunks: 14


In [21]:
text_chunks[:2]

["Drylab Newsfor in vestors & friends · Ma y 2017\nWelcome to our first newsletter of 2017! It's\nbeen a while since the last one, and a lot has\nhappened. W e promise to k eep them coming\nevery two months hereafter , and permit\nourselv es to mak e this one r ather long. The\nbig news is the beginnings of our launch in\nthe American mark et, but there are also\ninteresting updates on sales, de velopment,\nmentors and ( of course ) the in vestment\nround that closed in January .",
 'round that closed in January .\nNew c apital: The in vestment round was\nsuccessful. W e raised 2.13 MNOK to matchthe 2.05 MNOK loan from Inno vation\nNorwa y. Including the de velopment\nagreement with Filmlance International, the\ntotal new capital is 5 MNOK, partly tied to\nthe successful completion of milestones. All\nformalities associated with this process are\nnow finalized.\nNew o wners: We would especially lik e to\nwarmly welcome our new owners to the']

## Embedd the chunks

Let's take our chunks and turn them into embeddings. For this we'll import some more necessary packages and select an embedding model.

In [6]:
import torch
from sentence_transformers import SentenceTransformer

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# We'll use this model for the embedding
model_name = "BAAI/bge-small-en-v1.5"

# Get the model by name, and provide it a device
embedding_model = SentenceTransformer(model_name, device=device)

In [8]:
embeddings = embedding_model.encode(text_chunks, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
embeddings_size = embeddings[0].shape[0]
print(embeddings_size)

384


## Store Embeddings in a Vector Database

We'll be using Qdrant for our vector database. See some documentation below:

### Python

```
pip install qdrant-client
```

The python client offers a convenient way to start with Qdrant locally:

```python
from qdrant_client import QdrantClient
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance, for testing, CI/CD
# OR
client = QdrantClient(path="path/to/db")  # Persists changes to disk, fast prototyping
```

### Client-Server

To experience the full power of Qdrant locally, run the container with this command:

```bash
docker run -p 6333:6333 qdrant/qdrant
```

Now you can connect to this with any client, including Python:

```python
qdrant = QdrantClient("http://localhost:6333") # Connect to existing Qdrant instance
```

### Ok start the docker container in a new terminal session and leave it running

In [10]:
# !pip install qdrant-client

# Import client library
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance


client = QdrantClient("http://localhost:6333")


In [16]:
# embedding_model.get_sentence_embedding_dimension()
collection_name = "qa_index"
client.delete_collection(collection_name)

# Create our collection and ensure that the vector params size is equal to the size of the embeddings' shape.
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=embeddings_size, distance=Distance.COSINE),
    
)

True

## Create payloads and ids (metadata)

For each of our vectors/embeddings, we should store metadata that a) identifies the embedding and b) includes metadata about the embedding that is meaningful to us humans (the actual string content of the chunk)

In [17]:
ids = []
payload = []

for id, text in enumerate(text_chunks):
    ids.append(id)
    payload.append({"source": FILE_PATH, "content": text})

payload[0]

{'source': '/Users/michaeldownes/Library/Mobile Documents/iCloud~md~obsidian/Documents/Cloud Vault/Attachments/EXAMPLE-PDF.pdf',
 'content': "Drylab Newsfor in vestors & friends · Ma y 2017\nWelcome to our first newsletter of 2017! It's\nbeen a while since the last one, and a lot has\nhappened. W e promise to k eep them coming\nevery two months hereafter , and permit\nourselv es to mak e this one r ather long. The\nbig news is the beginnings of our launch in\nthe American mark et, but there are also\ninteresting updates on sales, de velopment,\nmentors and ( of course ) the in vestment\nround that closed in January ."}

In [18]:
client.upload_collection(
    collection_name=collection_name,
    vectors=embeddings,
    payload=payload,
    ids=ids,
    batch_size=256,  # How many vectors will be uploaded in a single request?
)

In [19]:
client.count(collection_name)

CountResult(count=14)

We should have the same number of embeddings here^ as the number of chunks we had from earlier

## Recapping

1. We read one PDF File and extract it's text using the `PdfReader` from **pypdf**
2. We split the pdf text into chunks with a fixed length
3. We embedded the chunks using a model from **sentence_transformers**
4. We store those embeddings and their metadata in **Qdrant** vector db

### What's left?

- Retrieval (querying our local data and getting relevant results)
- Response Generation (passing our query and relevant info to an LLM)

## Retrieval component

It's time to search our data! Let's define a function that takes a **query** and a **number of chunks** to return from our data

In [25]:
def search(text: str, top_k: int):
    query_embedding = embedding_model.encode(text).tolist() # use the same embedding model to embed our query

    search_result = client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        query_filter=None,
        limit=top_k
    )
    return search_result


In [26]:
question = "What is our customer return rate?"
results = search(question, top_k=5)

[ScoredPoint(id=4, version=0, score=0.59253424, payload={'source': '/Users/michaeldownes/Library/Mobile Documents/iCloud~md~obsidian/Documents/Cloud Vault/Attachments/EXAMPLE-PDF.pdf', 'content': "Canada. Lumiere Numeriques ha ve started\nusing us in F rance. W e also ha ve new\ncustomers in Norwa y, and high-profile users\nsuch as Gareth Un win, producer of Oscar-\nwinning The King's Speech . Re venue for the\nfirst four months is 200 kNOK, compared to\n339 kNOK for all of 2016. W e are working\non a partnership to safeguard sales in\nNorwa y while beginning to focus more on\nthe US.\nNew team members: We've extended our\norganization with two permanent de velopers"}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=1, version=0, score=0.559265, payload={'source': '/Users/michaeldownes/Library/Mobile Documents/iCloud~md~obsidian/Documents/Cloud Vault/Attachments/EXAMPLE-PDF.pdf', 'content': 'round that closed in January .\nNew c apital: The in vestment round was\nsucces

# Last bit - Response Generation!

Let's ask OpenAI to answer our query, by passing it some instructions, our query, and our reference chunks

In [28]:
system_prompt = """You are an assistant for question-answering tasks. Answer the question according only to the given context.
If question cannot be answered using the context, simply say I don't know. Do not make stuff up.

Context: {context}
"""

user_prompt = """
Question: {question}

Answer:"""

# Our results in the last step include relevance scores, versions, payloads, and other stuff
# We'll just send the payload content (the human-readable chunk) to the LLM
references = [obj.payload["content"] for obj in results]

# I think this is just to separate each reference with some new lines to let the LLM know where one chunk starts & ends
context = "\n\n".join(references) 

Showtime!

*A quick note about the **litellm** package*: `litellm` is useful here because it gives us this `completion` method which standardizes the interface/params for calling an LLM across all of the major LLM providers. [Go to litellm.ai](https://www.litellm.ai/)

Each LLM provider might have different syntax for how to actually execute a call to their API. `litellm` abstracts that away for us so that we can change the `model` argument from OpenAI's `"gpt-3.5-turbo"` to Anthropic's Claude 3.5 Sonnet without having to do anything else

Ok let's do it!

In [37]:
from litellm import completion

response = completion(
  api_key=OPENAI_API_KEY,
  model="gpt-3.5-turbo",
  messages=[{"content": system_prompt.format(context=context),"role": "system"}, {"content": user_prompt.format(question=question),"role": "user"}]
)

In [38]:
print(response.choices[0].message.content)

Our customer return rate is now 80%.


### Give me references with answers, please

In [39]:
print(f"ANSWER: {response.choices[0].message.content}\n\n")
print(f"REFERENCES:\n")
for index, ref in enumerate(references):
    print(f"Reference: [{index + 1}]: {ref}\n")

ANSWER: Our customer return rate is now 80%.


REFERENCES:

Reference: [1]: Canada. Lumiere Numeriques ha ve started
using us in F rance. W e also ha ve new
customers in Norwa y, and high-profile users
such as Gareth Un win, producer of Oscar-
winning The King's Speech . Re venue for the
first four months is 200 kNOK, compared to
339 kNOK for all of 2016. W e are working
on a partnership to safeguard sales in
Norwa y while beginning to focus more on
the US.
New team members: We've extended our
organization with two permanent de velopers

Reference: [2]: round that closed in January .
New c apital: The in vestment round was
successful. W e raised 2.13 MNOK to matchthe 2.05 MNOK loan from Inno vation
Norwa y. Including the de velopment
agreement with Filmlance International, the
total new capital is 5 MNOK, partly tied to
the successful completion of milestones. All
formalities associated with this process are
now finalized.
New o wners: We would especially lik e to
warmly welcome our ne

# Final word

Now that we've run through the basic pipeline of a single file, there are a couple ways we can make things more complicated but useful:

1. Use multiple PDF files
2. Use multiple files of different types (i.e. markdown, txt, and pdfs)
3. *Use even more file types (i.e. text, image, audio)*

Additionally, there are a few thigns we can do to make level up our developer skills:

1. Use a local model running with `Ollama`
2. Use `langchain` to abstract away the plumbing of our RAG system

## Next steps

Let's see if we can repeat this notebook for multiple PDFs.