<a target="_blank" href="https://colab.research.google.com/github/mrdbourke/simple-local-rag/blob/main/00-simple-local-rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Create and run a local RAG pipeline from scratch

The goal of this notebook is to build a RAG (Retrieval Augmented Generation) pipeline from scratch and have it run on a local GPU.

Specifically, we'd like to be able to open a PDF file, ask questions (queries) of it and have them answered by a Large Language Model (LLM).

There are frameworks that replicate this kind of workflow, including [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/), however, the goal of building from scratch is to be able to inspect and customize all the parts.

## What is RAG?

RAG stands for Retrieval Augmented Generation.

It was introduced in the paper [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).

Each step can be roughly broken down to:

* **Retrieval** - Seeking relevant information from a source given a query. For example, getting relevant passages of Wikipedia text from a database given a question.
* **Augmented** - Using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* **Generation** - Generating an output given an input. For example, in the case of an LLM, generating a passage of text given an input prompt.

## Why RAG?

The main goal of RAG is to improve the generation outptus of LLMs.

Two primary improvements can be seen as:
1. **Preventing hallucinations** - LLMs are incredible but they are prone to potential hallucination, as in, generating something that *looks* correct but isn't. RAG pipelines can help LLMs generate more factual outputs by providing them with factual (retrieved) inputs. And even if the generated answer from a RAG pipeline doesn't seem correct, because of retrieval, you also have access to the sources where it came from.
2. **Work with custom data** - Many base LLMs are trained with internet-scale text data. This means they have a great ability to model language, however, they often lack specific knowledge. RAG systems can provide LLMs with domain-specific data such as medical information or company documentation and thus customized their outputs to suit specific use cases.

 ## What we're going to build

We're going to build RAG pipeline which enables us to chat with a PDF document, specifically an open-source [nutrition textbook](https://pressbooks.oer.hawaii.edu/humannutrition2/), ~1200 pages long.

You could call our project NutriChat!

We'll write the code to:
1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

In [1]:
# Importing required libraries
import os
import pickle
from tqdm.auto import tqdm
import random
from time import perf_counter as timer

import fitz
import pandas as pd
import numpy as np
from spacy.lang.en import English
import re

from sentence_transformers import SentenceTransformer, util
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [2]:
def text_preprocessing(text: str) -> str:
    """Performs on text cleaning/preprocessing"""
    cleaned_text = text.replace("\n", " ").strip() 
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number,
        character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a pdf document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text() 
        text = text_preprocessing(text)
        pages_and_texts.append({"page_number": page_number, 
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # Assuming 1 token = ~4 chars
                                "text": text})
    return pages_and_texts

In [3]:
# Read preprocess PDF document
pdf_path = r"data/ConceptsofBiology-WEB.pdf"

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)

# For better and faster indexing. Let's focus on only one UNIT of the book "UNIT 2 CELL DIVISION AND GENETICS"
pages_and_texts = pages_and_texts[146:209]
pages_and_texts[:2] 

0it [00:00, ?it/s]

[{'page_number': 146,
  'page_char_count': 2396,
  'page_word_count': 386,
  'page_sentence_count_raw': 17,
  'page_token_count': 599.0,
  'text': 'INTRODUCTION CHAPTER 6 Reproduction at the Cellular Level 6.1 The Genome 6.2 The Cell Cycle 6.3 Cancer and the Cell Cycle 6.4 Prokaryotic Cell Division The individual sexually reproducing organism—including humans—begins life as a fertilized egg, or zygote. Trillions of cell divisions subsequently occur in a controlled manner to produce a complex, multicellular human. In other words, that original single cell was the ancestor of every other cell in the body. Once a human individual is fully grown, cell reproduction is still necessary to repair or regenerate tissues. For example, new blood and skin cells are constantly being produced. All multicellular organisms use cell division for growth, and in most cases, the maintenance and repair of cells and tissues. Single-celled organisms use cell division as their method of reproduction. 6.1 The G

Now let's get a random sample of the pages.

### EDA on text

Get max, min and avg no. of characters, words, sentences, tokens pagewise.

In [4]:
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,146,2396,386,17,599.0,INTRODUCTION CHAPTER 6 Reproduction at the Cel...
1,147,3000,484,25,750.0,molecule in the form of a loop or circle. The ...
2,148,2341,369,17,585.25,6.2 The Cell Cycle LEARNING OBJECTIVES By the ...
3,149,2057,319,18,514.25,"S Phase Throughout interphase, nuclear DNA rem..."
4,150,2436,362,31,609.0,VISUAL CONNECTION FIGURE 6.4 Animal cell mitos...


In [5]:
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,63.0,63.0,63.0,63.0,63.0
mean,177.0,2456.79,386.05,20.68,614.2
std,18.33,1126.44,176.54,12.53,281.61
min,146.0,67.0,11.0,1.0,16.75
25%,161.5,1727.0,283.0,13.0,431.75
50%,177.0,2436.0,371.0,20.0,609.0
75%,192.5,3060.5,496.5,28.0,765.12
max,208.0,4680.0,751.0,61.0,1170.0


Our average token count per page is 614.
For this particular use case, it would be better to choose an embedding model whose input capacity is nearby or greater than 660 tokens.
But I have chosen "all-mpnet-base-v2": https://huggingface.co/sentence-transformers/all-mpnet-base-v2, which has max token size of 514 and embedding size of 768. This model is chosen for it's fast performance and good scores and is suitable to run on my local machine with core i7 processor, 16 GB ram and RTX3050 GPU with 6 GB VRAM.

--For selecting the right model we can check MTEB Leaderboard : https://huggingface.co/spaces/mteb/leaderboard

### Further text processing (splitting pages into sentences)

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.

Let's use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`. 

In [6]:
nlp = English()

# Add a sentencizer pipeline,  
nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/63 [00:00<?, ?it/s]

In [7]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 163,
  'page_char_count': 67,
  'page_word_count': 11,
  'page_sentence_count_raw': 1,
  'page_token_count': 16.75,
  'text': '150 6 • Critical Thinking Questions Access for free at openstax.org',
  'sentences': ['150 6 • Critical Thinking Questions Access for free at openstax.org'],
  'page_sentence_count_spacy': 1}]

In [8]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,63.0,63.0,63.0,63.0,63.0,63.0
mean,177.0,2456.79,386.05,20.68,614.2,19.13
std,18.33,1126.44,176.54,12.53,281.61,9.47
min,146.0,67.0,11.0,1.0,16.75,1.0
25%,161.5,1727.0,283.0,13.0,431.75,13.0
50%,177.0,2436.0,371.0,20.0,609.0,19.0
75%,192.5,3060.5,496.5,28.0,765.12,25.0
max,208.0,4680.0,751.0,61.0,1170.0,41.0


For our set of text, it looks like our raw sentence count (e.g. splitting on `". "`) is quite close to what spaCy came up with.

### Chunking our sentences together

Let's break down our list of sentences/text into smaller chunks.

For now, we're going to break our pages of sentences into groups of 10.
On average each of our pages has 20 sentences.
And an average total of 614 tokens per page.
So our groups of 10 sentences will be ~310 tokens long.

This gives us plenty of room for the text to embedded by our `all-mpnet-base-v2` model (it has a max capacity of 512 tokens and default capacity of 384 tokens).

In [9]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"], slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/63 [00:00<?, ?it/s]

In [10]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 197,
  'page_char_count': 2163,
  'page_word_count': 330,
  'page_sentence_count_raw': 16,
  'page_token_count': 540.75,
  'text': 'does appear to be intermediate between the two parents. For example, in the snapdragon, Antirrhinum majus (Figure 8.12), a cross between a homozygous parent with white flowers (CWCW) and a homozygous parent with red flowers (CRCR) will produce offspring with pink flowers (CRCW). (Note that different genotypic abbreviations are used for Mendelian extensions to distinguish these patterns from simple dominance and recessiveness.) This pattern of inheritance is described as incomplete dominance, meaning that one of the alleles appears in the phenotype in the heterozygote, but not to the exclusion of the other, which can also be seen. The allele for red flowers is incompletely dominant over the allele for white flowers. However, the results of a heterozygote self-cross can still be predicted, just as with Mendelian dominant and recessive crosse

In [11]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,63.0,63.0,63.0,63.0,63.0,63.0,63.0
mean,177.0,2456.79,386.05,20.68,614.2,19.13,2.41
std,18.33,1126.44,176.54,12.53,281.61,9.47,0.94
min,146.0,67.0,11.0,1.0,16.75,1.0,1.0
25%,161.5,1727.0,283.0,13.0,431.75,13.0,2.0
50%,177.0,2436.0,371.0,20.0,609.0,19.0,2.0
75%,192.5,3060.5,496.5,28.0,765.12,25.0,3.0
max,208.0,4680.0,751.0,61.0,1170.0,41.0,5.0


### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.
So to keep things clean, let's create a new list of dictionaries each containing a single chunk of sentences with relative information such as page number as well statistics about each chunk.

In [12]:
# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/63 [00:00<?, ?it/s]

152

In [13]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 195,
  'sentence_chunk': 'Thus, there are four equally likely gametes that can be formed when the RrYy heterozygote is self-crossed, as follows: RY, rY, Ry, and ry. Arranging these gametes along the top and left of a 4 × 4 Punnett square (Figure 8.10) gives us 16 equally likely genotypic combinations. From these genotypes, we find a phenotypic ratio of 9 round–yellow:3 round–green:3 wrinkled–yellow:1 wrinkled–green (Figure 8.10). These are the offspring ratios we would expect, assuming we performed the crosses with a large enough sample size. The physical basis for the law of independent assortment also lies in meiosis I, in which the different homologous pairs line up in random orientations. Each gamete can contain any combination of paternal and maternal chromosomes (and therefore the genes on them) because the orientation of tetrads on the metaphase plane is random (Figure 8.11).182 8 • Patterns of Inheritance Access for free at openstax.org',
  'chunk_char_count': 

Excellent!

Now we've broken our whole textbook into chunks of 10 sentences or less as well as the page number they came from.

This means we could reference a chunk of text and know its source.

Let's get some stats about our chunks.

In [14]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,152.0,152.0,152.0,152.0
mean,176.3,1016.96,159.28,254.24
std,18.35,640.29,97.15,160.07
min,146.0,17.0,4.0,4.25
25%,161.0,652.25,103.0,163.06
50%,174.5,1039.0,164.0,259.75
75%,191.25,1296.5,204.0,324.12
max,208.0,4315.0,664.0,1078.75


Some of our chunks have quite a low token count.

How about we check for samples with less than 30 tokens (about the length of a sentence) and see if they are worth keeping?

In [15]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 28.75 | Text: In this case, the C gene is epistatic to the A gene.190 8 • Patterns of Inheritance Access for free at openstax.org
Chunk token count: 24.0 | Text: However, the heterozygote phenotype occasionally 8.3 • Extensions of the Laws of Inheritance 183
Chunk token count: 25.25 | Text: When the new cell walls are in place, the daughter cells separate.6.4 • Prokaryotic Cell Division 143
Chunk token count: 16.75 | Text: 170 7 • Critical Thinking Questions Access for free at openstax.org
Chunk token count: 4.25 | Text: 7.2 • Meiosis 155


Looks like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.

In [16]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 146,
  'sentence_chunk': 'INTRODUCTION CHAPTER 6 Reproduction at the Cellular Level 6.1 The Genome 6.2 The Cell Cycle 6.3 Cancer and the Cell Cycle 6.4 Prokaryotic Cell Division The individual sexually reproducing organism—including humans—begins life as a fertilized egg, or zygote. Trillions of cell divisions subsequently occur in a controlled manner to produce a complex, multicellular human. In other words, that original single cell was the ancestor of every other cell in the body. Once a human individual is fully grown, cell reproduction is still necessary to repair or regenerate tissues. For example, new blood and skin cells are constantly being produced. All multicellular organisms use cell division for growth, and in most cases, the maintenance and repair of cells and tissues. Single-celled organisms use cell division as their method of reproduction.6.1 The Genome LEARNING OBJECTIVES By the end of this section, you will be able to: • Describe the prokaryotic and 

Smaller chunks filtered!

Time to embed our chunks of text!

### Embedding our text chunks

As mentioned earlier, we'll get the `all-mpnet-base-v2` model

In [17]:
# Loading embedding model
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

In [18]:
# Checking for GPU, if not available run through CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [19]:
%%time

# Send the model to the GPU
embedding_model.to(device) # I'm using a NVIDIA RTX 3060

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/144 [00:00<?, ?it/s]

CPU times: total: 16.6 s
Wall time: 4.4 s


### Save embeddings to a pkl file

In [20]:
embeddings_df_save_path = r"embeddings/text_chunks_and_embeddings.pkl"
with open(embeddings_df_save_path, 'wb') as handle:
    pickle.dump(pages_and_chunks_over_min_token_len, handle)

### *First part completed*: Embeddings chunks have been created and saved in a pickel file.
Note: No need to run the above part every time, since the embeddings have been saved

## 2. RAG - Search and Answer

### Similarity search

Let's import our embeddings we created earlier and prepare them for use by turning them into a tensor.

In [21]:
# Import saved file and view
with open(embeddings_df_save_path, 'rb') as handle:
    pages_and_chunks = pickle.load(handle)

pages_and_chunks

[{'page_number': 146,
  'sentence_chunk': 'INTRODUCTION CHAPTER 6 Reproduction at the Cellular Level 6.1 The Genome 6.2 The Cell Cycle 6.3 Cancer and the Cell Cycle 6.4 Prokaryotic Cell Division The individual sexually reproducing organism—including humans—begins life as a fertilized egg, or zygote. Trillions of cell divisions subsequently occur in a controlled manner to produce a complex, multicellular human. In other words, that original single cell was the ancestor of every other cell in the body. Once a human individual is fully grown, cell reproduction is still necessary to repair or regenerate tissues. For example, new blood and skin cells are constantly being produced. All multicellular organisms use cell division for growth, and in most cases, the maintenance and repair of cells and tissues. Single-celled organisms use cell division as their method of reproduction.6.1 The Genome LEARNING OBJECTIVES By the end of this section, you will be able to: • Describe the prokaryotic and 

In [22]:
type(pages_and_chunks_over_min_token_len[0]["embedding"])

numpy.ndarray

In [23]:
# Convert dictionary to dataframe
text_chunks_and_embedding_df = pd.DataFrame(pages_and_chunks)
text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,146,INTRODUCTION CHAPTER 6 Reproduction at the Cel...,1430,230,357.5,"[0.030578418, -0.06025812, 0.008500892, -0.066..."
1,146,"Organisms as diverse as protists, plants, and ...",964,155,241.0,"[-0.012573496, -0.05184381, -0.015112887, -0.0..."
2,147,molecule in the form of a loop or circle. The ...,954,155,238.5,"[0.0048194006, -0.12724732, -0.009562335, -0.0..."
3,147,These chromosomes are viewed within the nucleu...,1426,220,356.5,"[-0.028140888, -0.09671277, 0.0034940273, -0.0..."
4,147,It is possible to have two copies of the same ...,617,108,154.25,"[-0.0026481792, -0.04604913, -0.006321701, 0.0..."


In [24]:
# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([144, 768])

In [25]:
embeddings[0]

tensor([ 3.0578e-02, -6.0258e-02,  8.5009e-03, -6.6905e-02, -9.7951e-03,
         4.3505e-02,  2.7605e-02, -2.0866e-02,  2.7174e-02, -2.0384e-02,
         3.4852e-02, -3.4849e-02,  1.1848e-02, -2.5423e-02, -1.2114e-02,
         3.1856e-02,  3.2217e-03, -3.5544e-02, -6.5791e-02, -1.1571e-02,
        -2.1049e-02,  1.6679e-02, -5.1109e-03, -4.2164e-02,  7.0373e-03,
        -1.8025e-02,  1.6101e-02,  2.1680e-02,  4.3700e-03, -9.7880e-03,
         6.9192e-03, -3.8853e-02,  7.2886e-03, -1.0609e-02,  2.3766e-06,
        -6.0086e-02, -1.0214e-02, -1.7423e-02, -4.7442e-02, -1.6844e-02,
         2.8313e-02,  3.6638e-02, -1.6446e-02, -6.9521e-03,  3.8913e-02,
         7.4818e-02,  6.6938e-03,  1.8996e-02,  3.0340e-02, -1.1950e-02,
        -8.0057e-03, -4.3651e-02,  3.4996e-02,  1.9025e-02,  3.2726e-02,
         7.3935e-02,  4.0560e-03,  3.4439e-02,  3.0490e-02, -2.1255e-02,
         5.2967e-02,  3.5121e-02,  2.8035e-02,  5.5708e-02,  2.2341e-02,
         1.8569e-02,  1.1419e-01, -6.6987e-02,  2.3

Time to perform a semantic search.

Lets take a question from textbook: "Distinguish between chromosomes, genes, and traits".

Well, we can do so with the following steps:
1. Define a query string.
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a dot product or cosine similarity between the text embeddings and the query embedding to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts. 

In [26]:
# 1. Define the query
query = "Distinguish between chromosomes, genes, and traits"
print(f"Query: {query}")

# 2. Embed the query using same embedding model 
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=7)
top_results_dot_product

Query: Distinguish between chromosomes, genes, and traits
Time take to get scores on 144 embeddings: 0.00031 seconds.


torch.return_types.topk(
values=tensor([0.7626, 0.6738, 0.5922, 0.5922, 0.5714, 0.5607, 0.5491],
       device='cuda:0'),
indices=tensor([  3,  32, 100,  99,  37, 134, 139], device='cuda:0'))

============================================================================================================================================
torch.topk returns a tuple of values (scores) and indicies for those scores.

The indicies relate to which indicies in the `embeddings` tensor have what scores in relation to the query embedding (higher is better).

We can use those indicies to map back to our text chunks.

First, we'll define a small helper function to print out wrapped text (so it doesn't print a whole text chunk as a single line).

In [27]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

Now we can loop through the `top_results_dot_product` tuple and match up the scores and indicies and then use those indicies to index on our `pages_and_chunks` variable to get the relevant text chunk.

In [28]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'Distinguish between chromosomes, genes, and traits'

Results:
Score: 0.7626
Text:
These chromosomes are viewed within the nucleus (top), removed from a cell in
mitosis (right), and arranged according to length (left) in an arrangement
called a karyotype. In this image, the chromosomes were exposed to fluorescent
stains to distinguish them. (credit: “718 Bot”/Wikimedia Commons, National Human
Genome Research) The matched pairs of chromosomes in a diploid organism are
called homologous chromosomes. Homologous chromosomes are the same length and
have specific nucleotide segments called genes in exactly the same location, or
locus. Genes, the functional units of chromosomes, determine specific
characteristics by coding for specific proteins. Traits are the different forms
of a characteristic. For example, the shape of earlobes is a characteristic with
traits of free or attached. Each copy of the homologous pair of chromosomes
originates from a different parent; therefore, the copie

**The first two/three result looks good. We get a relevant answer to our query even though its quite vague.**

This is the **retrieval** part of Retrieval Augmented Generation (RAG).

Workflow of retrieval:`ingest documents -> split into chunks -> embed chunks -> make a query -> embed the query -> compare query embedding to chunk embeddings`

## Functionizing our semantic search pipeline

Let's put all of the steps from above for semantic search into a function or two so we can repeat the workflow.

In [29]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks_over_min_token_len,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

Excellent! Now let's test our functions out.

In [30]:
query = "Distinguish between chromosomes, genes, and traits"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

[INFO] Time taken to get scores on 144 embeddings: 0.00008 seconds.


(tensor([0.7626, 0.6738, 0.5922, 0.5922, 0.5714], device='cuda:0'),
 tensor([  3,  32, 100,  99,  37], device='cuda:0'))

In [31]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 144 embeddings: 0.00008 seconds.
Query: Distinguish between chromosomes, genes, and traits

Results:
Score: 0.7626
These chromosomes are viewed within the nucleus (top), removed from a cell in
mitosis (right), and arranged according to length (left) in an arrangement
called a karyotype. In this image, the chromosomes were exposed to fluorescent
stains to distinguish them. (credit: “718 Bot”/Wikimedia Commons, National Human
Genome Research) The matched pairs of chromosomes in a diploid organism are
called homologous chromosomes. Homologous chromosomes are the same length and
have specific nucleotide segments called genes in exactly the same location, or
locus. Genes, the functional units of chromosomes, determine specific
characteristics by coding for specific proteins. Traits are the different forms
of a characteristic. For example, the shape of earlobes is a characteristic with
traits of free or attached. Each copy of the homologous pair of chromoso

## Augmenting our prompt with context items

What we'd like to do with augmentation is take the results from our search for relevant resources and put them into the prompt that we pass to our LLM.

In essence, we start with a base prompt and update it with context text.

Let's write a function called `query_formatter` that takes in a query and our list of context items (in our case it'll be select indices from our list of dictionaries inside `pages_and_chunks`) and then formats the query with text from the context items.

In [32]:
def query_formatter(query: str, 
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: Where do plants get each of the raw materials required for photosynthesis?
Answer: Plants require the following raw material for photosynthesis:\n1. CO2 is obtained from the atmosphere through stomata\n2. Water is absorbed by plant roots from the soil.\n3. Sunlight is an essential raw material for photosynthesis\n4. Nutrients are obtained by soil by plant roots
\nExample 2:
Query: Why are some substances biodegradable and some non-biodegradable?
Answer: The reason why some substances are biodegradable and some are non-biodegradable is because the microorganisms, like bacteria, and decomposers, like saprophytes, have a specific role to play. They can break down only natural products like paper, wood, etc., but they cannot break down human-made products like plastics. Based on this, some substances are biodegradable and some are non-biodegradable.
\nExample 3:
Query: Why is DNA copying an essential part of the process of reproduction?
Answer: DNA copying is an essential part of the process of reproduction because it carries the genetic information from the parents to offspring. A copy of DNA is produced through some chemical reactions resulting in two copies of DNA. Along with the additional cellular structure, DNA copying also takes place, which is then followed by cell division into two cells.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    modified_query = base_prompt.format(context=context, query=query)

    
    return modified_query

In [33]:
query = "Distinguish between chromosomes, genes, and traits"

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
modified_query = query_formatter(query=query,
                          context_items=context_items)
print(modified_query)

[INFO] Time taken to get scores on 144 embeddings: 0.00007 seconds.
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: Where do plants get each of the raw materials required for photosynthesis?
Answer: Plants require the following raw material for photosynthesis:
1. CO2 is obtained from the atmosphere through stomata
2. Water is absorbed by plant roots from the soil.
3. Sunlight is an essential raw material for photosynthesis
4. Nutrients are obtained by soil by plant roots

Example 2:
Query: Why are some substances biodegradable and some non-biodegradable?
Answer: The reason why some substances are biodegradable and some are non-biodegradable is because the microorganisms, like ba

## Text Generation part of RAG

Now we have successfully tested our prompt generation function. Now next step is to feed this prompt to LLM for response text generation.

- We are using `google/gemma-2b-it`. This model is suitable for the hardware i am using in this pipeline.
- Lets login in huggingface and access our model.

Note: For accessing this model the user must have hugging face access token and approval to use this model.

In [34]:
# Logging into huggingface hub
from huggingface_hub import login
# login(token = 'Your hugging face access token')

In [35]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [36]:
#Defining model pipeline
model_id = "google/gemma-2b-it"
# Defining tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id, 
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 low_cpu_mem_usage=False, # use full memory 
                                                 ).to(device)

# Querying the llm
messages = [{"role": "user", "content": "How are you?"}]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)
print(tokenizer.decode(tokenized_chat[0]))

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos><start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model



In [37]:
# Getting response from our LLM
outputs = llm_model.generate(tokenized_chat, max_new_tokens=256) 
print(tokenizer.decode(outputs[0]))

  attn_output = torch.nn.functional.scaled_dot_product_attention(
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


<bos><start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
I am doing well, thank you for asking! I am functioning properly and ready to assist you with any questions or tasks you may have. Is there anything I can help you with today?<eos>


### This shows our model is running fine. Now lets generate our prompts to get our responses by formatting the prompts in a better way on our modified_query.

In [38]:
input_text = modified_query
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [{"role": "user",
                      "content": input_text}]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: Where do plants get each of the raw materials required for photosynthesis?
Answer: Plants require the following raw material for photosynthesis:
1. CO2 is obtained from the atmosphere through stomata
2. Water is absorbed by plant roots from the soil.
3. Sunlight is an essential raw material for photosynthesis
4. Nutrients are obtained by soil by plant roots

Example 2:
Query: Why are some substances biodegradable and some non-biodegradable?
Answer: The reason why some substances are biodegradable and some are non-biodegradable is because the microorganisms, like bacteria, and decomposers, like saprophytes, have a specif

In [39]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig 
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) # define the maximum number of new tokens to create
print(f"Model output (tokens):\n{outputs[0]}\n")

Model input (tokenized):
{'input_ids': tensor([[   2,    2,  106,  ...,  106, 2516,  108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')}

Model output (tokens):
tensor([     2,      2,    106,  ...,   3910, 235265,      1], device='cuda:0')

CPU times: total: 2.75 s
Wall time: 2.97 s


In [40]:
# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: Where do plants get each of the raw materials required for photosynthesis?
Answer: Plants require the following raw material for photosynthesis:
1. CO2 is obtained from the atmosphere through stomata
2. Water is absorbed by plant roots from the soil.
3. Sunlight is an essential raw material for photosynthesis
4. Nutrients are obtained by soil by plant roots

Example 2:
Query: Why are some substances biodegradable and some non-biodegradable?
Answer: The reason why some substances are biodegradable and some are non-biodegradable is because the microorganisms, like bacteria, and de

### **That looks like a pretty good answer.**

But notice how the output contains the prompt text as well?

How about we do a little formatting to replace the prompt in the output text?

> **Note:** `"<bos>"` and `"<eos>"` are special tokens to denote "beginning of sentence" and "end of sentence" respectively.

In [41]:
print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

Input text: Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: Where do plants get each of the raw materials required for photosynthesis?
Answer: Plants require the following raw material for photosynthesis:
1. CO2 is obtained from the atmosphere through stomata
2. Water is absorbed by plant roots from the soil.
3. Sunlight is an essential raw material for photosynthesis
4. Nutrients are obtained by soil by plant roots

Example 2:
Query: Why are some substances biodegradable and some non-biodegradable?
Answer: The reason why some substances are biodegradable and some are non-biodegradable is because the microorganisms, like bacteria, and decomposers, like saprophytes, have a specif

### This shows our RAG steps are complete. Now lets write a functions for final prompt formatter and a wrapper which takes a query and runs whole pipeline as one.

In [42]:
# function to put our query and context into a prompt
def prompt_formatter(query: str, context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: Where do plants get each of the raw materials required for photosynthesis?
Answer: Plants require the following raw material for photosynthesis:\n1. CO2 is obtained from the atmosphere through stomata\n2. Water is absorbed by plant roots from the soil.\n3. Sunlight is an essential raw material for photosynthesis\n4. Nutrients are obtained by soil by plant roots
\nExample 2:
Query: Why are some substances biodegradable and some non-biodegradable?
Answer: The reason why some substances are biodegradable and some are non-biodegradable is because the microorganisms, like bacteria, and decomposers, like saprophytes, have a specific role to play. They can break down only natural products like paper, wood, etc., but they cannot break down human-made products like plastics. Based on this, some substances are biodegradable and some are non-biodegradable.
\nExample 3:
Query: Why is DNA copying an essential part of the process of reproduction?
Answer: DNA copying is an essential part of the process of reproduction because it carries the genetic information from the parents to offspring. A copy of DNA is produced through some chemical reactions resulting in two copies of DNA. Along with the additional cellular structure, DNA copying also takes place, which is then followed by cell division into two cells.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [{"role": "user",
                          "content": prompt}]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

In [43]:
# Function to take a query, retrieve it's context and returns response of the formatted prompt from llm.
def ask(query, 
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    
    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  print_time=False)
    
    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU 
        
    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)
    
    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)
    
    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here's the answer to the user's query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text
    
    return output_text, context_items

 Now our final function is ready. Let's test it

In [44]:
query = "Explain the differences between meiosis and mitosis"

print(f"Query: {query}")

# Answer query with context and return context 
answer = ask(query=query, 
             temperature=0.7,
             max_new_tokens=512,
             return_answer_only=True)

print(f"Answer:\n")
print_wrapped(answer)
# print(f"Context items:")
# context_items

Query: Explain the differences between meiosis and mitosis
Answer:

Meiosis is a two-round reproductive process that results in the production of
four haploid daughter cells from a single diploid cell. In contrast, mitosis is
a single-round reproductive process that results in the production of two
daughter cells from a single diploid cell.


**Now , we can see we're getting a suitable response. That's great**

# Local RAG workflow complete!

-- We've now officially got a way to Retrieve, Augment and Generate answers based on a source.

-- For now we can verify our answers manually by reading them and reading through the textbook.

**`Note`** A py file is also developed for running the streamlit app. You may run the app.py to trigger streamlit, which uses the module named bio_rag.py. This module is designed only for user query, it won't generate and save embedding chunks for the whole document