# **Local Retrieval Augmented Generation (RAG) from Scratch**

## **Importing Libraries**



In [1]:
# (Run in Colab to simulate a venv, libraries like Spacy cause a problem due to version dependencies)
!pip install -r "/content/Requirements.txt" -qq                  # -qq = quiet mode in pip

ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


In [54]:
import os
import fitz
from tqdm.auto import tqdm
import re
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np
import random
import textwrap

## **Reading the PDF**



To read the PDFs that we import, libraries like **PyMuPDF** and **PyPDF** can be used.

In [3]:
pdf_path = 'human-nutrition-text.pdf'

The text formatter **formats the text** read from the PDF by removing '\n' symbols. Additional changes can be made when required.


In [4]:
def Text_Format(text : str) -> str:
    cleaned_text = text.replace('\n', '').strip()                   # Strip removes any leading or trailing whitespaces
    return cleaned_text

The Read_PDF function reads the input PDF and stores it as a list of dictionaries.\
\
Alternative libraries like **NLTK** and **Spacy** can be used to sentencize the given text. These are far more accurate than just using split by fullstops

In [5]:
def Read_PDF(pdf_path : str) -> list[dict]:
    pdf = fitz.open(pdf_path)
    pages_list = []                                                 # The list that stores the page dicts

    for page_number, page in (enumerate(pdf)):
        text = page.get_text()
        text = Text_Format(text)

        pages_list.append({
            "page_number" : page_number - 41,                       # In our PDF, the page numbers start off from page 42, i.e, pg 42 -> pg 1
            "page_char_count" : len(text),
            "page_word_count" : len(text.split(" ")),
            "page_sentence_count" : len(text.split(". ")),
            "page_token_count" : len(text) / 4,                     # According to the GPT OpenAI paper, on average, 1 token ~ 4 char
            "text" : text,
            "sentences" : text.split(". ")
        })

    return pages_list


page_list = Read_PDF(pdf_path = pdf_path);

## **Text Splitting (Chunking)**



Here, we group sentences together into groups of 10. This can be done using the LangChain Library if required. \
\
We do this because :
- Easier to manage similar sized chunks of text.
- Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if you try to embed a sequence of 400+ tokens).
- Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

In [6]:
chunk_size = 10                                                       # Each chunk has 10 sentences, so any page with > 10 sentences gets split : [17] -> [[10], [7]]

In [7]:
def split_list(input_list: list, size: int) -> list[list[str]]:

    return [input_list[i:i + size] for i in range(0, len(input_list), size)]

In [8]:
for item in page_list:
    item["sentence_chunks"] = split_list(input_list = item["sentences"], size = chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

## **Seperating Chunks to be Individual Pages**

In [9]:
page_list[900]["sentence_chunks"]
# So there are 3 chunks of text, each chunk has several sentences which are stored as list[str]
# So sentence_chunks = list[list[str]], where 1 individual chunk = list[str]

[['should considering their length, and slightly more than 20 percent of children ages two to five are overweight or have obesity.4 Some minority group children, such as Filipinos, Native Hawaiians, and Other Pacific Islanders, in Hawai‘i have higher rates of overweight and obesity',
  'In 2012, 12.8% of Hawai‘i WIC (low-income) participants ages two to four years were overweight and 10.2% had obesity.567 One study that investigated 2000-2010 data for children ages two to eight years in 51 communities in 11 United States Affiliated Pacific (USAP) jurisdictions found that 14.4% of the study population was overweight and 14% had obesity.8 4',
  'Institute of Medicine',
  '(2011)',
  'Early childhood obesity prevention policies',
  'The National Academies Press',
  '5',
  'Oshiro C., Novotny R., Grove J., Hurwitz E',
  '(2015)',
  'Race/ethnic differences in birth size, infant growth, and body mass index at age five years in children in Hawaii'],
 ['Childhood Obesity, 11(6),683-690',
  'h

In [11]:
seperated_chunk_list = []

for dic in page_list:
    for chunk in dic["sentence_chunks"]:

        new_dict = {}
        new_dict["page_number"] = dic["page_number"]                        # As page number remains the same

        # Making a Paragraph out of the list of strings in 1 chunk
        joined_sentence = "".join(chunk)                                    # Join together all the sentences in this 1 chunk (Because chunk = list[str])
        joined_sentence = joined_sentence.replace("  ", " ").strip()

        joined_sentence = re.sub(r'\.([A-Z])', r'. \1', joined_sentence)    # After joining, the Sentences have no spacing between them : S1.S2 , so we introduce spaces between .Capital : .A -> . A

        new_dict["paragraph"] = joined_sentence                             # new_dict["paragraph"] --> str

        # Other keys are similar to the original page dictionary
        new_dict["char_count"] = len(joined_sentence)
        new_dict["word_count"] = len(joined_sentence.split(" "))
        new_dict["sentence_count"] = len(joined_sentence.split(". "))
        new_dict["token_count"] = len(joined_sentence) / 4

        seperated_chunk_list.append(new_dict)

## **Removing Unecessary Chunks**

The chunks with a **low number of tokens** rarely have valuable information, so we can **remove those chunks** and save ourselves some processing power.

In [13]:
df = pd.DataFrame(seperated_chunk_list)
min_tokens = 30

In [14]:
pruned_seperated_chunk_list = df[df["token_count"] > min_tokens].to_dict(orient="records")
df_pruned = pd.DataFrame(pruned_seperated_chunk_list)
df_pruned.describe().round(2)

Unnamed: 0,page_number,char_count,word_count,sentence_count,token_count
count,1671.0,1671.0,1671.0,1671.0,1671.0
mean,587.46,787.86,116.37,1.05,196.96
std,349.87,417.85,65.96,0.25,104.46
min,-39.0,121.0,3.0,1.0,30.25
25%,283.5,400.5,57.5,1.0,100.12
50%,594.0,801.0,118.0,1.0,200.25
75%,894.0,1130.0,170.0,1.0,282.5
max,1166.0,1863.0,299.0,4.0,465.75


## **Adding Chunk Embeddings**

To add the embeddings, we make use of the **all-mpnet-base-v2 model** from the sentence transformers library. \
\
This library leads to the generation of embeddings of **shape (768, ) with 384 tokens taken in at once**.

In [15]:
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu");        # Change to GPU to make it faster

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [32]:
%%time
embedding_model.to("cpu")

for item in tqdm(pruned_seperated_chunk_list):
    item["embedding"] = embedding_model.encode(item["paragraph"])

  0%|          | 0/1671 [00:00<?, ?it/s]

CPU times: total: 57min 4s
Wall time: 11min


Alternatively, we can try Batch Operations, this helps reduce the time further while using a gpu.

In [31]:
# embedding_model.to("cpu")

# text_chunks = [item["paragraph"] for item in pruned_seperated_chunk_list]
# text_chunk_embeddings = embedding_model.encode(text_chunks, batch_size=32, convert_to_tensor=True)
# text_chunk_embeddings

## **Saving the Embeddings**

We save the Dictionary of Chunks along with the Embedding of each chunk, so that we dont need to keep re-calculating the embeddings on every instance. 

In [33]:
text_chunks_and_embeddings_df = pd.DataFrame(pruned_seperated_chunk_list)
embeddings_df_save_path = "chunk_embeddings.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

#### **==== NOTE ====**
If the CSV file is available, we can start running the notebook from this point. Just re reun the embedding model definition. 
#### **==============**

In [None]:
emb = pd.read_csv("chunk_embeddings.csv")
emb.head()

emb["embedding"] = emb["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))      # Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
embeddings = torch.tensor(np.array(emb["embedding"].tolist()), dtype=torch.float32).to("cpu")   # (Note: NumPy arrays are float64, torch tensors are float32 by default)


Unnamed: 0,page_number,paragraph,char_count,word_count,sentence_count,token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,1,77.0,[ 6.74242899e-02 9.02280360e-02 -5.09548606e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,1,52.5,[ 5.52156381e-02 5.92138283e-02 -1.66167859e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,762,114,1,190.5,[ 2.68206690e-02 3.37356739e-02 -2.30485611e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,937,142,1,234.25,[ 6.77905306e-02 4.26554494e-02 -7.37832859e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,1,249.5,[ 3.30264382e-02 -8.49774200e-03 9.57152806e-...


## **Retrieval**

Once we have our embeddings, we make a system to **retrieve relevant paragraphs based on quries**.\
\
But before that, even the queries need to be made into embeddings so that they can be compared against the embeddings that we already have. 

In [47]:
query = "macronutrients functions"
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

Getting the **Cosine Similarity** and storing the **top k** results.

In [48]:
dot_scores = util.dot_score(a = query_embedding, b = embeddings)[0]
top_results_dot_product = torch.topk(dot_scores, k = 5)
top_results_dot_product

torch.return_types.topk(
values=tensor([0.6785, 0.6673, 0.6564, 0.6522, 0.6444]),
indices=tensor([42, 52, 41, 46, 51]))

In [55]:
def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pruned_seperated_chunk_list[idx]["paragraph"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pruned_seperated_chunk_list[idx]['page_number']}")
    print("\n")

Score: 0.6785
Text:
Macronutrients Nutrients that are needed in large amounts are called
macronutrientsThere are three classes of macronutrients: carbohydrates, lipids,
and proteinsThese can be metabolically processed into cellular energyThe energy
from macronutrients comes from their chemical bondsThis chemical energy is
converted into cellular energy that is then utilized to perform work, allowing
our bodies to conduct their basic functionsA unit of measurement of food energy
is the calorieOn nutrition food labels the amount given for “calories” is
actually equivalent to each calorie multiplied by one thousandA kilocalorie (one
thousand calories, denoted with a small “c”) is synonymous with the “Calorie”
(with a capital “C”) on nutrition food labelsWater is also a macronutrient in
the sense that you require a large amount of it, but unlike the other
macronutrients, it does not yield caloriesCarbohydrates Carbohydrates are
molecules composed of carbon, hydrogen, and oxygen
Page number

Apart from this top-k ranking, we can also construct a re-ranking model that looks at the top-k and re-ranks them, further improving the accuracy.