## What is RAG?

RAG stands for Retrieval Augmented Generation.

It was introduced in the paper [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).

Each step can be roughly broken down to:

* **Retrieval** - Seeking relevant information from a source given a query. For example, getting relevant passages of Wikipedia text from a database given a question.
* **Augmented** - Using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* **Generation** - Generating an output given an input. For example, in the case of an LLM, generating a passage of text given an input prompt.

In [3]:
import os
import requests

# Get PDF document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content) 
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {reponse.status_code}")

else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [4]:
!pip install PyMuPDF
!pip install tqdm
!pip install spacy tqdm



In [5]:
import fitz #for opening document
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """Opens a PDF file, reads its text content page by page, and collects statistics."""
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41, # adjusted page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(", ")),
                                "page_token_count": len(text) / 4, #1 token has approx 4 characters
                                "text": text})
    return pages_and_texts
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

  from .autonotebook import tqdm as notebook_tqdm
1208it [00:01, 1134.46it/s]


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [6]:
import random

random.sample(pages_and_texts, k=2)

[{'page_number': 322,
  'page_char_count': 1382,
  'page_word_count': 237,
  'page_sentence_count_raw': 6,
  'page_token_count': 345.5,
  'text': 'Chylomicron s Contain  Triglycerides  Cholesterol  Molecules  and other  Lipids by  OpenStax  College\xa0/ CC  BY 3.0  Just as lipids require special handling in the digestive tract to move  within a water-based environment, they require similar handling  to travel in the bloodstream. Inside the intestinal cells, the  monoglycerides and fatty acids reassemble themselves into  triglycerides. Triglycerides, cholesterol, and phospholipids form  lipoproteins when joined with a protein carrier. Lipoproteins have  an inner core that is primarily made up of triglycerides and  cholesterol esters (a cholesterol ester is a cholesterol linked to a  fatty acid). The outer envelope is made of phospholipids  interspersed with proteins and cholesterol. Together they form a  chylomicron, which is a large lipoprotein that now enters the  lymphatic system and

### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.




In [7]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,11,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,2,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,1,199.25,Contents Preface University of Hawai‘i at Mā...


In [8]:
df.tail()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
1203,1162,1676,252,1,419.0,39. Exercise 10.2 & 11.3 reused “Egg Oval Food...
1204,1163,1617,254,6,404.25,Images / Pixabay License; “Pumpkin Cartoon Ora...
1205,1164,1715,261,8,428.75,Flashcard Images Note: Most images in the fla...
1206,1165,1733,268,4,433.25,ShareAlike 11. Organs reused “Pancreas Organ ...
1207,1166,257,44,1,64.25,23. Vitamin D reused “The Functions of Vitamin...


In [9]:
df.shape

(1208, 6)

In [10]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,11.05,287.0
std,348.86,560.38,95.76,8.95,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,5.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,15.0,400.88
max,1166.0,2308.0,429.0,106.0,577.0


### Further text processing (splitting pages into sentences)
We will to follow the workflow of:

`Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings`

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.


We will use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`. 

In [11]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 1208/1208 [00:01<00:00, 634.11it/s]


In [12]:
random.sample(pages_and_texts, k=1)

[{'page_number': 62,
  'page_char_count': 1693,
  'page_word_count': 290,
  'page_sentence_count_raw': 20,
  'page_token_count': 423.25,
  'text': 'Basic Biology, Anatomy, and  Physiology  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  The Basic Structural and Functional Unit of Life:  The Cell  What distinguishes a living\xa0organism from an inanimate object? A  living organism conducts self-sustaining biological processes. A cell  is the smallest and most basic form of life.  The cell theory incorporates three principles:  Cells are the most basic building units of life.\xa0All living things  are composed of cells. New cells are made from preexisting cells,  which divide in two. Who you are has been determined because  of two cells that came together inside your mother’s womb. The  two cells containing all of your genetic information (DNA) united to  begin making new life. Cells divided and differentiated into other  cells with s

In [13]:
df = pd.DataFrame(pages_and_texts)
df

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text,sentences,page_sentence_count_spacy
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition,[Human Nutrition: 2020 Edition],1
1,-40,0,1,1,0.00,,[],0
2,-39,320,54,11,80.00,Human Nutrition: 2020 Edition UNIVERSITY OF ...,[Human Nutrition: 2020 Edition UNIVERSITY OF...,1
3,-38,212,32,2,53.00,Human Nutrition: 2020 Edition by University of...,[Human Nutrition: 2020 Edition by University o...,1
4,-37,797,145,1,199.25,Contents Preface University of Hawai‘i at Mā...,[Contents Preface University of Hawai‘i at M...,2
...,...,...,...,...,...,...,...,...
1203,1162,1676,252,1,419.00,39. Exercise 10.2 & 11.3 reused “Egg Oval Food...,"[39., Exercise 10.2 & 11.3 reused “Egg Oval Fo...",18
1204,1163,1617,254,6,404.25,Images / Pixabay License; “Pumpkin Cartoon Ora...,[Images / Pixabay License; “Pumpkin Cartoon Or...,10
1205,1164,1715,261,8,428.75,Flashcard Images Note: Most images in the fla...,[Flashcard Images Note: Most images in the fl...,13
1206,1165,1733,268,4,433.25,ShareAlike 11. Organs reused “Pancreas Organ ...,"[ShareAlike 11., Organs reused “Pancreas Orga...",13


In [14]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,11.05,287.0,10.32
std,348.86,560.38,95.76,8.95,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0
75%,864.25,1603.5,271.0,15.0,400.88,15.0
max,1166.0,2308.0,429.0,106.0,577.0,28.0


### Chunking our sentences together
Why do we do this?

1. Easier to manage similar sized chunks of text.
2. Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if you try to embed a sequence of 400+ tokens).
3. Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

In [15]:
chunk_size = 10
def split_list(input_list: list[str], 
               slice_size: int=chunk_size) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 1208/1208 [00:00<?, ?it/s]


In [16]:
random.sample(pages_and_texts,k=1)

[{'page_number': 466,
  'page_char_count': 2011,
  'page_word_count': 337,
  'page_sentence_count_raw': 15,
  'page_token_count': 502.75,
  'text': 'molecules of carbon dioxide. The energy obtained from the  breaking of chemical bonds in the citric acid cycle is transformed  into two more ATP molecules (or equivalents thereof) and high  energy electrons that are carried by the molecules, nicotinamide  adenine dinucleotide (NADH) and flavin adenine dinucleotide  (FADH2). NADH and FADH2 carry the electrons to the inner  membrane in the mitochondria where the third stage of energy  release takes place, in what is called the electron transport chain. In  this metabolic pathway a sequential transfer of electrons between  multiple proteins occurs and ATP is synthesized. The entire process  of nutrient catabolism is chemically similar to burning, as carbon  and hydrogen atoms are\xa0 combusted (oxidized) producing carbon  dioxide, water, and heat. However, the stepwise chemical reactions  in 

In [17]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,11.05,287.0,10.32,1.53
std,348.86,560.38,95.76,8.95,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0,1.0
75%,864.25,1603.5,271.0,15.0,400.88,15.0,2.0
max,1166.0,2308.0,429.0,106.0,577.0,28.0,3.0


### Splitting each chunk into its own item


In [18]:
import re

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  "," ").strip()
        joined_sentence_chunk = re.sub(r'\.(A-Z)', r'. \1', joined_sentence_chunk) # convert ".A"to ". A"(only for capital letter)
        chunk_dict["sentence_chunk"] = joined_sentence_chunk
        
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4
        
        pages_and_chunks.append(chunk_dict)
    
len(pages_and_chunks)       

100%|██████████| 1208/1208 [00:00<00:00, 24251.26it/s]


1843

In [19]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 801,
  'sentence_chunk': 'Updated April 2019.Accessed April 25, 2020. Physical Activity during Pregnancy For most pregnant women, physical activity is a must and is recommended in the 2015-2020 Dietary Guidelines for Americans and the 2018 Physical Activity Guidelines for Americans 10.\xa0Regular exercise of moderate intensity, about thirty minutes per day most 10.\xa0-U.S. Department of Health and Human Services. (2018) Physical Activity Guidelines for Americans, 2nd edition. U.S. Department of Health and Human Services Pregnancy | 801',
  'chunk_char_count': 510,
  'chunk_word_count': 73,
  'chunk_token_count': 127.5}]

In [20]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,731.11,109.0,182.78
std,347.79,445.65,69.34,111.41
min,-41.0,12.0,3.0,3.0
25%,280.5,313.5,43.0,78.38
50%,586.0,745.0,111.0,186.25
75%,890.0,1112.0,168.0,278.0
max,1166.0,1824.0,290.0,456.0


In [21]:
 df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-41,Human Nutrition: 2020 Edition,29,4,7.25
1,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0
2,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5
3,-37,Contents Preface University of Hawai‘i at Māno...,765,113,191.25
4,-36,Lifestyles and Nutrition University of Hawai‘i...,940,141,235.0


In [22]:
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count : {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count : 24.25 | Text: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=225 330 | Digestion and Absorption of Lipids
Chunk token count : 10.5 | Text: Accessed November 22, 2017. 676 | Selenium
Chunk token count : 11.0 | Text: Accessed October 5, 2017. Introduction | 433
Chunk token count : 12.25 | Text: PART VIII CHAPTER 8.ENERGY Chapter 8.Energy | 451
Chunk token count : 29.5 | Text: 2011.  https://www.ers.usda.gov/publications/pub- details/?pubid=44909.Accessed April 15, 2018. 1138 | Food Insecurity


In [23]:
#filtering rows with token under 30
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [24]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 730,
  'sentence_chunk': 'The survey also notes that more Americans 1.\xa0Food Labeling.US Food and Drug Administration. https://www.fda.gov/Food/GuidanceRegulation/ GuidanceDocumentsRegulatoryInformation/ LabelingNutrition/ucm385663.htm#highlights.Updated November 11, 2017.Accessed November 22, 2017. 2.\xa0Consumer Research on Labeling, Nutrition, Diet and Health.US Food and Drug Administration. 730 | Discovering Nutrition Facts',
  'chunk_char_count': 401,
  'chunk_word_count': 39,
  'chunk_token_count': 100.25}]

### Embedding our text chunks

Embeddings of text will mean that similar meaning texts have similar numerical representation.


Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

We'll use our computers to find patterns in the embeddings and then we can use their text mappings to further our understanding.

We'll use the [`sentence-transformers`](https://www.sbert.net/docs/installation.html) library which contains many pre-trained embedding models.

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [25]:
!pip install sentence-transformers # for embedding models

ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


In [26]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cuda")


for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|██████████| 199/199 [00:00<00:00, 786.57it/s, Materializing param=pooler.dense.weight]                        
[1mMPNetModel LOAD REPORT[0m from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


AssertionError: Torch not compiled with CUDA enabled

In [None]:
pages_and_chunks_over_min_token_len[0]["embedding"].shape

(768,)

Our embedding has a shape of `(768,)` meaning it's a vector of 768 numbers which represent our text in high-dimensional space.

In [None]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[9]

'Defining Protein University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 363 The Role of Proteins in Foods: Cooking and Denaturation University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 374 Protein Digestion and Absorption University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 378 Protein’s Functions in the Body University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 383 Diseases Involving Proteins University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 395 Proteins in a Nutshell University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 405 Proteins, Diet, and Personal Choices University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program 409'

In [None]:
len(text_chunks)

1680

In [None]:
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=16, # Embed all texts in batches
                                               convert_to_tensor=True)
text_chunk_embeddings[0]

Batches:   0%|          | 0/105 [00:00<?, ?it/s]

tensor([ 6.7424e-02,  9.0228e-02, -5.0955e-03, -3.1755e-02,  7.3908e-02,
         3.5198e-02, -1.9799e-02,  4.6769e-02,  5.3573e-02,  5.0123e-03,
         3.3393e-02, -1.6221e-03,  1.7608e-02,  3.6265e-02, -3.1669e-04,
        -1.0712e-02,  1.5426e-02,  2.6218e-02,  2.7765e-03,  3.6494e-02,
        -4.4411e-02,  1.8936e-02,  4.9012e-02,  1.6402e-02, -4.8578e-02,
         3.1829e-03,  2.7299e-02, -2.0476e-03, -1.2283e-02, -7.2805e-02,
         1.2045e-02,  1.0730e-02,  2.1000e-03, -8.1777e-02,  2.6783e-06,
        -1.8143e-02, -1.2080e-02,  2.4717e-02, -6.2747e-02,  7.3544e-02,
         2.2162e-02, -3.2877e-02, -1.8010e-02,  2.2295e-02,  5.6137e-02,
         1.7951e-03,  5.2593e-02, -3.3174e-03, -8.3388e-03, -1.0628e-02,
         2.3192e-03, -2.2393e-02, -1.5301e-02, -9.9306e-03,  4.6532e-02,
         3.5747e-02, -2.5476e-02,  2.6369e-02,  3.7491e-03, -3.8268e-02,
         2.5833e-02,  4.1287e-02,  2.5818e-02,  3.3297e-02, -2.5178e-02,
         4.5152e-02,  4.4903e-04, -9.9662e-02,  4.9

In [None]:
#Saving embedding to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(save_path, index=False)

In [None]:
# Import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242675e-02 9.02281404e-02 -5.09548886e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156419e-02 5.92139773e-02 -1.66167244e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,765,113,191.25,[ 2.79801842e-02 3.39813754e-02 -2.06426680e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,940,141,235.0,[ 6.82566911e-02 3.81275006e-02 -8.46854132e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264494e-02 -8.49763490e-03 9.57159605e-...


# RAG - Search and Answer

### Similarity search
Similarity search or semantic search or vector search is the idea of searching on *semantic*.

With keyword search, you are trying to match the string "apple" with the string "apple".

Whereas with similarity/semantic search, you may want to search "macronutrients functions".
And get back results that don't necessarily contain the words "macronutrients functions" but get back pieces of text that match that meaning.


In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

text_chunks_and_embedding_df = pd.read_csv(save_path)
#convert embedding to array (it got converted to string when it saved)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

#converting embedding into torch tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df["embedding"].tolist(), axis=0), dtype=torch.float32).to(device)
# Convert texts and embedding df to list of dicts
pages_and_chunks = text = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embeddings_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.00,"[0.06742427, 0.09022814, -0.005095489, -0.0317..."
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.50,"[0.05521564, 0.059213977, -0.016616724, -0.020..."
2,-37,Contents Preface University of Hawai‘i at Māno...,765,113,191.25,"[0.027980184, 0.033981375, -0.020642668, 0.001..."
3,-36,Lifestyles and Nutrition University of Hawai‘i...,940,141,235.00,"[0.06825669, 0.0381275, -0.008468541, -0.01813..."
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.50,"[0.03302645, -0.008497635, 0.009571596, -0.004..."
...,...,...,...,...,...,...
1675,1164,Flashcard Images Note: Most images in the flas...,1298,169,324.50,"[0.018562254, -0.016427767, -0.012704563, -0.0..."
1676,1164,Hazard Analysis Critical Control Points reused...,373,49,93.25,"[0.03347206, -0.057044085, 0.015148939, -0.010..."
1677,1165,ShareAlike 11.Organs reused “Pancreas Organ An...,1277,164,319.25,"[0.07705155, 0.009785576, -0.012181741, 0.0010..."
1678,1165,Sucrose reused “Figure 03 02 05” by OpenStax B...,408,57,102.00,"[0.10304516, -0.016470186, 0.008268461, 0.0377..."


In [None]:
embeddings.shape

torch.Size([1680, 768])

Retrival is done by following steps:
1. Define a query string.
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a [dot product](https://pytorch.org/docs/stable/generated/torch.dot.html) or [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function between the text embeddings and the query embedding to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts. 

In [None]:
from sentence_transformers import util

query = "macronutrients functions"
print(f"Query : {query}")

query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

dot_scores = util.dot_score(query_embedding, embeddings)[0]

top_results = torch.topk(dot_scores, k=5)
top_results

Query : macronutrients functions


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

torch.return_types.topk(
values=tensor([0.6926, 0.6738, 0.6646, 0.6536, 0.6473], device='cuda:0'),
indices=tensor([42, 47, 41, 51, 46], device='cuda:0'))

In [None]:
for score, idx in zip(top_results[0], top_results[1]):
    print(f"Score: {score:.4f}")
    print("Text")
    print(pages_and_chunks[idx]["sentence_chunk"])
    print("\n\n")


Score: 0.6926
Text
Macronutrients Nutrients that are needed in large amounts are called macronutrients.There are three classes of macronutrients: carbohydrates, lipids, and proteins.These can be metabolically processed into cellular energy.The energy from macronutrients comes from their chemical bonds.This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions.A unit of measurement of food energy is the calorie.On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand.A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels.Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yield calories. Carbohydrates Carbohydrates are molecules composed of carbon, hydrogen, and oxygen.



In [None]:
def retrieve_relevant_resources(query: str, n_resources_to_return: int=5):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

    dot_scores = util.dot_score(query_embedding, embeddings)[0]

    scores, indices = torch.topk(dot_scores, k=n_resources_to_return)
    
    return scores, indices



In [None]:
retrieve_relevant_resources(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(tensor([0.6926, 0.6738, 0.6646, 0.6536, 0.6473], device='cuda:0'),
 tensor([42, 47, 41, 51, 46], device='cuda:0'))

In [None]:
def print_top_results_and_scores(query: str, n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.
    """
    scores, indices = retrieve_relevant_resources(query, n_resources_to_return=n_resources_to_return)
    for score, idx in zip(scores, indices):
        print(f"Score: {score:.4f}")
        print("Text")
        print(pages_and_chunks[idx]["sentence_chunk"])
        print("\n\n")

In [None]:
print_top_results_and_scores(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Score: 0.6926
Text
Macronutrients Nutrients that are needed in large amounts are called macronutrients.There are three classes of macronutrients: carbohydrates, lipids, and proteins.These can be metabolically processed into cellular energy.The energy from macronutrients comes from their chemical bonds.This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions.A unit of measurement of food energy is the calorie.On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand.A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels.Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yield calories. Carbohydrates Carbohydrates are molecules composed of carbon, hydrogen, and oxygen.



# Installing Gemma-2b
We will be using Gemma_instruct_2b for this.

In [None]:
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

import os

os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.2.1 which is incompatible.[0m[31m
[0m

In [None]:
import keras
import keras_nlp
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en") 

2024-04-12 13:38:59.847988: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-12 13:38:59.848085: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-12 13:38:59.964682: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Attaching 'config.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_instruct_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/

In [None]:
input_text = "What are macronutrients, and what role do they play in human body?"

outputs = gemma_lm.generate(input_text, max_length=256)
print(outputs)

What are macronutrients, and what role do they play in human body?

Sure, here's a detailed explanation of macronutrients and their role in the human body:

**Macronutrients**

Macronutrients are nutrients that the body needs in large amounts to maintain good health. They are essential for various bodily functions, including building and repairing tissues, producing energy, and regulating metabolism.

There are three main macronutrients:

* **Carbohydrates:** Provide energy for the body's cells and tissues.
* **Proteins:** Build and repair tissues, produce enzymes, and help regulate metabolism.
* **Fats:** Insulate the body, help absorb vitamins, and provide energy.

**Role of Macronutrients in the Human Body**

* **Energy production:** Carbohydrates, proteins, and fats provide the body with energy.
* **Building and repairing tissues:** Proteins are essential for building and repairing tissues, such as muscles, bones, and cartilage.
* **Metabolism:** Macronutrients help regulate metabo

In [None]:
# Nutrition-style questions 
query_list = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "How does saliva help with digestion?",
    "water soluble vitamins"
]


query_list

['What are the macronutrients, and what roles do they play in the human body?',
 'How do vitamins and minerals differ in their roles and importance for health?',
 'Describe the process of digestion and absorption of nutrients in the human body.',
 'What role does fibre play in digestion? Name five fibre containing foods.',
 'How does saliva help with digestion?',
 'water soluble vitamins']

### Augmenting our prompt with context items

We'd like to do with augmentation is take the results from our search for relevant resources and put them into the prompt that we pass to our LLM.


#### We want our prompt like this

Based on the following contexts:
- sdjfhasdfjh
- dfhsdlfj
- sdfsdakfjaslkjf
- iwqewiurbndf
- dsfsd;fadskjfh

Please answer the following query: What are the macronutrients and what do they do?
Answer:

In [None]:
def prompt_formatter(query: str, 
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    base_prompt = """Based on the following context items, please answer the query.
        Give yourself room to think by extracting relevant passages from the context before answering the query.
        Don't return the thinking, only return the answer.
        Make sure your answers are as explanatory as possible.
        Use the following examples as reference for the ideal answer style.
        \nExample 1:
        Query: What are the fat-soluble vitamins?
        Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
        \nExample 2:
        Query: What are the causes of type 2 diabetes?
        Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
        \nExample 3:
        Query: What is the importance of hydration for physical performance?
        Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
        \nNow use the following context items to answer the user query:
        {context}
        \nRelevant passages: <extract relevant passages from the context here>
        User query: {query}
        Answer:"""

    # Update base prompt with context items and query   
    prompt = base_prompt.format(context=context, query=query)

    return prompt

In [None]:
query = random.choice(query_list)
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: Describe the process of digestion and absorption of nutrients in the human body.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Based on the following context items, please answer the query.
        Give yourself room to think by extracting relevant passages from the context before answering the query.
        Don't return the thinking, only return the answer.
        Make sure your answers are as explanatory as possible.
        Use the following examples as reference for the ideal answer style.
        
Example 1:
        Query: What are the fat-soluble vitamins?
        Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
        
Example 2:
        Query: What are the causes of type 2 d

In [None]:
outputs = gemma_lm.generate(prompt,max_length=2048) 

print(f"Query: {query}")
print(f"RAG answer:\n{outputs.replace(prompt, '')}")

Query: Describe the process of digestion and absorption of nutrients in the human body.
RAG answer:
 The process of digestion and absorption of nutrients in the human body involves the breakdown of food molecules into smaller components that can be absorbed and taken into the body. The digestive system consists of several hollow tube-shaped organs including the mouth, pharynx, esophagus, stomach, small intestine, large intestine (colon), rectum, and anus. The process begins with the mouth, where food is chewed and mixed with saliva to break it down into smaller pieces. The food then passes down the esophagus to the stomach, where it is further broken down by enzymes. The food then passes through the small intestine, where it is further broken down into even smaller molecules. The nutrients from the food are then absorbed into the bloodstream through the walls of the small intestine. The waste products from digestion are then expelled from the body through the rectum and anus.


In [None]:
def ask(query, return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    
    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query)
                                        
    
    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU 
        
    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)
    
    # Generate an output 
    outputs = gemma_lm.generate(prompt, max_length=2048)

    # Remove prompt in output
    output_text = outputs.replace(prompt, "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text
    
    return output_text, context_items

In [None]:
query = random.choice(query_list)
print(f"Query: {query}")

return_answer_only = True
if return_answer_only:
    # Answer query with context and return context 
    answer = ask(query=query, return_answer_only=True)
    print(f"Answer:\n")
    print(answer)
else: 
    answer, context_items = ask(query=query, return_answer_only=False)
    print(f"Answer:\n")
    print(answer)
    print(f"\n\nContext items:")
    print(context_items)

Query: What role does fibre play in digestion? Name five fibre containing foods.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Answer:

 Fibre plays a crucial role in digestion by providing structure and support for the digestive tract, facilitating the breakdown of food into smaller molecules that can be absorbed by the body, and promoting the production of digestive enzymes. Five fibre-containing foods are: whole grains, fruits, vegetables, legumes, and nuts.
