## 1. Document/Text Processing and Embedding Creation

Ingredients:

    PDF document of choice.(This could be any kind of document. Here, we focus on PDFs for now)
    Embedding model of choice.

Steps:

    1.Import PDF document.
    2.Process text for embedding (e.g. split into chunks of sentences).
    3.Embed text chunks with embedding model.
    4.Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).


### Import PDF file

In [1]:
import os
import requests

# Get PDF docuemnt path
pdf_path="Human-Nutrition-2020-Edition-1598491699.pdf"

# Download the file if it is not available
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, downloading...")

    #Enter the URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"
    
    # the local filename to save the downloaded file
    filename=pdf_path
    # Send the GET request to the URL
    response = requests.get(url)
    # Check if the request was successful
    if response.status_code == 200:
        # open the file and save it
        with open(filename,"wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status Code: {response.status_code}")
else:
    print(f"File {pdf_path} exists")
    

File Human-Nutrition-2020-Edition-1598491699.pdf exists


Let's open PDF

In [2]:
import fitz # Require pip install PyMuPDF
from tqdm.auto import tqdm

def text_formatter(text:str)->str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n"," ").strip()
    return cleaned_text

def open_and_read_pdf(pdf_path:str)->list[dict]:
    doc=fitz.open(pdf_path)
    pages_and_texts=[]
    for page_number,page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_numbers":page_number-41,
                                "page_char_count":len(text),
                                "page_word_count":len(text.split(" ")),
                                "page_sentence_count_raw":len(text.split(".")),
                                "page_token_count":len(text)/4, # 1 token=~ 4 characters
                                "text":text})
    return pages_and_texts


pages_and_texts=open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]
    

0it [00:00, ?it/s]

[{'page_numbers': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_numbers': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [3]:
import random
random.sample(pages_and_texts,k=3)

[{'page_numbers': 130,
  'page_char_count': 1404,
  'page_word_count': 225,
  'page_sentence_count_raw': 18,
  'page_token_count': 351.0,
  'text': 'longer than three months significantly reduces the incidence and  severity of diarrhea and respiratory illnesses.1  Zinc supplementation also has been found to be therapeutically  beneficial for the treatment of leprosy, tuberculosis, pneumonia,  and the common cold. Equally important to remember is that  multiple studies show that it is best to obtain your minerals and  vitamins from eating a variety of healthy foods.  Just as undernutrition compromises immune system health, so  does overnutrition. People who are obese are at increased risk for  developing immune system disorders such as asthma, rheumatoid  arthritis, and some cancers. Both the quality and quantity of fat  affect immune system function. High intakes of saturated and trans  fats negatively affect the immune system, whereas increasing your  intake of omega-3 fatty acids, fo

In [4]:
import pandas as pd

df=pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_numbers,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [5]:
df.describe().round(2)

Unnamed: 0,page_numbers,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0
std,348.86,560.38,95.83,9.54,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,8.0,190.5
50%,562.5,1231.5,216.0,13.0,307.88
75%,864.25,1603.5,272.0,19.0,400.88
max,1166.0,2308.0,430.0,82.0,577.0


## Further text Processing (Splitting pages into Sentences)

Two ways to do this:
1. Splitting on ".".
2. We can do this using NLP library such as Spacy and NLTK

In [6]:
from spacy.lang.en import English
nlp=English()

# Add a Sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance
doc=nlp("This is a Sentence. This is another Sentence. I like elephants.")
assert len(list(doc.sents)) == 3

list(doc.sents)

[This is a Sentence., This is another Sentence., I like elephants.]

In [7]:
pages_and_texts[600]

{'page_numbers': 559,
 'page_char_count': 863,
 'page_word_count': 138,
 'page_sentence_count_raw': 16,
 'page_token_count': 215.75,
 'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Korsakoff syndrome can cause similar symptoms as beriberi such  as confusion, loss of coordination, vision changes, hallucinations,  and may progress to coma and death. This condition is specific  to alcoholics as diets high in alcohol can cause thiamin deficiency.  Other individuals at risk include individuals who also consume diets  typically low in micronutrients such as those with eating disorders,  elderly, and individuals who have gone through gastric bypass  surgery.5  Figure 9.10 The Role of Thiamin  Figure 9.11 Beriberi, Thiamin Deficiency  5. Fact Sheets for Health Professionals: Thiamin. National  Institute of Health, Office of Dietary Supplements.   https://ods.od.nih.gov/factsheets/Thiamin- HealthProfessional/. Updated Feburary 11, 2016.  Accessed October 22, 2017.  Water-Soluble Vitamins  

In [8]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    # Make sure all sentences are strings (the deafult type is spacy datatype)
    item["sentences"]=[str(sentence) for sentence in item["sentences"]]
    # Count the sentences
    item["page_sentence_count_spacy"]=len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [9]:
random.sample(pages_and_texts,k=1)

[{'page_numbers': 658,
  'page_char_count': 939,
  'page_word_count': 172,
  'page_sentence_count_raw': 8,
  'page_token_count': 234.75,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0    Iron Toxicity  The body excretes little iron and therefore the potential for  accumulation in tissues and organs is considerable. Iron  accumulation in certain tissues and organs can cause a host of  health problems in children and adults including extreme fatigue,  arthritis, joint pain, and severe liver and heart toxicity. In children,  death has occurred from ingesting as little as 200 mg of iron and  therefore it is critical to keep iron supplements out of children’s  reach. The IOM has set tolerable upper intake levels of iron (Table  11.2 “Dietary Reference Intakes for Iron”). Mostly a hereditary  disease, hemochromatosis is the result of a genetic mutation that  leads to abnormal iron metabolism and an accumulation of iron in  certain tissues such as the liver, pancreas, and heart. The sig

In [10]:
df=pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_numbers,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32
std,348.86,560.38,95.83,9.54,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0


### Chunking our Sentences together

The concept of splitting larger peces of text into smaller ones is often referred to as text splitting or chunking.
There is no 100% correct way to do it.
We will keep it simple and split into group of 10 sentences. (However , you could also try 5,7,8, and other numbers)

There are frameworks which do this like langchain.
Why we do this:
1. So our texts are easier to filter (Smaller group of text can be easier to inspect than passages of text).
2. So out text chunks can fir into out embeddings model context window. (Eg. 384 tokens as a limit)
3. So our contexts passed to an LLM can be more specific and focused.

In [11]:
# Define split size to turn groups of sentences into chunks.
num_sentence_chunk_size=10

# create a function to split lists of texts ecursively into chunk size.
# E.g., [20] -> [10,10], [25]->[10,10,5]

def split_list(input_list:list[str],
               slice_size:int=num_sentence_chunk_size)-> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0,len(input_list),slice_size)]

## test function
test_list=list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [12]:
df.columns

Index(['page_numbers', 'page_char_count', 'page_word_count',
       'page_sentence_count_raw', 'page_token_count', 'text', 'sentences',
       'page_sentence_count_spacy'],
      dtype='object')

In [13]:
## Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"]=split_list(input_list=item["sentences"],slice_size=num_sentence_chunk_size)
    item["num_chunks"]=len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [14]:
random.sample(pages_and_texts,k=1)

[{'page_numbers': 448,
  'page_char_count': 1704,
  'page_word_count': 274,
  'page_sentence_count_raw': 18,
  'page_token_count': 426.0,
  'text': 'Health Benefits of Moderate  Alcohol Intake  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  In contrast to excessive alcohol intake, moderate alcohol intake has  been shown to provide health benefits. The data is most convincing  for preventing heart disease in middle-aged and older people. A  review of twenty-nine studies concluded that moderate alcohol  intake reduces the risk of coronary heart disease by about 30  percent in comparison to those who do not consume alcohol.1  Several other studies demonstrate that moderate alcohol  consumption reduces the incidences of stroke and heart attack, and  also death caused by cardiovascular and heart disease. The drop in  risk for these adverse events ranges between percent. Moreover,  there is some scientific evidence that moderate alcohol 

In [15]:
df=pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_numbers,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32,1.53
std,348.86,560.38,95.83,9.54,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0,1.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0,1.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0,2.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0,3.0


In [16]:
df.columns

Index(['page_numbers', 'page_char_count', 'page_word_count',
       'page_sentence_count_raw', 'page_token_count', 'text', 'sentences',
       'page_sentence_count_spacy', 'sentence_chunks', 'num_chunks'],
      dtype='object')

### Splitting each chunk into its own item

We would like to embed each chunk of sentences into its own numerical representaion.
That will give us a good level of granularity.
Meaning, we can dive specifically into the text sample that was used in our model.

In [17]:
import re

# Split each chunk into its own item
pages_and_chunks=[]
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict={}
        chunk_dict["page_number"]=item["page_numbers"]

        # Join the sentences together into paragraph-like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk="".join(sentence_chunk).replace("  "," ").strip()
        joined_sentence_chunk=re.sub(r'\.([A-z])',r'. \1',joined_sentence_chunk) # ".A" => ". A" (Will work for any capital letter
        chunk_dict["sentence_chunk"]=joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"]=len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"]=len(list(joined_sentence_chunk.split(" ")))
        chunk_dict["chunk_token_count"]=len(joined_sentence_chunk)/4 # 1 token= ~ 4 chars
        pages_and_chunks.append(chunk_dict)
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [18]:
pages_and_texts[615]

{'page_numbers': 574,
 'page_char_count': 1083,
 'page_word_count': 228,
 'page_sentence_count_raw': 35,
 'page_token_count': 270.75,
 'text': 'Dietary Sources  Biotin can be found in foods such as eggs, fish, meat, seeds, nuts  and certain vegetables. For the pantothenic acid content of various  foods, see Table 9.22 Biotin Content of Various Foods”.  Table 9.22 Biotin Content of Various Foods  Food  Serving Biotin  (mcg)  Percent Daily  Value*  Eggs  1 large  10  33.3  Salmon, canned  3 oz.  5  16.6  Pork chop  3 oz.  3.8  12.6  Sunflower seeds  ¼ c.  2.6  8.6  Sweet potato  ½ c.  2.4  8  Almonds  ¼ c.  1.5  5  Tuna, canned  3 oz.  0.6  2  Broccoli  ½ c.  0.4  1.3  Banana  ½ c.  0.2  0.6  * Current AI used to determine  Percent Daily Value  Fact Sheet for Health Professionals: Biotin. National Institute of  Health, Office of Dietary Supplements. https://ods.od.nih.gov/ factsheets/Biotin-HealthProfessional/. Updated October 3, 2017.  Accessed November 10, 2017.  Vitamin B6 (Pyridoxine

In [19]:
random.sample(pages_and_chunks,k=1)

[{'page_number': 980,
  'sentence_chunk': 'We cannot overstate the importance of eating a healthy, well-balanced diet designed to provide all of the necessary nutrients. Food contains many more beneficial substances, such as phytochemicals and fiber, that promote good 3. Watson S. How to Evaluate Vitamins and Supplements. WebMD. com. http://www. webmd. com/vitamins-and- supplements/lifestyle-guide -11/how-to-evaluate- vitamins-supplements. Accessed March 11, 2018. 980 | Food Supplements and Food Replacements',
  'chunk_char_count': 470,
  'chunk_word_count': 61,
  'chunk_token_count': 117.5}]

In [20]:
df=pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,735.35,114.0,183.84
std,347.79,447.16,70.77,111.79
min,-41.0,12.0,3.0,3.0
25%,280.5,317.5,46.5,79.38
50%,586.0,747.0,116.0,186.75
75%,890.0,1119.0,174.0,279.75
max,1166.0,1830.0,297.0,457.5


### Filter chunks of text for short chunks

These chunks may not contain much useful information

In [21]:
# Show random chunks with under 30 tokens in length
min_token_length = 30

for row in df[df["chunk_token_count"]<=min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')
    

Chunk token count: 26.25 | Text: Updated November 6, 2015. Accessed April 15, 2018. 1122 | Undernutrition, Overnutrition, and Malnutrition
Chunk token count: 15.75 | Text: PART IV CHAPTER 4. CARBOHYDRATES Chapter 4. Carbohydrates | 227
Chunk token count: 15.75 | Text: PART XVII CHAPTER 17. FOOD SAFETY Chapter 17. Food Safety | 985
Chunk token count: 3.75 | Text: 806 | Pregnancy
Chunk token count: 11.75 | Text: Accessed March 17, 2018. Sports Nutrition | 961


In [22]:
# Filter our Dataframe for rows that under 30 tokens.
pages_and_chunks_over_min_token_len=df[df["chunk_token_count"]>min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [23]:
random.sample(pages_and_chunks_over_min_token_len,k=1)

[{'page_number': 126,
  'sentence_chunk': 'After the osteoid tissue is built up, the bone tissue begins to mineralize. The last step of bone remodeling continues for months, and for a much longer time afterward the mineralized bone is continuously packed in a more dense fashion. Thus, we can say that bone is a living tissue that continually adapts itself to mechanical stress through the process of remodeling. For bone tissue to remodel certain nutrients such as calcium, phosphorus, magnesium, fluoride, vitamin D, and vitamin K are required. Bone Mineral Density Is an Indicator of Bone Health Bone mineral density (BMD) is a measurement of the amount of calcified tissue in grams per centimeter squared of bone tissue. BMD can be thought of as the total amount of bone mass in a 126 | The Skeletal System',
  'chunk_char_count': 767,
  'chunk_word_count': 131,
  'chunk_token_count': 191.75}]

### Embedding our text chunks

Embeddings are a broad but powerful concept.
While humans understand text. machines understand numbers.
What we would like to do:
- Turn our texts chunks into numbers, specifically embeddings
A useful numerical representation.
The best part about embeddings is that are *learned* representation.

In [24]:
from sentence_transformers import SentenceTransformer
embedding_model=SentenceTransformer(model_name_or_path="all-mpnet-base-v2",device="cuda")

# Create a list of sentences
sentences = ["The Sentence Transformer library provides an easy way to create embeddings",
             "Sentences can be embedded one by one or in a list",
             "I like horses!"]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences,embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")    



Sentence: The Sentence Transformer library provides an easy way to create embeddings
Embedding: [-3.17512639e-02  3.37267816e-02 -2.52437778e-02  5.22287712e-02
 -2.35248711e-02 -6.19115727e-03  1.35026146e-02 -6.25501126e-02
  7.50833051e-03 -2.29684636e-02  2.98146978e-02  4.57555167e-02
 -3.26700248e-02  1.39847435e-02  4.18013781e-02 -5.92969619e-02
  4.26309630e-02  5.04660420e-03 -2.44552288e-02  3.98594374e-03
  3.55897695e-02  2.78742686e-02  1.84098613e-02  3.67699936e-02
 -2.29960624e-02 -3.01796924e-02  5.99479070e-04 -3.64503972e-02
  5.69104664e-02 -7.49943545e-03 -3.70004401e-02 -3.04359244e-03
  4.64355014e-02  2.36148317e-03  9.06849948e-07  7.00033177e-03
 -3.92289571e-02 -5.95697341e-03  1.38653098e-02  1.87107606e-03
  5.34202345e-02 -6.18613735e-02  2.19613519e-02  4.86050807e-02
 -4.25697863e-02 -1.69858839e-02  5.04178517e-02  1.54733760e-02
  8.12859386e-02  5.07106148e-02 -2.27496978e-02 -4.35720831e-02
 -2.18389416e-03 -2.14091502e-02 -2.01758258e-02  3.0683271

In [25]:
embeddings[0].shape

(768,)

In [26]:
embeddings= embedding_model.encode("My favourite animal is the cow!")
embeddings

array([-1.45473508e-02,  7.66727105e-02, -2.85872407e-02, -3.31283286e-02,
        3.65210511e-02,  4.78570461e-02, -7.08107576e-02,  1.62834208e-02,
        1.93444081e-02, -2.80481651e-02, -2.91747134e-02,  5.11309877e-02,
       -3.28720286e-02, -8.98752827e-03, -1.03672603e-02, -3.15488242e-02,
        4.22784053e-02, -9.13283601e-03, -1.94017272e-02,  4.35689166e-02,
       -2.31997557e-02,  4.29883152e-02, -1.72393434e-02, -2.01372597e-02,
       -3.13573815e-02,  8.08166619e-03, -2.06725094e-02, -2.27869488e-02,
        2.44812574e-02,  1.71968378e-02, -6.26673028e-02, -7.54797012e-02,
        3.57422046e-02, -5.46571193e-03,  1.24730320e-06, -7.63200829e-03,
       -3.53222117e-02,  1.91326886e-02,  3.99045721e-02,  2.11729109e-03,
        1.64565817e-02,  9.84050520e-03, -1.80700645e-02,  9.33833607e-03,
        3.23482864e-02,  5.84785938e-02,  4.23187427e-02,  1.62091199e-02,
       -9.14911404e-02,  1.82305183e-02, -5.25729358e-03, -7.81017635e-03,
       -3.47644128e-02, -

In [None]:
# %%time
# # Embedding on our chunks of data
# embedding_model.to("cpu")
# # Embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"]=embedding_model.encode(item["sentence_chunk"])
    

In [31]:
%%time
# Embedding on our chunks of data
embedding_model.to("cuda")
# Embed each chunk one by one
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"]=embedding_model.encode(item["sentence_chunk"])
    

  0%|          | 0/1681 [00:00<?, ?it/s]

CPU times: user 32.2 s, sys: 2.06 s, total: 34.2 s
Wall time: 31.6 s


In [32]:
%%time
text_chunks=[item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[419]

CPU times: user 270 μs, sys: 23 μs, total: 293 μs
Wall time: 303 μs


'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture to their fascinating food creations. Add

In [33]:
len(text_chunks)

1681

In [34]:
%%time
# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=16, # Experiment
                                               convert_to_tensor=True)

CPU times: user 24.9 s, sys: 630 ms, total: 25.5 s
Wall time: 22.1 s


In [35]:
text_chunk_embeddings

tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

In [36]:
pages_and_chunks_over_min_token_len[419]

{'page_number': 277,
 'sentence_chunk': 'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture 

In [37]:
### Save Embeddings to file
text_chunks_and_embeddings_df=pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path="text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path,index=False)

In [39]:
# Import saved file and view
text_chunks_and_embedding_df_load=pd.read_csv("text_chunks_and_embeddings_df.csv")
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242675e-02 9.02281329e-02 -5.09549491e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156270e-02 5.92139587e-02 -1.66167449e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.5,[ 2.79801972e-02 3.39813679e-02 -2.06426457e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,[ 6.82566985e-02 3.81275043e-02 -8.46854504e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264494e-02 -8.49768892e-03 9.57159698e-...


# 2. RAG- Search and Answer

#### Similarity search

Similarity search or semantic search or vector search is the idea of searching on vibe.

If this sounds like woo, woo. It's not.

Perhaps searching via meaning is a better analogy.

With keyword search, you are trying to match the string "apple" with the string "apple".

Whereas with similarity/semantic search, you may want to search "macronutrients functions".

And get back results that don't necessarily contain the words "macronutrients functions" but get back pieces of text that match that meaning.

In [64]:
import random
import torch
import numpy as np
import pandas as pd

device="cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embeddings_df =  pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when saved in csv)

text_chunks_and_embeddings_df["embedding"]=text_chunks_and_embeddings_df["embedding"].apply(lambda x:np.fromstring(x.strip("[]"),sep=" "))

# Convert embeddings into torch.tensor.
embeddings=torch.tensor(np.stack(text_chunks_and_embeddings_df["embedding"].tolist(),axis=0),dtype=torch.float32).to(device)

# Convert texts and embeddings df to list of dicts
pages_and_chunks=text_chunks_and_embeddings_df.to_dict(orient="records")

text_chunks_and_embeddings_df


Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.00,"[0.0674242675, 0.0902281329, -0.00509549491, -..."
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.50,"[0.055215627, 0.0592139587, -0.0166167449, -0...."
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.50,"[0.0279801972, 0.0339813679, -0.0206426457, 0...."
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,"[0.0682566985, 0.0381275043, -0.00846854504, -..."
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.50,"[0.0330264494, -0.00849768892, 0.00957159698, ..."
...,...,...,...,...,...,...
1676,1164,Flashcard Images Note: Most images in the flas...,1304,186,326.00,"[0.0185622536, -0.0164277833, -0.0127045522, -..."
1677,1164,Hazard Analysis Critical Control Points reused...,374,51,93.50,"[0.0334720351, -0.0570440702, 0.015148947, -0...."
1678,1165,ShareAlike 11. Organs reused “Pancreas Organ A...,1285,175,321.25,"[0.0770515576, 0.00978558231, -0.0121817458, 0..."
1679,1165,Sucrose reused “Figure 03 02 05” by OpenStax B...,410,63,102.50,"[0.103045136, -0.0164702125, 0.0082684597, 0.0..."


In [47]:
text_chunks_and_embeddings_df["embedding"]

0       [0.0674242675, 0.0902281329, -0.00509549491, -...
1       [0.055215627, 0.0592139587, -0.0166167449, -0....
2       [0.0279801972, 0.0339813679, -0.0206426457, 0....
3       [0.0682566985, 0.0381275043, -0.00846854504, -...
4       [0.0330264494, -0.00849768892, 0.00957159698, ...
                              ...                        
1676    [0.0185622536, -0.0164277833, -0.0127045522, -...
1677    [0.0334720351, -0.0570440702, 0.015148947, -0....
1678    [0.0770515576, 0.00978558231, -0.0121817458, 0...
1679    [0.103045136, -0.0164702125, 0.0082684597, 0.0...
1680    [0.0863773674, -0.0125358971, -0.0112746442, 0...
Name: embedding, Length: 1681, dtype: object

In [49]:
len(embeddings)

1681

In [52]:
embeddings.shape

torch.Size([1681, 768])

In [53]:
embeddings

tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       dtype=torch.float64)

#### Embedding model

In [56]:
## Create model
from sentence_transformers import util,SentenceTransformer
embedding_model=SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                    device=device)

Embedding model ready!

Let's create a small semantic search pipeline.

In essence, we want to search for a query. (e.g., "macronutrient functions" and get back relevent passages from our textbook.

We can do so with the following steps.
1. Define a query string.
2. Turn the query string into Embedding.
3. Perform a dot product or cosine similarity function between the text embeddings and the query embedding.
4. Sort the results from 3 in descending order.
   
Note: To use dor product for comparison, ensure vector sizes are of same shape and tensors/vectors are in the same data type (e.g., Both are in torch.float32)

In [75]:
# 1. Define the query.
query = "macronutrients functions"
#query = "Breastfeeding infant timeline"
print(f"Query:{query}")
# 2. Embed the query.
query_embedding=embedding_model.encode(query, convert_to_tensor=True).to("cuda")

# 3.Get Similarity scores with the dot product (Use cosine similarity if outputs aren't normalized)
from time import perf_counter as timer

start_time=timer()
dot_scores=util.dot_score(a=query_embedding,b=embeddings)[0]
end_time=timer()

print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings:{end_time-start_time:.5f} seconds.")

#4 . Get top 5 results.
top_results_dot_product=torch.topk(dot_scores,k=5)
top_results_dot_product

Query:macronutrients functions
[INFO] Time taken to get scores on 1681 embeddings:0.00028 seconds.


torch.return_types.topk(
values=tensor([0.6926, 0.6738, 0.6646, 0.6536, 0.6473], device='cuda:0'),
indices=tensor([42, 47, 41, 51, 46], device='cuda:0'))

In [74]:
pages_and_chunks[1151]

{'page_number': 816,
 'sentence_chunk': 'milk is the best source to fulfill nutritional requirements. An exclusively breastfed infant does not even need extra water, including in hot climates. A newborn infant (birth to 28 days) requires feedings eight to twelve times a day or more. Between 1 and 3 months of age, the breastfed infant becomes more efficient, and the number of feedings per day often become fewer even though the amount of milk consumed stays the same. After about six months, infants can gradually begin to consume solid foods to help meet nutrient needs. Foods that are added in addition to breastmilk are called complementary foods. Complementary foods should be nutrient dense to provide optimal nutrition. Complementary foods include baby meats, vegetables, fruits, infant cereal, and dairy products such as yogurt, but not infant formula. Infant formula is a substitute, not a complement to breastmilk. In addition to complementary foods, the World Health Organization recommen

### Semantic search/vector search extensions

We've covered an exmaple of using embedding vector search to find relevant results based on a query.

However, you could also add to this pipeline with traditional keyword search.

Many modern search systems use keyword and vector search in tandem.

Our dataset is small and allows for an exhaustive search (comparing the query to every possible result) but if you start to work with large scale datasets with hundred of thousands, millions or even billions of vectors, you'll want to implement an index.

You can think of an index as sorting your embeddings before you search through them.

So it narrows down the search space.

For example, it would be inefficient to search every word in the dictionary to find the word "duck", instead you'd go straight to the letter D, perhaps even straight to the back half of the letter D, find words close to "duck" before finding it.

That's how an index can help search through many examples without comprimising too much on speed or quality (for more on this, check out nearest neighbour search).

One of the most popular indexing libraries is Faiss.

Faiss is open-source and was originally created by Facebook to deal with internet-scale vectors and implements many algorithms such as HNSW (Hierarchical Naviganle Small Worlds).

## Checking local GPU memory availability

In [78]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 8 GB
