steps:

1. import pdf doc
2. process text for embedding(split into chunks of senten)
3. embed text chunk with embedding model 
4. Save embedding to file for later

In [3]:
import os 
import requests

# get pdf doc 
pdf_path = "human-nutrition-text.pdf"

if not os.path.exists(pdf_path):
    print(f"[info] file doesn't exists, downloading...")

    url =" https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    response = requests.get(url)

    if response.status_code == 200:
        with open(pdf_path,"wb") as file:
            file.write(response.content)
        print(f"[INFO] file has downloaded ")
    else:
        print(f"Failed to download file {response.status_code}")
else:
    print(f"file {pdf_path} exists")

file human-nutrition-text.pdf exists


In [4]:
import fitz  # Ensure this import is at the top
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    cleaned_text = text.replace('\n', " ").strip()
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_text = []
    for page_number, page in tqdm(enumerate(doc)):  
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_text.append({
            "page_number": page_number-41,  # Corrected page numbering
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count": len(text)/4,  # Assuming tokens are words; corrected calculation
            "text": text
        })

    return pages_and_text



In [6]:
pages_and_texts  = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [7]:
import random 
random.sample(pages_and_texts,k=3)

[{'page_number': 457,
  'page_char_count': 1597,
  'page_word_count': 271,
  'page_sentence_count_raw': 15,
  'page_token_count': 399.25,
  'text': 'details on food composition data, go to the USDA Food Composition  Databases page.  An Organism Requires Energy and Nutrient  Input  Energy is required in order to build molecules into larger  macromolecules (like proteins), and to turn macromolecules into  organelles and cells, which then turn into tissues, organs, and organ  systems, and finally into an organism. Proper nutrition provides  the necessary nutrients to make the energy that supports life’s  processes. Your body builds new macromolecules from the  nutrients in food.  Nutrient and Energy Flow  Energy is stored in a nutrient’s chemical bonds. Energy comes from  sunlight, which plants capture and, via photosynthesis, use it to  transform carbon dioxide in the air into the molecule glucose. When  the glucose bonds are broken, energy is released. Bacteria, plants,  and animals (in

In [8]:
import pandas as pd 
df=pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


Token Count 
1. Emebdding model dont deal in infinite token 
2. LLM dont deal wiht infite token 

Some embedding model may been trained to embed seqence of 384 token into numerical space. 

## Splitting pages into sentence

1. we can do this by splitting on ". "
2. we can use nlp libraries like spacy  and nltk

In [22]:
from spacy.lang.en import English 
nlp = English()

# Add a sentenizer pipeline 
nlp.add_pipe("sentencizer")
# document instance 
doc = nlp ("This is a sentence, This is antoher sentence. I like elephant")
# assert len(list(doc.sents))  ==3 

print(doc.sents)

<generator object at 0x178ea2e60>


In [28]:
from spacy.lang.en import English 
nlp = English()

# sentencizer pipeline 
nlp.add_pipe("sentencizer")
# document instance 
doc = nlp("This is a sentence. This is another sentence. I like elephants.")
assert len(list(doc.sents)) == 3 

# print(doc.sents)
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.
I like elephants.


In [30]:
pages_and_texts[1200]

{'page_number': 1159,
 'page_char_count': 1575,
 'page_word_count': 285,
 'page_sentence_count_raw': 14,
 'page_token_count': 393.75,
 'text': '32. Figure 15.2 reused “Two Women Riding Bikes” by David Marcu/  Unsplash License  33. Figure 15. reused “Man wearing blue shirt standing on white  surfboard” by Alex Blajan / Unsplash License  34. Figure 16.3 Anaerobic versus Aerobic Metabolism by Allison  Calabrese / CC BY 4.0  35. Figure 16.4 The Effect of Exercise Duration on Energy Systems”  by Allison Calabrese / CC BY 4.0  36. Figure 16.5 “Fuel Sources for Anaerobic and Aerobic  Metabolism reused “Liver” by Maritacovarrubias / Public  Domain; “Bread” by Jack7 / Public Domain; “Muscle types” by  Bruce Balus / CC BY-SA 4.0; “Tango style chicken leg” by  Rugby471 / Public Domain; “Male body silhouette” by mlampret  / Public Domain  37. Figure 16.6 The Effect of Exercise Intensity on Fuel Sources  reused “Happy reading guy” from Max Pixel / CC0; “Surfers  surfing waters” by hhach / CC0; “Foo

In [31]:
for item in tqdm(pages_and_texts):
    item['sentences'] = list(nlp(item['text']).sents)
    
    # all sentences are strings 
    item['sentences'] = [str(sentence) for sentence in item['sentences']]

    # count sentences 
    item['page_sentence_count_spacy'] = len(item['sentences'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [33]:
random.sample(pages_and_texts,k=1)

[{'page_number': 59,
  'page_char_count': 629,
  'page_word_count': 109,
  'page_sentence_count_raw': 4,
  'page_token_count': 157.25,
  'text': 'Digestive  system  without  labels by  Mariana  Ruiz / Public  Domain  Knowing how to maintain the balance of friendly bacteria in your  intestines through proper diet can promote overall health. Recent  scientific studies have shown that probiotic supplements positively  affect intestinal microbial flora, which in turn positively affect  immune system function. As good nutrition is known to influence  immunity, there is great interest in using probiotic foods and other  immune-system-friendly foods as a way to prevent illness. In this  chapter we will explore not only immune system function, but also  Introduction  |  59',
  'sentences': ['Digestive  system  without  labels by  Mariana  Ruiz / Public  Domain  Knowing how to maintain the balance of friendly bacteria in your  intestines through proper diet can promote overall health.',
   'Rec

In [34]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32
std,348.86,560.44,95.75,6.19,140.11,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


#### Chunking sentence in group of 10(arbitary number)(based on page_setence_count_spacy)

Purpose of Chunking 
1. Text are easier to filter(incase of debugging)
2. Text can fit into our Embedding model context window( 384 token as limit)
3. LLM can be more focussed and specif

e.,g ->
[20] ->[10,10]

In [36]:
num_sentence_chunk_size=10 


def split_list(input_list:list[str],
               slice_size:int = num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+split_size] for i in range(0,len(input_list),slice_size)]

In [None]:
list(range(25))