# 1. Data Preprocessing

### Import the documents, videos & audio files


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

def list_files_in_folder(folder_path):
    """
    Returns a list of full file paths in the given folder.

    Args:
    - folder_path (str): The path to the folder.

    Returns:
    - List of full file paths.
    """
    try:
        return [
            os.path.join(folder_path, file)
            for file in os.listdir(folder_path)
            if os.path.isfile(os.path.join(folder_path, file))
        ]
    except FileNotFoundError:
        print(f"The folder '{folder_path}' does not exist.")
        return []


In [4]:
audio_folder_path = './rag_docs/audio_files'
# audio_folder_path = '/content/drive/MyDrive/RAG_PIPELINE/context_docs/audio_files'

audio_files = list_files_in_folder(audio_folder_path)

pdf_files = [
    {"file_path": "/content/drive/MyDrive/RAG_PIPELINE/context_docs/pdf_files/Werum MES Optimization Pharma .pdf", "page_number_offset": 0},
    {"file_path": "/content/drive/MyDrive/RAG_PIPELINE/context_docs/pdf_files/ZVEI_MES_Brochure_EN.pdf", "page_number_offset": 0},
    {"file_path": "/content/drive/MyDrive/RAG_PIPELINE/context_docs/pdf_files/Manufacturing Execution Systems Integration and Intelligence.pdf", "page_number_offset": 0},
    {"file_path": "/content/drive/MyDrive/RAG_PIPELINE/context_docs/pdf_files/human-nutrition-text.pdf", "page_number_offset": 0}
]
audio_files

['./rag_docs/audio_files/What is MES (Manufacturing Execution System)_.mp3',
 './rag_docs/audio_files/test.mp3',
 './rag_docs/audio_files/What is MES_ Manufacturing Execution Systems.mp3',
 './rag_docs/audio_files/Top 10 Manufacturing Execution Systems [Best Manufacturing Software].mp3']

## Handling audio files

In [10]:
# print("[INFO] Running in Google Colab, installing requirements.")
# !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
# !pip install PyMuPDF # for reading PDFs with Python
# !pip install tqdm # for progress bars
# !pip install pandas # for progress bars
# !pip install spacy
# !pip install sentence-transformers # for embedding models
# !pip install accelerate # for quantization model loading
# !pip install bitsandbytes # for quantizing models (less storage space)
# !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inferenc

In [7]:
%%capture
!pip install git+https://github.com/neuml/txtai#egg=txtai[api,pipeline]


In [12]:
%%capture

from txtai.pipeline import Transcription

# Create transcription model
transcribe = Transcription("openai/whisper-base")

In [13]:
from IPython.display import Audio, display

def transcribe_audio_files(files, display_files=False) -> list[dict]:
    transcriptions = []
    for file in files:
        # Call the transcription function (replace `transcribe` with your actual function)
        text = transcribe([file])[0]  # Assuming `transcribe` returns a list of texts

        # Display the audio file and transcription
        if display_files:
          display(Audio(file))
          print(text)

        # Append transcription to the result list
        transcriptions.append({
            "audio_source_file": file,
            "text": text,
            "char_count": len(text),
            "word_count": len(text.split(" ")),
            "sentence_count_raw": len(text.split(". ")),
            "token_count": len(text) / 4,
        })

    return transcriptions

# Example usage
audio_to_text = transcribe_audio_files(audio_files)
audio_to_text



[{'audio_source_file': './rag_docs/audio_files/What is MES (Manufacturing Execution System)_.mp3',
  'text': "What is MES? MES stands for Manufacturing Execution System, meaning a control system for monitoring and managing work and process on the factory floor. But that's an oversimplification of what a successful MES software implementation can do for manufacturers. MES provides detailed resource scheduling and status, production, dispatch, and sequencing. Traceability, genealogy, inventory, quality assurance, maintenance, management, document control, performance, analysis, and more. MES is crucial for manufacturers because it exists in a space between business-oriented applications like ERP and SCADA HMI systems designed to directly control plant-floor operations. While an ERP can help allocate resources, it lacks the level of detail that MES provides. MES allows for real time, minute to minute, or quicker resource scheduling, as well as handling execution and dispatch. MES connects

### Helper functions to cleanup text

In [None]:
def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

### Processing PDF files

In [None]:

import fitz  # PyMuPDF
from tqdm.auto import tqdm

def open_and_read_pdf(pdf_path: str, page_number_offset: int) -> list[dict]:
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - page_number_offset,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

def read_pdfs(pdf_files: list[dict]) -> list[dict]:
  res = []
  for item in pdf_files:
    texts = open_and_read_pdf(item["file_path"], item["page_number_offset"])
    res = [...texts]

  return res


pages_and_texts = open_and_read_pdf(pdf_path=pdf_path, page_number_offset=41)

0it [00:00, ?it/s]

In [None]:
import random
random.sample(pages_and_texts, 2)

[{'page_number': 1139,
  'page_char_count': 1386,
  'page_word_count': 221,
  'page_sentence_count_raw': 19,
  'page_token_count': 346.5,
  'text': 'in the United States.11 The program provides Electronic Benefit  Transfers (EBT) which work similarly to a debit card. Clients receive  a card with a certain allocation of money for each month that can be  used only for food. In 2010, the average benefit was about $134 per  person, per month and total federal expenditures for the program  were $68.2 billion.12  The Special, Supplemental Program for Women,  Infants, and Children  The Special, Supplemental Program for Women, Infants and  Children  (WIC)  provides  food  packages  to  pregnant  and  breastfeeding women, as well as to infants and children up to age  five, to promote adequate intake for healthy growth and  development. Most state WIC programs provide vouchers that  participants use to acquire supplemental packages at authorized  stores. In 2010, WIC served approximately 9.2 mil

In [None]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [None]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


### Processing text into sentences using nlp

In [None]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")

def nlp_sentence_splitter(text: str) -> list[str]:
  sentences = list(nlp(text).sents)
  sentences = [str(sentence).strip() for sentence in sentences]
  return sentences



In [None]:
for item in tqdm(pages_and_texts):
  item["sentences"] = nlp_sentence_splitter(item["text"])
  item["page_sentence_count_nlp"] = len(item["sentences"])

random.sample(pages_and_texts, 2)

  0%|          | 0/1208 [00:00<?, ?it/s]

[{'page_number': 682,
  'page_char_count': 235,
  'page_word_count': 50,
  'page_sentence_count_raw': 2,
  'page_token_count': 58.75,
  'text': 'Image by  Chris55 / CC  BY 4.0\xa0\xa0 A large  goiter by Dr.  J.S.Bhandari,  India / CC  BY-SA 3.0  Figure 11.6 Iodine Deficiency: Goiter  Dietary Reference Intakes for Iodine  Table 11.8 Dietary Reference Intakes for Iodine  682  |  Iodine',
  'sentences': ['Image by  Chris55 / CC  BY 4.0\xa0\xa0 A large  goiter by Dr.  J.S.Bhandari,  India / CC  BY-SA 3.0  Figure 11.6 Iodine Deficiency: Goiter  Dietary Reference Intakes for Iodine  Table 11.8 Dietary Reference Intakes for Iodine  682  |  Iodine'],
  'page_sentence_count_nlp': 1},
 {'page_number': 780,
  'page_char_count': 1155,
  'page_word_count': 199,
  'page_sentence_count_raw': 10,
  'page_token_count': 288.75,
  'text': 'Learning Objectives  By the end of this chapter you will be able to:  •  Describe the physiological basis for nutrient  requirements from pregnancy through the toddler

In [None]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_nlp
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32
std,348.86,560.38,95.76,6.19,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


In [None]:
for item in tqdm(audio_to_text):
  item["sentences"] = nlp_sentence_splitter(item["text"])
  item["page_sentence_count_nlp"] = len(item["sentences"])

random.sample(audio_to_text, 2)

In [None]:
df = pd.DataFrame(audio_to_text)
df.describe().round(2)

### Chunking sentences together

In [None]:
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

for item in tqdm(audio_to_text):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [None]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_nlp,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32,1.53
std,348.86,560.38,95.76,6.19,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


In [None]:
df = pd.DataFrame(audio_and_texts)
df.describe().round()

### Splitting each chunk into its own item

In [None]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [None]:
random.sample(pages_and_chunks, 1)

[{'page_number': 882,
  'sentence_chunk': 'Treatment for lead poisoning includes removing the child from the source of contamination and extracting lead from the body. Extraction may involve chelation therapy, which binds with lead so it can be excreted in urine. Another treatment protocol, EDTA therapy, involves administering a drug called ethylenediaminetetraacetic acid to remove lead from the bloodstream of patients with levels greater than 45 mcg/dL.9 Fortunately, lead toxicity is highly preventable. It involves identifying potential hazards, such as lead paint and pipes, and removing them before children are exposed to them. Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities. These activities are available in the web-based textbook and not available in the downloadable versions (EPUB, Digital PDF, Print_PDF, or Open Document). Learning activities may be used across various

In [None]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,733.59,111.95,183.4
std,347.79,447.58,71.32,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,314.0,44.0,78.5
50%,586.0,745.0,113.0,186.25
75%,890.0,1118.0,172.0,279.5
max,1166.0,1831.0,297.0,457.75


Filtering chunks with under 30 tokens

In [None]:
min_token_length = 30
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

# 2. Embedding Generations

In [None]:
import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

def generate_embeddings(text_list):
  """
  Generates embeddings for a list of strings.

  Args:
    text_list: A list of strings.

  Returns:
    A list of embeddings, where each embedding is a NumPy array.
  """
  embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device)
  embeddings = embedding_model.encode(text_list, convert_to_tensor=True)
  return embeddings

In [None]:
%%time

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks_embeddings = generate_embeddings(text_chunks)

CPU times: user 17 s, sys: 142 ms, total: 17.1 s
Wall time: 28.5 s


In [None]:
text_chunks_embeddings

tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

### Saving embeddings to vector database

In [None]:
!pip install pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="pcsk_6P78Jy_HCt5CtyvzDy3D3DrNr53uCCfmtiebZgZH1ZCDFbZQaesZWbfDhXkyaQ9MuhXqrh")
index_name = "rag"

pc.create_index(
    name=index_name,
    dimension=768, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)



In [None]:
index = pc.Index(index_name)
def save_to_pinecone(embeddings, text_list):
    upsert_data = [
        (str(i), embeddings[i].tolist()) for i in range(len(text_list))
    ]
    index.upsert(vectors=upsert_data)

    print(f"Successfully stored {len(text_list)} embeddings in Pinecone.")

save_to_pinecone(text_chunks_embeddings[:10], text_chunks[:10])

Successfully stored 10 embeddings in Pinecone.


In [None]:
def query_pinecone_db(query_text, k=5):
  query_embedding = generate_embeddings([query_text])[0].tolist()
  res = index.query(vector=query_embedding,top_k=k,include_values=True)
  res
  indices = [int(item["id"]) for item in res["matches"]]
  scores = [float(item["score"]) for item in res["matches"]]
  return indices,scores


In [None]:
query_text = "nutrients are good"
ids, scores = query_pinecone_db(query_text)
ids, scores


([1, 0, 3, 7, 5],
 [0.464331776, 0.450847924, 0.444324851, 0.420161039, 0.393756837])

### Semilarity search

In [None]:
from sentence_transformers import util
import torch
from time import perf_counter as timer

def get_top_k_scores(query_embedding, embeddings, k=5):
    """
    Computes the similarity scores between a query embedding and a set of embeddings,
    and returns the top-k indices and scores.

    Args:
    - query_embedding: Tensor representing the query embedding.
    - embeddings: Tensor representing the embeddings to compare against.
    - k (int): Number of top results to return (default is 5).

    Returns:
    - dict: Contains two keys:
        - "top_indices": List of top-k indices.
        - "top_scores": List of top-k scores.
    """
    start_time = timer()
    dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
    end_time = timer()

    print(f"Time taken to get scores on {len(embeddings)} embeddings: {end_time - start_time:.5f} seconds.")

    top_results_dot_product = torch.topk(dot_scores, k=k)
    top_indices = top_results_dot_product.indices.tolist()
    top_scores = top_results_dot_product.values.tolist()

    return {"top_indices": top_indices, "top_scores": top_scores}

In [None]:
import textwrap

def display_top_results(query, top_results, text_chunks, wrap_length=80):
    """
    Displays the top results with their scores and wrapped text for better readability.

    Args:
    - query (str): The query text.
    - top_results (tuple): Tuple containing scores and indices of top results from `torch.topk`.
        - top_results[0]: Tensor of top scores.
        - top_results[1]: Tensor of corresponding indices.
    - text_chunks (list): List of text chunks corresponding to embeddings.
    - wrap_length (int): Maximum line length for wrapping the text (default is 80).
    """
    def print_wrapped(text, wrap_length):
        wrapped_text = textwrap.fill(text, wrap_length)
        print(wrapped_text)

    print(f"Query: '{query}'\n")
    print("Results:")
    idxs = top_results["top_indices"]
    scores = top_results["top_scores"]

    for idx, score in zip(idxs, scores):
        print(f"Score: {score:.4f}")
        print(f"Page number: {pages_and_chunks_over_min_token_len[idx]['page_number']}")
        print("Text:")
        print_wrapped(text_chunks[idx], wrap_length)
        # Uncomment the following line if page numbers or other metadata are available
        print("\n")


In [None]:
query = "macronutrients functions"
query_embedding = generate_embeddings(query)
result = get_top_k_scores(query_embedding, text_chunks_embeddings)

display_top_results(query, result, text_chunks)

Time taken to get scores on 1679 embeddings: 0.00013 seconds.
Query: 'macronutrients functions'

Results:
Score: 0.6926
Page number: 5
Text:
Macronutrients Nutrients that are needed in large amounts are called
macronutrients. There are three classes of macronutrients: carbohydrates,
lipids, and proteins. These can be metabolically processed into cellular energy.
The energy from macronutrients comes from their chemical bonds. This chemical
energy is converted into cellular energy that is then utilized to perform work,
allowing our bodies to conduct their basic functions. A unit of measurement of
food energy is the calorie. On nutrition food labels the amount given for
“calories” is actually equivalent to each calorie multiplied by one thousand. A
kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with
the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a
macronutrient in the sense that you require a large amount of it, but unlike the
other

# RAG workflow

In [None]:
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 15 GB


In [None]:
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 15 | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: google/gemma-2b-it


In [None]:
# Step 1: Install Required Libraries
!pip install transformers accelerate huggingface-hub
# Step 2: Log in to Hugging Face
from huggingface_hub import login
login(token='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')




In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

# 1. Create quantization config for smaller model loading (optional)
# Requires !pip install bitsandbytes accelerate, see: https://github.com/TimDettmers/bitsandbytes, https://huggingface.co/docs/accelerate/
# For models that require 4-bit quantization (use this if you have low GPU memory available)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Bonus: Setup Flash Attention 2 for faster inference, default to "sdpa" or "scaled dot product attention" if it's not available
# Flash Attention 2 requires NVIDIA GPU compute capability of 8.0 or above, see: https://developer.nvidia.com/cuda-gpus
# Requires !pip install flash-attn, see: https://github.com/Dao-AILab/flash-attention
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")

# 2. Pick a model we'd like to use (this will depend on how much GPU memory you have available)
#model_id = "google/gemma-7b-it"
model_id = model_id # (we already set this above)
print(f"[INFO] Using model_id: {model_id}")

# 3. Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: google/gemma-2b-it


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-

In [None]:
pages_and_chunks = pages_and_chunks_over_min_token_len

In [None]:
input_text = "What are the macronutrients, and what roles do they play in the human body?"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
What are the macronutrients, and what roles do they play in the human body?

Prompt (formatted):
<bos><start_of_turn>user
What are the macronutrients, and what roles do they play in the human body?<end_of_turn>
<start_of_turn>model



In [None]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) # define the maximum number of new tokens to create
print(f"Model output (tokens):\n{outputs[0]}\n")

Model input (tokenized):
{'input_ids': tensor([[     2,      2,    106,   1645,    108,   1841,    708,    573, 186809,
         184592, 235269,    578,   1212,  16065,    749,    984,   1554,    575,
            573,   3515,   2971, 235336,    107,    108,    106,   2516,    108]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]], device='cuda:0')}



In [None]:
# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
What are the macronutrients, and what roles do they play in the human body?<end_of_turn>
<start_of_turn>model
Sure, here's a breakdown of the macronutrients and their roles in the human body:

**Macronutrients:**

* **Carbohydrates:**
    * Provide energy for the body's cells and tissues.
    * Carbohydrates are the primary source of energy for most cells.
    * Complex carbohydrates are those that take longer to digest, such as whole grains, fruits, and vegetables.
    * Simple carbohydrates are those that are quickly digested, such as sugar, starch, and lactose.

* **Proteins:**
    * Build and repair tissues, enzymes, and hormones.
    * Proteins are essential for immune function, hormone production, and tissue repair.
    * There are different types of proteins, each with specific functions.

* **Fats:**
    * Provide energy, insulation, and help absorb vitamins.
    * Healthy fats include olive oil, avocado, nuts, and seeds.
  

In [None]:
print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

Input text: What are the macronutrients, and what roles do they play in the human body?

Output text:
Sure, here's a breakdown of the macronutrients and their roles in the human body:

**Macronutrients:**

* **Carbohydrates:**
    * Provide energy for the body's cells and tissues.
    * Carbohydrates are the primary source of energy for most cells.
    * Complex carbohydrates are those that take longer to digest, such as whole grains, fruits, and vegetables.
    * Simple carbohydrates are those that are quickly digested, such as sugar, starch, and lactose.

* **Proteins:**
    * Build and repair tissues, enzymes, and hormones.
    * Proteins are essential for immune function, hormone production, and tissue repair.
    * There are different types of proteins, each with specific functions.

* **Fats:**
    * Provide energy, insulation, and help absorb vitamins.
    * Healthy fats include olive oil, avocado, nuts, and seeds.
    * Trans fats can raise cholesterol levels and increase the r

In [None]:
# Nutrition-style questions generated with GPT4
gpt4_questions = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "Explain the concept of energy balance and its importance in weight management."
]

# Manually created question list
manual_questions = [
    "How often should infants be breastfed?",
    "What are symptoms of pellagra?",
    "How does saliva help with digestion?",
    "What is the RDI for protein per day?",
    "water soluble vitamins"
]

query_list = gpt4_questions + manual_questions

In [None]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = generate_embeddings(query)
    res = get_top_k_scores(query_embedding=query_embedding, embeddings=embeddings)

    return res["top_scores"], res["top_indices"]

In [None]:
import random
# query = random.choice(query_list)
query = "macronutrients functions"

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query, embeddings=text_chunks_embeddings)
# scores, indices
for idx in indices:
  print(text_chunks[idx])

Query: macronutrients functions
Time taken to get scores on 1679 embeddings: 0.00013 seconds.
Macronutrients Nutrients that are needed in large amounts are called macronutrients. There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions. A unit of measurement of food energy is the calorie. On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand. A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yield calories. Ca

In [None]:
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

In [None]:
query = random.choice(query_list)
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=text_chunks_embeddings)

# Create a list of context items
context_items = [pages_and_chunks_over_min_token_len[i] for i in indices]
context_items

# Format prompt with context items
# prompt = prompt_formatter(query=query,
#                           context_items=context_items)
# print(prompt)

Query: How often should infants be breastfed?
Time taken to get scores on 1679 embeddings: 0.00007 seconds.


[{'page_number': 816,
  'sentence_chunk': 'milk is the best source to fulfill nutritional requirements. An exclusively breastfed infant does not even need extra water, including in hot climates. A newborn infant (birth to 28 days) requires feedings eight to twelve times a day or more. Between 1 and 3 months of age, the breastfed infant becomes more efficient, and the number of feedings per day often become fewer even though the amount of milk consumed stays the same. After about six months, infants can gradually begin to consume solid foods to help meet nutrient needs. Foods that are added in addition to breastmilk are called complementary foods. Complementary foods should be nutrient dense to provide optimal nutrition. Complementary foods include baby meats, vegetables, fruits, infant cereal, and dairy products such as yogurt, but not infant formula. Infant formula is a substitute, not a complement to breastmilk. In addition to complementary foods, the World Health Organization recomm

In [None]:
%%time

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate an output of tokens
outputs = llm_model.generate(**input_ids,
                             temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                             do_sample=True, # whether or not to use sampling, see https://huyenchip.com/2024/01/16/sampling.html for more
                             max_new_tokens=256) # how many new tokens to generate from prompt

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, '')}")

Query: How often should infants be breastfed?
RAG answer:
<bos>Relevant passages from the context are:

> "Dietary fiber is categorized as either water-soluble or insoluble. Some examples of soluble fibers are inulin, pectin, and guar gum and they are found in peas, beans, oats, barley, and rye. Cellulose and lignin are insoluble fibers and a few dietary sources of them are whole-grain foods, flax, cauliflower, and avocados. Cellulose is the most abundant fiber in plants, making up the cell walls and providing structure. Soluble fibers are more easily accessible to bacterial enzymes in the large intestine so they can be broken down to a greater extent than insoluble fibers, but even some breakdown of cellulose and other insoluble fibers occurs."

> "Fiber promotes the growth and development of colonic cells, inhibits colonic inflammation, and stimulates the immune system."<eos>
CPU times: user 5.3 s, sys: 13 ms, total: 5.31 s
Wall time: 5.34 s


In [None]:
def ask(query,
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True,
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """

    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query, text_chunks_embeddings)

    # Create a list of context items
    context_items = [pages_and_chunks_over_min_token_len[i] for i in indices]

    # Add score to context item
    # for i, item in enumerate(context_items):
    #     item["score"] = scores[i].cpu() # return score back to CPU

    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)

    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)

    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text

    return output_text, context_items

In [None]:
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [None]:
query = "macronutrients functions"
print(f"Query: {query}")


# Answer query with context and return context
# answer, context_items = ask(query=query,
#                             temperature=0.7,
#                             max_new_tokens=512,
#                             return_answer_only=False)

print(f"Answer:\n")
print_wrapped(answer)
print(f"Context items:")
context_items

Query: macronutrients functions
Time taken to get scores on 1679 embeddings: 0.00007 seconds.
Answer:

Sure, here are the relevant passages from the context:  * Nutrients are
substances required by the body to perform its basic functions. * Nutrients have
one or more of three basic functions: they provide energy, contribute to body
structure, and/or regulate chemical processes in the body. * Carbohydrates
provide energy, proteins provide structure to bones, muscles and skin, and play
a role in conducting most of the chemical reactions that take place in the body.
* Lipids provide energy, support cell growth and repair, and help to create
hormones and cell membranes. * Proteins are macromolecules composed of chains of
subunits called amino acids. * Vitamins are nutrients required by the body in
lesser amounts, but are still essential for carrying out bodily functions.
Context items:


[{'page_number': 5,
  'sentence_chunk': 'Macronutrients Nutrients that are needed in large amounts are called macronutrients. There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions. A unit of measurement of food energy is the calorie. On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand. A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yield calories. Carbohydrates Carbohydrates are molecules composed of c