<a href="https://colab.research.google.com/github/kdhenderson/msds_colab_notebooks/blob/main/RAG_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generation
## Part 1



#Step 0: Install and import useful packages

In [None]:
# PyMuPDF -> digest pdfs; tranformers -> hugging face models; faiss-cpu (facebook pkg) -> vectorize
pip install PyMuPDF transformers faiss-cpu

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.25.5 faiss-cpu-1.11.0


In [None]:
%pip install nltk  # natural language toolkit



In [None]:
import os
import fitz  # PyMuPDF
from transformers import AutoTokenizer, AutoModel
import torch
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import faiss
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Step 1: Read PDF Files

In [None]:
# Mount Google Drive to the notebook
from google.colab import drive
drive.mount('/content/drive')

# Example folder path in Google Drive
folder_path = '/content/drive/My Drive/PDFs/'  # Adjust this to your folder path
#file_path = '/content/drive/MyDrive/documents/my_pdf_file.pdf'

def read_pdfs(folder_path):
    pdf_texts = []
    for file_name in os.listdir(folder_path):  # can put many pdfs in here (will slow it down)
        if file_name.endswith('.pdf'):
            file_path = os.path.join(folder_path, file_name)
            try:
                doc = fitz.open(file_path)  # fitz function (digest pdfs)
                text = ""
                for page in doc:
                    text += page.get_text()
                pdf_texts.append((file_name, text))
            except Exception as e:
                print(f"Error reading {file_name}: {e}")
    return pdf_texts

# Run the function
pdf_contents = read_pdfs(folder_path)

# Display the results
for file_name, text in pdf_contents:
    print(f"Contents of {file_name}:\n{text[:1000]}...")  # Display first 100 characters for preview

Mounted at /content/drive
Contents of ds_6371_syllabus Ver 7.pdf:
Course Syllabus: DS 6371 Statistical Foundations for Data Science 
 
Course Designers:  
Dr. Bivin Sadler and Dr. Monnie McGee 
Course Text: 
Ramsey, F. L., and D. W. Schafer. The Statistical Sleuth: A Course in 
Methods of Data Analysis, 3rd ed. Boston, MA: Brooks/Cole, 2013, 
with associated website www.statisticalsleuth.com. 
 
Other Materials: 
ChatGPT account … it can be paid or free.   
Prerequisites: 
 
 
A previous introductory statistics course and Bridge to Statistics 
 
Midterm Date 
Saturday, March 2nd 2024 from 11am to 2pm CST on Zoom 
Final Exam Date 
Saturday, April 20th 2024 from 11am to 2pm CST on Zoom 
The text is available as an electronic version from CengageBrain.com and is much less expensive 
this way! 
All elements of the syllabus are subject to change by the instructor. 
Before taking this class, you should know 
• 
Statistical methods from an introductory statistics course: appropriate use of th

In [None]:
pdf_texts = read_pdfs(folder_path)

pdf_texts

[('ds_6371_syllabus Ver 7.pdf',
  'Course Syllabus: DS 6371 Statistical Foundations for Data Science \n \nCourse Designers:  \nDr. Bivin Sadler and Dr. Monnie McGee \nCourse Text: \nRamsey, F. L., and D. W. Schafer. The Statistical Sleuth: A Course in \nMethods of Data Analysis, 3rd ed. Boston, MA: Brooks/Cole, 2013, \nwith associated website www.statisticalsleuth.com. \n \nOther Materials: \nChatGPT account … it can be paid or free.   \nPrerequisites: \n \n \nA previous introductory statistics course and Bridge to Statistics \n \nMidterm Date \nSaturday, March 2nd 2024 from 11am to 2pm CST on Zoom \nFinal Exam Date \nSaturday, April 20th 2024 from 11am to 2pm CST on Zoom \nThe text is available as an electronic version from CengageBrain.com and is much less expensive \nthis way! \nAll elements of the syllabus are subject to change by the instructor. \nBefore taking this class, you should know \n• \nStatistical methods from an introductory statistics course: appropriate use of the mean

# Step 2: Chunk Text

In [None]:
# Step 2: Chunk Text
def chunk_text(text, chunk_size=100):  # chunk_size = hyperparameter (can't be more than 100 tokens, i.e. ~words)
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        words = sentence.split()
        if current_length + len(words) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.extend(words)
        current_length += len(words)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

   # Print out each chunk
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i}: {chunk}\n")

    return chunks

In [None]:
nltk.download('punkt_tab')

all_chunks = []
chunk_mapping = []

for pdf_name, text in pdf_texts:
    chunks = chunk_text(text)
    all_chunks.extend(chunks)
    chunk_mapping.append((pdf_name, chunks))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Chunk 0: Course Syllabus: DS 6371 Statistical Foundations for Data Science Course Designers: Dr. Bivin Sadler and Dr. Monnie McGee Course Text: Ramsey, F. L., and D. W. Schafer. The Statistical Sleuth: A Course in Methods of Data Analysis, 3rd ed. Boston, MA: Brooks/Cole, 2013, with associated website www.statisticalsleuth.com. Other Materials: ChatGPT account … it can be paid or free.

Chunk 1: Prerequisites: A previous introductory statistics course and Bridge to Statistics Midterm Date Saturday, March 2nd 2024 from 11am to 2pm CST on Zoom Final Exam Date Saturday, April 20th 2024 from 11am to 2pm CST on Zoom The text is available as an electronic version from CengageBrain.com and is much less expensive this way! All elements of the syllabus are subject to change by the instructor.

Chunk 2: Before taking this class, you should know • Statistical methods from an introductory statistics course: appropriate use of the mean and median, interpretation of box plots and histograms, use of 

# Step 3: Create Embeddings / Vectorization

In [None]:
# Step 3: Create Embeddings
def create_embeddings(text_chunks, tokenizer, model):
    embeddings = []
    for chunk in text_chunks:
        inputs = tokenizer(chunk, return_tensors='pt', truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())

          # Print out each embedding
    for i, embed in enumerate(embeddings):
        print(f"Embedding {i}: {embed}\n")

    return np.array(embeddings)

In [None]:
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 sentence transformer model to 384 dim vector
model = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModel.from_pretrained(model)

# Create embeddings
embeddings = create_embeddings(all_chunks, tokenizer, model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Embedding 0: [-6.53053522e-02 -1.06348768e-01  3.95441279e-02  5.84123619e-02
 -4.81642634e-02 -1.77587971e-01 -2.52298228e-02 -2.91868020e-03
 -1.19266555e-01  1.26687825e-01 -8.96836147e-02  8.97001661e-03
  9.55372453e-02 -9.61673558e-02 -6.88278452e-02 -5.96843325e-02
  2.21790355e-02 -2.26778165e-02  1.49146467e-01 -1.17126025e-01
  5.27740754e-02  1.12173641e-02  1.10034473e-01 -1.22876652e-01
  7.36729503e-02 -4.22556214e-02 -1.24696165e-03 -1.40419193e-02
 -4.55504023e-02  2.89534666e-02 -1.74629927e-01  1.21788584e-01
  1.21086866e-01  7.69288689e-02  2.10101232e-02 -1.25272691e-01
  1.39385685e-01  1.00311562e-01  9.00881141e-02  2.67574012e-01
 -2.67032146e-01  7.03313248e-03  2.07027588e-02  2.69298386e-02
 -4.67460528e-02 -1.88797992e-02 -1.33954778e-01 -1.68592572e-01
 -1.69797257e-01 -1.53928879e-03 -1.76100150e-01  2.71150172e-02
 -9.32352245e-02  7.93322176e-03 -3.66215818e-02  3.63429151e-02
  1.56535149e-01 -1.89531185e-02 -5.93098663e-02 -1.10795341e-01
  1.25465114

# Step 4: Index Vectors / Embeddings

Indexing embeddings allows for efficient retrieval of relevant text chunks. Without indexing, finding similar chunks would involve comparing the query embedding against all embeddings, which is computationally expensive.

In [None]:
# Step 4: Index Embeddings
def index_embeddings(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index
# faiss vectorization strategy (organize based on semantic values, cosine similarity)

In [None]:
 # Index embeddings
 index = index_embeddings(embeddings)
 index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x78e0f1beb6f0> >

# Step 5: Retrieve and return relevant chunks.
### Note that there is no LLM to provide a refined answer here... we were add this later.

In [None]:
# Step 5: Answer Questions
def answer_question(question, pdf_texts, index, embeddings, tokenizer, model, top_k=3):
    # Create embedding for the question
    inputs = tokenizer(question, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        question_embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Search for the nearest text chunks
    _, indices = index.search(np.array([question_embedding]), k=top_k)
    indices = indices[0]

    # Collect top-k chunks
    retrieved_chunks = []
    sources = []
    for idx in indices:
        chunk_offset = idx
        pdf_idx = 0

        while chunk_offset >= len(pdf_texts[pdf_idx][1]):
            chunk_offset -= len(pdf_texts[pdf_idx][1])
            pdf_idx += 1

        pdf_name, chunks = pdf_texts[pdf_idx]
        retrieved_chunks.append(chunks[chunk_offset])
        sources.append(f"{pdf_name}, Chunk {chunk_offset}")



    combined_text = ' '.join(retrieved_chunks)
    return f"Answer: {combined_text}\nSources: {sources}"

In [None]:
 # Answer question
question = "What percent of the overall grade is the homework grade worth in DS 6371?"
answer = answer_question(question, chunk_mapping, index, embeddings, tokenizer, model, top_k=3)
print(answer)

Answer: Table 1: Cumulative Percentage Required to Reach Each Letter Grade Cumulative Percentage Earned Grade [100 – 93] A (93 – 90] A- (90 – 88] B+ (88 – 83] B (83 – 80] B- (80 – 78] C+ (78 – 73] C (73 – 70] C- (70 – 60] D < 60 F The cumulative percentage for the course is determined by the course assignment components with their corresponding percentages defined in Table 2. Table 2: Grade Components and Weightings of the Cumulative Percentage Percentage of Cumulative Percentage Component Must complete 100% on time to pass the course. Questions regarding the grading of any assignments should be directed to the course instructor as soon as possible and in accordance with any regrading policy instituted by the instructor. The final grade for the course will be calculated on the bases of the earned cumulative percentage and the grade received for each of the components of the cumulative percentage. This course is not graded on a curve. The required cumulative percentage needed to earn ea

# All Together

In [None]:
# Step 1: Read PDF Files
def read_pdfs(folder_path):
    pdf_texts = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(folder_path, file_name)
            try:
                doc = fitz.open(file_path)
                text = ""
                for page in doc:
                    text += page.get_text()
                pdf_texts.append((file_name, text))
            except Exception as e:
                print(f"Error reading {file_name}: {e}")
    return pdf_texts

# Step 2: Chunk Text
def chunk_text(text, chunk_size=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        words = sentence.split()
        if current_length + len(words) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.extend(words)
        current_length += len(words)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks


# Step 3: Create Embeddings
def create_embeddings(text_chunks, tokenizer, model):
    embeddings = []
    for chunk in text_chunks:
        inputs = tokenizer(chunk, return_tensors='pt', truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
    return np.array(embeddings)

# Step 4: Index Embeddings
def index_embeddings(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index

# Step 5: Answer Questions
def answer_question(question, pdf_texts, index, embeddings, tokenizer, model, top_k=3):
    # Create embedding for the question
    inputs = tokenizer(question, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        question_embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Search for the nearest text chunks
    _, indices = index.search(np.array([question_embedding]), k=top_k)
    indices = indices[0]

    # Collect top-k chunks
    retrieved_chunks = []
    sources = []
    for idx in indices:
        chunk_offset = idx
        pdf_idx = 0

        while chunk_offset >= len(pdf_texts[pdf_idx][1]):
            chunk_offset -= len(pdf_texts[pdf_idx][1])
            pdf_idx += 1

        pdf_name, chunks = pdf_texts[pdf_idx]
        retrieved_chunks.append(chunks[chunk_offset])
        sources.append(f"{pdf_name}, Chunk {chunk_offset}")



    combined_text = ' '.join(retrieved_chunks)
    return f"Answer: {combined_text}\nSources: {sources}"


# Main function to tie everything together
def main(folder_path, question, model):
    tokenizer = AutoTokenizer.from_pretrained(model)
    model = AutoModel.from_pretrained(model)

    # Read and chunk PDFs
    pdf_texts = read_pdfs(folder_path)
    all_chunks = []
    chunk_mapping = []

    for pdf_name, text in pdf_texts:
        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        chunk_mapping.append((pdf_name, chunks))

    # Create and index embeddings
    embeddings = create_embeddings(all_chunks, tokenizer, model)
    index = index_embeddings(embeddings)

    # Answer question
    answer = answer_question(question, chunk_mapping, index, embeddings, tokenizer, model)
    print(answer)


# Comparing Different Models

In [None]:
#question = 'What does the "Check drainage" code mean on the washer?'
#question = 'What is Campus Caring Connections?'
question = "What percent of the overall grade is the homework grade worth in DS 6371?"
#question = "What determines the  largest percent of the grade?"
#question = "What is the FLS assignment?"

__DistilBERT Variants__
  - __distilbert-base-uncased:__ A distilled version of the original BERT model, which is optimized for speed and reduced size, while retaining much of the performance of the larger BERT models.
  - __distilroberta-base:__ A distilled version of the RoBERTa model, offering similar benefits in terms of size and speed.

In [None]:
main(folder_path, question, 'distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Answer: • The FLS is critically important to learning in this class and must be completed thoroughly and in a timely manner in order to receive feedback before live session. Discussion Boards: Students are not required to post in discussion boards, unless specified by the professor. Midterm Exam (25 percent, Points/Scale 0-100): There will be a midterm exam in week 8 of the course. It will cover concept and hand-calculation questions, as well as a data analysis question. Please clear your schedule now! We will have a review for the exam during live session 8. Course Grading Policy This course consists of a number of assignments and projects that are to be completed throughout the term. It is expected that all students will put forth the effort required to earn an 'A' letter grade for this course. Assignment grades will be determined using evaluation rubrics. You are responsible for reviewing the rubrics and raising questions or concerns related to the assignments, their rubrics, and th

In [None]:
main(folder_path, question, 'distilroberta-base')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Answer: If you are at all concerned about being prepared for this course … take the Bridge Course … it will help tremendously. Complete the answers to these questions with at least one slide per question. The idea is that you will present some or all of these in a breakout session during the live session. Make sure to add at least four takeaways on one slide and any questions you have on the last slide. 4. Submit the slide deck to the For Live Session Assignment: Unit X assignment on 2DS by 1pm, Central Time, the day of your live session. This is the absolute latest that they can be turned in without a penalty. That assessment may be somewhat true; however, we will pay more attention to sample size calculation and experimental design than a first course typically does. Furthermore, we will concentrate on understanding WHY a particular technique is appropriate and HOW to interpret the results. There is a lot in this course for everyone.
Sources: ['ds_6371_syllabus Ver 7.pdf, Chunk 3', '

__BERT Variants:__
  - __bert-large-uncased:__ A larger version of BERT with more parameters, which can provide better embeddings and improved performance.
  - __roberta-large:__ A robustly optimized BERT approach with more parameters and improved training techniques.

In [None]:
main(folder_path, question, 'bert-large-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Answer: Questions regarding the grading of any assignments should be directed to the course instructor as soon as possible and in accordance with any regrading policy instituted by the instructor. The final grade for the course will be calculated on the bases of the earned cumulative percentage and the grade received for each of the components of the cumulative percentage. This course is not graded on a curve. The required cumulative percentage needed to earn each letter grade is given in Table 1. Most professors encourage collaborative work except when explicitly prohibited (usually on quizzes and exams). Collaboration means helping one other, not copying answers from one another. Students who turn in exactly the same answers to the same homework will share the grade assigned (i.e., if two students have the same answers, and the grade on the assignment is a 90, then each student will receive a 45). Some instructors may impose stricter penalties. The expectation is that each student sp

__Sentence Transformers:__

  - __all-MiniLM-L6-v2:__ A lightweight model optimized for generating sentence embeddings efficiently.
  - __all-mpnet-base-v2:__ A variant of MPNet optimized for generating high-quality sentence embeddings.

In [None]:
main(folder_path, question, 'sentence-transformers/all-MiniLM-L6-v2')

In [None]:
main(folder_path, question, 'sentence-transformers/all-mpnet-base-v2')