# Advanced RAG Exercise

This notebook is designed as an exercise to build a complete Retrieval-Augmented Generation (RAG) system. In this exercise, you will integrate three main components into a single pipeline:

1. **Retrieval Module** – Retrieve relevant documents based on a query.
2. **Transformation Module** – Transform the retrieved queries.
3. **Generation Module and Evaluation** – Use the transformed data to generate responses and evaluate the overall system performance.

In [123]:
import tqdm
import glob
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings  # For generating embeddings for text chunks
import faiss
import pickle
from dotenv import load_dotenv
import os
from groq import Groq
from sentence_transformers import SentenceTransformer
import random
from sentence_transformers import CrossEncoder
import numpy as np


## 1. Building the RAG Pipeline

Load the data and store it in a string.

In [124]:
#Der Code liest alle PDFs aus dem data-Ordner ein, extrahiert den Text seitenweise und fügt alles zu einem großen Textstring zusammen.
### load the pdf from the path
glob_path = "data/*.pdf"
text = ""
for pdf_path in tqdm.tqdm(glob.glob(glob_path)):
    with open(pdf_path, "rb") as file:
        reader = PdfReader(file)
         # Extract text from all pages in the PDF
        text += " ".join(page.extract_text() for page in reader.pages if page.extract_text())

text[:50]
### Split the data into chunks.
# Text-Splitter konfigurieren
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max. Länge pro Chunk
    chunk_overlap=200       # Überlappung zwischen Chunks
)

# Den extrahierten PDF-Text aufteilen
chunks = splitter.split_text(text)

# Vorschau
print(f"Anzahl Chunks: {len(chunks)}")
print(f"Beispiel-Chunk:\n{chunks[0][:300]}...")

100%|██████████| 1/1 [00:01<00:00,  1.28s/it]

Anzahl Chunks: 61
Beispiel-Chunk:
High and normal protein diets improve body composition and 
glucose control in adults with type 2 diabetes: A randomized 
trial
Julianne G. Clina1, R. Drew Sayer1,3, Zhaoxing Pan2, Caroline W. Cohen3, Michael T. 
McDermott4, Victoria A. Catenacci4, Holly R. Wyatt1,5, James O. Hill1
1Department of Nu...





In [125]:
print(f"Total chunks: {len(chunks)}")
print("Preview of the first chunk:", chunks[0][:200])

Total chunks: 61
Preview of the first chunk: High and normal protein diets improve body composition and 
glucose control in adults with type 2 diabetes: A randomized 
trial
Julianne G. Clina1, R. Drew Sayer1,3, Zhaoxing Pan2, Caroline W. Cohen3,


## Choose an embedding model
Use the SentenceTransfomer wrapper as we have done so far.
Models are found here: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
or on HuggingFace.

Embed the chunks.

In [126]:
# Der Code lädt ein vortrainiertes SentenceTransformer-Modell und wandelt alle Text-Chunks in numerische Vektor-Embeddings um, die später für semantische Suche verwendet werden.
# Modell auswählen – du kannst auch ein anderes von sbert.net nehmen
model_name = "paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(model_name)

# Chunks einbetten
chunk_embeddings = model.encode(chunks, convert_to_numpy=True, show_progress_bar=True)

# Form prüfen
print(f"Embedding-Shape: {chunk_embeddings.shape}")


Batches: 100%|██████████| 2/2 [00:02<00:00,  1.44s/it]

Embedding-Shape: (61, 384)





## 3. Build Index and save index

In [127]:
# Beispiel: Wenn chunk_embeddings die Form (100, 384) hat, bedeutet das, dass es 100 
# Embeddings gibt, und jedes Embedding hat 384 Dimensionen. In diesem Fall würde d = 384 sein.
d = chunk_embeddings.shape[1]
print(d)

384


In [128]:

# FAISS-Index initialisieren
index = faiss.IndexFlatL2(d)

# Embeddings hinzufügen
index.add(chunk_embeddings)
print("Number of embeddings in FAISS index:", index.ntotal)

Number of embeddings in FAISS index: 61


## Load Key for language Models

In [129]:
load_dotenv()
# Access the API key using the variable name defined in the .env file
google_api_key = os.getenv("GOOGLE_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
groq_api_key = os.getenv("GROQ_API_KEY")

## 4. Build a retriever function

arguments: query, k, index, chunks, embedding model

return: retrieved texts, distances

In [130]:
def retrieve(query, k, index, chunks, embedding_model):
    # 1. Embed die Abfrage
    print(type(query))
    query_embedding = embedding_model.encode([query])

    # 2. Suche im FAISS-Index
    distances, indices = index.search(query_embedding, k)

    # 3. Hole die zugehörigen Text-Chunks
    retrieved_texts = [chunks[i] for i in indices[0]]

    return retrieved_texts, distances[0]

results, scores = retrieve("What is deep learning?", 3, index, chunks, model)
print(results[0])



<class 'str'>
platform partners. Journal of biomedical informatics, 2019. 95: p. 103208.
34. Harris PA, et al. , Research electronic data capture (REDCap)—a metadata-driven methodology 
and workflow process for providing translational research informatics support. Journal of 
biomedical informatics, 2009. 42(2): p. 377–381. [PubMed: 18929686] 
35. Wing RR, et al. , Benefits of Modest Weight Loss in Improving Cardiovascular Risk Factors in 
Overweight and Obese Individuals With Type 2 Diabetes. Diabetes Care, 2011. 34(7): p. 1481–
1486. [PubMed: 21593294] 
36. Lau DCW and Teoh H, Benefits of Modest Weight Loss on the Management of Type 2 Diabetes 
Mellitus. Canadian Journal of Diabetes, 2013. 37(2): p. 128–134. [PubMed: 24070804] 
37. Huang S, et al. , Association of magnitude of weight loss and weight variability with mortality and 
major cardiovascular events among individuals with type 2 diabetes mellitus: a systematic review


## 5. Build an answer function
Build an answer function that takes a query, k, an index and the chunks.

return: answer

In [131]:
from openai import OpenAI
from dotenv import load_dotenv
import os

def answer_query(query, k, index, chunks, embedding_model):
    load_dotenv()
    client = OpenAI(api_key=openai_api_key)

    retrieved_texts, _ = retrieve(query, k, index, chunks, embedding_model)
    context = "\n\n".join(retrieved_texts)

    prompt = f"""
Answer the following question based only on the context below.

Context:
{context}

Question:
{query}

Answer:
"""

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content


** retrieve()	Suchen	Hol die relevantesten Textstücke zu einer Frage.

**answer_query()	Antworten	Bau ein Prompt aus den gefundenen Textstücken und frag GPT.

#### Test your RAG

In [132]:
query = "What is the most important factor in diagnosing asthma?"
answer = answer_query(query, 5, index, chunks, model)
print("LLM Answer:", answer)


<class 'str'>
LLM Answer: The most important factor in diagnosing asthma is typically a combination of symptoms, medical history, and lung function tests.


## 6. Create a Rewriter

Take a query and an api key for the model and rewrite the query. 

Rewriting a query: A Language Model is prompted to rewrite a query to better suit a task.

Other Transfomrations are implemented in a similar fashion, this is just an example!

In [133]:
from openai import OpenAI
import os
from dotenv import load_dotenv

def rewrite_query(original_query):
    load_dotenv()
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    prompt = f"""
Rewrite the following question to be more specific and better suited for information retrieval:

Original:
{original_query}

Improved:
"""

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content.strip()


In [134]:
query = "How does asthma work?"
improved_query = rewrite_query(query)
print("Rewritten Query:", improved_query)


Rewritten Query: What physiological processes characterize asthma and contribute to its symptoms?


## 7. Implement the rewriter into your answer function

1	✏️ Die ursprüngliche Frage wird umformuliert (präziser, besser für die Suche gemacht).

2	🔍 Mit der verbesserten Frage wird nach den relevantesten Chunks gesucht und dann eine Antwort erstellt.

In [135]:
from groq import Groq
from dotenv import load_dotenv
import os

def rewrite_query_groq(query, groq_api_key):
    client = Groq(api_key=groq_api_key)

    prompt = f"""
Rewrite the following query to be more precise and optimized for retrieval:

Original:
{query}

Improved:
"""

    response = client.chat.completions.create(
        model="llama3-8b-8192",  # oder ein anderes aktives Modell
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content.strip()


def answer_query_with_rewriting(query, k, index, chunks, groq_api_key):
    # ✏️ Rewrite the input query first
    improved_query = rewrite_query_groq(query, groq_api_key)
    print(type(improved_query))

    print("🔁 Rewritten Query:", improved_query)

    # 🔍 Retrieve relevant chunks
    embedding_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
    retrieved_texts, _ = retrieve(improved_query, k, index, chunks, embedding_model)
    context = "\n\n".join(retrieved_texts)

    # 🧠 Ask the Groq LLM
    prompt = f"""
Answer the following question based only on the context below.

Context:
{context}

Question:
{improved_query}

Answer:
"""

    client = Groq(api_key=groq_api_key)
    response = client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content.strip()


#### Test it

In [136]:
query = "What is the most important factor in diagnosing asthma?"
answer = answer_query_with_rewriting(query, 5, index, chunks, groq_api_key)
print("LLM Answer:", answer)

<class 'str'>
🔁 Rewritten Query: Here's an improved query:

**What are the top 2-3 clinically validated factors that are consistently identified as crucial in diagnosing asthma, as supported by peer-reviewed research?**

Improvements:

1. **Precision**: The original query is too broad and open-ended, making it difficult to provide a clear answer. The improved query narrows down the scope to focus on the top factors, which helps to provide more precise and actionable information.
2. **Optimization for retrieval**: The improved query is structured to retrieve specific, clinically validated information from peer-reviewed research, which is more likely to provide accurate and reliable answers.
3. **Relevance**: By specifying "clinically validated factors" and "peer-reviewed research", the improved query ensures that the retrieved results are relevant to the diagnosis of asthma and grounded in scientific evidence.
4. **Avoiding ambiguity**: The improved query clarifies that the factors shou

## 8 .Evaluation

Select random chunks from all your chunks, and generate a question to each of these chunks

In [137]:
# Wählt zufällig Text-Chunks aus und lässt GPT für jeden Chunk eine passende Frage generieren, 
# mit automatischen Retry-Versuchen bei Timeout-Fehlern.

import time
import httpx  # Ensure you're catching the correct timeout exception
from openai import OpenAI
def generate_questions_for_random_chunks(chunks, num_chunks=20, max_retries=3):
    """
    Randomly selects a specified number of text chunks from the provided list,
    then generates a question for each selected chunk using the Groq LLM.

    Parameters:
    - chunks (list): List of text chunks.
    - groq_api_key (str): Your Groq API key.
    - num_chunks (int): Number of chunks to select randomly (default is 20).

    Returns:
    - questions (list of tuples): Each tuple contains (chunk, generated_question).
    """
    # Randomly select the desired number of chunks.
    selected_chunks = random.sample(chunks, num_chunks)
    
    # Initialize the Groq client once
    client = OpenAI(api_key=openai_api_key)
    
    questions = []
    for chunk in tqdm.tqdm(selected_chunks):
        # Build a prompt that asks the LLM to generate a question based on the chunk.
        prompt = (
            "Based on the following text, generate an insightful question that covers its key content:\n\n"
            "Text:\n" + chunk + "\n\n"
            "Question:"
        )
        
        messages = [
            {"role": "system", "content": prompt}
        ]
        
        generated_question = None
        attempt = 0
        
        # Try calling the API with simple retry logic.
        while attempt < max_retries:
            try:
                llm_response = client.chat.completions.create(
                     model="gpt-4o-mini",
                    messages=messages
                )
                generated_question = llm_response.choices[0].message.content.strip()
                break  # Exit the loop if successful.
            except httpx.ReadTimeout:
                attempt += 1
                print(f"Timeout occurred for chunk. Retrying attempt {attempt}/{max_retries}...")
                time.sleep(2)  # Wait a bit before retrying.
        
        # If all attempts fail, use an error message as the generated question.
        if generated_question is None:
            generated_question = "Error: Failed to generate question after several retries."
        
        questions.append((chunk, generated_question))
    
    return questions

#### Test it

In [138]:
questions = generate_questions_for_random_chunks(chunks, num_chunks=5, max_retries=2)
for idx, (chunk, question) in enumerate(questions, start=1):
    print(f"Chunk {idx}:\n{chunk[:100]}...\nGenerated Question: {question}\n")

100%|██████████| 5/5 [00:04<00:00,  1.15it/s]

Chunk 1:
Study [ 25] to detect a 2.75 kg difference in weight loss between the HP and NP. Weight 
loss achiev...
Generated Question: What statistical methods were employed in the study to assess differences in weight loss between the HP and NP diet groups, and what factors were considered in the randomization process?

Chunk 2:
weight improvement, satisfaction and energy. Obesity science & practice, 2017. 3(3): p. 298–310. 
[P...
Generated Question: What recent research findings indicate about the impact of red meat consumption on weight management, vascular health, and risk factors for conditions like type 2 diabetes?

Chunk 3:
preferential loss of fat mass compared to fat free mass which was also not supported. A 
recent revi...
Generated Question: What implications do the findings on body composition changes during weight loss in individuals with Type 2 Diabetes have for dietary recommendations, particularly regarding the effects of high protein versus high carbohydrate diets?

Chun




## 9.Test the questions with your built retriever

In [141]:
def answer_generated_questions(question_tuples, k, index, texts, groq_api_key):
    """
    For each (chunk, generated_question) tuple in the provided list, use the prebuilt
    retrieval function to generate an answer for the generated question. The function
    returns a list of dictionaries containing the original chunk, the generated question,
    and the answer.
    
    Parameters:
    - question_tuples (list of tuples): Each tuple is (chunk, generated_question)
    - k (int): Number of retrieved documents to use for answering.
    - index: The FAISS index.
    - texts (list): The tokenized text chunks mapping.
    - groq_api_key (str): Your Groq API key.
    
    Returns:
    - results (list of dict): Each dict contains 'chunk', 'question', and 'answer'.
    """
    results = []
    for chunk, question in question_tuples:
        # Use your retrieval-based answer function. Here we assume the function signature is:
        # answer_query(query, k, index, texts, groq_api_key)
        embedding_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
        answer = answer_query(question, k, index, texts, embedding_model=embedding_model) #query, k, index,texts
        results.append({
            "chunk": chunk,
            "question": question,
            "answer": answer
        })
    return results

#### Check the results

In [142]:
results = answer_generated_questions(questions, 5, index, chunks, groq_api_key)

for item in results:
    print("Chunk Preview:", item['chunk'][:100])
    print("Generated Question:", item['question'])
    print("Answer:", item['answer'])
    print("-----------------------------")

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
Chunk Preview: Study [ 25] to detect a 2.75 kg difference in weight loss between the HP and NP. Weight 
loss achiev
Generated Question: What statistical methods were employed in the study to assess differences in weight loss between the HP and NP diet groups, and what factors were considered in the randomization process?
Answer: Linear mixed models (LMM) with unstructured covariance were used to test the effect of diet group, time, and their interaction between HP and NP. 
Randomization was performed by the statistician and was stratified by age, sex, BMI, and years since diagnosis of T2D.
-----------------------------
Chunk Preview: weight improvement, satisfaction and energy. Obesity science & practice, 2017. 3(3): p. 298–310. 
[P
Generated Question: What recent research findings indicate about the impact of red meat consumption on weight management, vascular health, and risk factors for conditions like type 2 diab

## Evaluate the answers

In [143]:
import pandas as pd
def evaluate_answers_binary(results, groq_api_key, max_retries=3):
    """
    Evaluates each answer in the results list using an LLM.
    For each result (a dictionary containing 'chunk', 'question', and 'answer'),
    it sends an evaluation prompt to the Groq LLM which outputs 1 if the answer is on point,
    and 0 if it is missing the point.
    
    Parameters:
    - results (list of dict): Each dict must contain keys 'chunk', 'question', and 'answer'.
    - groq_api_key (str): Your Groq API key.
    - max_retries (int): Maximum number of retries if the API call times out.
    
    Returns:
    - df (pandas.DataFrame): A dataframe containing the original chunk, question, answer, and evaluation score.
    """
    evaluations = []
    client = OpenAI(api_key=openai_api_key)
    
    for item in tqdm.tqdm(results, desc="Evaluating Answers"):
        # Build the evaluation prompt.
        prompt = (
            "Evaluate the following answer to the given question. "
            "If the answer is accurate and complete, reply with 1. "
            "If the answer is inaccurate, incomplete, or otherwise not acceptable, reply with 0. "
            "Do not include any extra text.\n\n"
            "Question: " + item['question'] + "\n\n"
            "Answer: " + item['answer'] + "\n\n"
            "Context (original chunk): " + item['chunk'] + "\n\n"
            "Evaluation (1 for good, 0 for bad):"
        )
        
        messages = [{"role": "system", "content": prompt}]
        
        generated_eval = None
        attempt = 0
        
        # Retry logic in case of timeouts or errors.
        while attempt < max_retries:
            try:
                llm_response = client.chat.completions.create(
                    messages=messages,
                    model="4o-mini"
                )
                generated_eval = llm_response.choices[0].message.content.strip()
                break  # Exit the retry loop if successful.
            except httpx.ReadTimeout:
                attempt += 1
                print(f"Timeout occurred during evaluation. Retrying attempt {attempt}/{max_retries}...")
                time.sleep(2)
            except Exception as e:
                attempt += 1
                print(f"Error during evaluation: {e}. Retrying attempt {attempt}/{max_retries}...")
                time.sleep(2)
        
        # If no valid evaluation was produced, default to 0.
        if generated_eval is None:
            generated_eval = "0"
        
        # Convert the response to an integer (1 or 0).
        try:
            score = int(generated_eval)
            if score not in [0, 1]:
                score = 0
        except:
            score = 0
        
        evaluations.append(score)
    
    # Add the evaluation score to each result.
    for i, item in enumerate(results):
        item['evaluation'] = evaluations[i]
    
    # Create a dataframe for manual review.
    df = pd.DataFrame(results)
    return df

### Display them

In [144]:
df_evaluations = evaluate_answers_binary(results, openai_api_key)
display(df_evaluations)

Evaluating Answers:   0%|          | 0/5 [00:00<?, ?it/s]

Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 1/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 2/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 3/3...


Evaluating Answers:  20%|██        | 1/5 [00:06<00:25,  6.49s/it]

Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 1/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 2/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 3/3...


Evaluating Answers:  40%|████      | 2/5 [00:12<00:19,  6.41s/it]

Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 1/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 2/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 3/3...


Evaluating Answers:  60%|██████    | 3/5 [00:19<00:12,  6.38s/it]

Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 1/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 2/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 3/3...


Evaluating Answers:  80%|████████  | 4/5 [00:25<00:06,  6.38s/it]

Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 1/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 2/3...
Error during evaluation: Error code: 404 - {'error': {'message': 'The model `4o-mini` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}. Retrying attempt 3/3...


Evaluating Answers: 100%|██████████| 5/5 [00:31<00:00,  6.38s/it]


Unnamed: 0,chunk,question,answer,evaluation
0,Study [ 25] to detect a 2.75 kg difference in ...,What statistical methods were employed in the ...,Linear mixed models (LMM) with unstructured co...,0
1,"weight improvement, satisfaction and energy. O...",What recent research findings indicate about t...,Recent research findings suggest that red meat...,0
2,preferential loss of fat mass compared to fat ...,What implications do the findings on body comp...,"The findings suggest that a high protein diet,...",0
3,High and normal protein diets improve body com...,"What impact do high protein diets, particularl...","High protein diets, including those that inclu...",0
4,"Participants received copies of the SOS book, ...",What were the key components of the SOS interv...,The key components of the SOS intervention pro...,0
