# Retrieval Augmented Generation modification and Evaluation

note: httpx verion 0.27.0  is necessary to use the httpx.AsyncClient with groq. langchain issue that needs fixing.

In [1]:
# Standard library imports
import os
import json
import random
from typing import Dict
from pathlib import Path  # For working with file paths

# Utility libraries
from dotenv import load_dotenv  # For loading environment variables from a .env file
import glob  # For matching file paths using patterns
import tqdm  # For displaying progress bars in loops
import pandas as pd  # For handling tabular data
from datasets import Dataset  # For managing datasets (Hugging Face)

# PDF handling
from PyPDF2 import PdfReader  # For extracting text from PDF files

# LangChain core functionality
from langchain.text_splitter import RecursiveCharacterTextSplitter  # For splitting text into manageable chunks
from langchain.prompts import PromptTemplate, ChatPromptTemplate  # For defining and managing prompt templates
from langchain.vectorstores import Chroma  # For creating vector stores for retrieval
from langchain.embeddings import HuggingFaceEmbeddings  # For generating embeddings for text chunks

# LangChain advanced components
from langchain.chains.combine_documents import create_stuff_documents_chain  # For combining retrieved documents
from langchain_core.output_parsers import StrOutputParser  # For parsing string outputs from models

# Third-party AI model interfaces
from langchain_openai import ChatOpenAI  # For using OpenAI models with LangChain
from langchain_groq import ChatGroq  # For using Groq models with LangChain
import openai  # For using OpenAI's API

# For displaying notebook progress bars
import tqdm
import tqdm.notebook as notebook_tqdm

  from .autonotebook import tqdm as notebook_tqdm


### Bugfix for Chroma with SQLite

This cell addresses a known issue with Chroma's dependency on `sqlite3`, which can conflict with certain Python environments. 

#### What this does:
1. **Replaces `sqlite3` with `pysqlite3`:**
   - Ensures compatibility by importing `pysqlite3` as a substitute for `sqlite3`.
   - Updates the `sys.modules` mapping to ensure all imports of `sqlite3` use `pysqlite3`.

In [2]:
%pip install pysqlite3
BASE_DIR = Path.cwd()
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
    }
}


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Loading Environment Variables

This cell loads sensitive environment variables such as API keys from a `.env` file. Using environment variables helps keep credentials secure and out of the source code.

#### What this does:
1. **`load_dotenv()`:**
   - Loads environment variables from a `.env` file in the current working directory.
   - A `.env` file typically contains key-value pairs (e.g., `GROQ_API_KEY=your_groq_api_key`).

2. **Retrieve API Keys:**
   - `os.getenv("GROQ_API_KEY")`: Retrieves the GROQ API key.
   - `os.getenv("OPENAI_API_KEY")`: Retrieves the OpenAI API key.


#### Notes:
- Ensure you have a `.env` file in the root of your project directory with the required keys, for example:
  ```plaintext
  GROQ_API_KEY=your_groq_api_key
  OPENAI_API_KEY=your_openai_api_key

In [3]:
load_dotenv()
groq_key = os.getenv("GROQ_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")

### Task: Load and Extract Text from PDFs

In this task, you will load multiple PDF files from a specified directory, read their content, and extract text. This text will later be used for processing and retrieval.

#### Instructions:
1. **Define the File Path:**

2. **Iterate Over PDFs:**
   - Use the `glob` library to find all files matching the `*.pdf` pattern in the specified directory.
   - For each file, open it in binary mode (`"rb"`) using a `with` statement.

3. **Extract Text:**
   - Use the `PdfReader` library to read the PDF content.
   - Iterate through the pages and extract text from each page.

4. **Combine Text:**
   - Concatenate the text from all pages into a single string (`text`).

5. **Preview the Output:**
   - Print the first 50 characters of the extracted text to verify that the content is loaded correctly.

#### What to Do:
- Run the cell and inspect the first 50 characters of the extracted text to confirm it works as expected.
- If necessary, adjust the `glob_path` to point to the correct directory.

In [4]:
### load the pdf from the path
glob_path = "data/*.pdf"
text = ""
for pdf_path in tqdm.tqdm(glob.glob(glob_path)):
    with open(pdf_path, "rb") as file:
        reader = PdfReader(file)
         # Extract text from all pages in the PDF
        text += " ".join(page.extract_text() for page in reader.pages if page.extract_text())

text[:50]

  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:04<00:00,  2.47s/it]


'Hyper tension in adul ts: \ndiagnosis and manag eme'

### Task: Split Extracted Text into Manageable Chunks

#### Instructions:
1. **Create a Text Splitter:**
   - Use the `RecursiveCharacterTextSplitter` to split the text.
   - Specify two key parameters:
     - **`chunk_size` (2000):** The maximum number of characters in each chunk.
     - **`chunk_overlap` (200):** The number of overlapping characters between consecutive chunks to maintain context continuity
2. **Inspect the Chunks:**
   - After splitting, verify the output by inspecting the `chunks` variable. Each chunk should be approximately 2000 characters long, with overlaps of 200 characters.

In [5]:
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
# Split the extracted text into manageable chunks
chunks = splitter.split_text(text)


In [6]:
print(len(chunks))
print(chunks[0])

130
Hyper tension in adul ts: 
diagnosis and manag emen t 
NICE guideline 
Published: 28 August 2019 
Last updat ed: 21 No vember 2023 
www .nice.or g.uk/guidance/ng136 
© NICE 202 4. All right s reserved. Subject t o Notice of right s (https://www .nice.or g.uk/t erms-and-
conditions#notice-of -right s). Your r esponsi bility 
The r ecommendations in t his guideline r epresent t he view of NICE, arriv ed at aft er car eful 
consideration of t he evidence a vailable. When e xercising t heir judgement, pr ofessionals 
and practitioners ar e expect ed to tak e this guideline fully int o account, alongside t he 
individual needs, pr eferences and v alues of t heir patient s or t he people using t heir ser vice. 
It is not mandat ory to apply t he recommendations, and t he guideline does not o verride t he 
responsibility t o mak e decisions appr opriat e to the cir cumstances of t he individual, in 
consultation wit h them and t heir f amilies and car ers or guar dian. 
All pr oblems (adv

### Task: Create Embeddings for Text Chunks

In this step, you will initialize the embedding model that will convert the text chunks into numerical representations (embeddings). These embeddings are essential for enabling similarity-based retrieval in the RAG system.

#### Instructions:
1. **Select an Embedding Model:**
   - Use the `HuggingFaceEmbeddings` class to specify the embedding model.
   - The model name provided here is `"sentence-transformers/all-mpnet-base-v2"`, a widely used embedding model for generating high-quality text representations.

2. **Initialize the Model:**
   - Pass the model name as an argument to `HuggingFaceEmbeddings` and assign the resulting object to the variable `embeddings`.


#### What to Do:
- Use the `HuggingFaceEmbeddings` class to load the specified embedding model.
- Assign the loaded model to the `embeddings` variable.

#### Documentation
https://python.langchain.com/api_reference/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html#huggingfaceembeddings

In [7]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


### Task: Create a Vector Store for Text Retrieval


#### Instructions:
1. **Generate the Vector Store:**
   - Use the `Chroma.from_texts` method to create a vector store.
   - Pass the `chunks` (text chunks) and the `embeddings` object (created in the previous step) as arguments.


3. **Inspect the Output:**
   - Optionally, inspect the `vector_store` to confirm that it is ready for retrieval tasks.

In [9]:

vector_store = Chroma.from_texts(chunks, embeddings)

RuntimeError: [91mYour system has an unsupported version of sqlite3. Chroma                     requires sqlite3 >= 3.35.0.[0m
[94mPlease visit                     https://docs.trychroma.com/troubleshooting#sqlite to learn how                     to upgrade.[0m

### Task: Create a Retriever for the Vector Store

In this step, you will create a retriever to query the vector store and fetch the most relevant text chunks for a given input query. The retriever uses the vector embeddings to perform similarity-based searches.

#### Instructions:
1. **Create the Retriever:**
   - Use the `as_retriever` method on the `vector_store` to create a retriever.
   - Set the `search_type` parameter to `"mmr"` (Maximal Marginal Relevance) to ensure diverse and relevant retrieval.
   - Pass additional search settings using `search_kwargs`, such as:
     - **`k`:** The number of chunks to retrieve (e.g., `k=3`).

2. **Assign the Retriever:**
   - Store the retriever in the variable `retriever` for later use in querying the vector store.

3. **Verify the Retriever:**
   - Ensure the retriever is correctly initialized and ready to handle queries.


https://python.langchain.com/docs/integrations/vectorstores/chroma/#query-by-turning-into-retriever

In [None]:
retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 3})

In [None]:
docs = retriever.invoke("How do I diagnose Asthma?")
docs

[Document(metadata={}, page_content='medical r ecords, alongside t he coded diagnostic entr y. [NICE 2017 , amended \nBTS/NICE/SIGN 202 4] Asthma: diagnosis, monit oring and chr onic ast hma management (BTS, NICE, SIGN)\n(NG2 45)\n© NICE 202 4. All right s reserved. Subject t o Notice of right s (https://www .nice.or g.uk/t erms-and-\nconditions#notice-of -right s).Page 9 of\n64 Physical examina tion \n1.1.4 Examine people wit h suspect ed ast hma t o identify e xpirat ory polyphonic wheez e \nand signs of ot her causes of r espirat ory sympt oms but be awar e that e ven if \nexamination r esult s are normal, t he person ma y still ha ve ast hma. [NICE 2017] \nInitial tr eatmen t and obje ctive tests f or acu te sym ptoms a t \npresen tation \n1.1.5 Treat people immediat ely if t hey are acut ely unw ell or highly sympt omatic at \npresentation, and per form objectiv e tests that ma y help suppor t a diagnosis of \nasthma (f or example, eosinophil count , fractional e xhaled nitric o x

### Task: Build a RAG Model Function

In this task, you will combine all the steps from the previous tasks to create a reusable function for building a Retrieval-Augmented Generation (RAG) model. The function will process raw text documents, generate embeddings, and store them in a vector store for efficient retrieval.

#### What You Need to Do:

1. **Define the Function:**
   - Create a function named `build_rag_model` with the following parameters:
     - **`texts` (List[str]):** A list of raw documents or text strings to process.
     - **`embedding_model` (str):** The name of the Hugging Face embedding model to use.
     - **`chunk_size` (int):** The maximum size of each text chunk.
     - **`chunk_overlap` (int):** The overlap size between consecutive chunks.

2. **Implement the Steps:**
   - **Step 1:** **Split Text into Chunks**
     - Use `RecursiveCharacterTextSplitter` to split the provided `texts` into chunks.
     - Ensure the function handles all documents in the list and combines the resulting chunks.
     - Print the number of generated chunks for debugging purposes.

   - **Step 2:** **Generate Embeddings**
     - Initialize a `HuggingFaceEmbeddings` object using the provided `embedding_model`.
     - Use this object to generate embeddings for the text chunks.

   - **Step 3:** **Create a Vector Store**
     - Use `Chroma.from_texts` to create a vector store from the chunks and their embeddings.
     - Print the number of chunks stored in the vector store for confirmation.

3. **Return the Vector Store:**
   - The function should return the vector store so it can be used for retrieval tasks.


#### Example Usage:
- Call the function like this:
  ```python
  retriever = build_rag_model(
      texts=["Asthma is a chronic condition.", "Hypertension is persistently high blood pressure."],
      embedding_model="sentence-transformers/all-mpnet-base-v2",
      chunk_size=200,
      chunk_overlap=50
  )


In [None]:
def build_rag_model(texts, embedding_model, chunk_value):
    """
    Build a RAG model (vector store) with the specified embedding model, chunk size, and overlap.

    Args:
        texts (List[str]): List of raw documents or text to process.
        embedding_model (str): Name of the Hugging Face embedding model.
        chunk_value (List): List containing chunk length and overlap

    Returns:
        VectorStore: A Chroma vector store with embeddings for retrieval.
    """
    print(f"Building RAG model with embedding model: {embedding_model}, chunk size: {chunk_value[0]}, overlap: {chunk_value[1]}")
    
    # Step 1: Split texts into chunks
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_value[0], chunk_overlap=chunk_value[1])
    chunks = splitter.split_text(text)
    
    print(f"Generated {len(chunks)} chunks from {len(texts)} documents.")

    # Step 2: Generate embeddings
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
    
    # Step 3: Create vector store
    vector_store = Chroma.from_texts(chunks, embeddings)
    
    print(f"Vector store created with {len(chunks)} chunks.")
    
    return vector_store

### Task: Define a Generative Model for Question Answering

In this task, you will define and initialize a generative model that can answer user questions based on a given context. This involves creating a prompt template, setting up an output parser, and initializing the language model for generation.

#### What You Need to Do:

1. **Define the Prompt Template:**
   - Use the `PromptTemplate` class to define a template that specifies how user questions and the associated context are structured.
   - Your template should:
     - Include placeholders for the context (`{context}`) and question (`{question}`).
     - Provide clear instructions for the model to generate answers based only on the context.

2. **Initialize the Prompt Template:**
   - Set the `template` argument to the system template:
     ```python
     system_template = """
     Answer the users question based on the below context:
     <context> {context} </context>
     Here is the question: <question> {question} </question>
     """
     ```
   - Specify the `input_variables` as `["context", "question"]` to define the placeholders.

3. **Set Up the Output Parser:**
   - Use the `StrOutputParser` to parse the string output from the model.

4. **Initialize the Generative Model:**
   - Use the `ChatGroq` class to set up a generative model with the following parameters:
     - **`model`:** Specify the model name (e.g., `"llama-3.2-3b-preview"`).
     - **`temperature`:** Set to `0` for deterministic outputs.
     - **`max_tokens`:** Set to `None` to allow the model to decide the output length.
     - **`timeout`:** Set to handle timeouts during generation.
     - **`max_retries`:** Define the number of retries in case of failure.


Groq documentation: https://python.langchain.com/v0.1/docs/integrations/chat/groq/

In [None]:
# Define the template for answering user questions based on a provided context
system_template = """
Answer the users question based on the below context:
<context> {context} </context>
Here is the question: <question> {question} </question>
"""
# Create a prompt template for the question-answering system
question_answering_prompt = PromptTemplate(template=system_template, input_variables=["context", "question"])
output_parser = StrOutputParser()

# Initialize the generative model for question answering
model = ChatGroq(model="llama-3.2-3b-preview", temperature=0, max_tokens=None, timeout=None, max_retries=2,)



### Task: Build the RAG Chain for Question Answering

In this task, you will create a **RAG (Retrieval-Augmented Generation) Chain** that connects the components you’ve defined so far: the prompt template, the generative model, and the output parser. This chain orchestrates the process of answering user questions by sequentially formatting inputs, generating answers, and parsing outputs.

#### What You Need to Do:

1. **Chain the Components:**
   - Use the pipe operator (`|`) to sequentially combine the components:
     - **`question_answering_prompt`:** Formats the user question and context into the structured template.
     - **`model`:** The generative model processes the formatted input and generates a response.
     - **`output_parser`:** Parses the raw response from the model into a structured and usable format.

2. **Assign the Chain:**
   - Store the combined components into the variable `rag_chain`.

In [None]:

rag_chain = question_answering_prompt | model | output_parser

#### Test your chain

In [None]:
query = "How do I diagnose Asthma?"
print(rag_chain.invoke({"context": docs, "question": query}))

Based on the provided context, diagnosing asthma involves a combination of physical examination, objective tests, and medical history. Here's a step-by-step guide:

1. **Physical examination**: Examine people with suspected asthma to identify expiratory polyphonic wheeze and signs of other respiratory symptoms. Be aware that even if examination results are normal, the person may still have asthma.

2. **Initial treatment and objective tests**: Treat people immediately if they are acutely unwell or highly symptomatic at presentation. Perform objective tests that may help support a diagnosis of asthma, such as:
   - Eosinophil count (the number of eosinophils in a blood sample)
   - Fractional exhaled nitric oxide (FeNO) test
   - Spirometry or peak expiratory flow (PEF) before and after bronchodilator

3. **Bronchial challenge test**: A test to measure airway responsiveness (bronchial responsiveness). It involves giving small increments of a bronchoconstrictor (most commonly methacholin

### Task: Define a Function to Answer Questions Using RAG

In this task, you will create a function that leverages the RAG (Retrieval-Augmented Generation) system to answer user questions. The function will retrieve relevant documents from the knowledge index and use the RAG chain to generate a response.

#### What You Need to Do:

1. **Define the Function:**
   - Name the function `answer_with_rag`.
   - Specify the following arguments:
     - **`question` (str):** The user's query.
     - **`rag_chain`:** The RAG chain you built earlier for formatting, generating, and parsing responses.
     - **`retriever` (VectorStore):** The vector store containing document embeddings for retrieval.

2. **Implement the Steps:**
   - **Step 1:** Retrieve Relevant Documents
     - Use the `retriever` to retrieve documents related to the query.

   - **Step 2:** Prepare the Input for the RAG Chain
     - Create a dictionary named `rag_input` with the following keys:
       - **`context`:** A list of retrieved document texts.
       - **`question`:** The user query.

   - **Step 3:** Generate an Answer
     - Pass the `rag_input` to the `rag_chain` using the `invoke` method.
     - Store the generated response in the variable `answer`.

3. **Return the Results:**
   - The function should return a tuple containing:
     - **`answer` (str):** The generated response to the question.
     - **`relevant_docs` (List[str]):** The list of retrieved document texts used for answering the query.

4. **Test the Function:**
   - Test the function with sample questions and ensure it retrieves relevant documents and generates accurate answers.

#### Example Usage:
- Call the function like this:
  ```python
  answer, relevant_docs = answer_with_rag(
      question="What are the symptoms of asthma?",
      rag_chain=rag_chain,  # Your defined RAG chain
      knowledge_index=knowledge_index  # Your vector store retriever
  )


In [None]:
def answer_with_rag(
    question,
    rag_chain,  # Replace LLM with the rag_chain directly
    retriever):
    """
    Answer a question using RAG with the given knowledge index and rag_chain.

    Args:
        question (str): The user's question.
        rag_chain: The RAG chain that takes context and question as input and generates the answer.
        knowledge_index (VectorStore): The vector store used for retrieving relevant documents.
        num_retrieved_docs (int): Number of documents to retrieve initially.
        num_docs_final (int): Number of documents to include in the final context.

    Returns:
        Tuple[str, List[str]]: A tuple containing the generated answer and the relevant documents used.
    """
    # Retrieve relevant documents
    relevant_docs = retriever.invoke(question)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # Keep only the text of the documents

    # Limit to the top N final documents
    relevant_docs = relevant_docs

    # Pass the documents and the question to the RAG chain
    rag_input = {
        "context": relevant_docs,  # Pass the documents as a list
        "question": question,      # Pass the user query
    }

    # Use the RAG chain to generate an answer
    answer = rag_chain.invoke(rag_input)

    return answer, relevant_docs

In [None]:
answer_with_rag("what is asthma?", rag_chain, retriever)

('Based on the provided context, asthma is a chronic respiratory condition characterized by inflammation of the airways, which can lead to symptoms such as wheezing, coughing, and shortness of breath. It can be managed and treated through various means, including medication, lifestyle changes, and monitoring of airway inflammation.',
 ['BTS ISBN: 9 78-1-917 619-01-1 \nNICE ISBN: 9 78-1-47 31-6612- 7 \nSIGN ISBN: 9 78-1-909103-92-4 Asthma: diagnosis, monit oring and chr onic ast hma management (BTS, NICE, SIGN)\n(NG2 45)\n© NICE 202 4. All right s reserved. Subject t o Notice of right s (https://www .nice.or g.uk/t erms-and-\nconditions#notice-of -right s).Page 64 of\n64',
  "requir ement. \nThe committ ee concluded t hat F eNO monit oring was cost -effectiv e in adult s but ma y not \nbe in childr en. It was not possible on t he curr ent e vidence t o say what t he optimum \nfrequency of monit oring should be, but t he committ ee agr eed t hat an appr opriat e \noppor tunity w ould be 

# Evaluation

## Eval Set Generation

### Documentation: Function to Call the OpenAI API

This function interacts with the OpenAI API to generate responses based on a given prompt. It provides a simple wrapper for querying the API and returning the generated output.


In [None]:
# Function to call the OpenAI API
def call_llm(prompt: str):
    """
    Calls the OpenAI API to generate a response for a given prompt.

    Args:
        prompt (str): The input prompt for the LLM.
        model (str): The OpenAI model to use (default is "gpt-4").

    Returns:
        str: The generated response from the LLM.
    """
    client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

    response = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": prompt}
        ],
        temperature=0.7
    )
    return response.choices[0].message.content


### Definition: Prompt for QA Generation

This prompt template defines the instructions for generating factoid-style question-answer (QA) pairs based on a given context. It is specifically crafted to create search-engine-style questions and concise, factual answers.

In [None]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

### Generate Question-Answer (QA) Pairs

1. **Set the Number of QA Pairs to Generate:**
   - **`N_GENERATIONS`:** Specifies the maximum number of QA pairs to generate. Here, it is set to `30`.

2. **Sample Chunks:**
   - Randomly selects `N_GENERATIONS` chunks from the `chunks` using `random.sample`.

3. **Loop Over Chunks:**
   - For each sampled chunk:
     - **Step 1:** Format the prompt:
       - Replaces the `{context}` placeholder in `QA_generation_prompt` with the text of the current chunk.
     - **Step 2:** Call the LLM:
       - Sends the formatted prompt to the `call_llm` function to generate a question and its corresponding answer.
     - **Step 3:** Extract Question and Answer:
       - Parses the output to extract the `Factoid question` and `Answer` fields.
     - **Step 4:** Validate and Append:
       - Ensures the answer is less than 300 characters long.
       - Appends the valid `context`, `question`, and `answer` to the `outputs` list.

4. **Handle Errors:**
   - If an error occurs during QA generation (e.g., malformed output), it skips the current chunk and logs the error.

5. **Display the Results:**
   - After processing all chunks, prints the generated QA pairs for inspection.


---

#### Why This is Important:
- This step generates a dataset of factoid-style QA pairs, which is essential for:
  - Evaluating the RAG system's performance.
  - Testing how well the QA pipeline retrieves relevant context and generates accurate answers.

---

#### Example Output:
- Each generated entry in `outputs` will look like this:
  ```python
  {
      "context": "Asthma is a chronic condition that affects the airways.",
      "question": "What is asthma?",
      "answer": "A chronic condition that affects the airways."
  }

In [None]:
N_GENERATIONS = 30

print(f"Generating {N_GENERATIONS} QA couples...")

# Generate QA pairs
outputs = []
for sampled_context in tqdm.tqdm(random.sample(chunks, min(N_GENERATIONS, len(chunks)))):
    # Generate QA couple
    try:
        formatted_prompt = QA_generation_prompt.format(context=sampled_context)
        output_QA_couple = call_llm(formatted_prompt)
        # Extract question and answer from the output
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0].strip()
        answer = output_QA_couple.split("Answer: ")[-1].strip()
        # Validate and append to outputs
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context,
                "question": question,
                "answer": answer,
                
            }
        )
    except Exception as e:
        print(f"Skipped a context due to error: {e}")
        continue

# Print generated outputs
for output in outputs:
    print(output)

Generating 30 QA couples...


100%|██████████| 30/30 [00:27<00:00,  1.08it/s]

{'context': 'Hyper tension in adul ts: \ndiagnosis and manag emen t \nNICE guideline \nPublished: 28 August 2019 \nLast updat ed: 21 No vember 2023 \nwww .nice.or g.uk/guidance/ng136 \n© NICE 202 4. All right s reserved. Subject t o Notice of right s (https://www .nice.or g.uk/t erms-and-\nconditions#notice-of -right s). Your r esponsi bility \nThe r ecommendations in t his guideline r epresent t he view of NICE, arriv ed at aft er car eful \nconsideration of t he evidence a vailable. When e xercising t heir judgement, pr ofessionals \nand practitioners ar e expect ed to tak e this guideline fully int o account, alongside t he \nindividual needs, pr eferences and v alues of t heir patient s or t he people using t heir ser vice. \nIt is not mandat ory to apply t he recommendations, and t he guideline does not o verride t he \nresponsibility t o mak e decisions appr opriat e to the cir cumstances of t he individual, in \nconsultation wit h them and t heir f amilies and car ers or guar di




In [None]:
display(pd.DataFrame(outputs).head(1))

Unnamed: 0,context,question,answer
0,Hyper tension in adul ts: \ndiagnosis and mana...,When was the NICE guideline on hypertension in...,21 November 2023


### Question Filtering with Critiques

These prompts are designed to evaluate the quality of the generated factoid questions based on specific criteria: **groundedness**, **relevance**, and **stand-alone clarity**. Each prompt asks the LLM to provide a score and a rationale for the rating.

---

#### **1. Groundedness Critique Prompt**

##### Purpose:
- To evaluate how well the question can be answered using the provided context.
- Ensures the question is clearly and unambiguously grounded in the given text.

##### Details:
- The rating scale is from **1 to 5**:
  - **1:** The question cannot be answered at all using the context.
  - **5:** The question is clearly and unambiguously answerable with the context.

#### **2. Relevance Critique Prompt**

##### Purpose:
- To assess how useful the question is for developers, particularly in machine learning or NLP applications.
- Ensures the question is aligned with the needs of the target audience (e.g., developers building with Hugging Face).

##### Details:
- The rating scale is from **1 to 5**:
  - **1:** The question is not useful at all.
  - **5:** The question is highly useful and relevant to the audience.

---

#### **3. Stand-Alone Critique Prompt**

##### Purpose:
- To determine if the question can be understood without additional context.
- Ensures the question is self-contained and meaningful to someone with domain knowledge or access to related documentation.

##### Details:
- The rating scale is from **1 to 5**:
  - **1:** The question depends on additional information (e.g., "in the context" or "in the document").
  - **5:** The question is fully understandable and stand-alone.


In [None]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

### Critique QA Pairs Using LLM Prompts

In this task, you will evaluate each generated QA pair using the previously defined critique prompts for **groundedness**, **relevance**, and **stand-alone clarity**. The goal is to score and document the quality of each question based on the provided context and criteria.

---

#### What This Code Does:

1. **Iterate Over QA Outputs:**
   - Loops through the `outputs` list, which contains the generated QA pairs (`context`, `question`, `answer`).

2. **Generate Evaluations:**
   - For each QA pair:
     - **Groundedness:** Uses the `question_groundedness_critique_prompt` to evaluate if the question is answerable based on the given context.
     - **Relevance:** Uses the `question_relevance_critique_prompt` to evaluate if the question is useful for the intended audience.
     - **Stand-alone Clarity:** Uses the `question_standalone_critique_prompt` to evaluate if the question is understandable without additional context.

3. **Call the LLM for Each Criterion:**
   - Sends the formatted prompt for each criterion to the LLM using `call_llm`.
   - Stores the response in the `evaluations` dictionary under the respective criterion.

4. **Parse the Results:**
   - Extracts the **`Total rating`** (score) and **`Evaluation`** (text rationale) from the LLM's response.
   - Updates the `output` dictionary with the scores and evaluations for each criterion.

5. **Handle Errors Gracefully:**
   - If any part of the process fails (e.g., LLM output is malformed), the loop skips the current QA pair and continues with the next one.

6. **Update Outputs:**
   - Adds the critique scores and rationale to each QA pair in the `outputs` list.

---

#### Example Output:

Each `output` in the `outputs` list will be updated with fields like these:

```python
{
    "context": "Asthma is a chronic condition that affects the airways.",
    "question": "What is asthma?",
    "answer": "A chronic condition that affects the airways.",
    "groundedness_score": 5,
    "groundedness_eval": "The question is fully answerable based on the provided context.",
    "relevance_score": 4,
    "relevance_eval": "This question is relevant to an audience seeking general knowledge about asthma.",
    "standalone_score": 5,
    "standalone_eval": "The question is clear and understandable without additional context."
}

In [None]:
print("Generating critique for each QA couple...")
for output in tqdm.tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        ),
        "relevance": call_llm(
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue



Generating critique for each QA couple...


100%|██████████| 30/30 [02:24<00:00,  4.83s/it]


### Explanation: Filtering and Preparing the Evaluation Dataset

In this step, we transform the evaluated QA pairs into a structured dataset, filter them based on their scores, and prepare the final dataset for further evaluation or model training.

---

#### Step-by-Step Breakdown:


2. **Convert QA Pairs to a DataFrame:**
   - `generated_questions = pd.DataFrame.from_dict(outputs)`:
     - Converts the `outputs` list (which now includes QA pairs and their scores) into a pandas DataFrame for easier manipulation and analysis.

3. **Display the Evaluation Dataset (Before Filtering):**
   - Prints a subset of columns:
     - **`question`:** The generated question.
     - **`answer`:** The corresponding answer.
     - **`groundedness_score`, `relevance_score`, `standalone_score`:** Scores assigned during the critique step.
   - This provides an overview of the dataset before applying any filtering criteria.

4. **Filter the QA Pairs:**
   - Keeps only QA pairs that meet the following conditions:
     - **`groundedness_score` >= 4:** The question is well-anchored in the provided context.
     - **`standalone_score` >= 4:** The question is clear and understandable without additional context.
   - **Note:** The `relevance_score` is not used for filtering here, but it remains part of the dataset for reference.

5. **Display the Filtered Dataset:**
   - Prints the filtered DataFrame to show the high-quality QA pairs that passed the criteria.

6. **Convert to a Hugging Face Dataset:**
   - `eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)`:
     - Converts the filtered pandas DataFrame into a Hugging Face `Dataset` object, which is commonly used for training and evaluation in NLP tasks.
     - The `split="train"` argument designates this as a training split.
     - `preserve_index=False` ensures the index from the pandas DataFrame is not carried over to the `Dataset`.

---

#### Purpose of This Step:

1. **Dataset Refinement:**
   - Filters out low-quality QA pairs to ensure only well-scored questions and answers are included in the final dataset.
   - Focuses on groundedness and stand-alone clarity to improve the overall utility and reliability of the dataset.

2. **Final Dataset Preparation:**
   - Converts the data into a format suitable for further evaluation or training machine learning models, such as Hugging Face models.

3. **Quality Assurance:**
   - Provides a visual overview of the dataset before and after filtering, allowing for manual inspection of the data quality.

In [None]:

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,When was the NICE guideline on hypertension in adults last updated?,21 November 2023,5,1,4
1,What did the committee emphasize when discussing asthma diagnosis?,The importance of taking a good clinical history.,5,1,1
2,What is the ISBN number for the book titled NICE?,9 78-1-47 31-6612- 7,5,1,2
3,What is the preferred treatment option for children aged 5 to 11 according to the committee's recommendation?,Regular paediatric low-dose ICS.,5,1,1
4,What is the most cost-effective diagnostic strategy for asthma according to the health economic model?,A gradual rule-in approach.,5,1,1
5,When should repeat blood pressure measurements be taken according to the committee's recommendations?,Within 7 days in people with no target organ damage.,3,1,1
6,How often should people with asthma be reviewed in primary care?,At least annually and after any exacerbation.,5,1,5
7,What was the previous recommendation for step 1 dual therapy for people of Black African or African–Caribbean origin with type 2 diabetes?,The previous recommendation was to offer step 1 dual therapy with an ACE inhibitor and either a diuretic (D-type drug) or a calcium channel blocker (CCB; C-type drug).,5,1,1
8,What evidence review discusses the diagnostic test accuracy of spirometry in people suspected of asthma?,Evidence review A.,5,1,1
9,What is uncontrolled asthma?,"Uncontrolled asthma is when asthma affects a person's lifestyle or restricts their normal activities due to symptoms such as coughing, wheezing, shortness of breath, and chest tightness.",5,1,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,When was the NICE guideline on hypertension in adults last updated?,21 November 2023,5,1,4
6,How often should people with asthma be reviewed in primary care?,At least annually and after any exacerbation.,5,1,5
9,What is uncontrolled asthma?,"Uncontrolled asthma is when asthma affects a person's lifestyle or restricts their normal activities due to symptoms such as coughing, wheezing, shortness of breath, and chest tightness.",5,1,5
10,What type of asthma medicines should be continued as normal during pregnancy?,"Short-acting and long-acting beta 2 agonists, inhaled corticosteroids, and oral theophyllines.",5,1,5
15,What should be checked if asthma control is inadequate at moderate-dose maintenance regimen?,FeNO and the eosinophil level should be checked.,5,1,4
16,What is the initial treatment recommended for children aged 5 to 11 with newly diagnosed asthma?,"A twice-daily paediatric low-dose inhaled corticosteroid (ICS), with a short-acting beta 2 agonist (SABA) as needed.",5,1,5
17,What should people with kidney disease avoid when reducing sodium intake?,Salt substitutes containing potassium chloride.,5,1,4
18,What is the clinic blood pressure range for Stage 2 hypertension?,160/100 mmHg or higher but less than 180/120 mmHg.,5,1,5
22,What type of testing is performed in some adults with suspected asthma?,Spirometry and reversibility testing.,5,1,4
23,What is measured by a bronchial challenge test?,Airway responsiveness (bronchial responsiveness) is measured by a bronchial challenge test.,5,1,5


### Explanation: Running RAG Tests

This function evaluates the performance of the RAG (Retrieval-Augmented Generation) system by comparing the system's generated answers to the true answers in a test dataset. The results are saved to a file for further analysis.

---

#### What This Function Does:

1. **Prepare the Output File:**
   - Attempts to load existing test results from `output_file`:
     - If the file exists, appends new results to the previous ones.
     - If the file does not exist, initializes an empty `outputs` list.

2. **Iterate Over the Evaluation Dataset:**
   - Loops through the `eval_dataset`, which contains the test questions, true answers, and source documents.

3. **Skip Already Evaluated Questions:**
   - Checks if a question has already been tested (i.e., exists in the loaded `outputs`).
   - Skips the question if it has already been evaluated.

4. **Run the RAG System:**
   - Calls the `answer_with_rag` function to:
     - Retrieve relevant documents using the `knowledge_index`.
     - Generate an answer using the LLM (Language Model).

5. **Print Results (Optional):**
   - If `verbose=True`, prints the following details for manual inspection:
     - The input question.
     - The generated answer.
     - The true answer from the dataset.

6. **Save Results:**
   - Constructs a dictionary containing:
     - The question and its true answer.
     - The source document.
     - The generated answer.
     - The retrieved documents used to generate the answer.
     - Test settings, if provided.
   - Appends the result to the `outputs` list and saves it to the `output_file` in JSON format.

---


#### Example Usage:

```python
run_rag_tests(
    eval_dataset=eval_dataset,  # Test dataset
    llm=rag_chain,  # RAG chain (includes retrieval and generation)
    knowledge_index=knowledge_index,  # Vector store retriever
    output_file="rag_test_results.json",  # File to save the results
    verbose=True,  # Print results for inspection
    test_settings={
        "embedding_model": "sentence-transformers/all-mpnet-base-v2",
        "chunk_size": 200,
        "overlap": 50,
    }
)


In [None]:


def run_rag_tests(
    eval_dataset,
    llm,
    knowledge_index,
    output_file: str,
    verbose = True,
    test_settings = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, llm, knowledge_index)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

### Explanation: Setting Up the Evaluation Prompt

This step defines the evaluation prompt that will be used to assess the quality of responses generated by the RAG system. The prompt follows a structured format to ensure consistent and objective evaluation based on a predefined scoring rubric.

---

#### Purpose of the Evaluation Prompt:

1. **Define the Evaluation Task:**
   - The LLM is tasked with comparing a generated response (`response`) to a reference answer (`reference_answer`) and scoring its quality based on specific criteria.

2. **Provide a Score Rubric:**
   - A detailed rubric is included to guide the LLM in assigning scores. The rubric ensures that scoring is based strictly on correctness, accuracy, and factual alignment with the reference answer.

3. **Standardize Output:**
   - The LLM is instructed to:
     - Write a detailed feedback summary addressing the evaluation criteria.
     - Assign a numerical score between 1 and 5, strictly adhering to the rubric.
     - Format the output using the required structure, including `[RESULT]`.

---

#### Structure of the Prompt:

1. **Task Description:**
   - Specifies the evaluation task and output format.
   - Emphasizes that the feedback must focus on the score rubric and avoid general evaluations.

2. **Instruction to Evaluate:**
   - The instruction or context that prompted the response.

3. **Response to Evaluate:**
   - The generated response being evaluated.

4. **Reference Answer:**
   - The ideal answer that would receive a perfect score of 5.

5. **Score Rubrics:**
   - Provides explicit criteria for scoring:
     - **Score 1:** Completely incorrect and inaccurate.
     - **Score 5:** Completely correct, accurate, and factual.

6. **Feedback Section:**
   - Guides the LLM to write structured feedback followed by the score.

In [None]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

### Explanation: Evaluating Generated Answers

This function evaluates the quality of answers generated by the RAG system using a predefined evaluation prompt and a scoring language model. The evaluation process is iterative and updates the results file in place for checkpointing and saving progress.

---

#### What This Function Does:

1. **Initialize the Evaluation Environment:**
   - **`answer_path`:** Path to the JSON file containing the generated answers.
   - **`eval_chat_model`:** The language model used for evaluation (e.g., GPT-4).
   - **`evaluator_name`:** A string identifier for the evaluator (e.g., "GPT4").
   - **`evaluation_prompt_template`:** The prompt template that defines how the evaluation task is framed.

2. **Load Existing Results:**
   - If the `answer_path` file exists, loads the previously saved results into `answers`.
   - This ensures that previously evaluated answers are not re-evaluated, saving time and resources.

3. **Iterate Over Generated Answers:**
   - For each entry in `answers`:
     - **Check for Prior Evaluation:** If the answer has already been evaluated by the specified evaluator (`eval_score_{evaluator_name}`), skip it.
     - **Prepare the Evaluation Prompt:**
       - Uses `evaluation_prompt_template` to format the instruction, response, and reference answer into the structured prompt.
     - **Evaluate the Response:**
       - Sends the prompt to the `eval_chat_model` (e.g., GPT-4) and receives the evaluation result.
     - **Parse the Result:**
       - Extracts `feedback` and `score` from the model's output, splitting on `[RESULT]` to ensure the expected format is followed.

4. **Update the Results:**
   - Adds the following fields to the current experiment:
     - **`eval_score_{evaluator_name}`:** The numeric score assigned by the evaluator.
     - **`eval_feedback_{evaluator_name}`:** The detailed feedback provided by the evaluator.
   - Saves the updated `answers` list back to the `answer_path` file after each iteration for checkpointing.

---



#### Example Usage:

```python
evaluate_answers(
    answer_path="rag_test_results.json",  # File containing generated answers
    eval_chat_model=ChatOpenAI(model="gpt-4-1106-preview", temperature=0),  # Evaluation model
    evaluator_name="GPT4",  # Identifier for the evaluator
    evaluation_prompt_template=evaluation_prompt_template,  # Evaluation prompt template
)


In [None]:


eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)
evaluator_name = "GPT4"


def evaluate_answers(
    answer_path: str,
    eval_chat_model,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in tqdm.tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [item.strip() for item in eval_result.content.split("[RESULT]")]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

### Explanation: Running the Complete RAG Evaluation Pipeline

This script integrates all the steps covered so far to run a full evaluation of the RAG system across different configurations. It includes embedding creation, chunking, retrieval, generation, and evaluation in a loop to test multiple setups.

---

#### Step-by-Step Breakdown:

1. **Create Output Directory:**
   - Ensures that a directory named `./output` exists to store the results.

2. **Define Configurations:**
   - **`embedding_models`:** List of embedding models to test (e.g., `"sentence-transformers/all-mpnet-base-v2"`).
   - **`chunk_sizes`:** List of `(chunk_size, overlap)` tuples to test different chunking strategies.
     - Example:
       - `[2000, 100]`: Chunks of 2000 characters with 100-character overlap.
       - `[5000, 500]`: Larger chunks of 5000 characters with 500-character overlap.

3. **Iterate Over Configurations:**
   - Loops through all combinations of `embedding_models` and `chunk_sizes`.
   - Constructs a unique `settings_name` for each combination to name the output files clearly.

4. **Build the Knowledge Base:**
   - Calls the `build_rag_model` function with:
     - `texts`: The input data (e.g., pre-split chunks of the documents).
     - `embedding_model`: The current embedding model.
     - `chunk_size` and `chunk_overlap`: Parameters for splitting the text.
   - Converts the resulting vector store into a retriever (`knowledge_index`) for querying.

5. **Run the RAG System:**
   - Iterates through the `eval_dataset` (assumed to contain questions and true answers).
   - Calls the `answer_with_rag` function to:
     - Retrieve relevant documents using the `knowledge_index`.
     - Generate answers using the RAG chain (`rag_chain`).
   - Appends the results to a list, including:
     - The question, true answer, generated answer, and retrieved documents.

6. **Save Results:**
   - Saves the answers to a JSON file named based on the configuration (`output_file_name`).

7. **Evaluate the Answers:**
   - Calls `evaluate_answers` to:
     - Critique and score the generated answers using the LLM evaluator (`eval_chat_model`).
     - Update the saved results with scores and feedback for each answer.

---

#### What This Script Accomplishes:

1. **End-to-End Workflow:**
   - Automates the entire RAG pipeline, from embedding creation to evaluation.

2. **Flexible Testing:**
   - Tests multiple configurations for embeddings and chunking, enabling comparative analysis.

3. **Results Storage:**
   - Saves intermediate and final results to disk for reproducibility and further analysis.

4. **Scoring and Feedback:**
   - Generates actionable feedback and numerical scores for the generated answers.

---

#### Example Workflow:

**Configuration 1:**
- **Embedding Model:** `"sentence-transformers/all-mpnet-base-v2"`
- **Chunk Size:** `2000`
- **Overlap:** `100`

**Sample Output File:**
- `./output/rag_chunk:2000_embeddings:sentence-transformers~all-mpnet-base-v2.json`

**Generated Results:**
```json
[
    {
        "question": "What are the symptoms of asthma?",
        "generated_answer": "Asthma symptoms include shortness of breath and chest tightness.",
        "true_answer": "Shortness of breath, wheezing, and chest tightness.",
        "retrieved_docs": ["Document 1 text", "Document 2 text"],
        "eval_score_GPT4": "4",
        "eval_feedback_GPT4": "The response is mostly correct, but it omits 'wheezing' from the symptoms listed in the reference answer."
    }
]


In [None]:
if not os.path.exists("./output"):
    os.mkdir("./output")

# Configurations
embedding_models = ["sentence-transformers/all-mpnet-base-v2"]  # Add more models as needed
chunk_sizes = [[2000,100], [5000,500]]  # Add more chunk sizes as needed

# Iterate through configurations
for chunk_size in chunk_sizes:
    for embedding_model in embedding_models:
        settings_name = f"chunk:{chunk_size}_embeddings:{embedding_model.replace('/', '~')}"
        output_file_name = f"./output/rag_{settings_name}.json"

        print(f"Running evaluation for {settings_name}:")

        print("Loading knowledge base embeddings...")
        # Use rag_builder to create the vector store
        vector_store = build_rag_model(
            texts=chunks,  # Assuming `chunks` contains pre-split text data
            embedding_model=embedding_model,
            chunk_value=chunk_size
        )
        retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 2})

        print("Running RAG...")
        answers = []
        for sample in tqdm.tqdm(eval_dataset):  # Assume eval_dataset is iterable
            question = sample["question"]
            true_answer = sample["answer"]

            # Call the RAG function to get the generated answer
            generated_answer, relevant_docs = answer_with_rag(
                question=question,
                rag_chain=rag_chain,  # Replace with your RAG chain
                retriever=retriever,
            )

            answers.append({
                "question": question,
                "generated_answer": generated_answer,
                "true_answer": true_answer,
                "relevant_docs": relevant_docs,
            })

        # Save results to file
        with open(output_file_name, "w") as f:
            json.dump(answers, f)

        print("Running evaluation...")
        evaluate_answers(
            output_file_name,
            eval_chat_model,
            evaluator_name,
            evaluation_prompt_template,
        )

Running evaluation for chunk:[2000, 100]_embeddings:sentence-transformers~all-mpnet-base-v2:
Loading knowledge base embeddings...
Building RAG model with embedding model: sentence-transformers/all-mpnet-base-v2, chunk size: 2000, overlap: 100
Generated 123 chunks from 130 documents.
Vector store created with 123 chunks.
Running RAG...


100%|██████████| 14/14 [01:41<00:00,  7.21s/it]


Running evaluation...


100%|██████████| 14/14 [00:49<00:00,  3.56s/it]


Running evaluation for chunk:[5000, 500]_embeddings:sentence-transformers~all-mpnet-base-v2:
Loading knowledge base embeddings...
Building RAG model with embedding model: sentence-transformers/all-mpnet-base-v2, chunk size: 5000, overlap: 500
Generated 52 chunks from 130 documents.
Vector store created with 52 chunks.
Running RAG...


100%|██████████| 14/14 [02:11<00:00,  9.38s/it]


Running evaluation...


100%|██████████| 14/14 [00:48<00:00,  3.43s/it]


### Explanation: Aggregating and Normalizing Evaluation Results

This code collects evaluation results from multiple JSON files, combines them into a single dataset, and normalizes the evaluation scores for further analysis.

---

#### Step-by-Step Breakdown:

1. **Initialize an Empty List:**
   - `outputs = []`: Prepares a list to store the results from all JSON files.

2. **Load JSON Files:**
   - `glob.glob("./output/*.json")`: Finds all JSON files in the `./output` directory.
   - For each file:
     - Loads the JSON content into a pandas DataFrame using `pd.DataFrame`.
     - Adds a new column, `settings`, to store the filename, indicating the configuration used for generating the results.
     - Appends the DataFrame to the `outputs` list.

3. **Combine All Results:**
   - `pd.concat(outputs)`: Concatenates all DataFrames in the `outputs` list into a single DataFrame named `result`.

4. **Normalize Evaluation Scores:**
   - **Convert to Integer:**
     - `result["eval_score_GPT4"].apply(lambda x: int(x) if isinstance(x, str) else 1)`:
       - Ensures all scores are integers, with a fallback value of `1` for non-numeric entries.
   - **Normalize to Range [0, 1]:**
     - `result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4`:
       - Transforms the scores from the range `[1, 5]` to `[0, 1]`:
         - Subtracts `1` to shift the range to `[0, 4]`.
         - Divides by `4` to scale the range to `[0, 1]`.

---

#### Purpose of This Step:

1. **Aggregate Results:**
   - Combines evaluation results from multiple configurations into a single dataset, making it easier to compare and analyze performance.

2. **Normalize Scores:**
   - Converts the raw scores into a standardized format (`[0, 1]`) for consistent interpretation and comparison across configurations.

3. **Preserve Configuration Context:**
   - Adds the `settings` column to retain information about which configuration each set of results corresponds to.

---

In [None]:

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)
result["eval_score_GPT4"] = result["eval_score_GPT4"].apply(lambda x: int(x) if isinstance(x, str) else 1)
result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4

### Explanation: Calculating and Sorting Average Scores by Configuration

This code calculates the average evaluation scores for each configuration and sorts them in ascending order to identify the best and worst-performing setups.

---

#### Purpose of This Step:

1. **Performance Comparison:**
   - Calculates the overall effectiveness of each configuration by averaging the normalized evaluation scores across all questions.
   - Highlights configurations that consistently produce better results.

2. **Identify Trends:**
   - Sorting the scores helps visualize how different configurations affect the system's performance.
   - Useful for pinpointing the impact of factors like chunk size, overlap, or embedding model.


3. **Insights into Configurations:**
   - Identifies which configurations yield higher-quality answers, guiding optimization efforts.
   - Helps determine the best chunk size, overlap, or embedding model for the


In [None]:
average_scores = result.groupby("settings")["eval_score_GPT4"].mean()
average_scores.sort_values()