# Build and Evaluate RAG with Gemini using LLM as a JUDGE

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/irum-zahra-awan/geneai/blob/main/podcast_agent.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.comirum-zahra-awan/geneai/blob/main/podcast_agent.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>    

| Author |
| --- |
| [Irum Zahra](https://github.com/irum-zahra-awan/) |

# **Part 1: Downloading and Processing Papers with Python**

To use the information within these papers, we first need to get their content into our Python environment. Since these papers are available as web pages or PDFs, we will write a script to:


* **Fetch the content:** We'll use the requests library to download the HTML of the web pages. For the PDF, we'll use a library designed to extract text from PDF files.

* **Parse the text:** Raw HTML contains a lot of code we don't need. We'll use BeautifulSoup to parse the HTML and extract only the meaningful text. For PDFs, pypdf will help us extract text directly.

* **Chunk the text:** Large language models have a limited context window (the amount of text they can consider at once). A single research paper is far too long. To handle this, we break the text into smaller, overlapping "***chunks.***" This ensures that the model receives manageable pieces of information and that semantic meaning isn't lost at the boundaries of chunks.
We use the `RecursiveCharacterTextSplitter` from langchain for this, a standard tool for this task.

This entire process prepares the raw data for the next crucial step: creating vector embeddings.



In [1]:
%pip install requests beautifulsoup4 pypdf langchain -q -U
%pip install google-cloud-aiplatform langchain-google-vertexai faiss-cpu langchain-community -q -U

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.8/64.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m305.5/305.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.3, but you have requests 2.32.4 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.0/101.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m84.9 MB/s[0m eta [36m0:00:00

In [2]:
import requests
from bs4 import BeautifulSoup
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pypdf import PdfReader


import sys
from google.colab import auth
import vertexai

from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import FAISS

In [3]:
# Define URLs and a directory to save the text
urls = [
    "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11680066/",
    "https://www.downtoearth.org.in/health/world-polio-day-2024-conflict-delays-and-vaccine-shortages-derail-global-eradication-efforts",
]

# The CDC paper is a PDF, so we handle it separately
pdf_url = "https://www.cdc.gov/mmwr/volumes/73/wr/pdfs/mm7341a1-H.pdf"
pdf_filename = "cdc_polio_report_2024.pdf"

# Create a directory to store the processed text
if not os.path.exists("polio_papers"):
    os.makedirs("polio_papers")

In [4]:
# Function to download and parse HTML content

def scrape_and_save_text(url, index):
    """Scrapes text from a URL and saves it to a file."""
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the main content of the article (selectors might need adjustment)
        # For ncbi.nlm.nih.gov
        if "ncbi.nlm.nih.gov" in url:
            content = soup.find('div', id='__article')
        # For downtoearth.org.in
        elif "downtoearth.org.in" in url:
            content = soup.find('div', class_='news-detail-content')
        else:
            content = soup.body # Fallback to the whole body

        if content:
            text = content.get_text(separator='\n', strip=True)
            filename = f"polio_papers/paper_{index}.txt"
            with open(filename, "w", encoding="utf-8") as f:
                f.write(text)
            print(f"Successfully scraped and saved paper {index}")
            return text
        else:
            print(f"Could not find main content for paper {index}")
            return ""
    except requests.exceptions.RequestException as e:
        print(f"Error downloading paper {index}: {e}")
        return ""

# Function to download and extract text from a PDF
def download_and_extract_pdf_text(url, filename):
    """Downloads a PDF and extracts its text content."""
    try:
        response = requests.get(url)
        response.raise_for_status()
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Successfully downloaded {filename}")

        # Extract text from the downloaded PDF
        reader = PdfReader(filename)
        text = ""
        for page in reader.pages:
            text += page.extract_text() or ""

        # Save the extracted text
        text_filename = "polio_papers/paper_2.txt"
        with open(text_filename, "w", encoding="utf-8") as f:
            f.write(text)
        print(f"Successfully extracted and saved text from {filename}")
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error downloading PDF: {e}")
        return ""
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return ""


# Execute the functions and process all papers
all_texts = []
# Scrape HTML papers
for i, url in enumerate(urls):
    all_texts.append(scrape_and_save_text(url, i))

# Download and process the PDF paper
all_texts.append(download_and_extract_pdf_text(pdf_url, pdf_filename))

# Combine all text into a single document for chunking
full_text = "\n\n--- NEW PAPER ---\n\n".join(filter(None, all_texts))


Error downloading paper 0: 403 Client Error: Forbidden for url: https://pmc.ncbi.nlm.nih.gov/articles/PMC11680066/
Could not find main content for paper 1
Successfully downloaded cdc_polio_report_2024.pdf
Successfully extracted and saved text from cdc_polio_report_2024.pdf


# **Part 2: Storing Information in a Vector Database**

A vector database allows us to perform semantic search. Instead of just searching for keywords, we can search for concepts and meanings. Here's how it works:

* **Embedding Model:** We use a powerful model from Vertex AI (like textembedding-gecko) to convert our text chunks into numerical representations called vectors or embeddings. Each vector is a list of numbers that captures the semantic meaning of the text. Chunks with similar meanings will have vectors that are "close" to each other in mathematical space.

* **Vector Store:** We need a place to store these vectors and a way to search through them efficiently. FAISS (Facebook AI Similarity Search) is a lightweight and highly efficient library for this purpose. It's perfect for a Colab environment as it runs in memory and doesn't require a separate database server.

* **Storing:** The script will take each text chunk, pass it to the Vertex AI embedding model to get a vector, and then store that vector (along with the original text chunk) in our FAISS index.


This setup is the core of our RAG system. It allows us to take a user's question, find the most relevant chunks of information from our research papers, and use them to generate a factual, context-aware answer.

In [5]:
# Chunk the text
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,  # The size of each chunk in characters
    chunk_overlap=200 # Number of characters to overlap between chunks
)
chunks = text_splitter.split_text(full_text)

print(f"\nTotal number of text chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])


Total number of text chunks: 29
Sample chunk:
Morbidity and Mortality Weekly Report
U.S. Centers for Disease Control and Prevention
Weekly / Vol. 73 / No. 41 O ctober 17, 2024
INSIDE
917 T obacco Product Use Among Middle and High 
School Students — National Youth Tobacco Survey, 
United States, 2024
925 C overage with Selected Vaccines and Exemption 
Rates Among Children in Kindergarten — United 
States, 2023–24 School Year
933 Not es from the Field: Enhanced Surveillance for 
Raccoon Rabies Virus Variant and Vaccination of 
Wildlife for Management — Omaha, Nebraska, 
October 2023–July 2024 
936 QuickStats
Continuing Education examination available at  
https://www.cdc.gov/mmwr/mmwr_continuingEducation.html
Update on Vaccine-Derived Poliovirus Outbreaks —  
Worldwide, January 2023–June 2024
Apophia Namageyo-Funa, PhD1; Sharon A. Greene, PhD1; Elizabeth Henderson2; Mohamed A. T raoré3; Shahzad Shaukat, PhD3; John Paul Bigouette, PhD1; 
Jaume Jorba, PhD2; Eric Wiesen, DrPH1; Omotayo Bol

In [6]:
# Authenticate user
auth.authenticate_user()

In [None]:
# Authenticate and initialize Vertex AI

# Define your Google Cloud project
PROJECT_ID = "your-project-id"  # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}

# Initialize Vertex AI
vertexai.init(project=PROJECT_ID, location=LOCATION)

In [8]:
# Set up the embedding model and FAISS vector store
# Initialize the embedding model
embeddings = VertexAIEmbeddings(model_name="text-embedding-005")

# Create the vector store from our text chunks
# This will take a moment as it processes each chunk and gets its embedding
print("Creating vector store... This may take a few minutes.")
vector_store = FAISS.from_texts(chunks, embeddings)
print("Vector store created successfully!")



Creating vector store... This may take a few minutes.
Vector store created successfully!


In [9]:
# Test the vector store with a sample query
# Let's see what information it retrieves for a sample question
sample_query = "What are the main challenges in polio eradication?"
retrieved_docs = vector_store.similarity_search(sample_query, k=2) # Get the top 2 most relevant chunks

print(f"\n--- Sample Retrieval for query: '{sample_query}' ---")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Relevant Chunk {i+1} ---")
    print(doc.page_content)
    print("--------------------")


--- Sample Retrieval for query: 'What are the main challenges in polio eradication?' ---

--- Relevant Chunk 1 ---
To achieve the Global Polio Eradication Initiative’s goal of interrupt -
ing cVDPV transmission by 2026, outbreak responses must be 
timely and overcome barriers to reaching children who are missed 
by routine and supplementary immunization activities.
Limitations
The findings in this report are subject to at least two limita-
tions. First, existing gaps in polio surveillance systems might 
lead to the underestimation of cases and transmission levels 
and inaccuracies in the geographic spread of cVDPVs. Second, 
delays in the transportation of polio samples and testing by 
reference laboratories might result in underreporting of cases, 
outbreaks, and emergences during January–June 2024.
Implications for Public Health Practice
GPEI currently aims to eradicate polio by 2026; the key 
challenges are ending transmission in security-compromised 
areas and hard-to-reach commun

# **Part 4: Generating Questions for the LLM**
Based on the content of the three papers, here are five insightful questions that require the LLM to synthesize information from multiple sources.

* Question 1: According to the provided documents, what are the primary reasons for the persistence of circulating vaccine-derived poliovirus (cVDPV) outbreaks in 2023-2024, and which type is most prevalent?

* Question 2: The papers mention both Wild Poliovirus (WPV1) and vaccine-derived polioviruses (cVDPV). Compare the geographical regions primarily affected by each in 2024.

* Question 3: What specific challenges, such as conflict and vaccine supply, have hindered polio vaccination campaigns in 2024, and what are the recommended strategies to overcome them?

* Question 4: How has the COVID-19 pandemic impacted global polio vaccination coverage, and what are the long-term risks associated with these disruptions?

* Question 5: Explain the role of the novel oral polio vaccine type 2 (nOPV2). What are its benefits, and what issues have been reported regarding its supply and deployment?

# **Part 5: Answering Queries with a RAG System**

This is where everything comes together. We will now build the complete RAG pipeline to answer our generated questions.

* **Retriever:** The vector store we built (FAISS) acts as our retriever. When a user asks a question, the retriever's job is to quickly find and "retrieve" the most relevant text chunks from our source documents.

* **Prompt Template:** We don't just send the user's question to the LLM. We create a structured prompt. This prompt instructs the LLM on how to behave (e.g., "be a helpful assistant"), provides the retrieved text chunks as context, and then presents the user's question. This guides the model to base its answer only on the information we've provided.

* **LLM:** We use a powerful Gemini model from Vertex AI as the "brain" of our operation. It will receive the formatted prompt (with context) and generate a coherent, human-like answer.

* **Chain**: We use langchain to tie these components together into a RetrievalQA chain. This chain automates the entire process: a question goes in, and a fully formed, context-aware answer comes out.

This RAG approach is far superior to simply asking the LLM a question directly because it grounds the model's response in our specific source material, dramatically reducing the risk of hallucinations (made-up information) and ensuring the answers are factual and relevant to our documents.



In [14]:
# Set up the LLM and the QA Chain
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize the Gemini LLM
llm = VertexAI(model_name="gemini-2.0-flash-001", temperature=0.1)

In [15]:
# Create a prompt template
prompt_template = """
You are a helpful assistant specialized in summarizing information from medical research papers.
Use the following pieces of context to answer the question at the end.
If you don't know the answer from the context, just say that you don't know, don't try to make up an answer.
Be concise and provide the answer based only on the provided text.

Context:
{context}

Question: {question}

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [16]:
# Create the RetrievalQA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [17]:
# Ask one of our generated questions
question_to_ask = "According to the provided documents, what are the primary reasons for the persistence of circulating vaccine-derived poliovirus (cVDPV) outbreaks in 2023-2024, and which type is most prevalent?"

print(f"Asking question: {question_to_ask}")
result = qa_chain({"query": question_to_ask})

# Print the results
print("\n--- Generated Answer ---")
print(result["result"])

print("\n--- Source Documents Used ---")
for doc in result["source_documents"]:
    print(f"\n- Source: {doc.page_content[:200]}...") # Print snippet of the source

Asking question: According to the provided documents, what are the primary reasons for the persistence of circulating vaccine-derived poliovirus (cVDPV) outbreaks in 2023-2024, and which type is most prevalent?

--- Generated Answer ---
The primary reasons for the persistence of cVDPV outbreaks are delayed implementation of outbreak response campaigns, low-quality campaigns, and barriers to reaching children. cVDPV type 2 is the most prevalent.


--- Source Documents Used ---

- Source: MADAGASCAR
INDONESIA
UGANDA
BU
Country or area with at 
least one detection of 
cVDPV through 
environmental surveillance
No environmental detections
Not applicable
cVDPV type 1
cVDPV type 2
Acute /f_...

- Source: by the end of 2026. Continued circulation of cVDPVs high-
lights the need for 1) increased urgency to implement prompt, 
high-quality SIAs upon detection of new cVDPV outbreaks and 
2) enhanced effort...

- Source: same dates, symbols will overlap; thus, not all isolates are visible. Data as 

# **Part 6: Using an LLM as a Judge**

How do we know if our RAG system is providing good answers? We can evaluate them manually, but this is time-consuming. An advanced and powerful technique is to use another **LLM as an impartial "judge."**

* **The Judge's Task:** We give the judge LLM a very specific set of instructions. Its job is not to answer the original question, but to evaluate the answer generated by our RAG system.

* **Evaluation Criteria:** We define clear criteria for the judge. In this script, we ask it to assess two key aspects:

 * **Faithfulness:** Is the generated answer fully supported by the provided source documents? It should not contain information that isn't in the context.

 * **Relevance:** Does the answer directly address the user's question?

* **Structured Output:** We instruct the judge to provide its reasoning and a final verdict ("SUPPORTED" or "NOT SUPPORTED") in a structured format. This makes the evaluation easy to interpret.

Using an LLM Judge automates the evaluation process, allowing us to quickly assess the quality of our RAG system's responses. It's a key part of building robust and reliable AI systems.



In [25]:
# Set up the Judge LLM and Prompt Template
judge_llm = VertexAI(model_name="gemini-2.5-pro", temperature=0.0)

In [27]:
prompt = "How many planets exist in the solar system?"
judge_llm(prompt)

'There are **eight** planets in our solar system.\n\nIn order from the Sun, they are:\n1.  **Mercury**\n2.  **Venus**\n3.  **Earth**\n4.  **Mars**\n5.  **Jupiter**\n6.  **Saturn**\n7.  **Uranus**\n8.  **Neptune**\n\n### What Happened to Pluto?\n\nYou might remember learning that there were nine planets. For a long time, Pluto was considered the ninth planet. However, in 2006, the International Astronomical Union (IAU) established a new definition for a planet.\n\nAccording to the IAU, a celestial body must meet three criteria to be classified as a planet:\n1.  It must orbit the Sun.\n2.  It must have enough mass to be pulled into a nearly round shape by its own gravity (hydrostatic equilibrium).\n3.  It must have "cleared the neighborhood" around its orbit, meaning it is the dominant gravitational body in its orbital path.\n\nPluto meets the first two criteria, but it fails the third. Its orbit is located in the Kuiper Belt, a region full of other icy objects, and it has not cleared th

In [28]:
judge_prompt_template = """
You are an expert evaluator. Your task is to determine if a generated answer is faithful to the provided source documents.

You will be given:
1. The original question.
2. The source documents (context) used to generate the answer.
3. The generated answer.

Your evaluation criteria are:
- **Faithfulness**: The answer must be fully supported by the information in the source documents. It should not add any new information or contradict the source.
- **Relevance**: The answer must be relevant to the original question.

Provide a step-by-step reasoning for your decision and then conclude with a final verdict in the format: "Final Verdict: [SUPPORTED or NOT SUPPORTED]".

---
Original Question:
{question}

---
Source Documents:
{context}

---
Generated Answer:
{answer}
---

Reasoning:
"""

In [29]:
# Prepare the inputs for the judge
# We will use the results from the previous step
question = result["query"]
context_docs = "\n\n".join([doc.page_content for doc in result["source_documents"]])
generated_answer = result["result"]

# Format the input for the judge
judge_input = judge_prompt_template.format(
    question=question,
    context=context_docs,
    answer=generated_answer
)

In [30]:
# Get the evaluation from the judge
print("\n--- ⚖️ Submitting to LLM Judge for Evaluation ⚖️ ---")
evaluation = judge_llm(judge_input)

# Print the judge's verdict
print("\n--- Judge's Evaluation ---")
print(evaluation)


--- ⚖️ Submitting to LLM Judge for Evaluation ⚖️ ---

--- Judge's Evaluation ---
**Reasoning:**

1.  **Analyze the first part of the answer:** "The primary reasons for the persistence of cVDPV outbreaks are delayed implementation of outbreak response campaigns, low-quality campaigns, and barriers to reaching children."
    *   The source document states: "Delayed implementation of outbreak response campaigns and low-quality campaigns have resulted in further international spread." This directly supports the first two reasons.
    *   The source document also states: "Stopping all cVDPV transmission requires effectively increasing population immunity by overcoming barriers to reaching children." This directly supports the third reason.
    *   Another sentence reinforces this: "Continued circulation of cVDPVs highlights the need for... enhanced efforts... to vaccinate children in security-compromised areas and in hard-to-reach communities."

2.  **Analyze the second part of the answer:

In [31]:
# We use the same setup as before, but with our new unsupported answer.

# --- Inputs for the judge ---
# The question and context remain the same from our previous RAG query.
question = result["query"]
context_docs = "\n\n".join([doc.page_content for doc in result["source_documents"]])

# Here is our manually crafted unsupported answer
unsupported_answer = "The primary reasons for the persistence of cVDPV outbreaks are low immunization coverage in certain areas and interruptions to vaccination campaigns. The most prevalent type is cVDPV2. A new, more resilient strain, cVDPV3, also emerged in late 2024 in South America, causing significant concern."

# --- Format the prompt for the judge ---
judge_input = judge_prompt_template.format(
    question=question,
    context=context_docs,
    answer=unsupported_answer
)

# --- Get the evaluation ---
print("\n--- ⚖️ Submitting MODIFIED Answer to LLM Judge ⚖️ ---")
evaluation = judge_llm(judge_input)

# --- Print the judge's verdict ---
print("\n--- Judge's Evaluation of Unsupported Answer ---")
print(evaluation)


--- ⚖️ Submitting MODIFIED Answer to LLM Judge ⚖️ ---

--- Judge's Evaluation of Unsupported Answer ---
**Step-by-step reasoning:**

1.  **Analyze the Generated Answer's Claims:**
    *   Claim 1: The primary reasons for the persistence of cVDPV outbreaks are low immunization coverage in certain areas and interruptions to vaccination campaigns.
    *   Claim 2: The most prevalent type is cVDPV2.
    *   Claim 3: A new, more resilient strain, cVDPV3, also emerged in late 2024 in South America, causing significant concern.

2.  **Verify Each Claim Against the Source Documents:**
    *   **Claim 1:** The source documents state that cVDPVs "can emerge and cause paralysis in areas with low population poliovirus immunity." They also highlight the need to overcome "barriers to reaching children" and point to "Delayed implementation of outbreak response campaigns and low-quality campaigns" as reasons for spread. This claim is **supported**.
    *   **Claim 2:** The source documents state that