## Search-gather-response, experimenting AI Agentic RAG for public health research
This notebook implements an agentic RAG system for public health research.  
Mayer Antoine

In this project, we reproduce an agentic AI for retrieval augmented generation (RAG) as implemented in the paper "L'ala, Jakub et al. 'PaperQA: Retrieval-Augmented Generative Agent for Scientific Research.' ArXiv abs/2312.07559 (2023): n. pag." using OpenAI's Agent SDK. The authors developed PaperQA, an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. This follows a **search-gather-response** framework.



Our implementation uses ChromaDB vector store and LangChain. While we don't exactly follow the paper's implementation (as they developed and used an in-house AI framework), we demonstrate their core principles. This notebook shows indexing and simple queries using the agent implemented in rag_agent.py

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
from sodapy import Socrata
import pathlib
from pathlib import Path
from typing import List, Dict, Any, Optional, Union, Tuple
from langchain_huggingface import HuggingFaceEmbeddings as lgHuggingFaceEmbeddings
from rag_agent import  AgentConfig,AgenticRAG
from loader import download_file,get_data_directory,extract_zip_files,load_html_files
from vectorstore import VectorStorePaper

### Download the data
Download CDC public health data (PCD 2004-2023) containing scientific articles.

In [3]:

_URL_PCD = "https://data.cdc.gov/api/views/ut5n-bmc3/files/c0594869-ba74-4c26-bf54-b2dab3dff971?download=true&filename=pcd_2004-2023.zip"
HTML_ZIP_DIRECTORY="./cdc-corpus-data/zip"

if not Path(HTML_ZIP_DIRECTORY).exists():
    print("No data.. donwloading")
    download_file(url=_URL_PCD,file_name="pcd.zip")

No data.. donwloading


20979it [00:23, 910.59it/s]


### Extract and Load HTML Files
Extract ZIP files and load 2,914 HTML articles for processing

In [4]:
data_dir = get_data_directory()

extract_zip_files()
target_dir = data_dir / "html-outputs/pcd"

# If html files where downloaded and extracted - load them
if target_dir.exists():
    data_html = load_html_files()
### Chuncking and Indexing HTML Files

Loaded 2914 HTML articles


### Create the Vector store with Chroma
Initialize ChromaDB vector store with existing index containing 169,394 document chunks

In [5]:
# Configuration: Set to False to reuse existing index, True to recreate
RECREATE_INDEX = False  # Change to True to force index recreation

In [6]:
# Set up ChromaDB client
CHROMA_PERSIST_DIRECTORY="./cdc-corpus-data/chroma_db"
vector_store = VectorStorePaper(html_articles=data_html,
                 persist_directory=CHROMA_PERSIST_DIRECTORY,
                 recreate_index=RECREATE_INDEX)

# Display index status
if vector_store.index_exists:
    doc_count = vector_store.get_document_count()
    print(f"Index contains {doc_count} document chunks")
else:
    print("No existing index found")

Creating new index at ./cdc-corpus-data/chroma_db
No existing index found


### Chuncking the HTML Files
Process HTML documents into smaller chunks for vector embedding (skipped if using existing index).

In [7]:
%%time
# Only chunk documents if we need to recreate the index
if vector_store.should_process_documents():
    print("Chunking documents...")
    documents = vector_store.chunking()
    print(f"Created {len(documents)} document chunks")
else:
    print("Skipping document chunking (using existing index)")
    documents = []

Chunking documents...
Created 94310 document chunks
CPU times: user 1min 59s, sys: 1.87 s, total: 2min 1s
Wall time: 2min 4s


### Indexing HTML Files in the  Vector Database
Create and store vector embeddings for document chunks in ChromaDB (skipped if using existing index)

In [8]:
%%time
# Only index documents if we need to recreate the index
if vector_store.should_process_documents():
    print(f"Indexing {len(documents)} documents (this may take several minutes)...")
    vector_store.index_document(documents)  # Using subset for demo
    print("Indexing completed!")
else:
    print("Skipping document indexing (using existing index)")
    print("Ready to perform searches!")

Indexing 94310 documents (this may take several minutes)...


Creating embeddings: 100%|██████████| 94310/94310 [12:55<00:00, 121.68doc/s]  

Indexing completed!
CPU times: user 3min 38s, sys: 53.8 s, total: 4min 32s
Wall time: 12min 55s





### Creating and Configuring the Agent
Configure the AgenticRAG agent with search parameters and relevance settings.

In [22]:

config = AgentConfig(
        collection_filter='pcd',
        relevance_cutoff=8,
        search_k=10,
        max_evidence_pieces=5,
        max_search_attempts=3
    )
    
    # Initialize agentic RAG
agentic_rag = AgenticRAG(vector_store=vector_store,config=config)

 
async def ask(config:AgentConfig,agent:AgenticRAG,question:str):

    # Ask question
    #question = "Causes of sleep disorder"
    print(f"Question: {question}\n")
    answer = await agentic_rag.ask_question(question, max_turns=10)

    return answer


### Asking questions
Demonstrate the agent answering a complex question about diabetes prevention in rural adolescents.

In [23]:
question = "What are the most common methods used in diabetes prevention to support adolescents in rural areas in the US?"
answer = await ask(config=config,agent = agentic_rag, question=question)

Question: What are the most common methods used in diabetes prevention to support adolescents in rural areas in the US?

🟢 [Search] Starting paper search for question:common methods used in diabetes prevention for adolescents in rural areas in the US
🟢 [Search] Paper search returned 10 passages from papers
🟢 [Status] Paper Count=10 | Relevant Papers=0 Current Evidence=0
🟢 [Gather] Gathering evidence for question: common methods used in diabetes prevention for adolescents in rural areas in the US
🟢 [Gather] Finished gathering evidence for question: common methods used in diabetes prevention for adolescents in rural areas in the US
{'Paper': 10, 'Relevant': 5, 'Evidence': 5}


### Display Final answer
Show the final comprehensive answer generated by the agentic RAG system.

In [24]:
print(f"\nFinal Answer: {answer}")


Final Answer: Common methods used in diabetes prevention for adolescents in rural areas of the US include:

1. **Lifestyle Change Programs**: The CDC's National Diabetes Prevention Program promotes structured interventions focusing on dietary modifications and increased physical activity, achieving a significant reduction in diabetes incidence (30% to 60%) among at-risk adolescents.

2. **Community Engagement**: Programs like the Together On Diabetes trial emphasize creating local support systems, integrating family and community resources to enhance participation and sustainability.

3. **School-Based Initiatives**: Tailored physical activity programs in schools, especially for underserved populations, improve physical fitness and potentially lower diabetes risk. For instance, interventions for American Indian youth have shown positive impacts on glucose levels and fitness, despite mixed results on BMI.

4. **Pharmacy Collaborations**: Collaborating with pharmacies to extend access t

In [20]:
import os
from dotenv import load_dotenv
load_dotenv(override=True)
print(os.getenv('DEFAULT_EMBEDDING_MODEL'))

all-MiniLM-L6-v2
