# Automated Literature Review Generation using Agentic RAG

This notebook demonstrates an end-to-end pipeline for automatically generating "Related Work" sections for scientific papers using:
- **Hybrid Retrieval**: Combining semantic search (vector embeddings) and keyword search (BM25)
- **Agentic Relevance Scoring**: Using LLM agents to evaluate paper relevance with structured reasoning
- **Automated Synthesis**: Generating coherent literature reviews with proper citations

## Pipeline Overview
1. Load biomedical abstracts corpus
2. Create vector store with hybrid retrieval capabilities
3. Define research query/abstract
4. Retrieve candidate papers using hybrid search
5. Score papers using debate-style relevance agent
6. Select top-k most relevant papers
7. Generate cohesive "Related Work" section

---
## 1. Setup & Configuration

Import dependencies and configure pipeline parameters. Adjust these settings to customize the pipeline behavior.

In [1]:
# Auto-reload modules for development
%load_ext autoreload
%autoreload 2

In [2]:
# Standard library imports
import os
import asyncio
from typing import List, Dict, Any
from pprint import pprint

# Data manipulation
import pandas as pd
import numpy as np

# OpenAI and agents
import openai
from agents import Agent, Runner
from pydantic import BaseModel, Field
from typing import Annotated

# Environment and display
from dotenv import load_dotenv
from IPython.display import Markdown, display, HTML

# Local modules
from vectorstore import VectorStoreAbstract

In [3]:
# ============================================================================
# CONFIGURATION PARAMETERS
# ============================================================================

# Vector Store Configuration
CHROMA_PERSIST_DIRECTORY = "./corpus-data/chroma_db"
RECREATE_INDEX = False  # Set to True to rebuild the index from scratch

# Retrieval Configuration
HYBRID_SEARCH_K = 50  # Number of papers to retrieve using hybrid search

# Relevance Scoring Configuration
NUM_ABSTRACTS_TO_SCORE = 3  # Set to None to score all retrieved abstracts, or set a number for testing (e.g., 5, 10, 20)
RELEVANCE_MODEL = "gpt-4o-mini"  # Model for relevance scoring agent

# Top-K Selection Configuration
TOP_K_PAPERS = 3  # Number of top-ranked papers to include in related work

# Related Work Generation Configuration
GENERATION_MODEL = "gpt-4o-mini"  # Model for generating related work section

print("Configuration loaded successfully!")
print(f"  - Retrieval: Top {HYBRID_SEARCH_K} papers using hybrid search")
print(f"  - Scoring: {'All' if NUM_ABSTRACTS_TO_SCORE is None else NUM_ABSTRACTS_TO_SCORE} abstracts will be scored")
print(f"  - Selection: Top {TOP_K_PAPERS} papers for related work")
print(f"  - Models: {RELEVANCE_MODEL} (scoring), {GENERATION_MODEL} (generation)")

Configuration loaded successfully!
  - Retrieval: Top 50 papers using hybrid search
  - Scoring: 3 abstracts will be scored
  - Selection: Top 3 papers for related work
  - Models: gpt-4o-mini (scoring), gpt-4o-mini (generation)


In [4]:
# Load environment variables and initialize OpenAI client
load_dotenv(override=True)
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
openai_client = openai.OpenAI()

print("✓ Environment loaded")
print("✓ OpenAI client initialized")

✓ Environment loaded
✓ OpenAI client initialized


---
## 2. Data Loading & Preparation

Load the biomedical abstracts corpus and prepare it for indexing. Each abstract contains:
- **id**: Unique identifier
- **title**: Paper title
- **abstract**: Paper abstract
- **title_abstract**: Concatenated title and abstract for retrieval

In [5]:
# Load abstracts from CSV and shuffle
all_abstracts = pd.read_csv('./abstracts_rag.csv').sample(frac=1, random_state=42)

print(f"Loaded {len(all_abstracts)} abstracts from corpus")
print(f"\nDataset columns: {list(all_abstracts.columns)}")
print(f"\nFirst few abstracts:")
all_abstracts.head(3)

Loaded 78 abstracts from corpus

Dataset columns: ['id', 'title', 'abstract']

First few abstracts:


Unnamed: 0,id,title,abstract
33,34,Reshaping Biomedical Scientific Literature in ...,Biomedical Question Answering (BQA) poses spec...
0,1,PaperQA: Retrieval-Augmented Generative Agent ...,Large Language Models (LLMs) generalize well a...
34,35,"Attention is all you need, A",The dominant sequence transduction models are ...


In [6]:
# Concatenate title and abstract for better retrieval
all_abstracts['title_abstract'] = all_abstracts['title'] + all_abstracts['abstract']

# Convert to list of dictionaries for vector store
samples_abstracts = [
    v for k, v in all_abstracts[['title_abstract', 'id']].reset_index(drop=True).T.to_dict().items()
]

print(f"✓ Prepared {len(samples_abstracts)} abstracts for indexing")
print(f"\nSample abstract structure:")
print(f"  - ID: {samples_abstracts[0]['id']}")
print(f"  - Text length: {len(samples_abstracts[0]['title_abstract'])} characters")

✓ Prepared 78 abstracts for indexing

Sample abstract structure:
  - ID: 34
  - Text length: 1325 characters


---
## 3. Vector Store Initialization

Initialize ChromaDB vector store with hybrid retrieval capabilities:
- **Semantic Search**: Uses HuggingFace embeddings (all-MiniLM-L6-v2)
- **Keyword Search**: Uses BM25 algorithm
- **Chunking**: Splits abstracts into 150-character chunks with 20-character overlap

In [7]:
# Initialize vector store
vector_store = VectorStoreAbstract(
    abstracts=samples_abstracts,
    persist_directory=CHROMA_PERSIST_DIRECTORY,
    recreate_index=RECREATE_INDEX
)

# Display index status
if vector_store.index_exists:
    doc_count = vector_store.get_document_count()
    print(f"✓ Using existing index at {CHROMA_PERSIST_DIRECTORY}")
    print(f"  Index contains {doc_count} document chunks")
else:
    print(f"✓ Created new index at {CHROMA_PERSIST_DIRECTORY}")

Creating new index at ./corpus-data/chroma_db
✓ Created new index at ./corpus-data/chroma_db


In [8]:
%%time
# Chunk documents if needed (only when creating new index or recreating)
if vector_store.should_process_documents():
    print("Chunking documents...")
    documents = vector_store.chunking()
    print(f"✓ Created {len(documents)} document chunks")
else:
    print("✓ Skipping document chunking (using existing index)")
    documents = []

Chunking documents...


Chunking documents: 100%|██████████| 78/78 [00:00<00:00, 6881.40article/s]

✓ Created 1138 document chunks
CPU times: user 9.93 ms, sys: 15.6 ms, total: 25.5 ms
Wall time: 25.2 ms





In [9]:
%%time
# Index documents if needed
if vector_store.should_process_documents():
    print(f"Indexing {len(documents)} documents (this may take several minutes)...")
    vector_store.index_document(documents)
    print("✓ Indexing completed!")
    print(f"  Total chunks indexed: {vector_store.get_document_count()}")
else:
    print("✓ Skipping document indexing (using existing index)")
    print(f"  Ready to perform searches!")
    print(f"  Index contains {vector_store.get_document_count()} chunks")

Indexing 1138 documents (this may take several minutes)...


Creating embeddings: 100%|██████████| 1138/1138 [00:02<00:00, 413.27doc/s]

✓ Indexing completed!
  Total chunks indexed: 0
CPU times: user 1.5 s, sys: 546 ms, total: 2.04 s
Wall time: 2.77 s





---
## 4. Research Query Definition

Define the research query or abstract for which we want to generate a literature review. This will be used to:
1. Retrieve relevant papers from the corpus
2. Score the relevance of each retrieved paper
3. Generate the final "Related Work" section

In [10]:
# Define the research query/abstract
query = """
Retrieval-augmented generation (RAG) systems are emerging as effective tools for biomedical literature. 
However, their performance in this domain is not yet generalizable. 
We propose a new strategy for high-performing RAG applied to biomedical question answering. 
This approach would allow the wider public and public health professionals to access evidence from scientific literature in easy-to-understand language.
""".strip()

# Display the query
display(Markdown("### Research Query/Abstract"))
display(Markdown(f"_{query}_"))
print(f"\nQuery length: {len(query)} characters")

### Research Query/Abstract

_Retrieval-augmented generation (RAG) systems are emerging as effective tools for biomedical literature. 
However, their performance in this domain is not yet generalizable. 
We propose a new strategy for high-performing RAG applied to biomedical question answering. 
This approach would allow the wider public and public health professionals to access evidence from scientific literature in easy-to-understand language._


Query length: 419 characters


---
## 5. Hybrid Retrieval

Perform hybrid search combining:
- **Semantic similarity**: Vector search using embeddings
- **Keyword matching**: BM25 ranking

The ensemble retriever combines both methods with equal weights (0.5, 0.5) to balance semantic understanding and keyword relevance.

In [11]:

# Perform hybrid search
rs = vector_store.hybrid_search(query, k=HYBRID_SEARCH_K)
#rs = vector_store.semantic_search(query, k=HYBRID_SEARCH_K)

print(rs)

# Extract unique document IDs from results
retrieved_docs = {item.metadata['id'] for item in rs}

# Filter abstracts DataFrame to get full information for retrieved papers
retrieved_abstracts = all_abstracts[all_abstracts['id'].isin(retrieved_docs)].copy()

print(f"✓ Retrieved {len(retrieved_abstracts)} unique papers (from {HYBRID_SEARCH_K} chunks)")
print(f"\nTop 5 retrieved papers:")
display(retrieved_abstracts[['id', 'title']].head())

[Document(id='8f729245-0bcb-434b-92d9-12b140858776', metadata={'id': 23}, page_content='. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area'), Document(id='39229b77-e288-4148-b603-203e56953712', metadata={'id': 33}, page_content='. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public'), Document(id='675b284d-8d0b-405c-b775-bf634b3ce1be', metadata={'id': 33}, page_content='Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)This work presents a Biomedical Literature Question Answering (Q&A) system'), Document(metadata={'id': 33}, page_content='. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research'), Document(id='8f6fdb73-dc9f-4e12-867a-3f3ac97232da', metadata={'id': 32}, page_content='RAG-BioQA Retrieval-Augmented G

Unnamed: 0,id,title
33,34,Reshaping Biomedical Scientific Literature in ...
0,1,PaperQA: Retrieval-Augmented Generative Agent ...
49,50,Accessing Biomedical Literature in the Current...
22,23,Improving accuracy of gpt-3/4 results on biome...
18,19,Biobert: a pre-trained biomedical language rep...


In [12]:
# Display retrieval statistics
print(f"Retrieval Statistics:")
print(f"  - Total papers in corpus: {len(all_abstracts)}")
print(f"  - Papers retrieved: {len(retrieved_abstracts)}")
print(f"  - Retrieval rate: {len(retrieved_abstracts) / len(all_abstracts) * 100:.1f}%")
print(f"\nSample retrieved abstracts:")
display(retrieved_abstracts[['id', 'title', 'abstract']].head(3))

Retrieval Statistics:
  - Total papers in corpus: 78
  - Papers retrieved: 23
  - Retrieval rate: 29.5%

Sample retrieved abstracts:


Unnamed: 0,id,title,abstract
33,34,Reshaping Biomedical Scientific Literature in ...,Biomedical Question Answering (BQA) poses spec...
0,1,PaperQA: Retrieval-Augmented Generative Agent ...,Large Language Models (LLMs) generalize well a...
49,50,Accessing Biomedical Literature in the Current...,Biomedical and life sciences literature is uni...


---
## 6. Relevance Agent Setup

Configure the relevance scoring agent that evaluates each paper using a debate-style approach:
1. Generate arguments **for** including the paper
2. Generate arguments **against** including the paper
3. Extract supporting quotes from the abstract
4. Assign a relevance probability score (1-100)

This structured reasoning helps ensure high-quality relevance judgments.

In [13]:
# Define the structured output model for relevance scoring
class AbstractRelevance(BaseModel):
    """Structured relevance assessment for a candidate paper."""
    id: int
    arguments_for: str
    arguments_for_quotes: list[str]
    arguments_against: str
    arguments_against_quotes: list[str]
    probability_score: Annotated[
        float, 
        Field(ge=1.0, le=100.0, description="A relevance score between 1 and 100.")
    ]

print("✓ AbstractRelevance model defined")

✓ AbstractRelevance model defined


In [14]:
def create_relevance_agent():
    """Create an agent that scores paper relevance using debate-style reasoning."""
    
    INSTRUCTIONS_DEBATE_RANKING = """ 
    You are a helpful research assistant who is helping with literature review of a research idea. 
    You will be given a query or research idea and a candidate reference abstract.
    Your task is to score reference abstract based on their relevance to the query. Please make sure you read and understand these instructions carefully. 
    Please keep this document open while reviewing, and refer to it as needed.

    ## Instruction: 
    Use the following steps to rank the reference papers:

    1. Generate arguments for including this reference abstract in the literature review.

    2. Generate arguments against including this reference abstract in the literature review.

    3. Extract relevant sentences from the candidate paper abstract to support each argument.

    4. Then, provide a score between 1 and 100 (up to two decimal places) that is proportional to the probability 
    of a paper with the given query including the candidate reference paper in its literature review. 

    Important:
    - Put the extracted sentences in quotes
    - You can use the information in other candidate papers when generating the arguments for a candidate paper
    - Generate arguments and probability for each paper separately
    - Do not generate anything else apart from the probability and the arguments
    - Follow this process even if a candidate paper happens to be identical or near-perfect match to the query abstract

    Your Response: """

    relevance_agent = Agent(
        name="RelevanceAgent",
        instructions=INSTRUCTIONS_DEBATE_RANKING,
        model=RELEVANCE_MODEL,
        output_type=AbstractRelevance
    )
    
    return relevance_agent

print("✓ Relevance agent factory created")

✓ Relevance agent factory created


In [15]:
async def relevance_summary(id: int, query: str, reference_paper: str) -> AbstractRelevance:
    """Score a single paper's relevance to the query using the relevance agent.
    
    Args:
        id: Paper ID
        query: Research query/abstract
        reference_paper: Candidate paper's title and abstract
    
    Returns:
        AbstractRelevance object with scoring and reasoning
    """
    relevance_agent = create_relevance_agent()
    
    user_instructions = f"""
For this query abstract with id={id}

Given the query abstract: {query}

Given the candidate reference paper abstract: {reference_paper}

Your Reference Abstract Relevance:
"""
    
    result = await Runner.run(relevance_agent, input=user_instructions)
    return result.final_output

print("✓ Relevance scoring function defined")

✓ Relevance scoring function defined


---
## 7. Parallel Relevance Scoring

Score all retrieved papers in parallel using async execution for efficiency. Each paper is evaluated independently by the relevance agent.

**Note**: Adjust `NUM_ABSTRACTS_TO_SCORE` in the configuration section to limit the number of papers scored (useful for testing).

In [16]:
async def gather_abstract_relevance(retrieved_abstracts: pd.DataFrame, num_to_score: int = None) -> List[AbstractRelevance]:
    """Score multiple abstracts in parallel.
    
    Args:
        retrieved_abstracts: DataFrame of retrieved papers
        num_to_score: Number of abstracts to score (None = all)
    
    Returns:
        List of AbstractRelevance objects
    """
    # Select subset if specified
    if num_to_score is not None:
        abstracts_to_score = retrieved_abstracts.head(num_to_score)
        print(f"Scoring {num_to_score} abstracts (configured limit)")
    else:
        abstracts_to_score = retrieved_abstracts
        print(f"Scoring all {len(abstracts_to_score)} retrieved abstracts")
    
    # Create async tasks for parallel execution
    tasks = [
        asyncio.create_task(
            relevance_summary(
                id=item['id'],
                query=query,
                reference_paper=item['title_abstract']
            )
        )
        for index, item in abstracts_to_score[['id', 'title_abstract']].iterrows()
    ]
    
    print(f"Executing {len(tasks)} relevance scoring tasks in parallel...")
    results = await asyncio.gather(*tasks)
    
    return results

print("✓ Parallel scoring function defined")

✓ Parallel scoring function defined


In [17]:
%%time
# Execute relevance scoring (handles both Jupyter notebook and async contexts)
try:
    # Try to get existing event loop (in Jupyter)
    loop = asyncio.get_event_loop()
    if loop.is_running():
        # If loop is already running (Jupyter), use nest_asyncio or create task
        import nest_asyncio
        nest_asyncio.apply()
        results = loop.run_until_complete(
            gather_abstract_relevance(retrieved_abstracts, NUM_ABSTRACTS_TO_SCORE)
        )
    else:
        results = loop.run_until_complete(
            gather_abstract_relevance(retrieved_abstracts, NUM_ABSTRACTS_TO_SCORE)
        )
except RuntimeError:
    # If no event loop exists, create one
    results = asyncio.run(
        gather_abstract_relevance(retrieved_abstracts, NUM_ABSTRACTS_TO_SCORE)
    )

print(f"\n✓ Completed scoring {len(results)} abstracts")

Scoring 3 abstracts (configured limit)
Executing 3 relevance scoring tasks in parallel...

✓ Completed scoring 3 abstracts
CPU times: user 156 ms, sys: 38.4 ms, total: 194 ms
Wall time: 6.99 s


In [18]:
# Display scoring statistics
scores = [abs.probability_score for abs in results]

print("Relevance Scoring Statistics:")
print(f"  - Papers scored: {len(scores)}")
print(f"  - Mean score: {np.mean(scores):.2f}")
print(f"  - Std dev: {np.std(scores):.2f}")
print(f"  - Min score: {np.min(scores):.2f}")
print(f"  - Max score: {np.max(scores):.2f}")
print(f"  - Median score: {np.median(scores):.2f}")

# Show sample of results
print(f"\nSample relevance assessments:")
for i, result in enumerate(results[:3]):
    print(f"\n  Paper ID {result.id} (Score: {result.probability_score:.2f}):")
    print(f"    For: {result.arguments_for[:100]}...")
    print(f"    Against: {result.arguments_against[:100]}...")

Relevance Scoring Statistics:
  - Papers scored: 3
  - Mean score: 78.33
  - Std dev: 9.43
  - Min score: 65.00
  - Max score: 85.00
  - Median score: 85.00

Sample relevance assessments:

  Paper ID 34 (Score: 85.00):
    For: The candidate paper explores the application of Retrieval-Augmented Generation (RAG) pipelines speci...
    Against: While the candidate paper offers insights into RAG in biomedical contexts, it does not propose a new...

  Paper ID 1 (Score: 85.00):
    For: The candidate paper directly addresses Retrieval-Augmented Generation (RAG) systems applied to scien...
    Against: While the candidate paper discusses RAG in relation to general scientific literature, it does not sp...

  Paper ID 50 (Score: 65.00):
    For: The candidate reference abstract discusses the importance of accessing biomedical literature and hig...
    Against: The reference abstract does not specifically address retrieval-augmented generation (RAG) or its app...


---
## 8. Top-K Selection

Select the top-k most relevant papers based on their probability scores. These papers will be used to generate the "Related Work" section.

In [19]:
def get_top_k_abstracts(results: List[AbstractRelevance], k: int = 10) -> List[tuple]:
    """Select top-k papers by relevance score.
    
    Args:
        results: List of AbstractRelevance objects
        k: Number of top papers to select
    
    Returns:
        List of (id, score) tuples sorted by score descending
    """
    scores = [(abs.id, abs.probability_score) for abs in results]
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    return sorted_scores[:k]

# Get top-k papers
top_k_scores = get_top_k_abstracts(results, k=TOP_K_PAPERS)

print(f"✓ Selected top {TOP_K_PAPERS} papers by relevance score")

✓ Selected top 3 papers by relevance score


In [20]:
# Extract top-k paper IDs and get full information
top_k_id = [id for id, score in top_k_scores]
top_k_abstracts = retrieved_abstracts[retrieved_abstracts['id'].isin(top_k_id)].copy()

# Add scores to DataFrame for display
score_dict = {id: score for id, score in top_k_scores}
top_k_abstracts['relevance_score'] = top_k_abstracts['id'].map(score_dict)
top_k_abstracts = top_k_abstracts.sort_values('relevance_score', ascending=False)

print(f"Top {TOP_K_PAPERS} Papers Selected for Related Work:")
print("=" * 80)
for idx, row in top_k_abstracts.iterrows():
    print(f"\n[{row['id']}] Score: {row['relevance_score']:.2f}")
    print(f"Title: {row['title']}")
    print(f"Abstract: {row['abstract'][:200]}...")

# Display as DataFrame
print("\n" + "=" * 80)
display(top_k_abstracts[['id', 'relevance_score', 'title']])

Top 3 Papers Selected for Related Work:

[34] Score: 85.00
Title: Reshaping Biomedical Scientific Literature in a RAG Pipeline for Question Answering
Abstract: Biomedical Question Answering (BQA) poses specific challenges due to the specialized vocabulary and complex semantic structures of biomedical literature. Large Language Models (LLMs) have shown great ...

[1] Score: 85.00
Title: PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
Abstract: Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-...

[50] Score: 65.00
Title: Accessing Biomedical Literature in the Current Information Landscape
Abstract: Biomedical and life sciences literature is unique because of its exponentially increasing volume and interdisciplinary nature. Biomedical literature access is essential for several types of users incl...



Unnamed: 0,id,relevance_score,title
33,34,85.0,Reshaping Biomedical Scientific Literature in ...
0,1,85.0,PaperQA: Retrieval-Augmented Generative Agent ...
49,50,65.0,Accessing Biomedical Literature in the Current...


---
## 9. Related Work Generation

Generate a cohesive "Related Work" section using the top-k papers. The generation agent:
- Creates a coherent narrative connecting the papers
- Performs critical analysis comparing strengths and weaknesses
- Motivates the proposed approach in context of prior work
- Cites papers using [id] format
- Avoids copying abstracts verbatim

In [21]:
# Define instructions for related work generation
INSTRUCTIONS_RELATED_WORK = """ 
You are an expert research assistant who is helping with literature review for a research idea or abstract. 
You will be provided with an abstract or research idea and a list of reference abstracts. 
Your task is to write the related work section of the document using only the provided reference abstracts. 
Please write the related work section creating a cohesive storyline by doing a critical analysis of prior work 
in the reference abstracts comparing the strengths and weaknesses while also motivating the proposed approach. 
You should cite the reference abstracts as [id] whenever you are referring it in the related work. 
Do not write it as Reference #. Do not cite abstract or research Idea. 
Do not include any extra notes or newline characters at the end. 
Do not copy the abstracts of reference papers directly but compare and contrast to the main work concisely. 
Do not provide the output in bullet points or markdown. 
Do not provide references at the end. 
Please cite all the provided reference papers if needed.
"""

print("✓ Generation instructions defined")

✓ Generation instructions defined


In [22]:
# Build input for related work generation
input_related_work = f"Given the Research Idea or abstract: {query}"
input_related_work += "\n\n## Given references abstracts list below:"

for index, item in top_k_abstracts[['id', 'title_abstract']].iterrows():
    input_related_work += f"\n\n[{item['id']}]: {item['title_abstract']}"

input_related_work += "\n\nWrite the related work section summarizing in a cohesive story prior works relevant to the research idea."
input_related_work += "\n\n## Related Work:"

print(f"✓ Built generation input ({len(input_related_work)} characters)")

✓ Built generation input (4758 characters)


In [23]:
%%time
# Generate related work section
response = openai_client.responses.create(
    model=GENERATION_MODEL,
    instructions=INSTRUCTIONS_RELATED_WORK,
    input=input_related_work
)

generated_related_work = response.output_text

print("✓ Related work section generated")
print(f"  Length: {len(generated_related_work)} characters")
print(f"  Words: ~{len(generated_related_work.split())} words")

✓ Related work section generated
  Length: 2767 characters
  Words: ~375 words
CPU times: user 7.35 ms, sys: 2.7 ms, total: 10 ms
Wall time: 8.43 s


---
## 10. Results & Evaluation

Display the final generated "Related Work" section with formatting and metadata.

In [24]:
# Display the generated related work section
display(Markdown("## Generated Related Work Section"))
display(Markdown("---"))
display(Markdown(generated_related_work))
display(Markdown("---"))

## Generated Related Work Section

---

Retrieval-augmented generation (RAG) systems have gained traction in addressing the complexities associated with biomedical literature, yet their efficacy in this domain remains inconsistent. Prior work highlights the necessity of combining retrieval methods with large language models (LLMs) to improve question-answering capabilities in biomedical contexts. For example, one study delves into the challenges presented by the unique vocabulary and semantic intricacies of biomedical literature, revealing that RAG frameworks can significantly enhance the performance of LLMs by structuring the context more effectively. This approach not only boosts the quality of generated responses but also emphasizes the importance of precision over recall in generating accurate answers [34].

Conversely, another investigation, PaperQA, showcases a RAG agent specifically designed for scientific research. This framework exemplifies how integrating retrieval capabilities with LLMs can mitigate issues related to hallucinations and enhance the interpretability of generated content. By leveraging full-text scientific articles and employing a more complex benchmarking system, PaperQA surpasses existing models in performance, aligning closer to human-like research methodologies [1]. However, while these advancements are notable, they underscore the ongoing challenges surrounding grounding in factual information and the need for robust evaluation metrics.

In the broader context of biomedical literature access, significant efforts have been made to develop search tools that cater to various user needs, including researchers and clinicians. Notable systems and tools, such as PubMed and Google Scholar, facilitate access to an expanding volume of literature but often fall short in terms of precise query formulation and result interpretation [50]. This highlights a gap in methodologies that leverage RAG approaches to streamline not just the retrieval but also the synthesis of knowledge, thereby making biomedical literature more accessible to a wider audience.

The interactions between retrieval methodologies and LLMs place emphasis on the need for innovative strategies that can bridge existing knowledge gaps and improve communication of complex information in a more digestible format. As the landscape of biomedical literature continues to evolve, the proposed advancements in RAG systems for question answering present a pivotal opportunity to both enhance research accessibility and improve the user experience in navigating scientific information. This context sets the stage for the proposed strategy to optimize RAG methodologies for biomedical applications, ensuring that the dissemination of knowledge aligns with public health needs and understanding.

---

In [25]:
# Extract citations used in the generated text
import re

citations = re.findall(r'\[(\d+)\]', generated_related_work)
unique_citations = sorted(set(int(c) for c in citations))

print(f"Citations Used in Generated Text:")
print(f"  - Total citations: {len(citations)}")
print(f"  - Unique papers cited: {len(unique_citations)}")
print(f"  - Papers provided: {len(top_k_abstracts)}")
print(f"  - Citation IDs: {unique_citations}")

# Show which papers were cited
print(f"\nCited Papers:")
for paper_id in unique_citations:
    paper = top_k_abstracts[top_k_abstracts['id'] == paper_id]
    if not paper.empty:
        print(f"  [{paper_id}] {paper.iloc[0]['title']}")

Citations Used in Generated Text:
  - Total citations: 3
  - Unique papers cited: 3
  - Papers provided: 3
  - Citation IDs: [1, 34, 50]

Cited Papers:
  [1] PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
  [34] Reshaping Biomedical Scientific Literature in a RAG Pipeline for Question Answering
  [50] Accessing Biomedical Literature in the Current Information Landscape


In [26]:
# Pipeline execution summary
display(Markdown("## Pipeline Execution Summary"))

summary = f"""
**Configuration:**
- Corpus size: {len(all_abstracts)} papers
- Hybrid retrieval: Top {HYBRID_SEARCH_K} papers
- Papers retrieved: {len(retrieved_abstracts)} unique papers
- Papers scored: {len(results)} papers
- Top-K selection: {TOP_K_PAPERS} papers
- Papers cited in output: {len(unique_citations)} papers

**Models Used:**
- Relevance scoring: {RELEVANCE_MODEL}
- Related work generation: {GENERATION_MODEL}

**Output:**
- Related work length: {len(generated_related_work)} characters (~{len(generated_related_work.split())} words)
- Citations included: {len(citations)} total, {len(unique_citations)} unique
"""

display(Markdown(summary))

## Pipeline Execution Summary


**Configuration:**
- Corpus size: 78 papers
- Hybrid retrieval: Top 50 papers
- Papers retrieved: 23 unique papers
- Papers scored: 3 papers
- Top-K selection: 3 papers
- Papers cited in output: 3 papers

**Models Used:**
- Relevance scoring: gpt-4o-mini
- Related work generation: gpt-4o-mini

**Output:**
- Related work length: 2767 characters (~375 words)
- Citations included: 3 total, 3 unique


In [27]:
# Optional: Save the generated related work to a file
SAVE_OUTPUT = True  # Set to True to save

if SAVE_OUTPUT:
    output_file = "generated_related_work.txt"
    with open(output_file, 'w') as f:
        f.write("RESEARCH QUERY:\n")
        f.write(query)
        f.write("\n\n" + "="*80 + "\n\n")
        f.write("RELATED WORK:\n")
        f.write(generated_related_work)
        f.write("\n\n" + "="*80 + "\n\n")
        f.write("REFERENCES:\n")
        for paper_id in unique_citations:
            paper = top_k_abstracts[top_k_abstracts['id'] == paper_id]
            if not paper.empty:
                f.write(f"[{paper_id}] {paper.iloc[0]['title']}\n")
    
    print(f"✓ Output saved to {output_file}")
else:
    print("Output not saved (set SAVE_OUTPUT=True to save)")

✓ Output saved to generated_related_work.txt


---
## Conclusion

This notebook demonstrated an end-to-end agentic RAG pipeline for automated literature review generation. Key features:

1. **Hybrid Retrieval**: Combines semantic and keyword search for comprehensive coverage
2. **Agentic Scoring**: Uses structured reasoning (debate-style) for reliable relevance assessment
3. **Parallel Processing**: Efficiently scores multiple papers concurrently
4. **Coherent Synthesis**: Generates well-structured literature reviews with proper citations

### Next Steps

- Experiment with different retrieval parameters (`HYBRID_SEARCH_K`)
- Adjust the number of papers to score (`NUM_ABSTRACTS_TO_SCORE`)
- Try different top-k values (`TOP_K_PAPERS`)
- Evaluate different LLM models for scoring and generation
- Expand the corpus with more biomedical abstracts
- Add evaluation metrics for generated related work quality