<!-- # CS224V Fall 2025 HW1: Autonomous Research Agents

## From Literature Search to Investigative Analysis

> **Reference:** This notebook implements the coding exercises described in the [**CS224V HW1 Handout**](handout/cs224v_hw1_handout.tex). Please refer to the handout for detailed background, theory, and implementation context.

### Assignment Overview
This assignment involves building a **Deep Research Lite (DRLite)** system that progresses through two main phases:
- **Foundation Building Blocks:** RAG Systems and Autonomous Literature Search
- **Investigative Research:** Database Exploration and Automated Report Synthesis

--- -->

# Rare & Oprhan Diseases - RAG & Article Generation

### Environment configuration

In [5]:
# run this cell if you haven't installed the requirements
! pip install -r requirements.txt
! playwright install



In [37]:
# set the disease of interest


TOPIC = "Informative and Accurate Entry of Duchenne Muscular Dystrophy, describing the rare disease, its symptoms and suspected causes, and potential treatments or current efforts to find treatements."

In [39]:
# configurations, be sure to load appropriate 
# 1.) SERPER_API_KEY
# 2.) LITELLM_API_KEY
# 3.) LITELLM_API_BASE
# :D

import json
import os
from typing import List, Tuple

import dspy
import httpx
from dotenv import load_dotenv
from tqdm import tqdm

from src.dataclass import RetrievedDocument, LiteratureSearchAgentResponse, LiteratureSearchAgentRequest
from src.encoder import Encoder
from src.literature_search import LiteratureSearchAgent
from src.lm import init_lm, LanguageModelProviderConfig, LanguageModelProvider, LiteLLMServerConfig
from src.retriever_agent.serper_rm import SerperRM
from src.rag import RagAgent
from src.dataclass import RagResponse, RagRequest

load_dotenv()

True

In [40]:
## Sanity check :D
load_dotenv()

test_lm_config = LanguageModelProviderConfig(
      provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
      model_name="gpt-4.1-mini",
      temperature=0.0,
      max_tokens=10,
      litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
  )
test_lm = init_lm(test_lm_config)
print(test_lm("say 'Hello!' as is")[0]) # Expect to see "Hello!" or something similar

Hello!


In [41]:
encoder = Encoder(model_name="text-embedding-3-small", **{"api_key": os.getenv("LITELLM_API_KEY"), "api_base": os.getenv("LITELLM_API_BASE")})
embedding = await encoder.aencode("hello")
assert len(embedding[0]) == 1536
print(f"✅ encoder is working")

✅ encoder is working


In [17]:
# retriever check



load_dotenv()

serper_retriever = SerperRM(api_key=os.getenv("SERPER_API_KEY"), encoder=encoder)
retrieved_document: RetrievedDocument = await serper_retriever.aretrieve("rare diseases and orphan disease information")
assert len(retrieved_document) > 0
print(f"✅ retriever is working")
print(f"example output")
print(json.dumps(retrieved_document[0].to_dict(), indent=2))

✅ retriever is working
example output
{
  "url": "https://www.orpha.net/",
  "excerpts": [
    "Orphanet is a unique resource, gathering and improving knowledge on rare diseases so as to improve the diagnosis, care and treatment of patients with rare diseases. Orphanet aims to provide high-quality information on rare diseases, and ensure equal access to knowledge for all stakeholders. Orphanet also maintains the Orphanet rare disease nomenclature (ORPHAcode), essential in improving the visibility of rare diseases in health and research information systems. \nOrphanet was established in France by the INSERM (French National Institute for Health and Medical Research) in 1997. This initiative became a European endeavour from 2000, supported by grants from the European Commission: Orphanet has gradually grown to a Consortium of 40 countries, within Europe and across the globe. \nOrphaNews is a freely available, twice-monthly electronic newsletter for the rare disease community, presenting 




### Objective
Evaluate the complete Retrieval-Augmented Generation pipeline to understand information flow through the following stages:

```
Query → Internet Search → Content Extraction → Document Chunking → Semantic Reranking → Answer Generation
```

---

In [18]:
# configure the language model for RAG agent
rag_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1-mini",
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
)
rag_lm = init_lm(rag_lm_config)

# initialize the RAG agent
rag = RagAgent(retriever=serper_retriever, rag_lm=rag_lm)

# forward the request to the RAG agent
rag_response: RagResponse = await rag.aforward(RagRequest(question="Provide a informative entry about Duchenne Muscular Dystrophy.", max_retriever_calls=1))

# make output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# save the response to a file
with open("output/action_item_3_rag_response.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response.json")


✅ Result saved to output/action_item_3_rag_response.json


### Autonomous Literature Search Evaluation


---

In [None]:
literature_search_planning_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: planning invovles intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

answer_synthesis_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-5-mini", # NOTE: synthesis does not require high intelligence, but requires minimal hallucination. GPT-5-mini is a good balance.
    temperature=1.0,
    max_tokens=20000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

# initialize the literature search agent
literature_search_agent = LiteratureSearchAgent(rag_agent=rag, literature_search_lm=literature_search_planning_lm, answer_synthesis_lm=answer_synthesis_lm)

# run the literature search agent
literature_search_response: LiteratureSearchAgentResponse = await literature_search_agent.aforward(LiteratureSearchAgentRequest(topic=TOPIC))

# save the response to a file
with open("output/action_item_4_literature_search_response.json", "w") as f:
    json.dump(literature_search_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/literature_search_response.json")

Starting literature search for topic: Informative and Accurate Entry of Duchenne Muscular Dystrophy, describing the rare disease, its symptoms and suspected causes, and potential treatments or current efforts to find treatements.
Completeness check start.
Completeness check: False, reasoning: Major pillars of the topic—definition, symptoms, and causes—have not been addressed yet, signifying substantial gaps that must be filled before the survey can be considered comprehensive.
Generated 3 next questions for exploration
Executing 3
RAG call start. Question: What is Duchenne Muscular Dystrophy, and how is it classified as a rare disease?. Question context: Foundational description and classification are core to the survey, allowing readers to understand what the disease is and why it is considered rare; this element is missing and essential for contextualizing subsequent details.
RAG call start. Question: What are the primary symptoms of Duchenne Muscular Dystrophy and how do they typica

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x119e09ad0>


## Action Item 5: Primary Source Database Exploration

> **Reference:** See **Section 4.2 (ACLED Database & Primary Source Analysis)** in the handout for comprehensive coverage of investigative research methodologies and database exploration strategies.

### Objective
Transition from literature synthesis to investigative analysis through systematic exploration of primary source data. This exercise demonstrates capabilities beyond web-accessible information by leveraging structured conflict databases.

**Data Source:** Armed Conflict Location & Event Data Project (ACLED)
- **Coverage:** Global conflict events with structured metadata
- **Temporal Scope:** Real-time updates with historical records
- **Analytical Advantages:** Systematic pattern detection, geographic clustering, temporal trend analysis

### Step 1: Database Exploration Agent Initialization

---

In [15]:
# This dspy signature is used to generate seed questions using writeup from previous literature search to kick start the database exploration. 
class ResearchQuestionGenerator(dspy.Signature):
    """You are conducting research to extract previously unknown insights by exploring and observing information in a database. Generate research questions that an investigator will be interested in. The questions will be used to generate search queries in the database to help answer them. The questions should be self-contained, meaning they must include any specific years, months, locations, or other details instead of references that require the reader to know additional context. All questions must be related to the research goal and topic. Investigate any correlations as you see fit. The questions should be completely independent of each other - if you believe some questions need to be answered first before others can be meaningful, only generate those foundational questions and not others that depend on the answers to the first questions. You do not need to generate the maximum number of questions if you believe fewer questions would be better for the research goals."""
    
    topic: str = dspy.InputField(desc="The research goal or topic being investigated")
    db_description: str = dspy.InputField(desc="Description of the available database including its structure, contents, and capabilities")
    max_questions: int = dspy.InputField(desc="Maximum number of questions to generate - you may generate fewer if appropriate")
    previous_insights: str = dspy.InputField(desc="Previous insights from the database")
    
    questions: List[str] = dspy.OutputField(desc="List of independent, self-contained research questions that will extract previously unknown insights related to the topic from the database")

generate_exploration_questions = dspy.Predict(ResearchQuestionGenerator)

In [16]:
# research question generation
database_exploration_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # database exploration planning requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
)
database_exploration_lm = init_lm(database_exploration_lm_config)

# load the literature search response
assert os.path.exists("output/action_item_4_literature_search_response.json"), "Please run the literature search first"
literature_search_response = LiteratureSearchAgentResponse.from_dict(json.load(open("output/action_item_4_literature_search_response.json")))

with dspy.context(lm=database_exploration_lm):
    seed_questions = (await generate_exploration_questions.aforward(
        topic=TOPIC, 
        db_description=ACLED_DB_DESCRIPTION, 
        max_questions=4, 
        previous_insights=literature_search_response.writeup
    )).questions

seed_questions_formatted = "\n\t- ".join(seed_questions)
print(f"✅ Seed questions generated. Seed questions:\n\t- {seed_questions_formatted}")

✅ Seed questions generated. Seed questions:
	- How have the frequency, intensity, and geographic distribution of different types of military operations (e.g., offensive assaults, defensive operations, drone strikes, cross-border raids) by both Russian and Ukrainian forces changed from 2022 to early 2025, and what patterns emerge when comparing phases of the conflict?
	- What correlations exist between periods of significant Western/NATO military aid deliveries to Ukraine and subsequent shifts in Ukrainian military operational tactics, such as increased use of precision strikes, adoption of defense-in-depth, or expanded use of unmanned systems?
	- Is there measurable evidence that the introduction and increased usage of AI-enabled or autonomous systems (such as drones or unmanned surface vessels) by either Russia or Ukraine resulted in statistically significant changes in battlefield outcomes, such as territory gained/lost, casualty rates, or disruption of enemy operations between 2023 

## DSPy Implementation for Automated Thesis Generation

> **Reference:** See **Section 4.3 (DSPy Framework)** in the handout.

---

In [25]:
class ThesisGenerator(dspy.Signature):
    """
    You are GENERATING exactly FIVE (5) INVESTIGATIVE thesis statements that synthesize 
    THE ORIGINAL RESEARCH TOPIC, STRUCTURED DATABASE INSIGHTS, and LITERATURE REVIEW HIGHLIGHTS AND GAPS.

    THe requrements for each thesis are as follows: Specific! Names teh actors and actions, avoids vagueness. Evidence based, CITE at least one concrete data point OR pattern from the 
    database inputs OR precise claims from literature, Analytcically defensable, AND goes over societal impact EXPLICTLY must be CONCISE and MEMORABLE.

    

    
    """
    original_topic: str = dspy.InputField(
        desc="The initial research topic that was explored"
    ) 
    
    # db_figures: str = dspy.InputField(
    #     desc="Responses from database exploration"
   
    # )

    
    # ===============================================
    # additional input fields here
    # Hint1: how to pass insights from database exploration agent to the thesis generator?
    # Hint2: will directly concatenate question and answers from database exploration agent work? Do we need any formatting?
    # ===============================================
    
    # Should only have one output field as defined below. Do not change the name of the output field.
    proposed_theses: List[str] = dspy.OutputField(
        desc="""A python list of exactly five distinct, concise thesis. each is specific 
        names actors and cites a conrete pattern or figure when critical. (specific, evidence-based, and analytically defensible)"""
    )

thesis_generator = dspy.Predict(ThesisGenerator)
thesis_generator_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: thesis generation requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

In [26]:
database_exploration_rag_responses = []
with open("data/action_item_5_database_exploration_precomputed.json", "r") as f:
    database_exploration_precomputed = json.load(f)
    database_exploration_rag_responses = database_exploration_precomputed.get("rag_service_responses", [])


with dspy.context(lm=thesis_generator_lm):
    generated_theses = (await thesis_generator.aforward(
        original_topic=TOPIC,
        # ===============================================
        # other input fields here
        # ===============================================
        # db_figures = database_exploration_rag_responses,
        
    )).proposed_theses

formatted_generated_theses = "\n\t-".join(generated_theses)
print(f"generated_theses:\n\t- {formatted_generated_theses}")

with open("output/action_item_6_generated_theses.json", "w") as f:
    json.dump({
        "generated_theses": generated_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_generated_theses.json")

print(f"\n\nTake a look at the output format and skim through the content. Are you satisfied with the results? Will audience be interested in your proposed theses? With doubt, it's always a good idea to revise the prompt and try again.")
if len(generated_theses) != 5:
    raise ValueError(f"❌ Please generate exactly 5 proposed theses but found {len(generated_theses)}.")


generated_theses:
	- Genetic mutations in the DMD gene are confirmed as the singular cause of Duchenne Muscular Dystrophy, as evidenced by over 95% of diagnosed cases showing deletions or duplications in this locus, underscoring the necessity for molecular diagnostic protocols in clinical settings (Emery, 2023).
	-Progressive muscle wasting and cardiac involvement, observed consistently in longitudinal studies of DMD patients (Mendell et al., 2016), directly correlate with reduced life expectancy—highlighting an urgent societal need for expanded cardiopulmonary care in pediatric neuromuscular clinics.
	-Despite promising results in early-phase exon-skipping trials (e.g., eteplirsen leading to increased dystrophin production in 30% of treated subjects), significant barriers remain in translating genetic therapies for DMD into widely accessible treatments, with cost and equitable access representing the main public health challenge (FDA, 2021).
	-Steroid treatments such as prednisone and

# database_exploration_rag_responses = []
with open("data/action_item_5_database_exploration_precomputed.json", "r") as f:
    database_exploration_precomputed = json.load(f)
    database_exploration_rag_responses = ... # TODO: add the rag responses to the list

with dspy.context(lm=thesis_generator_lm):
    generated_theses = (await thesis_generator.aforward(
        original_topic=TOPIC,
        # ===============================================
        # other input fields here
        # ===============================================
    )).proposed_theses

formatted_generated_theses = "\n\t-".join(generated_theses)
print(f"generated_theses:\n\t- {formatted_generated_theses}")

with open("output/action_item_6_generated_theses.json", "w") as f:
    json.dump({
        "generated_theses": generated_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_generated_theses.json")

print(f"\n\nTake a look at the output format and skim through the content. Are you satisfied with the results? Will audience be interested in your proposed theses? With doubt, it's always a good idea to revise the prompt and try again.")
if len(generated_theses) != 5:
    raise ValueError(f"❌ Please generate exactly 5 proposed theses but found {len(generated_theses)}.")


**Extra human-in-the-loop step**
Manually review all generated thesis and hand pick the best 2

In [29]:
# Review the generated these and hand pick the best 2 theses.
selected_theses = [
    "Genetic mutations in the DMD gene are confirmed as the singular cause of Duchenne Muscular Dystrophy, as evidenced by over 95% of diagnosed cases showing deletions or duplications in this locus, underscoring the necessity for molecular diagnostic protocols in clinical settings (Emery, 2023).",
    "Progressive muscle wasting and cardiac involvement, observed consistently in longitudinal studies of DMD patients (Mendell et al., 2016), directly correlate with reduced life expectancy—highlighting an urgent societal need for expanded cardiopulmonary care in pediatric neuromuscular clinics."
]

assert len(selected_theses) == 2, f"❌ Please select exactly 2 theses but found {len(selected_theses)}."
with open("output/action_item_6_selected_theses.json", "w") as f:
    json.dump({
        "selected_theses": selected_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_selected_theses.json")

✅ Result saved to output/action_item_6_selected_theses.json


## Action Item 7: Automated Investigative Report Synthesis

> **Reference:** See **Section 4.4 (Synthesis & Report Composition)** in the handout for comprehensive methodology on evidence synthesis and investigative narrative construction.

### Objective
Synthesize your research findings into a comprehensive investigative report using automated DSPy-based processing pipelines that demonstrate advanced research capabilities beyond traditional literature synthesis.

### Generated Report Structure:
1. **Executive Summary:** A concise overview of your key findings and their significance
2. **Sections:** Use `#` for primary sections; Use `##` for subsections that organize detailed analysis; etc.
3. **Inline Citations:** Use numbered references in square brackets (e.g., `[1]`, `[2]`, `[3]`)
4. **Bibliography:** Conclude with a reference section where each line follows the format: `[index]. URL or source description`

### Implementation Process:

**Step 1: Thesis-Specific Literature Search**

Using your three generated theses from Action Item 6, conduct targeted literature searches for each thesis. Implement a `LiteratureSearchAgentRequest` with the `guideline` parameter to focus searches on supporting evidence for each specific thesis. Set `with_synthesis=False` to collect raw `rag_responses` for later processing.

**Step 2: Key Insight Identification**

Implement a `KeyInsightIdentifier` DSPy signature to automatically extract the most important insight from each RAG response. This reduces noise and focuses on essential information for report composition. The key insight should be a concise, one-sentence summary capturing the most relevant information for each question-answer pair.

**Step 3: Report Structure and Guideline Generation**

Create a `FinalWritingGuidelineProposal` DSPy signature that uses the collected key insights to:
- Generate a unified report thesis that synthesizes your investigative findings
- Propose a structured writing guideline in bullet-point format outlining key sections and content organization
- Ensure logical flow from background through specific discoveries to implications

**Step 4: Automated Report Synthesis**

Implement a `FinalReportSynthesizer` DSPy signature that combines the report thesis, writing guidelines, and all collected evidence into a coherent investigative report. The synthesizer should:
- Merge relevant information into logically coherent narrative sections
- Preserve all original citations exactly as provided
- Eliminate redundancy while maintaining completeness
- Create smooth transitions between thematic sections
- Constrain content to provided information without external speculation

### Quality Standards:

Your final report should demonstrate investigative depth by presenting original analytical insights not readily available in existing literature, comprehensive coverage through automated integration of evidence from multiple search iterations, logical organization with clear progression from initial questions to final conclusions, and proper attribution through systematic citation preservation.

The completed report should read as a coherent investigative piece rather than a collection of separate research summaries, showcasing the power of automated synthesis in building compelling investigative narratives from distributed evidence sources.

---

In [30]:
## After you have generated 3 theses, we'll generate another round of literature search to support your theses. 
rag_responses = []

for idx, thesis in enumerate(selected_theses, 1):
    # ===============================================
    # TODO: generate literature search request. Take a look at definition of LiteratureSearchAgentRequest
    # Hint: previously we only use the field `topic` and leave other fields as default. Now we need to make use of the field `guideline`. Take a look at the source code to understand how the guideline is used.
    # Hint 2: we disable the synthesis step by setting `with_synthesis=False` as we only need the rag responses.
    # ===============================================
    literature_search_response: LiteratureSearchAgentResponse = await literature_search_agent.aforward(LiteratureSearchAgentRequest(topic=thesis, guideline=(
            "Objective: retrieve high-quality, citable biomedical/clinical sources that DIRECTLY SUPPORT, CLARIFY, "
            "OR CHALLENGE the thesis about this rare/orphan disease.\n"
            f"Target thesis: {thesis}\n\n"
            "Priority sources (in order):\n"
            "- Authoritative rare disease resources (NIH/NCATS GARD, Orphanet, GeneReviews)\n"
            "- Peer-reviewed review articles and consensus/guideline statements on this disease or its gene/pathway\n"
            "- PubMed-indexed clinical cohort/longitudinal studies (epidemiology, natural history, genotype–phenotype)\n"
            "- Regulatory or label information for disease-specific therapies (FDA, EMA), when relevant\n"
            "- ClinicalTrials.gov / trial registry entries for current investigational approaches\n\n"
            "Extraction focus:\n"
            "- Disease definition and classification (is it clearly the same condition as in the thesis?)\n"
            "- Genetics/etiology: gene, variant classes, inheritance, percentage of cases explained\n"
            "- Phenotype: core signs/symptoms, age of onset, organ systems involved, severity range\n"
            "- Diagnosis: recommended tests, molecular confirmation rates, differential diagnoses\n"
            "- Management/therapy: standard-of-care, disease-modifying options, trialed therapies, reported outcomes\n"
            "- Prognosis/natural history: survival, loss of function milestones, complications\n"
            "- Explicit statements of LIMITED or INSUFFICIENT evidence (these are important for our ‘What Is Not Known’ section)\n\n"
            "Style/constraints:\n"
            "- Prefer up-to-date (last 5–10 years) reviews/guidelines when available.\n"
            "- Exclude non-reviewed blogs, unsourced social posts, general news outlets.\n"
            "- If only case reports exist, return them and clearly mark as low-N evidence.\n"
            "- Return raw RAG items only; DO NOT synthesize.\n"
        ), with_synthesis=False))
    
    
    with open(f"output/action_item_7_literature_search_response_{idx}.json", "w") as f:
        json.dump(literature_search_response.to_dict(), f, indent=2)
    print(f"✅ Result saved to output/action_item_7_literature_search_response_{idx}.json")

    rag_responses.extend(literature_search_response.rag_responses) # TODO: add the rag response to the list


Starting literature search for topic: Genetic mutations in the DMD gene are confirmed as the singular cause of Duchenne Muscular Dystrophy, as evidenced by over 95% of diagnosed cases showing deletions or duplications in this locus, underscoring the necessity for molecular diagnostic protocols in clinical settings (Emery, 2023).
Completeness check start.
Completeness check: False, reasoning: No exploration tasks have been completed, so all critical components—especially the genetic etiology, possible exceptions, and evidence for diagnostic protocols—must be addressed with direct, recent, and authoritative sources to comprehensively survey the thesis.
Generated 3 next questions for exploration
Executing 3
RAG call start. Question: What do recent authoritative resources (GeneReviews, Orphanet, NIH GARD) state about the genetic basis, gene involved, and variant types responsible for Duchenne Muscular Dystrophy? Please extract statements on etiology, percentage breakdown of mutation types,

Completed iteration 6, remaining budget: 9
Completeness check start.
Completeness check: False, reasoning: While the genetics/etiology, inheritance, and necessity of molecular diagnostics are well-supported by citable sources, key dimensions including authoritative phenotype characterization and clinical prognosis/natural history remain unaddressed, making the survey incomplete per guideline.
Generated 2 next questions for exploration
Executing 2
RAG call start. Question: Extract high-quality, authoritative sources that detail the core clinical phenotype, onset patterns, organ systems affected, and severity spectrum of Duchenne Muscular Dystrophy, specifically focusing on consistency with DMD gene mutation etiology.. Question context: There is currently a gap in sourced, succinct statements and data from biomedical/clinical references clarifying whether the clinical features and progression of DMD consistently match the genetic diagnosis, which is a guideline extraction focus (phenotyp

Completed iteration 10, remaining budget: 5
Completeness check start.
Completeness check: True, reasoning: All mandatory guideline extraction areas (etiology, exception frequency, phenotype/genotype correlation, molecular diagnostics, prognosis, and natural history) have been comprehensively and authoritatively addressed with high-quality, citable sources; there are no major gaps left relevant to the thesis or guideline requirements.
Literature search deemed complete by completeness checker
Survey completed with 5 responses
✅ Result saved to output/action_item_7_literature_search_response_1.json
Starting literature search for topic: Progressive muscle wasting and cardiac involvement, observed consistently in longitudinal studies of DMD patients (Mendell et al., 2016), directly correlate with reduced life expectancy—highlighting an urgent societal need for expanded cardiopulmonary care in pediatric neuromuscular clinics.
Completeness check start.
Completeness check: False, reasoning: No

Completed iteration 6, remaining budget: 9
Completeness check start.
Completeness check: False, reasoning: While there is thorough coverage of disease definition, natural history, genotype-phenotype relationships, and clinical guideline recommendations, regulatory information on disease-modifying therapies and documentation of ongoing/registry clinical trials—both core extraction targets per the guideline—remain unexplored and are needed to complete the survey.
Generated 2 next questions for exploration
Executing 2
RAG call start. Question: What regulatory or label information is available from the FDA, EMA, or other agencies regarding disease-specific therapies for Duchenne muscular dystrophy, particularly documenting survival, cardiac outcomes, or label restrictions tied to cardiac involvement?. Question context: Directly advances the survey by retrieving regulatory-label-approved endpoints and post-marketing data from authoritative sources, which may document evidence on therapies’ 

Completed iteration 10, remaining budget: 5
Completeness check start.
Completeness check: True, reasoning: All major and mandatory scope areas from the guideline—including disease definition, genetic/phenotypic correlations, cardiopulmonary burden, clinical care recommendations, regulatory/label status, ongoing investigational trials, and evidence gaps—are addressed by high-quality, citable sources; significant knowledge gaps do not remain, and the survey achieves comprehensive, guideline-concordant coverage.
Literature search deemed complete by completeness checker
Survey completed with 5 responses
✅ Result saved to output/action_item_7_literature_search_response_2.json


In [33]:
# Identify key insight for each rag response

# The goal of key insight identification is to identify the most important insight from the RAG response. Think of it as a one sentence summary of the RAG response that is most important to the question. The reason we do this is to help us reduce the noise in the rag responses and focus on the most important information when determing the final report outline.

class KeyInsightIdentifier(dspy.Signature):
    """
    You are GENERTING exactly ONE (1) KEY INSIGHT from the RAG repsonse. 
    It must be ONE sentence only, no more, no less. 

    REQURIEMENTS:
    - Directly answers the question, not vague.
    - Names actors and actions, incl time/place if avaible.
    - If there is a number/trend in the answer include it.
    - Keep inline [n] citations exactly as shown, do NOT change them.
    - No hedging like "maybe" or "could". No meta talk.
    - If no good fact is in the answer, return: "Insufficient evidence in retrieved sources to answer the question."
    """
    question: str = dspy.InputField(
        desc="The question that was asked"
    )
    question_context: str = dspy.InputField(
        desc="The context of the question"
    )
    answer: str = dspy.InputField(
        desc="The answer to the question aggregating information from external sources"
    )
    key_insight: str = dspy.OutputField(
        desc="One sentence key insight."
    )

key_insight_identifier = dspy.Predict(KeyInsightIdentifier)
key_insight_identifier_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1-mini", # NOTE: should we use a more powerful model? You can play around with it.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))


for rag_response in tqdm(rag_responses):
    with dspy.context(lm=key_insight_identifier_lm):
        rag_response.key_insight = (await key_insight_identifier.aforward(
            question=rag_response.question,
            question_context=rag_response.question_context,
            answer=rag_response.answer
        )).key_insight

with open("output/action_item_7_rag_responses_with_key_insight.json", "w") as f:
    json.dump([rag_response.to_dict() for rag_response in rag_responses], f, indent=2)

for rag_response in rag_responses:
    print(rag_response.key_insight + "\n")
print(f"✅ Result saved to output/action_item_7_rag_responses_with_key_insight.json")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:16<00:00,  1.37s/it]

Reputable datasets and reports from the Ukrainian Air Force, CSIS, ACLED, ISW, and UN document that from 2022 to 2024, Russia conducted extensive drone strike campaigns across Ukraine, including in Donetsk and Kyiv, with confirmed launches exceeding 14,700 attack drones and over 19,000 missiles, reaching peak monthly drone launches of over 6,400 in July 2025 and causing significant civilian casualties [1][2][3][4][5][6][8].

No primary sources or Tier-1 newswires from 2022 to 2024 provide evidence of Russian Air Force drone operations in transnational contexts such as Ar Raqqa, Rural Damascus, or Lattakia, including details on dates, strike counts, or campaign intent.

Between 2022 and 2024, reputable organizations like the ISW, IISS, CSIS, CNAS, the Belfer Center, and West Point highlight that Russian Air Force drone campaign data suffer from significant methodological challenges including reliance on incomplete open-source intelligence, varied drone types complicating counts, electro




In [31]:
# Final report title and guideline proposal
# We will generate a TITLE + bullet GUIDELINE for the final rare-disease article.
# The guideline must mirror our target article structure.

class FinalWritingGuidelineProposal(dspy.Signature):
    """
    You are GENERATING a patient-education rare/orphan disease article TITLE
    and a bullet-format WRITING GUIDELINE using:
    - the selected theses (core claims we want to support),
    - and key_insights extracted from RAG.

    REQUIREMENTS FOR TITLE:
    - 8–14 words.
    - Headline/title case.
    - No colon, no question mark.
    - Must contain the disease name if it is obvious from theses/insights.
    - Tone: educational, not sensational (“Understanding…”, “Overview of…”, “Key Features of…” are okay).

    REQUIREMENTS FOR GUIDELINE:
    - Output MUST be bullet points only (no prose paragraph).
    - Each bullet should map to a section we will later fill, in this order:
        1) Brief definition/what the disease is
        2) Genetic/etiologic basis (tie to theses if about mutations)
        3) Clinical picture (signs/symptoms, progression)
        4) Diagnosis/confirmation (esp. molecular diagnostics if mentioned in theses)
        5) Management/treatment landscape (summarize, do NOT prescribe; mention steroids/exon skipping only if present in theses/insights)
        6) Prognosis/natural history / impact on quality of life
        7) Current research/therapies in development
        8) Gaps/uncertainties (“What is not known”)
    - Where a thesis or key insight contains a citation marker like [1], preserve it on the bullet that uses that info.
    - If multiple theses overlap (e.g. DMD genetics + therapy access), merge them into a single coherent flow.
    - Explicitly add a safety/education bullet: “This report is for education only and does not replace clinical advice.”

    CONTENT RULE:
    - Use ONLY information that can be traced to selected_theses or key_insights; do NOT invent new clinical facts.
    - If a thesis mentions barriers (cost/access), add a bullet under “Current research / health systems considerations”.
    """
    # Inputs
    selected_theses: List[str] = dspy.InputField(
        desc="2–3 core theses about the disease (genetics, progression, therapy)."
    )
    key_insights: List[str] = dspy.InputField(
        desc="Key insights from RAG for this disease; may include diagnostics, therapy notes, or unmet needs."
    )

    # Outputs
    report_thesis: str = dspy.OutputField(
        desc="One headline-style title (8–14 words), disease-focused, no colon."
    )
    writing_guideline: str = dspy.OutputField(
        desc=(
            "Bullet-only guideline, ordered as: definition, genetics/etiology, clinical features, diagnosis, "
            "management/treatment, prognosis/impact, research/current trials, gaps/unknowns, safety note. "
            "Preserve any inline [n] citations from inputs."
        )
    )


# model init stays the same
final_writing_guideline_proposal = dspy.Predict(FinalWritingGuidelineProposal)
final_writing_guideline_proposal_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1",
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(
        api_key=os.getenv("LITELLM_API_KEY"),
        api_base=os.getenv("LITELLM_API_BASE")
    )
))

with dspy.context(lm=final_writing_guideline_proposal_lm):
    final_writing_guideline_proposal_response = await final_writing_guideline_proposal.aforward(
        selected_theses=selected_theses,
        key_insights=[rr.key_insight for rr in rag_responses],
    )

final_writing_thesis = final_writing_guideline_proposal_response.report_thesis
final_writing_guideline = final_writing_guideline_proposal_response.writing_guideline

with open("output/action_item_7_final_writing_guideline_proposal.json", "w") as f:
    json.dump({
        "report_thesis": final_writing_thesis,
        "writing_guideline": final_writing_guideline
    }, f, indent=2)

print(f"report title: {final_writing_thesis}\n")
print(f"writing guideline:\n{final_writing_guideline}")
print("✅ Result saved to output/action_item_7_final_writing_guideline_proposal.json")


report title: Understanding Duchenne Muscular Dystrophy: Genetics, Diagnosis, and Clinical Progression

writing guideline:
- Duchenne Muscular Dystrophy (DMD) is a rare genetic condition characterized by progressive muscle wasting and weakness.
- DMD is caused by mutations (deletions or duplications) in the DMD gene; over 95% of diagnosed cases show changes in this locus (Emery, 2023).
- The clinical course involves worsening muscle function and frequent cardiac involvement; progression often leads to reduced life expectancy (Mendell et al., 2016).
- Diagnosis relies on molecular protocols to confirm DMD gene mutations; genetic testing is essential in clinical settings.
- Current management prioritizes multidisciplinary care, with a special emphasis on cardiopulmonary support in pediatric neuromuscular clinics.
- Prognosis is shaped by the rate of muscle and cardiac deterioration, affecting mobility and overall quality of life.
- Research is ongoing to address gaps in cardiopulmonary c

In [32]:
def _normalize_rag_response_citation_indices(rag_responses: List[RagResponse]) -> Tuple[List[str], List[RetrievedDocument]]:
        """
        Normalize citation indices across multiple RAG (retrieval-augmented generation) responses.

        Each `RagResponse` contains:
        - `answer`: a string with inline citations like [1], [2], ...
        - `cited_documents`: the list of documents those citations refer to

        Problem:
        Citation indices restart at [1] for every response, but when combining answers,
        we want all citations to point to a single global list of retrieved documents.

        What this function does:
        1. Iterates over all RAG responses in order.
        2. Shifts the local citation indices in each answer so that they correctly map
            into the combined list of all retrieved documents.
            - For example, if the first response cited 3 docs ([1], [2], [3]),
            then the second response’s citations start at [4], not [1].
        3. Prefixes each updated answer with its corresponding sub-question for clarity.
        4. Returns:
            - A list of normalized answers (with corrected citation indices).
            - The flattened list of all retrieved documents in the proper order.

        Example:
            Input (two RAG responses):
                R1: "Paris is in France [1].", docs=[docA]
                R2: "Berlin is in Germany [1].", docs=[docB]

            Output:
                answers = [
                "Sub-question: ...\nAnswer: Paris is in France [1].",
                "Sub-question: ...\nAnswer: Berlin is in Germany [2]."
                ]
                documents = [docA, docB]
        """
        all_documents: List[RetrievedDocument] = []
        all_updated_answers: List[str] = []
        for idx, rag_response in enumerate(rag_responses):
            citation_offset = len(all_documents)
            updated_answer = rag_response.answer
            for i in range(len(rag_response.cited_documents)):
                updated_answer = updated_answer.replace(f"[{i+1}]", f"[tmp_{citation_offset+i+1}]")
            for i in range(len(rag_response.cited_documents)):
                updated_answer = updated_answer.replace(f"[tmp_{citation_offset+i+1}]", f"[{citation_offset+i+1}]")

            all_updated_answers.append(
                f"Sub-question: {rag_response.question}\nAnswer: {updated_answer}\n")
            all_documents.extend(rag_response.cited_documents)
        return all_updated_answers, all_documents

In [34]:
# Final report synthesis
# We will synthesize the final rare-disease article using the generated title, guideline, and RAG responses.

class FinalReportSynthesizer(dspy.Signature):
    """
    You are composing a patient-education RARE/ORPHAN DISEASE ARTICLE using:
    - a report title,
    - a section-by-section writing guideline,
    - and normalized RAG-sourced information with inline numeric citations.

    CORE GOAL:
    - Produce a markdown article that follows the guideline's section order.
    - Make it clear, accurate, and explicit about uncertainties.
    - Do NOT give medical advice; describe current knowledge only.

    STRUCTURE RULES:
    - Start with the provided title as an H1 (# ...).
    - Immediately follow with the education/safety note:
      "This article is for education only and does not replace advice from your clinician."
    - Then create sections in the exact order implied by `writing_guideline`.
    - Use markdown headings (#, ##, ###) to reflect hierarchy.
    - If the guideline mentions sections like "What Is Not Known", include them.

    CONTENT INTEGRATION RULES:
    - Merge all relevant RAG answers into the correct sections; do not leave content floating.
    - If multiple RAG items repeat the same fact, keep the clearest one.
    - If RAG indicates limited evidence, state that explicitly in the "What Is Not Known"/limitations section.
    - Do NOT introduce information that is not present in `gathered_information` or obviously implied by the selected theses/guideline.

    WRITING STYLE:
    - Patient-friendly, ~9th–10th grade, but still precise.
    - Short paragraphs; use bulleted lists for symptoms, tests, or resources if the guideline suggests it.
    - Neutral, non-alarmist tone.
    - Do NOT prescribe or recommend doses.

    CITATION POLICY (STRICT)

        1. Allowed format
           - Only numeric inline citations are allowed: [1], [2], [3], ...
           - Do NOT output URLs, domains, or alphanumeric IDs inside brackets (e.g. no [pmc.ncbi.nlm.nih.gov], no [FDA-2024], no [Orphanet]).
           - If the source text you were given contains non-numeric citations, DROP them unless a numeric ID for the same fact is also present.
        
        2. Source of truth
           - Assume that `gathered_information` has already had its citation numbers normalized by the pipeline.
           - You MUST reuse these numeric IDs exactly as they appear in `gathered_information`.
           - You MUST NOT invent new numbers or renumber globally.
        
        3. Where to place citations
           - Attach the citation immediately after the fact/number/name it supports.
           - Example: “DMD is caused by mutations in the DMD gene on Xp21.2.[1]”
           - Not allowed: putting all citations at the end of the paragraph.
           - Not allowed: “... Xp21.2 [1].” (space before bracket) → prefer “Xp21.2.[1]”
        
        4. How many citations per fact
           - Default to 1 citation per fact.
           - You may use 2 citations only when:
             a) the fact combines two distinct claims that both had citations in `gathered_information`, or
             b) you are showing that multiple guideline documents agree.
           - Maximum allowed at once: 3.
           - If the source text had a long chain like [1][2][13][30][31], shorten it to the 1–2 MOST RELEVANT numbers for that sentence.
        
        5. Conflicting or duplicate citations
           - If two sentences in `gathered_information` cite the same number for the same claim, you may keep only the first instance in your synthesized paragraph.
           - Do not show the same citation number 3 times in the same paragraph unless the paragraph truly has 3 different claims from that source.
        
        6. Non-numeric or malformed citations
           - If you encounter a citation like “[pmc.ncbi.nlm.nih.gov]” or “[FDA page]” in the input:
             - Try to find the nearest numeric citation for the same idea in the same input block and use that instead.
             - If there is no numeric match, omit that citation completely and keep the sentence only if it is well supported elsewhere in the same paragraph.
             - Do NOT output the malformed citation.
        
        7. Ascending order within a sentence
           - If a sentence must keep more than one citation, order them ascending: “[1][5]” not “[5][1]”.
           - Do not renumber across the entire document—just sort the ones you are actually using in that spot.
        
        8. Missing support
           - If you cannot find any numeric citation in `gathered_information` that supports a sentence you are writing, REMOVE that sentence instead of inventing a citation.
        
        9. Section-level claims
           - For generic, well-supported section intros that summarize multiple cited sentences in `gathered_information`, use the earliest applicable citation number for that section.
           - Example: if all the DMD natural history claims in the inputs use [23], [24], [25], open the prognosis section with “[23]” or “[23][24]”, not with a new number.
        
        10. No cross-section recycling
           - Do not copy a citation number from a therapy paragraph into a genetics paragraph if the original source in `gathered_information` did not talk about genetics.
           - Keep citations tied to the topic they originally supported.


    CONTENT CONSTRAINTS:
    - Constrain content to provided information.
    - Do not speculate or add external knowledge.
    - If something is missing (e.g., no management info), add a one-line note: "Limited management information was available in the retrieved sources." and keep it citation-free.
    """

    report_thesis: str = dspy.InputField(
        desc="The article/report title that should appear as H1."
    )
    writing_guideline: str = dspy.InputField(
        desc="Bullet-format guideline listing the sections and key points, in order."
    )
    gathered_information: str = dspy.InputField(
        desc="All RAG answers with normalized inline citations, newline-separated."
    )
    report_style: str = dspy.InputField(
        desc="Optional style hint, e.g. 'patient-friendly, concise, no medical advice'.",
        default="patient-friendly, concise, no medical advice"
    )

    final_report: str = dspy.OutputField(
        desc="The final rare-disease article in markdown, with H1 title, safety note, and ordered sections."
    )


final_report_synthesizer = dspy.Predict(FinalReportSynthesizer)
final_report_synthesizer_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-5-mini",
    temperature=1.0,
    max_tokens=20000,
    litellm_server_config=LiteLLMServerConfig(
        api_key=os.getenv("LITELLM_API_KEY"),
        api_base=os.getenv("LITELLM_API_BASE")
    )
))

# normalize citations from RAG (your existing helper)
all_updated_answers, all_documents = _normalize_rag_response_citation_indices(rag_responses)
gathered_information = "\n".join(all_updated_answers)

with dspy.context(lm=final_report_synthesizer_lm):
    final_report = (await final_report_synthesizer.aforward(
        report_thesis=final_writing_thesis,
        writing_guideline=final_writing_guideline,
        gathered_information=gathered_information,
        report_style="patient-friendly, preserve section order, no medical advice"
    )).final_report

with open("output/action_item_7_final_report_raw.md", "w") as f:
    f.write(final_report)
print("✅ Result saved to output/action_item_7_final_report_raw.md")


✅ Result saved to output/action_item_7_final_report_raw.md


In [35]:
# TODO: manually add bibliography to the final report. Review the output. Adjust the prompt and rerun the report synthesis if necessary. Make sure it has title, executive summary, sections with desired inline citations, and bibliography.

bibliography = "# Bibliography \n\n"
for index, doc in enumerate(all_documents, 1):
    bibliography += f"{index}. **{doc.title}**. Available at: {doc.url}\n"

final_report_with_bibliography = final_report + "\n\n" + bibliography

with open("output/action_item_7_final_report.md", "w") as f:
    f.write(final_report_with_bibliography)
print(f"✅ Result saved to output/action_item_7_final_report.md")

✅ Result saved to output/action_item_7_final_report.md


**Review the generated report. Reflect on what are weaknesses and explain in detail how would you plan to improve it.**

In [47]:
weaknesses_and_improvements = """
Weakness:
- Citations were a bit iffy, with the generated report sometimes including URLS when explicitly instructed not to or not following correctly
the citation format. 
- Sections within sections aren't really used in the report, we only really have overarching sections (such as Methods, etc) but no further 
fragmentation within them. This isn't really in line with what we see in modern literature.
- Our report doesn't seem to utilize explicit statistics from our database crawls, instead it gestures at "increases" but doesn't give
concrete numbers. This indicates we might not be properly leveraging our database insights.


Improvements:
- Make more explicit instructions so that the system creates citations that better reflect our standards.
- Define how we can further fragment the report so that we have subsections as desired
- Have the report use specific statistics from our database crawls, perhaps via better employed input fields for this data.

"""

with open("output/action_item_7_weaknesses_and_improvements.md", "w") as f:
    f.write(weaknesses_and_improvements)
print(f"✅ Result saved to output/action_item_7_weaknesses_and_improvements.md")


✅ Result saved to output/action_item_7_weaknesses_and_improvements.md


## Create submission

In [51]:
! python create_submission.py

CS224V HW1 Submission Creator
🔄 Converting notebook to PDF: notebook.ipynb -> notebook.pdf
❌ Error converting notebook to PDF: Command '['jupyter', 'nbconvert', '--to', 'pdf', '--output', 'notebook.pdf', 'notebook.ipynb']' returned non-zero exit status 1.
stderr: [NbConvertApp] Converting notebook notebook.ipynb to pdf
[NbConvertApp] ERROR | Error while converting 'notebook.ipynb'
Traceback (most recent call last):
  File "/Users/jrizo/miniconda3/envs/cs224v_hw1/lib/python3.11/site-packages/nbconvert/nbconvertapp.py", line 487, in export_single_notebook
    output, resources = self.exporter.from_filename(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrizo/miniconda3/envs/cs224v_hw1/lib/python3.11/site-packages/nbconvert/exporters/templateexporter.py", line 390, in from_filename
    return super().from_filename(filename, resources, **kw)  # type:ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrizo/miniconda3/envs/cs

## Appendix

### Database Exploration Agent Implementation

> **Note:** This implementation is provided for reference purposes. Utilize the pre-computed results in Action Item 5 for efficient completion of the assignment.

---

In [None]:
# API endpoint URL
api_url = "https://cs224v-database-agent.genie.stanford.edu/database-exploration"

# Prepare the request payload
payload = {
    "topic": TOPIC,
    "seed_questions": seed_questions,
    "lm_config": database_exploration_lm_config.to_dict()
}

# NOTE: uncomment the code below to make the request

# print("Making request to database exploration endpoint... Might take up to 30 minutes")
# async with httpx.AsyncClient(timeout=6000.0) as client:
#     r = await client.post(api_url, json=payload)
#     r.raise_for_status()
#     database_exploration_response = r.json()

# with open("output/action_item_5_database_exploration.json", "w") as f:
#     json.dump(database_exploration_response, f, indent=2)

# print(f"✅ Result saved to output/action_item_5_database_exploration.json")
