<!-- # CS224V Fall 2025 HW1: Autonomous Research Agents

## From Literature Search to Investigative Analysis

> **Reference:** This notebook implements the coding exercises described in the [**CS224V HW1 Handout**](handout/cs224v_hw1_handout.tex). Please refer to the handout for detailed background, theory, and implementation context.

### Assignment Overview
This assignment involves building a **Deep Research Lite (DRLite)** system that progresses through two main phases:
- **Foundation Building Blocks:** RAG Systems and Autonomous Literature Search
- **Investigative Research:** Database Exploration and Automated Report Synthesis

--- -->

# CS224V Fall2025 Homework1 Code

### Environment configuration

In [1]:
# run this cell if you haven't installed the requirements
! pip install -r requirements.txt
! playwright install



In [2]:
TOPIC = "Evolving Military Strategies in the Russia-Ukraine War and Future Implications"

ACLED_DB_DESCRIPTION = """ACLED is a global conflict and event data repository. It captures and analyzes conflicts worldwide, from local conflicts in Africa to international armed conflicts. It includes events like civil wars, military operations, and terrorist attacks. Acled's data helps monitor conflicts, understand conflict dynamics, and inform policy decisions."""

In [3]:
import json
import os
from typing import List, Tuple

import dspy
import httpx
from dotenv import load_dotenv
from tqdm import tqdm

from src.dataclass import RetrievedDocument, LiteratureSearchAgentResponse, LiteratureSearchAgentRequest
from src.encoder import Encoder
from src.literature_search import LiteratureSearchAgent
from src.lm import init_lm, LanguageModelProviderConfig, LanguageModelProvider, LiteLLMServerConfig
from src.retriever_agent.serper_rm import SerperRM
from src.rag import RagAgent
from src.dataclass import RagResponse, RagRequest

load_dotenv()

True

## Action Item 1: LLM API Configuration

> **Reference:** See **Section 2 (Background) → Action Item 1** in the handout for complete setup instructions and theoretical background.

### Setup Requirements

| Step | Task | Implementation Details |
|------|------|------------------------|
| 1 | **Obtain API Key** | Access the provided portal at [cs224v-litellm-portal.genie.stanford.edu](http://cs224v-litellm-portal.genie.stanford.edu) |
| 2 | **Configure Secrets** | Add `LITELLM_API_KEY` and `LITELLM_API_BASE` to .env |
| 3 | **Verify Configuration** | Execute the validation code block below |

---

In [4]:
## Action item 1. 
load_dotenv()

test_lm_config = LanguageModelProviderConfig(
      provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
      model_name="gpt-4.1-mini",
      temperature=0.0,
      max_tokens=10,
      litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
  )
test_lm = init_lm(test_lm_config)
print(test_lm("say 'Hello!' as is")[0]) # Expect to see "Hello!" or something similar

Hello!


In [5]:
encoder = Encoder(model_name="text-embedding-3-small", **{"api_key": os.getenv("LITELLM_API_KEY"), "api_base": os.getenv("LITELLM_API_BASE")})
embedding = await encoder.aencode("hello")
assert len(embedding[0]) == 1536
print(f"✅ encoder is working")

✅ encoder is working


## Action Item 2: Web Retrieval Service Configuration

> **Reference:** See **Section 3.1 (RAG Systems) → Action Item 2** in the handout for detailed background on web retrieval architectures and alternative search providers.

### Serper.dev Setup Process

The Retrieval-Augmented Generation pipeline requires a web search backend for information retrieval. This implementation uses Serper.dev as the search provider.

**Configuration Steps:**
1. **API Key Acquisition:** Register at [serper.dev](https://serper.dev) to obtain 2,500 free search credits
2. **Secret Configuration:** Add `SERPER_API_KEY` to .env file under repo root
3. **System Validation:** Execute the test code block below

### Expected Output
Successful configuration should produce:
- Confirmation message: "✅ retriever is working"
- Example document structure demonstrating the retrieved data format

### Technical Context
This setup establishes the foundation for the RAG pipeline's retrieval component. The retrieved document structure includes URL sources, content snippets, and metadata that will be processed through subsequent pipeline stages including content extraction, chunking, and semantic reranking.

---

In [6]:
load_dotenv()

serper_retriever = SerperRM(api_key=os.getenv("SERPER_API_KEY"), encoder=encoder)
retrieved_document: RetrievedDocument = await serper_retriever.aretrieve("stanford new AI research")
assert len(retrieved_document) > 0
print(f"✅ retriever is working")
print(f"example output")
print(json.dumps(retrieved_document[0].to_dict(), indent=2))

✅ retriever is working
example output
{
  "url": "https://ai.stanford.edu/",
  "excerpts": [
    "The Stanford Artificial Intelligence Laboratory (SAIL) has been a center of excellence for Artificial Intelligence research, teaching, theory, and practice since its founding in 1963.\n### Carlos Guestrin named as new Director of the Stanford AI Lab!\nWe thank Christopher Manning for being Director of the Stanford AI Lab during a period of enormous growth for AI and SAIL from 2018\u20132025 and today welcome Carlos Guestrin, Fortinet Founders Professor of Computer Science, as the new Director of SAIL.\n### Congratulations to Prof. Fei-Fei Li for being one of the seven engineers who have made seminal contributions to the development of Modern Machine Learning awarded the 2025 Queen Elizabeth Prize for Engineering.\n### Congratulations, Emma for being elected as AAAI Fellow!\n### Congratulations to Chelsea Finn, Dorsa Sadigh, and Sanmi Koyejo for all winning a Presidential Early Career Award

## Action Item 3: End-to-End RAG System Evaluation

> **Reference:** See **Section 3.1 (RAG Systems) → Action Item 3** in the handout for comprehensive RAG pipeline theory and component analysis.

### Objective
Evaluate the complete Retrieval-Augmented Generation pipeline to understand information flow through the following stages:

```
Query → Internet Search → Content Extraction → Document Chunking → Semantic Reranking → Answer Generation
```

### Step 1: Basic Pipeline Test
Execute the code block below to test the RAG system with a basic query. The system will:
- Process the input question through the complete pipeline
- Save results to `output/action_item_3_rag_response.json`
- Demonstrate the integration of retrieval and generation components

### Analysis Requirements
Review the output format to understand:
- Retrieved document structure and source attribution
- Semantic reranking results and relevance scoring
- Generated answer quality and citation methodology

---

In [7]:
# configure the language model for RAG agent
rag_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1-mini",
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
)
rag_lm = init_lm(rag_lm_config)

# initialize the RAG agent
rag = RagAgent(retriever=serper_retriever, rag_lm=rag_lm)

# forward the request to the RAG agent
rag_response: RagResponse = await rag.aforward(RagRequest(question="What is the latest news on AI research at Stanford?", max_retriever_calls=1))

# make output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# save the response to a file
with open("output/action_item_3_rag_response.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response.json")


✅ Result saved to output/action_item_3_rag_response.json


### Step 2: Advanced Query Testing

> **Reference:** Section 3.1 emphasizes the importance of evaluating RAG systems across diverse query types to assess robustness and performance characteristics.

#### Test Case A: Recency-Focused Query (`rag_question_recency`)
**Objective:** Evaluate the system's capability to retrieve and synthesize recent information.

**Requirements:**
- Formulate a factual question requiring current information for `rag_question_recency` (e.g., "What are the latest products from Apple in 2025?")
- **Note:** Do not use the provided example; create an original query
- Focus on testing temporal information retrieval accuracy

#### Evaluation Criteria

For this test question, document and analyze:
- **Quality of retrieved sources** (`rag_question_recency_quality_comment`, `rag_question_depth_quality_comment`): List all unique URLs from the RAG results one line per URL followed by [yes] or [no, {reason}] to indicate whether the URL is of high quality and should be included and cited when generating the final answer.
- **Relevance retrieved information** (`rag_question_recency_relevance_comment`, `rag_question_depth_relevance_comment`): Do all retrieved documents contain information that directly answers the question? Take a look at uncited documents, make comment on whether you think they are irrelevant to the question and thus LLM didn't cited it.
- **Comprehensiveness of generated answer** (`rag_question_recency_comprehensiveness_comment`, `rag_question_depth_comprehensiveness_comment`): Does the final answer synthesize information well and provide a complete response? Does LLM fully make use of all retrieved information? What are details that are left not included in the final answer.

---

In [8]:
rag_question_recency = "What are the top grossing movies of 2025 thus far?"

rag_response: RagResponse = await rag.aforward(RagRequest(question=rag_question_recency, max_retriever_calls=1))

# save the response to a file
with open("output/action_item_3_rag_response_recency.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response_recency.json")

✅ Result saved to output/action_item_3_rag_response_recency.json


In [9]:
rag_question_recency_quality_comment = """
https://www.the-numbers.com/market/2025/top-grossing-movies [yes]
https://en.wikipedia.org/wiki/2025_in_film [yes] 
https://www.reddit.com/r/boxoffice/comments/1lomp1u/with_half_of_2025_gone_heres_the_updated_list_of/ [no, user generated discussion, this mihgt lack reliability and verifiability for factual box office data]  
https://www.reddit.com/r/boxoffice/comments/1mhv0sj/2025_hollywood_global_box_office_ytd/ [no, same as previous, user generated forum].  
https://www.imdb.com/list/ls597681409/ [yes]
https://manofmany.com/entertainment/movies-tv/highest-grossing-movies-of-2025 [yes]
https://www.facebook.com/groups/1247563199057863/posts/2245404382607068/ [no, same as reddit posts, user generated!]

"""
rag_question_recency_relevance_comment = """
 We see that the cited sources are: The Numbers, Wikipedia, Reddit.com posts.  Upon inspection, these certainly contain information that answers our question directly.
 Notably, we see that a couple of reddit.com posts are cited. One of the posts cites the highest grossing movies of the 2020s, rather than only filtering 2025 films. It also communicates
 this information via an image that doesn't have any further citations, so I'd actually deem this citation an irrelevant/bad one. 
 The other reddit post sticks to the correct domain of 2025 films, but does the same thing in reporting this info through an image (this time, with accompanied text) WITHOUT any citations listed. 
The wikipedia citation and the numbers citation seem to be aptly relevant to our query, given that they report data from the box office, and are reputable sources
(debatable for wikipedia, but since the article itself contains citations, and since wikipedia articles tend to cite other wikipedia articles, this seems OK for our purposes.)

 
 On the flipside, we see thatand facebook.com and IMBD posts are not cited.
The Facebook post simply contained a graphic with movies & $ numbers, but the only excerpt read by the scraper were the comments to the post, so no actual concrete data.) This was correctly not used by RAG in my judgement.
Discarding the IMDB sites seem questionable in my opinion, since they do contain relevant data to answer our question -  the verbose descriptions of each movie's synopsis found in these two websites may be 
why RAG decided against using them (A case of TMI, perhaps). 

"""
rag_question_recency_comprehensiveness_comment = """
 I think, overall, yes! The data seems to align with what I can find online via fact checking and cross referencing. The response makes adequate use of 
 retrieved information as showcased by its accuracy in answering the question. I don't see any holes in any major details that were omitted, of course the response
 had to sift through lengthier synopsis and irrelevant details.
"""


with open("output/action_item_3_rag_response_recency_comment.json", "w") as f:
    json.dump({
        "quality_comment": rag_question_recency_quality_comment,
        "relevance_comment": rag_question_recency_relevance_comment,
        "comprehensiveness_comment": rag_question_recency_comprehensiveness_comment
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_3_rag_response_recency_comment.json")

✅ Result saved to output/action_item_3_rag_response_recency_comment.json


#### Test Case B: Technical Depth Query (`rag_question_depth`)
**Objective:** Assess the system's performance on specialized technical topics requiring domain expertise.

**Requirements:**
- Design a question addressing a niche technical domain for `rag_question_depth` (e.g., "How do transformer attention mechanisms handle positional encoding?")
- **Note:** Do not use the provided example; formulate an original technical query
- Evaluate the system's ability to retrieve and synthesize specialized knowledge

**Apply the same evaluation framework as detailed above for comprehensive analysis.**

---

In [10]:
rag_question_depth = "How does the condensor work in an AC unit, and how does the transfer of heat differ between a window unit and a standing unit?"

rag_response: RagResponse = await rag.aforward(RagRequest(question=rag_question_depth, max_retriever_calls=1))

# save the response to a file
with open("output/action_item_3_rag_response_depth.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response_depth.json")

✅ Result saved to output/action_item_3_rag_response_depth.json


In [11]:
import json


with open("output/action_item_3_rag_response_depth.json", "r") as f:
    data = json.load(f)
cited = []
for source in data["cited_documents"]:
    cited.append(source["url"])
uncited = []
for source in data["uncited_documents"]:
    uncited.append(source["url"])

for c in cited:
    print(c)
for u in uncited:
    print(u)


https://www.quora.com/How-does-an-air-conditioner-transfer-heat-from-inside-to-outside
https://todayshomeowner.com/hvac/guides/how-window-air-conditioners-work/
https://www.thespruce.com/how-types-of-air-conditioning-systems-work-1824734
https://www.therma.com/how-does-an-air-conditioner-work/
https://www.reddit.com/r/answers/comments/14ve9sz/if_air_conditioners_work_by_moving_hot_air/
https://home.howstuffworks.com/ac3.htm


In [12]:
rag_question_depth_quality_comment = """
https://www.quora.com/How-does-an-air-conditioner-transfer-heat-from-inside-to-outside [yes]
https://todayshomeowner.com/hvac/guides/how-window-air-conditioners-work/ [yes]
https://home.howstuffworks.com/ac3.htm [yes]
https://www.reddit.com/r/answers/comments/14ve9sz/if_air_conditioners_work_by_moving_hot_air/ [no, user generated, so we can't verify the validity of claims, since reddit allows anyone to respond to a post] 
https://en.wikipedia.org/wiki/Air_conditioning [yes]
https://www.carrier.com/residential/en/us/products/air-conditioners/how-do-air-conditioners-work/ [yes]
"""
rag_question_depth_relevance_comment = """
 All of the sources, both cited and uncited, contain relevant information that could be used to answer our question. Given this, it's a bit difficult to call why the model wouldn't cite the only uncited article from carrier.com, since the article contains niche information on the inner workings of an AC unit. One reason why this might be is because the article doesn't specifcally talk about window units vs portable units, which was asked in the second part of my question, instead the source seemed to mainly deal with the domain of central cooling. 
"""
rag_question_depth_comprehensiveness_comment = """
 I think rag did an excellent job in the comprehensiveness of the answer to my question. The technical details track well, and it does a good job of going into depth to answer my question completely, with apt details. 
"""


with open("output/action_item_3_rag_response_depth_comment.json", "w") as f:
    json.dump({
        "quality_comment": rag_question_depth_quality_comment,
        "relevance_comment": rag_question_depth_relevance_comment,
        "comprehensiveness_comment": rag_question_depth_comprehensiveness_comment
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_3_rag_response_depth_comment.json")

✅ Result saved to output/action_item_3_rag_response_depth_comment.json


## Action Item 4: Autonomous Literature Search Evaluation

> **Reference:** See **Section 3.2 (Autonomous Literature Search)** in the handout for comprehensive coverage of deep research paradigms, trajectory tracking, and autonomous exploration strategies.

### Objective
Evaluate the autonomous literature search system using a complex, multi-dimensional research topic that requires systematic exploration and evidence synthesis.

**Research Topic:** *"Evolving Military Strategies in the Russia-Ukraine War and Future Implications"*

### Step 1: System Execution
Execute the literature search agent using the code block below. The system will:
- Autonomously decompose the research topic into focused sub-queries
- Maintain an exploration trajectory across multiple search iterations
- Synthesize findings into a comprehensive research summary

**Output Location:** `output/action_item_4_literature_search_response.json`

---

In [13]:
literature_search_planning_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: planning invovles intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

answer_synthesis_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-5-mini", # NOTE: synthesis does not require high intelligence, but requires minimal hallucination. GPT-5-mini is a good balance.
    temperature=1.0,
    max_tokens=20000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

# initialize the literature search agent
literature_search_agent = LiteratureSearchAgent(rag_agent=rag, literature_search_lm=literature_search_planning_lm, answer_synthesis_lm=answer_synthesis_lm)

# run the literature search agent
literature_search_response: LiteratureSearchAgentResponse = await literature_search_agent.aforward(LiteratureSearchAgentRequest(topic=TOPIC))

# save the response to a file
with open("output/action_item_4_literature_search_response.json", "w") as f:
    json.dump(literature_search_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_4_literature_search_response.json")

Starting literature search for topic: Evolving Military Strategies in the Russia-Ukraine War and Future Implications
Completeness check start.
Completeness check: False, reasoning: No exploration of the required subject matter has begun; major subtopics concerning strategy shifts by both parties, adaptations, and implications remain unexplored, necessitating foundational questions for a comprehensive survey.
Generated 3 next questions for exploration
Executing 3
RAG call start. Question: What have been the major shifts in military strategy employed by Russia throughout the conflict, from the 2022 invasion to the present?. Question context: Exploring Russia's evolving military strategies is essential to understand the main dimensions of the conflict; this question targets the foundational aspect of one side's strategic approach, which has not yet been addressed.
RAG call start. Question: How has Ukraine adapted its military strategies in response to Russian actions and changing battlefi

Completed iteration 6, remaining budget: 9
Completeness check start.
Completeness check: False, reasoning: Surveyed material thoroughly covers Ukraine and Russia’s evolving strategies and future implications but omits an independent assessment of external actor influence and a focused analysis of cutting-edge technology integration, both of which are major and indispensable facets for a comprehensive survey on the topic.
Generated 2 next questions for exploration
Executing 2
RAG call start. Question: How have third-party interventions, such as the involvement of NATO, the United States, and other external actors, influenced the evolving military strategies in the Russia-Ukraine war?. Question context: While the survey has covered the strategies of Russia and Ukraine and assessed future implications, it has not yet analyzed the direct impact of key external actors—whose involvement fundamentally shapes both sides' strategic adaptations and future scenarios. This question addresses a cor

Completed iteration 10, remaining budget: 5
Completeness check start.
Completeness check: True, reasoning: All major dimensions—Russian and Ukrainian evolving strategies, technological transformation, third-party interventions, and future military/geopolitical implications—have been thoroughly surveyed with sufficient depth and balance; there are no significant knowledge gaps remaining in the relevant scope.
Literature search deemed complete by completeness checker
Survey completed with 5 responses
Starting final synthesis
Final synthesis complete
✅ Result saved to output/action_item_4_literature_search_response.json


### Step 2: Systematic Literature Search Analysis

> **Analysis Target:** `output/action_item_4_literature_search_response.json`

#### Analysis Questions

1. **Information Quality:** What information did you find most valuable? Which insights provided new understanding of the topic?
2. **Coverage Gaps:** What important aspects or perspectives appear missing from the investigation?
3. **Source Diversity:** How well did the system identify and incorporate diverse viewpoints and source types?
4. **Research Depth:** Did the investigation progress beyond surface-level information to uncover nuanced insights?

#### Algorithm Question
5. **Read the code and make comment on what's the stop criteria of this pipeline?** (hint: code entry point is `cs224v_hw1/src/literature_search.py` line 311)

---

In [17]:
literature_search_information_quality_comment = """
As a person who is generally aware of the larger events that dictate the Russian invasion of Ukraine, but lacking in the finer details, what I found most 
valuable were the more indepth bits of information regarding how the landscape of the conflict shifted over time. Specifically, the report highlighted how
the protraction of the war was characterized by an energized initial invasion that tapered into a much slower "war of attrition" inflicted by Russia. 
The description throughout sprnkled valuable pieces of info such as what fronts Russia initially aimed for in its attempt to take over Kyiv, and what offensive
moves were used (aerial attacks @ military/civilian infra). The article defines well how the trajectory faced a notable shift
as Ukranian resistance was deployed, including how Russia faced assymmetrical costs via high casualty rates and in turn higher mobilization rates, as well as
the bolstering of Ukranian defence via Western-provided arms. 
"""

literature_search_coverage_gaps_comment = """
What feels missing is that the future implications aren't explored from either than
humanitaran perspective nor the economic one. Implications regarding rebuilding, social costs
etc could have been explored, especially given the huge blows to Ukranian infrastructure. 
Implications regarding diplomatic paths for future resolution could have also been touched on.
"""

literature_search_source_diversity_comment = """
The report seems to have a solid mixo f reputable sources, such as the BBC, Wikipedia, RAND (which according
to google is a us policy think tank), CSIS (american thinktank), CNA. To comment on diversity, these sources 
clearly show a Western lean in perspective. Still, I'd deem the sources to be a 
reliable group.
"""

literature_search_research_depth_comment = """
I'd say the investigation process went beyond surface level facts, as it laid out how both Ukranian
and Russian strategies shifted over time, attributed them to reasons rooted
in reporting. The report also does a decent job of pointing out how emerging
wartime deployments such as mass drones from the Ukranian side as a form of defense might influence future
warfares. 
"""

literature_search_stop_criteria_comment = """
I think there are 3 different stop criteria, first being if the is_complete flag is true, which is 
defined as: 

desc="Return True if the survey comprehensively covers the main topic with sufficient depth in all 
required scope detailed in the guideline. Err on the side of thoroughness."

OR if there are no further investigative questions generated 
OR
if there is no more retriever calls to be used, as budgeted via max_retriever_calls.
"""

with open("output/action_item_4_literature_search_response_comment.json", "w") as f:
    json.dump({
        "information_quality_comment": literature_search_information_quality_comment,
        "coverage_gaps_comment": literature_search_coverage_gaps_comment,
        "source_diversity_comment": literature_search_source_diversity_comment,
        "research_depth_comment": literature_search_research_depth_comment,
        "stop_criteria_comment": literature_search_stop_criteria_comment
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_4_literature_search_response_comment.json")

✅ Result saved to output/action_item_4_literature_search_response_comment.json


## Action Item 5: Primary Source Database Exploration

> **Reference:** See **Section 4.2 (ACLED Database & Primary Source Analysis)** in the handout for comprehensive coverage of investigative research methodologies and database exploration strategies.

### Objective
Transition from literature synthesis to investigative analysis through systematic exploration of primary source data. This exercise demonstrates capabilities beyond web-accessible information by leveraging structured conflict databases.

**Data Source:** Armed Conflict Location & Event Data Project (ACLED)
- **Coverage:** Global conflict events with structured metadata
- **Temporal Scope:** Real-time updates with historical records
- **Analytical Advantages:** Systematic pattern detection, geographic clustering, temporal trend analysis

### Step 1: Database Exploration Agent Initialization

---

In [15]:
# This dspy signature is used to generate seed questions using writeup from previous literature search to kick start the database exploration. 
class ResearchQuestionGenerator(dspy.Signature):
    """You are conducting research to extract previously unknown insights by exploring and observing information in a database. Generate research questions that an investigator will be interested in. The questions will be used to generate search queries in the database to help answer them. The questions should be self-contained, meaning they must include any specific years, months, locations, or other details instead of references that require the reader to know additional context. All questions must be related to the research goal and topic. Investigate any correlations as you see fit. The questions should be completely independent of each other - if you believe some questions need to be answered first before others can be meaningful, only generate those foundational questions and not others that depend on the answers to the first questions. You do not need to generate the maximum number of questions if you believe fewer questions would be better for the research goals."""
    
    topic: str = dspy.InputField(desc="The research goal or topic being investigated")
    db_description: str = dspy.InputField(desc="Description of the available database including its structure, contents, and capabilities")
    max_questions: int = dspy.InputField(desc="Maximum number of questions to generate - you may generate fewer if appropriate")
    previous_insights: str = dspy.InputField(desc="Previous insights from the database")
    
    questions: List[str] = dspy.OutputField(desc="List of independent, self-contained research questions that will extract previously unknown insights related to the topic from the database")

generate_exploration_questions = dspy.Predict(ResearchQuestionGenerator)

In [16]:
# research question generation
database_exploration_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # database exploration planning requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
)
database_exploration_lm = init_lm(database_exploration_lm_config)

# load the literature search response
assert os.path.exists("output/action_item_4_literature_search_response.json"), "Please run the literature search first"
literature_search_response = LiteratureSearchAgentResponse.from_dict(json.load(open("output/action_item_4_literature_search_response.json")))

with dspy.context(lm=database_exploration_lm):
    seed_questions = (await generate_exploration_questions.aforward(
        topic=TOPIC, 
        db_description=ACLED_DB_DESCRIPTION, 
        max_questions=4, 
        previous_insights=literature_search_response.writeup
    )).questions

seed_questions_formatted = "\n\t- ".join(seed_questions)
print(f"✅ Seed questions generated. Seed questions:\n\t- {seed_questions_formatted}")

✅ Seed questions generated. Seed questions:
	- How have the frequency, intensity, and geographic distribution of different types of military operations (e.g., offensive assaults, defensive operations, drone strikes, cross-border raids) by both Russian and Ukrainian forces changed from 2022 to early 2025, and what patterns emerge when comparing phases of the conflict?
	- What correlations exist between periods of significant Western/NATO military aid deliveries to Ukraine and subsequent shifts in Ukrainian military operational tactics, such as increased use of precision strikes, adoption of defense-in-depth, or expanded use of unmanned systems?
	- Is there measurable evidence that the introduction and increased usage of AI-enabled or autonomous systems (such as drones or unmanned surface vessels) by either Russia or Ukraine resulted in statistically significant changes in battlefield outcomes, such as territory gained/lost, casualty rates, or disruption of enemy operations between 2023 

### Step 2: Primary Source Analysis Framework

> **Reference:** See **Action Item 5 Step 2** in the handout for detailed investigative analysis methodology.


Database exploration may take up to 30-60 minutes with given topic and seed questions generated above. As a result, we have pre-computed a result from database explroation agent available at `data/action_item_5_database_exploration_precomputed.json`  
> **Implementation Reference:** Complete code provided in Appendix section

#### Key Analysis Dimensions:
1. **Database Structure and Operations:** Document which specific ACLED data tables, fields, and query patterns the agent utilized. Understanding the data architecture helps assess the comprehensiveness and reliability of findings.

2. **Quantitative Evidence Gaps:** Compare database findings with your literature search results from Action Item 4. Identify novel insights unavailable through web-based sources and assess how structured data analysis reveals patterns obscured in traditional reporting.

#### Step 3: Investigative Hypothesis Formation

Based on your comparative analysis, develop at least **2 investigative thesis statements** that synthesize insights from both literature search and database exploration. Each thesis should: be specific, highlight the tension or contradiction, name the actor and the action if presented, use concrete numbers or data if central to the story, signal impact on readers or society, and keep it concise/punchy/memorable while grounded on the facts.

---

In [37]:
action_item_5_database_operation_comment = """

FROM stanford_api_data was the data table of choice, which I'm assuming is just the 
ACLED table, with the primary filters being to search for Air/drone strike, between 2022-2024, and specifying specific actors for 
Russian and Ukraine, to narrow down the search to only capture the invasion. Also includes 
details like the event's date, location, actors involved, fatalities, and notes, along with the precision of the time and
location data. As noted under the "key_point" attr, the agent through queries found
a total of 7,432 air/drone strike events from 2022 to 2024 involving Russian and Ukrainian military forces.


"""
    
action_item_5_database_evidence_comment = """
Compared to item 4's literature review, we see a dimension of quant depth added to our research.
We now have actual numbers that paint the picture that, for instance, strike frequency ~doubled 
\from 2022 to 2024. Another interesting insight given by this system: "Most attacks result in low but consistent fatality numbers, punctuated by rare large-scale events,",
and we also learn that the initial attacks in 2022 were the most frequent of the queried years. While these general trends were mentioned in 
brief in our previous analysis, these numbers provide strong evidence via data analysis that paint a more complete picture

One piece of more compelling investigation provided by the system comes from the following synthesis of the data:

"Initially, strikes were concentrated in key Ukrainian administrative regions—Donetsk, Kharkiv, Zaporizhia, and Kherson—with strong 
targeting of front-line contact zones and logistics hubs [1]. However, by 2024, strikes became more distributed and transnational:
Ukrainian forces initiated attacks within Russian territory, particularly in Kursk (with 1,411 strikes),
signifying new cross-border operational capability and intent"\
This insight is quite novel in comparsion to our previous investigation, we get concrete analysis of how strikes shifted in location, and
the system is then able to make a conclusion that the conflict shifted from Ukranian regions to areas within Russia. Very neat!
"""

action_item_5_proposed_theses = """
1. In the year 2024, Ukraine's use of over 1400 UAV strikes inside Russia show a reversal of territorial
dominance, redefining the war as one that the aggressor of the invasion can no longer contain within Ukraine.
2. Ukraine’s leap from defending its skies in 2022 to launching over 3,700 drone strikes by 2024 illustrates
how low-cost UAVs can level the playing field against an aggressive and militarily superior enemy.
"""

with open("output/action_item_5_comments.json", "w") as f:
    json.dump({
        "action_item_5_database_operation_comment": action_item_5_database_operation_comment,
        "action_item_5_database_evidence_comment": action_item_5_database_evidence_comment,
        "action_item_5_proposed_theses": action_item_5_proposed_theses
    }, f, indent=2)

print(f"✅ Result saved to output/action_item_5_comments.json")


✅ Result saved to output/action_item_5_comments.json


## Action Item 6: DSPy Implementation for Automated Thesis Generation

> **Reference:** See **Section 4.3 (DSPy Framework)** in the handout.

### Objective
Implement a LLM thesis generation mechanism using the DSPy framework to automate the investigative hypothesis development process demonstrated manually in Action Item 5.

### Implementation Requirements

**DSPy Signature Specification:**
```python
class ThesisGenerator(dspy.Signature):
    # Implementation required
```

**Technical Specifications:**
- **Input Fields:** Research topic and database exploration insights
- **Output Specification:** Exactly 5 thesis statements
- **Constraint Requirements:** Each thesis must be specific, evidence-based, and analytically defensible
---

In [26]:
class ThesisGenerator(dspy.Signature):
    """
    You are GENERATING exactly FIVE (5) INVESTIGATIVE thesis statements that synthesize 
    THE ORIGINAL RESEARCH TOPIC, STRUCTURED DATABASE INSIGHTS, and LITERATURE REVIEW HIGHLIGHTS AND GAPS.

    THe requrements for each thesis are as follows: Specific! Names teh actors and actions, avoids vagueness. Evidence based, CITE at least one concrete data point OR pattern from the 
    database inputs OR precise claims from literature, Analytcically defensable, AND goes over societal impact EXPLICTLY must be CONCISE and MEMORABLE.

    

    
    """
    original_topic: str = dspy.InputField(
        desc="The initial research topic that was explored"
    ) 
    
    db_figures: str = dspy.InputField(
        desc="Responses from database exploration"
   
    )

    
    # ===============================================
    # additional input fields here
    # Hint1: how to pass insights from database exploration agent to the thesis generator?
    # Hint2: will directly concatenate question and answers from database exploration agent work? Do we need any formatting?
    # ===============================================
    
    # Should only have one output field as defined below. Do not change the name of the output field.
    proposed_theses: List[str] = dspy.OutputField(
        desc="""A python list of exactly five distinct, concise thesis. each is specific 
        names actors and cites a conrete pattern or figure when critical. (specific, evidence-based, and analytically defensible)"""
    )

thesis_generator = dspy.Predict(ThesisGenerator)
thesis_generator_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: thesis generation requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

In [49]:
database_exploration_rag_responses = []
with open("data/action_item_5_database_exploration_precomputed.json", "r") as f:
    database_exploration_precomputed = json.load(f)
    database_exploration_rag_responses = database_exploration_precomputed.get("rag_service_responses", [])


with dspy.context(lm=thesis_generator_lm):
    generated_theses = (await thesis_generator.aforward(
        original_topic=TOPIC,
        # ===============================================
        # other input fields here
        # ===============================================
        db_figures = database_exploration_rag_responses,
        
    )).proposed_theses

formatted_generated_theses = "\n\t-".join(generated_theses)
print(f"generated_theses:\n\t- {formatted_generated_theses}")

with open("output/action_item_6_generated_theses.json", "w") as f:
    json.dump({
        "generated_theses": generated_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_generated_theses.json")

print(f"\n\nTake a look at the output format and skim through the content. Are you satisfied with the results? Will audience be interested in your proposed theses? With doubt, it's always a good idea to revise the prompt and try again.")
if len(generated_theses) != 5:
    raise ValueError(f"❌ Please generate exactly 5 proposed theses but found {len(generated_theses)}.")


generated_theses:
	- Ukrainian military forces, by escalating high-precision cross-border drone and missile strikes in Belgorod and Kursk in 2024 (with 'Air/drone strike' and 'Shelling/artillery/missile attack' accounting for 43.9% and 26.2% of events with geo_precision=1), have evolved toward a doctrine of persistent, surgical disruption that seeks to erode Russian border stability while minimizing direct fatalities.
	-Russian Air Force operations from 2022 to 2024 demonstrate a doctrinal broadening, conducting 39 distinct regional drone strike campaigns—both within Ukraine (e.g., Donetsk, Kyiv) and transnationally (e.g., Ar Raqqa, Rural Damascus, Lattakia)—thereby signaling intent to establish multi-theater unmanned warfare as a new pillar of power projection.
	-The overwhelming targeting of major Ukrainian urban centers—such as Kharkiv (928 events), Kyiv City (491), and Odesa (391) between 2022 and 2025, with 975 of 2,359 attacks explicitly impacting civilians—reveals a deliberate R

# database_exploration_rag_responses = []
with open("data/action_item_5_database_exploration_precomputed.json", "r") as f:
    database_exploration_precomputed = json.load(f)
    database_exploration_rag_responses = ... # TODO: add the rag responses to the list

with dspy.context(lm=thesis_generator_lm):
    generated_theses = (await thesis_generator.aforward(
        original_topic=TOPIC,
        # ===============================================
        # other input fields here
        # ===============================================
    )).proposed_theses

formatted_generated_theses = "\n\t-".join(generated_theses)
print(f"generated_theses:\n\t- {formatted_generated_theses}")

with open("output/action_item_6_generated_theses.json", "w") as f:
    json.dump({
        "generated_theses": generated_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_generated_theses.json")

print(f"\n\nTake a look at the output format and skim through the content. Are you satisfied with the results? Will audience be interested in your proposed theses? With doubt, it's always a good idea to revise the prompt and try again.")
if len(generated_theses) != 5:
    raise ValueError(f"❌ Please generate exactly 5 proposed theses but found {len(generated_theses)}.")


**Extra human-in-the-loop step**
Manually review all generated thesis and hand pick the best 2

In [30]:
# Review the generated these and hand pick the best 2 theses.
selected_theses = ["Russian Air Force operations from 2022 to 2024 demonstrate a doctrinal broadening, conducting 39 distinct regional drone strike campaigns—both within Ukraine (e.g., Donetsk, Kyiv) and transnationally (e.g., Ar Raqqa, Rural Damascus, Lattakia)—thereby signaling intent to establish multi-theater unmanned warfare as a new pillar of power projection.",
    "Despite the proliferation of remote strike events (over 15,000 drone and artillery-related incidents from 2022–2024), the median fatality per attack remains at zero, evidencing a shift from mass-casualty operations to continuous attrition warfare designed for disruption, resource depletion, and prolonged psychological pressure on civilian societies."] # TODO: add the selected theses

assert len(selected_theses) == 2, f"❌ Please select exactly 2 theses but found {len(selected_theses)}."
with open("output/action_item_6_selected_theses.json", "w") as f:
    json.dump({
        "selected_theses": selected_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_selected_theses.json")

✅ Result saved to output/action_item_6_selected_theses.json


## Action Item 7: Automated Investigative Report Synthesis

> **Reference:** See **Section 4.4 (Synthesis & Report Composition)** in the handout for comprehensive methodology on evidence synthesis and investigative narrative construction.

### Objective
Synthesize your research findings into a comprehensive investigative report using automated DSPy-based processing pipelines that demonstrate advanced research capabilities beyond traditional literature synthesis.

### Generated Report Structure:
1. **Executive Summary:** A concise overview of your key findings and their significance
2. **Sections:** Use `#` for primary sections; Use `##` for subsections that organize detailed analysis; etc.
3. **Inline Citations:** Use numbered references in square brackets (e.g., `[1]`, `[2]`, `[3]`)
4. **Bibliography:** Conclude with a reference section where each line follows the format: `[index]. URL or source description`

### Implementation Process:

**Step 1: Thesis-Specific Literature Search**

Using your three generated theses from Action Item 6, conduct targeted literature searches for each thesis. Implement a `LiteratureSearchAgentRequest` with the `guideline` parameter to focus searches on supporting evidence for each specific thesis. Set `with_synthesis=False` to collect raw `rag_responses` for later processing.

**Step 2: Key Insight Identification**

Implement a `KeyInsightIdentifier` DSPy signature to automatically extract the most important insight from each RAG response. This reduces noise and focuses on essential information for report composition. The key insight should be a concise, one-sentence summary capturing the most relevant information for each question-answer pair.

**Step 3: Report Structure and Guideline Generation**

Create a `FinalWritingGuidelineProposal` DSPy signature that uses the collected key insights to:
- Generate a unified report thesis that synthesizes your investigative findings
- Propose a structured writing guideline in bullet-point format outlining key sections and content organization
- Ensure logical flow from background through specific discoveries to implications

**Step 4: Automated Report Synthesis**

Implement a `FinalReportSynthesizer` DSPy signature that combines the report thesis, writing guidelines, and all collected evidence into a coherent investigative report. The synthesizer should:
- Merge relevant information into logically coherent narrative sections
- Preserve all original citations exactly as provided
- Eliminate redundancy while maintaining completeness
- Create smooth transitions between thematic sections
- Constrain content to provided information without external speculation

### Quality Standards:

Your final report should demonstrate investigative depth by presenting original analytical insights not readily available in existing literature, comprehensive coverage through automated integration of evidence from multiple search iterations, logical organization with clear progression from initial questions to final conclusions, and proper attribution through systematic citation preservation.

The completed report should read as a coherent investigative piece rather than a collection of separate research summaries, showcasing the power of automated synthesis in building compelling investigative narratives from distributed evidence sources.

---

In [32]:
## After you have generated 3 theses, we'll generate another round of literature search to support your theses. 
rag_responses = []

for idx, thesis in enumerate(selected_theses, 1):
    # ===============================================
    # TODO: generate literature search request. Take a look at definition of LiteratureSearchAgentRequest
    # Hint: previously we only use the field `topic` and leave other fields as default. Now we need to make use of the field `guideline`. Take a look at the source code to understand how the guideline is used.
    # Hint 2: we disable the synthesis step by setting `with_synthesis=False` as we only need the rag responses.
    # ===============================================
    literature_search_response: LiteratureSearchAgentResponse = await literature_search_agent.aforward(LiteratureSearchAgentRequest(topic=thesis, guideline=(
            "Objective: retrieve high-quality, citable sources that directly SUPPORT or CHALLENGE the thesis.\n"
            f"Target thesis: {thesis}\n\n"
            "Priority sources:\n"
            "- Primary datasets and reputable orgs (e.g., ACLED, SIPRI, IISS, RAND, CSIS, CNA)\n"
            "- Tier-1 newswires/outlets (Reuters, AP, BBC, FT, Economist)\n\n"
            "Extraction focus:\n"
            "- Actors, dates (YYYY-MM / YYYY-MM-DD), locations (admin1/admin2), quantitative counts/trends,\n"
            "- Methodology caveats (coding, precision, coverage), exact quoted passages when numbers are cited.\n\n"
            "Exclusions: opinion-only blogs, unsourced social posts.\n"
            "Return raw RAG items only; DO NOT synthesize.\n"), with_synthesis=False))
    
    
    with open(f"output/action_item_7_literature_search_response_{idx}.json", "w") as f:
        json.dump(literature_search_response.to_dict(), f, indent=2)
    print(f"✅ Result saved to output/action_item_7_literature_search_response_{idx}.json")

    rag_responses.extend(literature_search_response.rag_responses) # TODO: add the rag response to the list


Starting literature search for topic: Russian Air Force operations from 2022 to 2024 demonstrate a doctrinal broadening, conducting 39 distinct regional drone strike campaigns—both within Ukraine (e.g., Donetsk, Kyiv) and transnationally (e.g., Ar Raqqa, Rural Damascus, Lattakia)—thereby signaling intent to establish multi-theater unmanned warfare as a new pillar of power projection.
Completeness check start.
Completeness check: False, reasoning: No evidence or sources have been retrieved; major knowledge gaps exist for Ukraine, transnational operations, and methodology caveats, each explicitly required by the guideline.
Generated 3 next questions for exploration
Executing 3
RAG call start. Question: What citable primary dataset or reputable organizational report documents Russian Air Force drone strike campaigns conducted in Ukraine (including locations such as Donetsk and Kyiv) from 2022 to 2024 with quantitative counts and dates?. Question context: This targets retrieval of high-qua

Completed iteration 6, remaining budget: 9
Completeness check start.
Completeness check: False, reasoning: While Ukrainian theater coverage and methodological caveats have strong documentation, there remains a critical lack of citable direct evidence and credible analysis confirming Russian Air Force drone strike campaigns in transnational theaters (e.g., Syria) and explicit Russian doctrinal intent—both of which are mandatory to fully support or challenge the thesis per the guideline.
Generated 2 next questions for exploration
Executing 2
RAG call start. Question: What additional Tier-1 newswire or primary dataset sources (beyond those reviewed) directly document Russian Air Force drone strike operations in transnational theaters such as Ar Raqqa, Rural Damascus, Lattakia, or other comparable regions between 2022 and 2024, including actor, date, strike count, and campaign details?. Question context: This targets the explicit gap in direct, citable evidence for Russian Air Force drone 

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x11f2fa290>


Completed iteration 10, remaining budget: 5
Completeness check start.
Completeness check: False, reasoning: Substantial evidence fully verifies Ukrainian campaigns and doctrinal intent, with methodology caveats addressed, but the survey still lacks any direct, citable sources documenting Russian Air Force drone strike operations in transnational theaters (Ar Raqqa, Rural Damascus, Lattakia, etc.), which is a mandatory thesis requirement not yet covered despite attempted retrieval.
Generated 1 next questions for exploration
Executing 1
RAG call start. Question: What additional primary datasets or reputable organizational reports (e.g., SIPRI, Jane’s, or regional monitoring alliances) contain any documented evidence—dates, actors, strike counts—of Russian Air Force drone operations in transnational theaters such as Ar Raqqa, Rural Damascus, and Lattakia between 2022 and 2024?. Question context: This directly targets the persistent, critical gap in transnational campaign evidence; confirm

Completed iteration 12, remaining budget: 3
Completeness check start.
Completeness check: False, reasoning: There remains a major unfilled evidentiary gap regarding direct, citable documentation of Russian Air Force drone strike campaigns in transnational theaters (such as Ar Raqqa, Rural Damascus, Lattakia) from 2022 to 2024, which is essential for complete coverage of the thesis and has not been addressed by previously summarized sources.
Generated 1 next questions for exploration
Executing 1
RAG call start. Question: Do any multilingual regional security reports or Syrian local media archives (with translation) document Russian Air Force drone strikes in Ar Raqqa, Rural Damascus, Lattakia, or other comparable transnational theaters from 2022 to 2024, including actors, dates, and strike counts?. Question context: This targets a critical knowledge gap in citable evidence for Russian Air Force transnational drone operations—particularly in the Syrian theaters—by querying less commonly 

Completed iteration 14, remaining budget: 1
Completeness check start.
Completeness check: True, reasoning: All core thesis components have been strategically addressed via high-quality, citable sources—including quantitative and methodological coverage of Russian Air Force drone campaigns in Ukraine, extensive extraction and caveats on data reliability, direct querying and documentation of doctrinal and intent-related evidence, and targeted attempts (including multilingual and local archives) to resolve the transnational campaign gap, ultimately finding no documented evidence for Russian offensive drone strikes outside Ukraine within the specified timeframe; thus, all priority guideline requirements have been fulfilled and no critical knowledge gaps remain.
Literature search deemed complete by completeness checker
Survey completed with 7 responses
✅ Result saved to output/action_item_7_literature_search_response_1.json
Starting literature search for topic: Despite the proliferation of 

Completed iteration 4, remaining budget: 11
Completeness check start.
Completeness check: False, reasoning: Key mandatory requirements remain unaddressed, particularly around methodology caveats/limitations from main datasets and direct empirical evidence for the thesis’s median fatality claim; these present significant knowledge gaps that must be filled for a rigorous, citable survey.
Generated 2 next questions for exploration
Executing 2
RAG call start. Question: What methodology caveats or limitations are disclosed in the cited datasets (e.g., ACLED, Explosive Weapons Monitor, Drone Wars UK) concerning incident completeness, fatality coding, and geographic/event precision for 2022–2024 drone/artillery strike reporting?. Question context: This question targets an unaddressed mandatory requirement from the guideline: extracting explicit methodology caveats that may affect data accuracy or coverage, which are critical for assessing the reliability of quantitative counts and fatality ra

Completed iteration 8, remaining budget: 7
Completeness check start.
Completeness check: False, reasoning: A critical knowledge gap remains regarding direct, credible sources that empirically challenge or contradict the thesis by presenting median/aggregate fatality statistics above zero for remote strike events during 2022–2024, which is essential for rigorous completeness.
Generated 1 next questions for exploration
Executing 1
RAG call start. Question: Are there any reputable outlier studies or incident datasets (not already covered) that challenge the thesis, by documenting significant numbers or proportions of mass-casualty remote strikes (drone/artillery) during 2022–2024 with median or aggregate event fatalities above zero?. Question context: This directly targets a remaining gap: locating any empirical counter-evidence or alternative incident-level data from credible sources (beyond ACLED/SIPRI/Drone Wars UK/AOAV), essential for testing the thesis's generalizability and robustne

Completed iteration 10, remaining budget: 5
Completeness check start.
Completeness check: True, reasoning: All mandatory guideline requirements—including authoritative supporting and dissenting sources, quantitative event-level data, methodology caveats, strategic analyses by tier-1 media, and outlier conflicting studies—have now been addressed with direct citable evidence or explicit statements of absence, ensuring exhaustive empirical coverage and closure of major knowledge gaps on the survey topic.
Literature search deemed complete by completeness checker
Survey completed with 5 responses
✅ Result saved to output/action_item_7_literature_search_response_2.json


In [33]:
# Identify key insight for each rag response

# The goal of key insight identification is to identify the most important insight from the RAG response. Think of it as a one sentence summary of the RAG response that is most important to the question. The reason we do this is to help us reduce the noise in the rag responses and focus on the most important information when determing the final report outline.

class KeyInsightIdentifier(dspy.Signature):
    """
    You are GENERTING exactly ONE (1) KEY INSIGHT from the RAG repsonse. 
    It must be ONE sentence only, no more, no less. 

    REQURIEMENTS:
    - Directly answers the question, not vague.
    - Names actors and actions, incl time/place if avaible.
    - If there is a number/trend in the answer include it.
    - Keep inline [n] citations exactly as shown, do NOT change them.
    - No hedging like "maybe" or "could". No meta talk.
    - If no good fact is in the answer, return: "Insufficient evidence in retrieved sources to answer the question."
    """
    question: str = dspy.InputField(
        desc="The question that was asked"
    )
    question_context: str = dspy.InputField(
        desc="The context of the question"
    )
    answer: str = dspy.InputField(
        desc="The answer to the question aggregating information from external sources"
    )
    key_insight: str = dspy.OutputField(
        desc="One sentence key insight."
    )

key_insight_identifier = dspy.Predict(KeyInsightIdentifier)
key_insight_identifier_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1-mini", # NOTE: should we use a more powerful model? You can play around with it.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))


for rag_response in tqdm(rag_responses):
    with dspy.context(lm=key_insight_identifier_lm):
        rag_response.key_insight = (await key_insight_identifier.aforward(
            question=rag_response.question,
            question_context=rag_response.question_context,
            answer=rag_response.answer
        )).key_insight

with open("output/action_item_7_rag_responses_with_key_insight.json", "w") as f:
    json.dump([rag_response.to_dict() for rag_response in rag_responses], f, indent=2)

for rag_response in rag_responses:
    print(rag_response.key_insight + "\n")
print(f"✅ Result saved to output/action_item_7_rag_responses_with_key_insight.json")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:16<00:00,  1.37s/it]

Reputable datasets and reports from the Ukrainian Air Force, CSIS, ACLED, ISW, and UN document that from 2022 to 2024, Russia conducted extensive drone strike campaigns across Ukraine, including in Donetsk and Kyiv, with confirmed launches exceeding 14,700 attack drones and over 19,000 missiles, reaching peak monthly drone launches of over 6,400 in July 2025 and causing significant civilian casualties [1][2][3][4][5][6][8].

No primary sources or Tier-1 newswires from 2022 to 2024 provide evidence of Russian Air Force drone operations in transnational contexts such as Ar Raqqa, Rural Damascus, or Lattakia, including details on dates, strike counts, or campaign intent.

Between 2022 and 2024, reputable organizations like the ISW, IISS, CSIS, CNAS, the Belfer Center, and West Point highlight that Russian Air Force drone campaign data suffer from significant methodological challenges including reliance on incomplete open-source intelligence, varied drone types complicating counts, electro




In [35]:
# Final report title and guideline proposal

# We will generate title and guideline for the final report. The guideline should be in bullet point format outling the key points that should be included in the report.

class FinalWritingGuidelineProposal(dspy.Signature):
    """
    You are GENERTING a report TITLE + bullet GUIDELINE using our theses + key insights.

    REQS:
    - Title: 8–14 words, headline style, punchy, no colon.
    - Use the selected theses + key insights; no outside facts.
    - Guideline: bullets only; clear sections; keep inline [n] cites if present.
    - Focus on flow: Background → Data/Methods → Findings → Limits → Implications → Next steps.
    - Be concise + memorable; avoid fluff.
    """
    ... # TODO: define any input fields. Hint: you can use the key insight from each rag response to help you decide what to include in the guideline.
    selected_theses: List[str] = dspy.InputField(desc="2–3 theses being defended.")
    key_insights: List[str] = dspy.InputField(desc="One-sentence key insights.")
    
    report_thesis: str = dspy.OutputField(
        desc="Exactly one headline-style line (8-14 words)") # TODO: add/improve instructions
    writing_guideline: str = dspy.OutputField(
        desc="The proposed writing guideline for the final report in bullet point format. The guideline should outline the key points that should be included in the report. ... " # TODO: add/improve instructions
    )

final_writing_guideline_proposal = dspy.Predict(FinalWritingGuidelineProposal)
final_writing_guideline_proposal_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: report title proposal requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

with dspy.context(lm=final_writing_guideline_proposal_lm):
    final_writing_guideline_proposal_response = (await final_writing_guideline_proposal.aforward(
        selected_theses=selected_theses,  key_insights=[rr.key_insight for rr in rag_responses],
    ))

final_writing_thesis = final_writing_guideline_proposal_response.report_thesis
final_writing_guideline = final_writing_guideline_proposal_response.writing_guideline

with open("output/action_item_7_final_writing_guideline_proposal.json", "w") as f:
    json.dump({
        "report_thesis": final_writing_thesis,
        "writing_guideline": final_writing_guideline
    }, f, indent=2)

print(f"report title: {final_writing_thesis}\n\n")
print(f"writing guideline: {final_writing_guideline}")
print(f"✅ Result saved to output/action_item_7_final_writing_guideline_proposal.json")


report title: Russian Air Force Drone Campaigns 2022–2024: Expanding Reach or Illusory Overstatement?


writing guideline: - Background
  - Briefly summarize the claimed broadening of Russian Air Force unmanned operations from 2022–2024, highlighting reference to campaigns in Ukraine and purported transnational theaters.
  - Note the emergence of attrition- and disruption-focused warfare and the establishment of Russia’s Unmanned Systems Forces (VBS) as key doctrinal developments [1][2].
- Data and Methods
  - Identify principal datasets and sources used (Ukrainian Air Force, CSIS, ACLED, ISW, UN, Saratoga Foundation, local Syrian media, SIPRI, Jane’s, etc.).
  - Outline methods/coding practices underlying casualty, strike, and theater reporting; note ways drone/missile incidents are counted and geographic assignments are made [1][2][3][4][5][6][8].
- Key Findings
  - Confirm robust, multi-source evidence of large-scale Russian Air Force remote strike campaigns within Ukraine, with hig

In [36]:
def _normalize_rag_response_citation_indices(rag_responses: List[RagResponse]) -> Tuple[List[str], List[RetrievedDocument]]:
        """
        Normalize citation indices across multiple RAG (retrieval-augmented generation) responses.

        Each `RagResponse` contains:
        - `answer`: a string with inline citations like [1], [2], ...
        - `cited_documents`: the list of documents those citations refer to

        Problem:
        Citation indices restart at [1] for every response, but when combining answers,
        we want all citations to point to a single global list of retrieved documents.

        What this function does:
        1. Iterates over all RAG responses in order.
        2. Shifts the local citation indices in each answer so that they correctly map
            into the combined list of all retrieved documents.
            - For example, if the first response cited 3 docs ([1], [2], [3]),
            then the second response’s citations start at [4], not [1].
        3. Prefixes each updated answer with its corresponding sub-question for clarity.
        4. Returns:
            - A list of normalized answers (with corrected citation indices).
            - The flattened list of all retrieved documents in the proper order.

        Example:
            Input (two RAG responses):
                R1: "Paris is in France [1].", docs=[docA]
                R2: "Berlin is in Germany [1].", docs=[docB]

            Output:
                answers = [
                "Sub-question: ...\nAnswer: Paris is in France [1].",
                "Sub-question: ...\nAnswer: Berlin is in Germany [2]."
                ]
                documents = [docA, docB]
        """
        all_documents: List[RetrievedDocument] = []
        all_updated_answers: List[str] = []
        for idx, rag_response in enumerate(rag_responses):
            citation_offset = len(all_documents)
            updated_answer = rag_response.answer
            for i in range(len(rag_response.cited_documents)):
                updated_answer = updated_answer.replace(f"[{i+1}]", f"[tmp_{citation_offset+i+1}]")
            for i in range(len(rag_response.cited_documents)):
                updated_answer = updated_answer.replace(f"[tmp_{citation_offset+i+1}]", f"[{citation_offset+i+1}]")

            all_updated_answers.append(
                f"Sub-question: {rag_response.question}\nAnswer: {updated_answer}\n")
            all_documents.extend(rag_response.cited_documents)
        return all_updated_answers, all_documents

In [44]:
# Final report synthesis

# We will synthesize the final report using generated thesis, guideline, and rag responses.

class FinalReportSynthesizer(dspy.Signature):
    """
    You are an investigative journalist composing a report with given thesis, guideline, and useful information from previous literature search.

    CONTENT INTEGRATION RULES:
    - Merge all relevant sub-question answers into a logically coherent narrative
    - Create clear thematic sections with smooth transitions between topics. Use #, ##, ###, etc. to create title of sections and sub-sections.
    - Eliminate redundancy while preserving all unique factual content
    - Exclude sub-questions/answers that don't contribute meaningfully to the survey topic
    - Maintain completeness - no loss of relevant information from source material
    - No title, conclusion, summary, or reference at the end of the answer.
    
    WRITING STYLE:
    - Please write in PARAGRAPH format ONLY.

    CITATION PRESERVATION:
    - Preserve ALL original citations exactly as provided - no format modifications
    - Attach [n] immediately after the fact/number it supports, not at end of sentence.
    - Do not invent citations; do not merge or collapse them.
    - Please provide numerical citations that are in ASCENDING order.
    - DO NOT use URLS in text, only use [n] citation
    - Don't overcite, if there are multiple sources to attribute, only use ones that haven't been
    mentioned before in the past paragraph or two.
    

    CONTENT CONSTRAINTS:
     -  Constraint the content to provided information, and do not add any external knowledge. Also, do not speculate.
        
        """
    report_thesis: str = dspy.InputField(
        desc="The proposed thesis for the investigative journalism report"
    )
    writing_guideline: str = dspy.InputField(
        desc="The proposed writing guideline for the final report in bullet point format"
    )
    gathered_information: str = dspy.InputField(
        description="""Complete set of sub-question answers with their inline citations from previous research steps. 
        Format typically includes:
        - Sub-question: [question text]
        - Answer: [detailed response with inline citations [1], [2], etc.]
        - (Repeated for multiple sub-questions)"""
    )

    final_report: str = dspy.OutputField(
        desc="The final investigative report in markdown format"
    )

final_report_synthesizer = dspy.Predict(FinalReportSynthesizer)
final_report_synthesizer_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-5-mini", # NOTE: final report synthesis requires consolidate information from very long context and requires high reasoning ability, so we use a more powerful model. If this takes too long, you can try a smaller model like gpt-5-mini or gpt-4.1.
    temperature=1.0,
    max_tokens=20000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

# TODO: complete the input fields. Read the function doc to understand what the function does.
all_updated_answers, all_documents = _normalize_rag_response_citation_indices(rag_responses)
gathered_information = "\n".join(all_updated_answers)

with dspy.context(lm=final_report_synthesizer_lm):
    final_report = (await final_report_synthesizer.aforward(
        report_thesis=final_writing_thesis,
        writing_guideline=final_writing_guideline,
        gathered_information=gathered_information, 
        report_style="Comprehensive, highly accurate, and exhaustive; include every relevant detail and ensure no important information is omitted."
    )).final_report

with open("output/action_item_7_final_report_raw.md", "w") as f:
    f.write(final_report)
print(f"✅ Result saved to output/action_item_7_final_report_raw.md")

✅ Result saved to output/action_item_7_final_report_raw.md


In [45]:
# TODO: manually add bibliography to the final report. Review the output. Adjust the prompt and rerun the report synthesis if necessary. Make sure it has title, executive summary, sections with desired inline citations, and bibliography.

bibliography = "# Bibliography \n\n"
for index, doc in enumerate(all_documents, 1):
    bibliography += f"{index}. **{doc.title}**. Available at: {doc.url}\n"

final_report_with_bibliography = final_report + "\n\n" + bibliography

with open("output/action_item_7_final_report.md", "w") as f:
    f.write(final_report_with_bibliography)
print(f"✅ Result saved to output/action_item_7_final_report.md")

✅ Result saved to output/action_item_7_final_report.md


**Review the generated report. Reflect on what are weaknesses and explain in detail how would you plan to improve it.**

In [47]:
weaknesses_and_improvements = """
Weakness:
- Citations were a bit iffy, with the generated report sometimes including URLS when explicitly instructed not to or not following correctly
the citation format. 
- Sections within sections aren't really used in the report, we only really have overarching sections (such as Methods, etc) but no further 
fragmentation within them. This isn't really in line with what we see in modern literature.
- Our report doesn't seem to utilize explicit statistics from our database crawls, instead it gestures at "increases" but doesn't give
concrete numbers. This indicates we might not be properly leveraging our database insights.


Improvements:
- Make more explicit instructions so that the system creates citations that better reflect our standards.
- Define how we can further fragment the report so that we have subsections as desired
- Have the report use specific statistics from our database crawls, perhaps via better employed input fields for this data.

"""

with open("output/action_item_7_weaknesses_and_improvements.md", "w") as f:
    f.write(weaknesses_and_improvements)
print(f"✅ Result saved to output/action_item_7_weaknesses_and_improvements.md")


✅ Result saved to output/action_item_7_weaknesses_and_improvements.md


## Create submission

In [51]:
! python create_submission.py

CS224V HW1 Submission Creator
🔄 Converting notebook to PDF: notebook.ipynb -> notebook.pdf
❌ Error converting notebook to PDF: Command '['jupyter', 'nbconvert', '--to', 'pdf', '--output', 'notebook.pdf', 'notebook.ipynb']' returned non-zero exit status 1.
stderr: [NbConvertApp] Converting notebook notebook.ipynb to pdf
[NbConvertApp] ERROR | Error while converting 'notebook.ipynb'
Traceback (most recent call last):
  File "/Users/jrizo/miniconda3/envs/cs224v_hw1/lib/python3.11/site-packages/nbconvert/nbconvertapp.py", line 487, in export_single_notebook
    output, resources = self.exporter.from_filename(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrizo/miniconda3/envs/cs224v_hw1/lib/python3.11/site-packages/nbconvert/exporters/templateexporter.py", line 390, in from_filename
    return super().from_filename(filename, resources, **kw)  # type:ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrizo/miniconda3/envs/cs

## Appendix

### Database Exploration Agent Implementation

> **Note:** This implementation is provided for reference purposes. Utilize the pre-computed results in Action Item 5 for efficient completion of the assignment.

---

In [None]:
# API endpoint URL
api_url = "https://cs224v-database-agent.genie.stanford.edu/database-exploration"

# Prepare the request payload
payload = {
    "topic": TOPIC,
    "seed_questions": seed_questions,
    "lm_config": database_exploration_lm_config.to_dict()
}

# NOTE: uncomment the code below to make the request

# print("Making request to database exploration endpoint... Might take up to 30 minutes")
# async with httpx.AsyncClient(timeout=6000.0) as client:
#     r = await client.post(api_url, json=payload)
#     r.raise_for_status()
#     database_exploration_response = r.json()

# with open("output/action_item_5_database_exploration.json", "w") as f:
#     json.dump(database_exploration_response, f, indent=2)

# print(f"✅ Result saved to output/action_item_5_database_exploration.json")
