## Step 2: Scouting Research Candidates

Decision whether or not to pursue a research item

In [1]:
import sys
sys.path.append('../')

import pandas as pd
from datetime import datetime

import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

from components.agents.scout_agent import scout_batch

from data.content_saver import ContentSaver
content_saver = ContentSaver(base_path="../data")

2025-12-26 23:32:27,038 - data.content_saver - INFO - Loaded content index with 150 entries


Load Research Items

In [2]:
df = pd.read_csv("../data/research_items.csv")

provider_counts = df['provider'].value_counts()
print("Absolute counts:")
print(provider_counts)
print("\nNormalized (proportions):")
print(df['provider'].value_counts(normalize=True))

Absolute counts:
provider
arxiv        481
openai       359
anthropic    321
exa          195
Name: count, dtype: int64

Normalized (proportions):
provider
arxiv        0.354720
openai       0.264749
anthropic    0.236726
exa          0.143805
Name: proportion, dtype: float64


Select items that have not yet been looked at

In [3]:
pending = df[df.get("scout_decision").isna()] if "scout_decision" in df.columns else df

In [4]:
# Collect scout items
items_to_scout = pending.to_dict("records")
print(f"Number of items to scout: {len(items_to_scout)}")

Number of items to scout: 1356


Run Scouting

In [5]:
decisions = await scout_batch(items_to_scout[:10])

Scout agent triage:   0%|          | 0/10 [00:00<?, ?it/s]

2025-12-26 23:32:46,070 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Scout agent triage:  10%|█         | 1/10 [00:06<00:57,  6.40s/it]2025-12-26 23:32:47,728 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Scout agent triage:  20%|██        | 2/10 [00:07<00:28,  3.58s/it]2025-12-26 23:32:49,349 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Scout agent triage:  30%|███       | 3/10 [00:09<00:18,  2.68s/it]2025-12-26 23:32:49,553 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Scout agent triage:  40%|████      | 4/10 [00:09<00:10,  1.70s/it]2025-12-26 23:32:50,461 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Scout agent triage:  50%|█████     | 5/10 [00:10<00:07,  1.42s/it]2025-12-26 23:32:50,741 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTT

Update DataFrame

In [6]:
for i, (idx, row) in enumerate(pending.head(10).iterrows()):
    df.loc[idx, "scout_decision"] = "pursue" if decisions[i].pursue else "discard"
    df.loc[idx, "scout_confidence"] = decisions[i].confidence
    df.loc[idx, "scout_reasoning"] = decisions[i].reasoning
    df.loc[idx, "scouted_at"] = datetime.now().isoformat()

In [8]:
pending['scout_decision'].value_counts()

scout_decision
pursue     8
discard    2
Name: count, dtype: int64

In [7]:
pending

Unnamed: 0,focus_area,provider,url,title,source,published,relevance,date_added,scout_decision,scout_confidence,scout_reasoning,scouted_at
0,dummy1,openai,https://dummy1.com,dummy1,OpenAI Blog,2025-12-01,dummy example 1,2025-12-20,discard,0.95,The title and summary are clearly placeholders...,2025-12-26T23:32:59.839722
1,dummy2,anthropic,https://dummy2.com,dummy2,arXiv,2025-12-15,dummy example 2,2025-12-21,discard,0.93,The title and summary are placeholders (“dummy...,2025-12-26T23:32:59.840667
2,reasoning_agent,openai,https://openai.com/index/gpt-5-2-codex,Introducing GPT-5.2-Codex,OpenAI blog,2025-12-18,Official release of an agentic coding model em...,2025-12-24,pursue,0.93,"This is a very recent, first-party OpenAI rele...",2025-12-26T23:32:59.841221
3,reasoning_agent,openai,https://openai.com/index/introducing-gpt-5-2/,Introducing GPT-5.2,OpenAI blog,2025-12-11,Details GPT-5.2 “Thinking/Pro” modes and API r...,2025-12-24,pursue,0.93,This is an official OpenAI release (very recen...,2025-12-26T23:32:59.841649
4,reasoning_agent,openai,https://blog.google/products/gemini/gemini-3/,Introducing Gemini 3: our most intelligent mod...,Google Blog (Gemini/DeepMind),2025-11-18,Announces Gemini 3 with “thinking”/Deep Think ...,2025-12-24,pursue,0.86,"This is a major, very recent (Nov 18, 2025) fl...",2025-12-26T23:32:59.842074
...,...,...,...,...,...,...,...,...,...,...,...,...
1351,arxiv,arxiv,https://arxiv.org/abs/2512.20715v1,SoK: Speedy Secure Finality,arXiv,2025-12-23,Summary: While Ethereum has successfully achie...,2025-12-25,,,,
1352,arxiv,arxiv,https://arxiv.org/abs/2512.20712v1,Real-World Adversarial Attacks on RF-Based Dro...,arXiv,2025-12-23,Summary: Radio frequency (RF) based systems ar...,2025-12-25,,,,
1353,arxiv,arxiv,https://arxiv.org/abs/2512.20610v2,FedPOD: the deployable units of training for f...,arXiv,2025-12-23,"Summary: This paper proposes FedPOD, which ran...",2025-12-25,,,,
1354,arxiv,arxiv,https://arxiv.org/abs/2512.20605v2,Emergent temporal abstractions in autoregressi...,arXiv,2025-12-23,Summary: Large-scale autoregressive models pre...,2025-12-25,,,,


Save back to DF

In [None]:
#df.to_csv("../data/research_items.csv", index=False)

to test:
1. after new cols ahve been aded, upon running the research, can it still save to it after dedup
2. work on scouting prompt
3. re-read after some cols already are filled for scouting