Skip to content

matiasbayas/SCOUT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCOUT

SCOUT is a tool I built to automate part of my literature review workflow as an academic economist. It retrieves recent papers from arXiv and NBER, ranks them against my research profile using semantic embeddings, summarizes the top hits with an LLM, and packages everything into an HTML digest.

This is a personal tool, not a production system. It works well for my workflow but hasn't been hardened for general use.

What it does

  1. Retrieve — Pulls recent papers from arXiv (API) and NBER (web scraping) based on configurable keywords and lookback windows.
  2. Rank — Computes semantic similarity between retrieved papers and a research profile (stated interests + embeddings of your own uploaded papers). A configurable paper_weight blends the two signals.
  3. Summarize — Sends the top-ranked papers to an LLM (OpenAI, Claude, or Gemini) for structured summaries tailored to the user's interests.
  4. Digest — Generates a styled HTML digest with relevance scores, explanations, and links. Optionally sends it by email.
  5. Feedback loop — Users rate recommendations; a Thompson Sampling bandit adjusts the paper_weight parameter over time to improve relevance.

Architecture

┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│  Retrieval   │────▸│   Ranking    │────▸│ Summarization │
│ (arXiv/NBER) │     │ (embeddings) │     │  (LLM call)   │
└─────────────┘     └──────┬───────┘     └───────┬───────┘
                           │                     │
                    ┌──────▼─────────────────────▼──────┐
                    │         HTML Digest Builder        │
                    └──────────────┬────────────────────┘
                                   │
                    ┌──────────────▼────────────────────┐
                    │   Feedback → Parameter Optimizer   │
                    │     (Thompson Sampling bandit)     │
                    └───────────────────────────────────┘

Modules:

  • paper_retrieval.py — arXiv API + NBER scraper
  • relevance_ranking.py — OpenAI embeddings, cosine similarity, weighted scoring
  • paper_summarization.py — Multi-provider LLM summarization
  • llm_providers.py — Unified interface for OpenAI / Claude / Gemini
  • paper_processor.py — PDF text extraction, embedding generation for uploaded papers
  • parameter_optimizer.py — Thompson Sampling (Beta-bandit) for tuning ranking weights from user feedback
  • feedback_store.py — JSON-based feedback persistence
  • pdf_downloader.py — Downloads PDFs for top-ranked papers

Quickstart

# Clone and install
git clone https://github.com/matiasbayas/SCOUT.git
cd SCOUT
pip install -r requirements.txt

# Set up API keys (at minimum, OpenAI for embeddings)
export OPENAI_API_KEY="your-key"
# Optional: export ANTHROPIC_API_KEY="your-key" or GEMINI_API_KEY="your-key"

# Create a config file and edit research interests
python -m scout_agent.scout_agent --create-config
# Edit config.json: set your research_interests and preferred summarization provider

# Run
python -m scout_agent.scout_agent

The output is an HTML digest in the digests/ directory.

Configuration

Copy config.json.example to config.json and edit. Key settings:

Section Key What it controls
top-level research_interests Topics for ranking
retriever source, lookback_days, max_results Where and how far back to search
ranker paper_weight, similarity_threshold Balance between interests and uploaded papers
summarizer provider, temperature Which LLM to use
papers enabled, use_for_ranking Whether uploaded papers influence ranking

API keys can be set via environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY), in config.json, or via CLI flags.

Uploading your own papers

SCOUT can embed your papers and use them to personalize ranking:

# Single paper
python -m scout_agent.scout_agent --upload-paper paper.pdf --paper-title "My Paper"

# Directory of papers
python -m scout_agent.scout_agent --upload-dir ./my-papers/

# Upload and immediately run
python -m scout_agent.scout_agent --upload-paper paper.pdf --run-after-upload

Feedback and parameter tuning

After reviewing a digest, rate papers to improve future recommendations:

python -m scout_agent.scout_agent --feedback http://arxiv.org/abs/2401.00001v1 highly_relevant

Ratings: highly_relevant, somewhat_relevant, not_relevant.

To run a tuning session (presents papers and collects feedback interactively):

python -m scout_agent.scout_agent --tune --tune-iters 5

Under the hood, this uses Thompson Sampling on a Beta posterior to adjust the paper_weight parameter — a simple Bayesian bandit, not a reward model or policy optimization. It's a lightweight way to personalize the interest-vs-paper balance over time.

Limitations and known issues

  • Ranking depends on OpenAI embeddings — the ranking module requires an OpenAI API key even when using Claude or Gemini for summarization.
  • NBER retrieval uses Selenium — requires Chrome/ChromeDriver. ArXiv-only mode works without it.
  • No CI or comprehensive test suite — tests cover core logic (feedback store, API key precedence, provider construction) but not end-to-end workflows.
  • Single-parameter tuning — the feedback loop only adjusts one weight. A richer approach would tune retrieval keywords, similarity thresholds, or summarization prompts.

License

MIT

About

An agent that retrieves, ranks, and summarizes new papers according to one's research interests

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages