<<<<<<< HEAD
A unified research-paper pipeline: ingest PDFs → discuss papers with an AI agent swarm → generate runnable code repos.
PDF
└─[ingest]→ cleaned JSON
├─[discuss]→ expert Q&A (agent swarm)
└─[codegen]→ generated code repository
generalresearch/
├── run.py # Master CLI (ingest / discuss / codegen)
├── Makefile # Shorthand commands
├── requirements.txt # Combined Python dependencies
│
├── papers/ # Input PDFs (4 pre-loaded test papers)
│ ├── bert.pdf
│ ├── attention_is_all_you_need.pdf
│ ├── vision_transformer.pdf
│ └── og_attention.pdf
│
├── data/
│ ├── raw_json/ # Raw S2ORC JSON output from Grobid
│ └── cleaned_json/ # Cleaned JSONs consumed by discuss + codegen
│ └── BERT_cleaned.json # Pre-processed BERT (ready to use immediately)
│
├── outputs/ # Generated code repos land here
│
├── ingestion/ # PDF → S2ORC JSON (s2orc-doc2json)
│ ├── doc2json/ # Core conversion library
│ ├── setup.py # Install with: pip install -e ingestion/
│ └── scripts/
│ ├── setup_grobid.sh # Download Grobid (one-time)
│ └── ingest_pdf.sh # Internal script called by run.py ingest
│
├── paper2code/ # JSON → code generation pipeline
│ ├── codes/ # Pipeline stages (0_pdf_process through 4_debugging)
│ └── prompts/ # LLM prompt templates
│
└── agentswarm/ # Expert agent Q&A over papers
├── cli.py, orchestrator.py, expert.py
├── retriever.py, blackboard.py
├── llm.py, paper_loader.py
└── __init__.py
pip install -r requirements.txt
pip install -e ingestion/ # installs doc2json packageexport OPENROUTER_API_KEY="sk-or-..." # required for discuss + codegen
# OR for OpenAI directly:
export OPENAI_API_KEY="sk-..."python run.py discuss "What is masked language modeling?"Or with Make:
make discuss QUESTION="What is masked language modeling?"Requires Docker. Pull the image once:
make setup-grobid
# or: sudo docker pull grobid/grobid:0.9.0-crfStart Grobid in a separate terminal (keep it running while ingesting):
make start-grobid
# or: sudo docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.9.0-crfThen ingest any PDF:
python run.py ingest papers/bert.pdf
# or
make ingest PAPER=bertOutputs:
data/raw_json/bert.jsondata/cleaned_json/bert_cleaned.json
python run.py codegen data/cleaned_json/bert_cleaned.json
# or
make codegen JSON=data/cleaned_json/bert_cleaned.jsonWith a local vLLM backend instead of OpenRouter:
python run.py codegen data/cleaned_json/bert_cleaned.json --local
make codegen-local JSON=data/cleaned_json/bert_cleaned.jsonGenerated repo lands in outputs/<paper_name>_repo/.
Converts a PDF to a cleaned JSON file ready for discuss or codegen.
| Arg | Description |
|---|---|
pdf |
Path to input PDF |
-o / --output |
Output dir (default: data/cleaned_json/) |
Runs a moderated panel of LLM paper-expert agents that answer questions grounded in evidence from the papers.
| Arg | Description |
|---|---|
question |
The question to ask |
--papers |
One or more *_cleaned.json files (default: BERT) |
--max-agents |
Max experts per question (default: 5) |
--top-k |
Evidence chunks per expert (default: 4) |
--critique-rounds |
Rounds of cross-critique (default: 1) |
--model |
OpenRouter model override |
Runs the full paper→code generation pipeline: planning → analysis → coding.
| Arg | Description |
|---|---|
cleaned_json |
Path to *_cleaned.json |
--name |
Output name (default: JSON stem) |
--model |
Model ID override |
--local |
Use local vLLM backend (DeepSeek-Coder) |
# 1. One-time: pull Grobid Docker image
make setup-grobid
# 2. Start Grobid (in a separate terminal, keep it running)
make start-grobid
# 3. Ingest all test papers
make ingest-all
# 4. Ask the agent swarm a question across all ingested papers
python run.py discuss "How do transformers handle positional encoding?" \
--papers data/cleaned_json/*.json \
--critique-rounds 2
# 5. Generate code for the attention paper
make codegen JSON=data/cleaned_json/attention_is_all_you_need_cleaned.jsonSends the PDF to a Grobid server running in Docker (grobid/grobid:0.9.0-crf, port 8070), which parses it into TEI-XML. The doc2json library then converts TEI-XML into a structured S2ORC JSON with body text, sections, citations, figures, and equations. 0_pdf_process.py strips noisy metadata (cite spans, eq spans, etc.) to produce a cleaned JSON for downstream use.
Each *_cleaned.json becomes one PaperExpertAgent. When you ask a question:
- Agents score relevance to the question (BM25 keyword matching).
- Selected agents retrieve top-k evidence chunks and compose answers grounded only in their paper.
- Agents critique each other's claims.
- An orchestrator synthesizes a final answer with consensus points, disagreements, and citations.
A four-stage LLM pipeline:
- Planning — LLM reads the paper and generates an overview, software design, and task list.
- Config extraction — hyperparameters are pulled into
config.yaml. - Analysis — each file in the task list gets detailed logic specs.
- Coding — LLM generates each file with full context from prior files and specs.
| Variable | Used by | Description |
|---|---|---|
OPENROUTER_API_KEY |
discuss, codegen | OpenRouter API key |
OPENAI_API_KEY |
codegen (OpenAI path) | OpenAI API key |
GROBID_DIR |
ingest | Grobid install path (default: ~/grobid-0.7.3) |
| ======= |
LAHacks 2026, Qianheng (Hendry) Xu, Yuvraj Chaudhary, Eric Zheng
781d8231a85b51f9afbf71d81f7b6f38e0721bb0