redink

Adversarial pre-submission paper red-teamer and dataset opportunity scout — one chat, two LangGraph flows. It finds citation hallucinations, statistical weaknesses, novelty gaps, and writing problems in a paper before a real reviewer does; and it scans dataset sources to score and catalog opportunities into a portable knowledge bundle.

Two flows, one REPL

uv run redink

	Review papers	Research datasets
do	`/review <path\|arxiv-url>`	`/scan <query>`
read	`/report` · `/rerun <dim>`	`/rank` · `/gaps` · `/spikes` · `/wiki <slug>`
graph	`redink`	`drl`

Configure everything from inside the chat with /config (or redink setup for a full wizard). After a review, type freely to ask questions about the findings.

See examples/ for a sample review and a sample OKF concept.

Paper review

Architecture

paper.md / GitHub URL / arXiv URL
   │
   ▼
 [fetch_paper]      arXiv via ar5iv (tables preserved as pipe rows; abstract-only
                    fetches are flagged so reviewers don't fault missing sections)
   │
   ▼
 [classify]         area · type · dimensions · citations · 5–8 technical claims
   │
   ▼  fan-out via Send — 3 personas × N dimensions
 [reviewer] × N     skeptic · practitioner · academic, different priors each
 [figure_reviewer]  Gemini Vision on ar5iv figures (cherry-picking, truncated axes)
   │
   ▼
 [debate]           dedup, then every CRITICAL faces a defender (argues the
                    author's side from the text) + a judge → sustained /
                    downgraded / dismissed. Kills criticals nobody can defend
                    against — and, more importantly, that nobody can uphold.
   │
   ▼
 [contradiction_map] · [blind_spot]
   │
   ▼
 [judge_panel]      3 lenses — rigor · contribution · era-appropriate standards —
                    calibrated against reference papers → PASS / REVISE / FAIL
   │
   ▼
 [synthesize]       verdict + findings, plus a self-contained interactive HTML report

Dimensions: citations · methodology · novelty · writing · statistics · reproducibility · ethics · figures. Each runs three independent personas in parallel; findings are semantically de-duplicated (two-pass: per-dimension then global) and weighted by cross-persona agreement.

Measured against real reviews

Most "AI reviewer" prompts collapse to reject everything — every real paper has flaws, so a judge grading against an implicit ideal fails them all. Almost no one checks. redink is measured against 300 ICLR papers with their real reviews and decisions (from ASAP-Review):

Findings recall ≈ 0.73 — it surfaces ~73% of the weaknesses human reviewers actually raised, at a ~0.2 noise rate.
The verdict discriminates. Anchoring the judge panel to reference papers cut the false-fail rate on accepted papers from 81% → 14%, and made the verdict track the human decision: rejected papers now FAIL 4× more often than accepted ones (54% vs 14%), where the baseline failed both alike.

The harness that produced — and proves — this lives in eval/:

Tool	Does
`collect_asap.py`	build the labeled set (papers + real reviews + decisions)
`overlap_metric.py`	findings recall / noise vs the human weaknesses
`rejudge.py` · `confirm_calibration.py`	cheap judge A/B over cached findings

The pipeline output is cached, so iterating on the verdict costs cents, not a full re-run — calibration changes are measured, never eyeballed.

Dataset research loop (`drl`)

A second graph that scans dataset sources, scores opportunity, and writes an Open Knowledge Format (OKF) bundle — a portable directory of markdown concepts you can git clone or open in Obsidian.

 [scan] × sources     fan-out: HuggingFace · Kaggle · OpenML
    │
    ▼
 [merge]              dedupe across sources
    │
    ▼
 [prescore]           rule-based quality gate (source-aware)
    │
    ▼
 [score] × datasets   LLM opportunity score 0–3, one per dataset
    │
    ▼
 [catalog]            write OKF concepts + rebuild index / log
    │
    ▼
 [digest]             run summary concept

Analysis reads the bundle's frontmatter at query time (no DB), exactly as the OKF spec intends:

Command	What it does
`/scan <query> [--sources hf,kaggle,openml] [--limit N]`	scan → score → write OKF concepts
`/rank [N]`	top datasets by opportunity PageRank over the tag-similarity graph
`/gaps [N]`	least-covered task categories
`/spikes [N]`	recently-active datasets (velocity proxy)
`/wiki <slug>`	print an OKF concept

Papers With Code was dropped — its API now redirects to Hugging Face. OpenML replaced it. Kaggle's list endpoint works anonymously (KAGGLE_* only raises limits).

Also usable one-shot / in cron: drl scan "...", drl rank, drl gaps, drl setup.

Setup

Requires Python ≥ 3.11 and uv.

git clone https://github.com/p7dotorg/redink
cd redink
uv sync

uv run redink setup      # interactive wizard: keys + models → .env
# or: cp .env.example .env  and set OPENROUTER_API_KEY

Usage

uv run redink                                   # interactive chat (both flows)
uv run redink my-paper.md                       # one-shot review (CI / pipe)
uv run redink https://arxiv.org/abs/1706.03762  # arXiv
cat paper.md | uv run redink -                  # stdin

One-shot prints the report to stdout, saves <paper>.review.md, and writes an interactive <paper>.annotated.html.

Models & config

Every model and key is configurable via /config papers|datasets, redink setup, or .env. All calls route through OpenRouter and are capped (max_tokens) to avoid runaway credit reservations.

Role	Env var	Default
Classify	`CLASSIFY_MODEL`	`openai/gpt-4o-mini`
Reviewer / defender	`REVIEWER_MODEL`	`deepseek/deepseek-v4-flash`
Tool calls (citations/novelty)	`TOOL_MODEL`	`openai/gpt-4o-mini`
Figure review	`FIGURE_MODEL`	`google/gemini-2.5-flash`
Structured output / dedup	`STRUCTURED_MODEL`	`openai/gpt-4o-mini`
Synthesis prose	`SYNTHESIZE_MODEL`	`deepseek/deepseek-v4-flash`
Judge panel + rebuttal	`JUDGE_MODEL`	`openai/gpt-4o`
Dataset scorer	`DRL_SCORE_MODEL`	`openai/gpt-4o-mini`

Estimated cost per review: ~$0.10 — mostly the gpt-4o judge panel. Set JUDGE_MODEL=gpt-4o-mini for ~$0.03.

How it works — details

Citation verification. Only the skeptic persona makes web requests (all three in parallel would exhaust the Semantic Scholar rate limit). Tools: search_papers (Semantic Scholar, cross-disciplinary), get_paper (arXiv abstract), verify_doi (Crossref). A finding's evidence quote is checked against the paper text — an unverifiable quote drops the finding to minor, killing hallucinations.

Novelty search. The classify node extracts 5–8 subject+verb+object claims that become specific arXiv queries. Results published after the paper under review are filtered out in code — no more 2024 papers cited as prior work for a 2017 paper.

Fetch & truncation. arXiv is fetched via ar5iv with <table> preserved as pipe rows and <math> as LaTeX. Reviewers get up to 60k chars with an explicit excerpt notice, so "missing" sections in the omitted tail are never reported as flaws. Abstract-only renders (ar5iv failures) are detected and flagged.

Data & privacy

redink runs locally, but reviewing a paper sends parts of it to third parties. If your paper is unpublished or confidential, know what leaves your machine:

Goes out	To	What
LLM calls	OpenRouter → the chosen provider (OpenAI, DeepSeek, Google, …)	paper excerpts, findings, prompts
Citation / novelty tools	Semantic Scholar · arXiv · Crossref	search queries derived from your claims and references
Figures	ar5iv (fetch) + the vision model via OpenRouter	figure images + captions
Dataset scans (`drl`)	HuggingFace · Kaggle · OpenML	your search query only — no paper text
Tracing (only if `LANGSMITH_TRACING=true`)	LangSmith	full run traces, including paper text

redink itself stores nothing beyond local files — the *.review.md / *.annotated.html report and the OKF bundle/. It does not phone home. For sensitive work, point the models at a self-hosted / private OpenRouter setup and keep LANGSMITH_TRACING=false (the default).

See SECURITY.md to report a vulnerability.

Part of p7dotorg. · redink.sh

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
assets		assets
cli		cli
core		core
eval		eval
examples		examples
paper		paper
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
SECURITY.md		SECURITY.md
langgraph.json		langgraph.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

redink

Two flows, one REPL

Paper review

Architecture

Measured against real reviews

Dataset research loop (`drl`)

Setup

Usage

Models & config

How it works — details

Data & privacy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

redink

Two flows, one REPL

Paper review

Architecture

Measured against real reviews

Dataset research loop (drl)

Setup

Usage

Models & config

How it works — details

Data & privacy

About

Topics

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Dataset research loop (`drl`)

Packages