You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After #176 the fetch chain is technically capable of pulling almost any public URL (DOI shortcut → httpx → curl_cffi → Byparr in #178). Nothing about the techniques is illegal, but the project is one step away from looking like an adversarial scraper to the sites it touches. This issue bundles four cheap mitigations that dramatically reduce legal/operational risk and are good engineering on their own merits.
None of these are blockers for #176 — they're follow-up work, sized to land together as a single "be a good citizen" PR.
Why: Ignoring robots.txt is the #1 thing that gets a scraper blocked, C&D'd, or reported. It's not legally binding in any jurisdiction, but honoring it is the strongest single signal of good-faith intent if a dispute ever arises.
What:
New RobotsCache (in-memory + Redis-backed, ~1h TTL per host) that fetches and parses robots.txt once per host.
FetchProviderRegistry.fetch() consults the cache before invoking any provider. On disallow, return a synthetic robots_disallowedFetchAttempt (same shape as the existing url_safety rejection) so the audit UI shows the reason cleanly.
Per-host opt-out via a new fetch_robots_ignore_hosts: list[str] setting in kt-config, for the rare case where a site's robots.txt is overly broad and we have a separate agreement.
robots.txt fetched at most once per host per TTL window
Disallowed URLs short-circuit the chain with an audit entry
fetch_robots_ignore_hosts config bypasses the check
Tests cover: allowed, disallowed, missing robots.txt (treated as allowed), malformed robots.txt (treated as allowed + warning logged), per-host opt-out
2. Contact-email User-Agent (deferred until we have an official email)
Why: Crossref's "polite pool" pattern is the gold standard. A UA like KnowledgeTree/0.x (+https://knowledge-tree.example; mailto:abuse@...) tells site operators who to contact instead of just blocking, and is the cheapest way to look legitimate.
What:
The fetch_user_agent setting in kt-config already exists. Update the default once we have a real public URL and an abuse@ mailbox.
New fetch_contact_email setting; when set, automatically appended to fetch_user_agent in the format (+<url>; mailto:<email>).
Falls back gracefully when unset (today's behavior — generic browser-ish UA).
DOI provider already uses crossref_email / unpaywall_email for its API calls; consolidate so a single fetch_contact_email feeds all three by default, with the per-API overrides still respected.
Acceptance:
fetch_contact_email setting added; appended to UA when set
DOI provider falls back to fetch_contact_email when crossref_email / unpaywall_email are unset
No-op when unset (no behavior change vs today)
Deferred sub-task: flip the default UA once we have an official public URL + abuse mailbox (separate small PR)
3. Per-host rate limiting
Why: The current chain has max_concurrent for total in-flight requests but no per-host throttle. Hammering a single domain is the fastest way to get blocked, blacklisted, or flagged as a DoS — and the synthesis pipelines absolutely will hit the same host repeatedly when expanding a node from a single source.
What:
Token-bucket rate limiter keyed by host, default ~1 req/sec per host (configurable via fetch_per_host_rps in kt-config).
Honors Crawl-delay from robots.txt when present (overrides the default if stricter).
Lives in the registry, applied uniformly across all providers — so a host that needs Byparr still gets throttled.
Backed by Redis when available (so multiple workers share the budget); in-memory fallback for tests / single-process dev.
Waits rather than fails when the bucket is empty — short bursts are fine.
Acceptance:
Default 1 req/sec/host, configurable
Crawl-delay honored when stricter than default
Redis-backed when available, in-memory otherwise (mirrors host_pref.py)
Why: Even with robots.txt honored, site operators occasionally need to escalate ("please stop fetching us regardless of what our robots.txt says"). Having a documented contact path and a fast operator-side block is what separates a mature scraper from a nuisance — and what regulators look for if anything ever escalates.
What:
Documented opt-out path: README section + /.well-known/-style page on the eventual public site explaining how to request removal (email + the form of evidence we need).
Operator-side blocklist: new fetch_blocked_hosts: list[str] setting in kt-config. URLs matching a blocked host are rejected at the registry entry point with a synthetic host_blocked audit entry, before url_safety and before robots.txt.
Cleanup of stored content: when a host is added to the blocklist, a one-shot job (or just recipe) removes existing RawSource rows for that host. Lower priority but worth doing — opt-out without cleanup is a half-measure.
Audit-trail UX: the source detail page should explain the rejection reason clearly ("This host has been blocked at operator request") rather than a generic failure.
Acceptance:
fetch_blocked_hosts setting added; rejection at registry entry point
Synthetic host_blocked audit entry surfaced in the UI
Documented opt-out path in README (email-based until we have a public site)
just fetch-purge-host <host> recipe to remove existing RawSource for a blocked host
Context
After #176 the fetch chain is technically capable of pulling almost any public URL (DOI shortcut → httpx → curl_cffi → Byparr in #178). Nothing about the techniques is illegal, but the project is one step away from looking like an adversarial scraper to the sites it touches. This issue bundles four cheap mitigations that dramatically reduce legal/operational risk and are good engineering on their own merits.
None of these are blockers for #176 — they're follow-up work, sized to land together as a single "be a good citizen" PR.
1.
robots.txtcheck (soft default, per-host opt-out)Why: Ignoring
robots.txtis the #1 thing that gets a scraper blocked, C&D'd, or reported. It's not legally binding in any jurisdiction, but honoring it is the strongest single signal of good-faith intent if a dispute ever arises.What:
RobotsCache(in-memory + Redis-backed, ~1h TTL per host) that fetches and parsesrobots.txtonce per host.FetchProviderRegistry.fetch()consults the cache before invoking any provider. On disallow, return a syntheticrobots_disallowedFetchAttempt(same shape as the existingurl_safetyrejection) so the audit UI shows the reason cleanly.fetch_robots_ignore_hosts: list[str]setting inkt-config, for the rare case where a site'srobots.txtis overly broad and we have a separate agreement.Acceptance:
robots.txtfetched at most once per host per TTL windowfetch_robots_ignore_hostsconfig bypasses the checkrobots.txt(treated as allowed), malformedrobots.txt(treated as allowed + warning logged), per-host opt-out2. Contact-email User-Agent (deferred until we have an official email)
Why: Crossref's "polite pool" pattern is the gold standard. A UA like
KnowledgeTree/0.x (+https://knowledge-tree.example; mailto:abuse@...)tells site operators who to contact instead of just blocking, and is the cheapest way to look legitimate.What:
fetch_user_agentsetting inkt-configalready exists. Update the default once we have a real public URL and anabuse@mailbox.fetch_contact_emailsetting; when set, automatically appended tofetch_user_agentin the format(+<url>; mailto:<email>).crossref_email/unpaywall_emailfor its API calls; consolidate so a singlefetch_contact_emailfeeds all three by default, with the per-API overrides still respected.Acceptance:
fetch_contact_emailsetting added; appended to UA when setfetch_contact_emailwhencrossref_email/unpaywall_emailare unset3. Per-host rate limiting
Why: The current chain has
max_concurrentfor total in-flight requests but no per-host throttle. Hammering a single domain is the fastest way to get blocked, blacklisted, or flagged as a DoS — and the synthesis pipelines absolutely will hit the same host repeatedly when expanding a node from a single source.What:
fetch_per_host_rpsinkt-config).Crawl-delayfromrobots.txtwhen present (overrides the default if stricter).Acceptance:
Crawl-delayhonored when stricter than defaulthost_pref.py)4. Site-operator opt-out mechanism
Why: Even with
robots.txthonored, site operators occasionally need to escalate ("please stop fetching us regardless of what ourrobots.txtsays"). Having a documented contact path and a fast operator-side block is what separates a mature scraper from a nuisance — and what regulators look for if anything ever escalates.What:
/.well-known/-style page on the eventual public site explaining how to request removal (email + the form of evidence we need).fetch_blocked_hosts: list[str]setting inkt-config. URLs matching a blocked host are rejected at the registry entry point with a synthetichost_blockedaudit entry, beforeurl_safetyand beforerobots.txt.justrecipe) removes existingRawSourcerows for that host. Lower priority but worth doing — opt-out without cleanup is a half-measure.Acceptance:
fetch_blocked_hostssetting added; rejection at registry entry pointhost_blockedaudit entry surfaced in the UIjust fetch-purge-host <host>recipe to remove existingRawSourcefor a blocked hostOrdering / dependencies
These can land in any order but naturally pair:
fetch_blocked_hosts(perf: optimize CI test execution with parallelization and change detection #4 minus the purge recipe) — smallest, highest leverage. Lets us respond to a complaint same-day.robots.txt(ci: add CI/CD pipelines for testing, releases, and Docker builds #1) + per-host rate limiting (feat: integration test pipeline with containerized services #3) — both touch the registry entry point and the host-keyed cache machinery (host_pref.pyis the natural template).Non-goals