Skip to content

Fetch chain: good-citizen mitigations (robots.txt, rate limiting, contact UA, opt-out) #179

@charlie83Gs

Description

@charlie83Gs

Context

After #176 the fetch chain is technically capable of pulling almost any public URL (DOI shortcut → httpx → curl_cffi → Byparr in #178). Nothing about the techniques is illegal, but the project is one step away from looking like an adversarial scraper to the sites it touches. This issue bundles four cheap mitigations that dramatically reduce legal/operational risk and are good engineering on their own merits.

None of these are blockers for #176 — they're follow-up work, sized to land together as a single "be a good citizen" PR.

1. robots.txt check (soft default, per-host opt-out)

Why: Ignoring robots.txt is the #1 thing that gets a scraper blocked, C&D'd, or reported. It's not legally binding in any jurisdiction, but honoring it is the strongest single signal of good-faith intent if a dispute ever arises.

What:

  • New RobotsCache (in-memory + Redis-backed, ~1h TTL per host) that fetches and parses robots.txt once per host.
  • FetchProviderRegistry.fetch() consults the cache before invoking any provider. On disallow, return a synthetic robots_disallowed FetchAttempt (same shape as the existing url_safety rejection) so the audit UI shows the reason cleanly.
  • Per-host opt-out via a new fetch_robots_ignore_hosts: list[str] setting in kt-config, for the rare case where a site's robots.txt is overly broad and we have a separate agreement.
  • Crawl-delay directive respected and fed into the rate limiter (feat: integration test pipeline with containerized services #3).

Acceptance:

  • robots.txt fetched at most once per host per TTL window
  • Disallowed URLs short-circuit the chain with an audit entry
  • fetch_robots_ignore_hosts config bypasses the check
  • Tests cover: allowed, disallowed, missing robots.txt (treated as allowed), malformed robots.txt (treated as allowed + warning logged), per-host opt-out

2. Contact-email User-Agent (deferred until we have an official email)

Why: Crossref's "polite pool" pattern is the gold standard. A UA like KnowledgeTree/0.x (+https://knowledge-tree.example; mailto:abuse@...) tells site operators who to contact instead of just blocking, and is the cheapest way to look legitimate.

What:

  • The fetch_user_agent setting in kt-config already exists. Update the default once we have a real public URL and an abuse@ mailbox.
  • New fetch_contact_email setting; when set, automatically appended to fetch_user_agent in the format (+<url>; mailto:<email>).
  • Falls back gracefully when unset (today's behavior — generic browser-ish UA).
  • DOI provider already uses crossref_email / unpaywall_email for its API calls; consolidate so a single fetch_contact_email feeds all three by default, with the per-API overrides still respected.

Acceptance:

  • fetch_contact_email setting added; appended to UA when set
  • DOI provider falls back to fetch_contact_email when crossref_email / unpaywall_email are unset
  • No-op when unset (no behavior change vs today)
  • Deferred sub-task: flip the default UA once we have an official public URL + abuse mailbox (separate small PR)

3. Per-host rate limiting

Why: The current chain has max_concurrent for total in-flight requests but no per-host throttle. Hammering a single domain is the fastest way to get blocked, blacklisted, or flagged as a DoS — and the synthesis pipelines absolutely will hit the same host repeatedly when expanding a node from a single source.

What:

  • Token-bucket rate limiter keyed by host, default ~1 req/sec per host (configurable via fetch_per_host_rps in kt-config).
  • Honors Crawl-delay from robots.txt when present (overrides the default if stricter).
  • Lives in the registry, applied uniformly across all providers — so a host that needs Byparr still gets throttled.
  • Backed by Redis when available (so multiple workers share the budget); in-memory fallback for tests / single-process dev.
  • Waits rather than fails when the bucket is empty — short bursts are fine.

Acceptance:

  • Default 1 req/sec/host, configurable
  • Crawl-delay honored when stricter than default
  • Redis-backed when available, in-memory otherwise (mirrors host_pref.py)
  • Tests cover: single-host throttling, multi-host independence, Crawl-delay override, Redis vs in-memory parity

4. Site-operator opt-out mechanism

Why: Even with robots.txt honored, site operators occasionally need to escalate ("please stop fetching us regardless of what our robots.txt says"). Having a documented contact path and a fast operator-side block is what separates a mature scraper from a nuisance — and what regulators look for if anything ever escalates.

What:

  • Documented opt-out path: README section + /.well-known/-style page on the eventual public site explaining how to request removal (email + the form of evidence we need).
  • Operator-side blocklist: new fetch_blocked_hosts: list[str] setting in kt-config. URLs matching a blocked host are rejected at the registry entry point with a synthetic host_blocked audit entry, before url_safety and before robots.txt.
  • Cleanup of stored content: when a host is added to the blocklist, a one-shot job (or just recipe) removes existing RawSource rows for that host. Lower priority but worth doing — opt-out without cleanup is a half-measure.
  • Audit-trail UX: the source detail page should explain the rejection reason clearly ("This host has been blocked at operator request") rather than a generic failure.

Acceptance:

  • fetch_blocked_hosts setting added; rejection at registry entry point
  • Synthetic host_blocked audit entry surfaced in the UI
  • Documented opt-out path in README (email-based until we have a public site)
  • just fetch-purge-host <host> recipe to remove existing RawSource for a blocked host
  • Tests cover blocklist short-circuit, audit trail surfacing, purge recipe

Ordering / dependencies

These can land in any order but naturally pair:

  1. First: fetch_blocked_hosts (perf: optimize CI test execution with parallelization and change detection #4 minus the purge recipe) — smallest, highest leverage. Lets us respond to a complaint same-day.
  2. Second: robots.txt (ci: add CI/CD pipelines for testing, releases, and Docker builds #1) + per-host rate limiting (feat: integration test pipeline with containerized services #3) — both touch the registry entry point and the host-keyed cache machinery (host_pref.py is the natural template).
  3. Third: UA consolidation (fix: resolve cross-worker test imports and enforce unit/integration test boundaries #2) — trivial code change, mostly waiting on the operational decision about the official email/URL.
  4. Fourth: purge recipe + documented opt-out page — once there's a public site to host the docs.

Non-goals

  • Not building a full crawl scheduler. The fetch chain is on-demand from user-submitted URLs and synthesis pipelines, not a continuous crawler.
  • Not implementing DMCA takedown tooling — that's a SaaS-launch concern, separate from local/dev mode.
  • Not changing any provider implementations from feat(kt-providers): pluggable fetch providers with fallback chain #176.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions