Fetch chain: good-citizen mitigations (robots.txt, rate limiting, contact UA, opt-out)

## Context

After #176 the fetch chain is technically capable of pulling almost any public URL (DOI shortcut → httpx → curl_cffi → Byparr in #178). Nothing about the techniques is illegal, but the project is one step away from looking like an adversarial scraper to the sites it touches. This issue bundles four cheap mitigations that dramatically reduce legal/operational risk and are good engineering on their own merits.

None of these are blockers for #176 — they're follow-up work, sized to land together as a single \"be a good citizen\" PR.

## 1. `robots.txt` check (soft default, per-host opt-out)

**Why:** Ignoring `robots.txt` is the #1 thing that gets a scraper blocked, C&D'd, or reported. It's not legally binding in any jurisdiction, but honoring it is the strongest single signal of good-faith intent if a dispute ever arises.

**What:**
- New `RobotsCache` (in-memory + Redis-backed, ~1h TTL per host) that fetches and parses `robots.txt` once per host.
- `FetchProviderRegistry.fetch()` consults the cache before invoking any provider. On disallow, return a synthetic `robots_disallowed` `FetchAttempt` (same shape as the existing `url_safety` rejection) so the audit UI shows the reason cleanly.
- Per-host opt-out via a new `fetch_robots_ignore_hosts: list[str]` setting in `kt-config`, for the rare case where a site's `robots.txt` is overly broad and we have a separate agreement.
- Crawl-delay directive respected and fed into the rate limiter (#3).

**Acceptance:**
- [ ] `robots.txt` fetched at most once per host per TTL window
- [ ] Disallowed URLs short-circuit the chain with an audit entry
- [ ] `fetch_robots_ignore_hosts` config bypasses the check
- [ ] Tests cover: allowed, disallowed, missing `robots.txt` (treated as allowed), malformed `robots.txt` (treated as allowed + warning logged), per-host opt-out

## 2. Contact-email User-Agent (deferred until we have an official email)

**Why:** Crossref's \"polite pool\" pattern is the gold standard. A UA like `KnowledgeTree/0.x (+https://knowledge-tree.example; mailto:abuse@...)` tells site operators who to contact instead of just blocking, and is the cheapest way to look legitimate.

**What:**
- The `fetch_user_agent` setting in `kt-config` already exists. Update the default once we have a real public URL and an `abuse@` mailbox.
- New `fetch_contact_email` setting; when set, automatically appended to `fetch_user_agent` in the format `(+<url>; mailto:<email>)`.
- Falls back gracefully when unset (today's behavior — generic browser-ish UA).
- DOI provider already uses `crossref_email` / `unpaywall_email` for its API calls; consolidate so a single `fetch_contact_email` feeds all three by default, with the per-API overrides still respected.

**Acceptance:**
- [ ] `fetch_contact_email` setting added; appended to UA when set
- [ ] DOI provider falls back to `fetch_contact_email` when `crossref_email` / `unpaywall_email` are unset
- [ ] No-op when unset (no behavior change vs today)
- [ ] **Deferred sub-task:** flip the default UA once we have an official public URL + abuse mailbox (separate small PR)

## 3. Per-host rate limiting

**Why:** The current chain has `max_concurrent` for total in-flight requests but no per-host throttle. Hammering a single domain is the fastest way to get blocked, blacklisted, or flagged as a DoS — and the synthesis pipelines absolutely will hit the same host repeatedly when expanding a node from a single source.

**What:**
- Token-bucket rate limiter keyed by host, default ~1 req/sec per host (configurable via `fetch_per_host_rps` in `kt-config`).
- Honors `Crawl-delay` from `robots.txt` when present (overrides the default if stricter).
- Lives in the registry, applied uniformly across all providers — so a host that needs Byparr still gets throttled.
- Backed by Redis when available (so multiple workers share the budget); in-memory fallback for tests / single-process dev.
- Waits rather than fails when the bucket is empty — short bursts are fine.

**Acceptance:**
- [ ] Default 1 req/sec/host, configurable
- [ ] `Crawl-delay` honored when stricter than default
- [ ] Redis-backed when available, in-memory otherwise (mirrors `host_pref.py`)
- [ ] Tests cover: single-host throttling, multi-host independence, Crawl-delay override, Redis vs in-memory parity

## 4. Site-operator opt-out mechanism

**Why:** Even with `robots.txt` honored, site operators occasionally need to escalate (\"please stop fetching us regardless of what our `robots.txt` says\"). Having a documented contact path and a fast operator-side block is what separates a mature scraper from a nuisance — and what regulators look for if anything ever escalates.

**What:**
- **Documented opt-out path:** README section + `/.well-known/`-style page on the eventual public site explaining how to request removal (email + the form of evidence we need).
- **Operator-side blocklist:** new `fetch_blocked_hosts: list[str]` setting in `kt-config`. URLs matching a blocked host are rejected at the registry entry point with a synthetic `host_blocked` audit entry, before `url_safety` and before `robots.txt`.
- **Cleanup of stored content:** when a host is added to the blocklist, a one-shot job (or `just` recipe) removes existing `RawSource` rows for that host. Lower priority but worth doing — opt-out without cleanup is a half-measure.
- **Audit-trail UX:** the source detail page should explain the rejection reason clearly (\"This host has been blocked at operator request\") rather than a generic failure.

**Acceptance:**
- [ ] `fetch_blocked_hosts` setting added; rejection at registry entry point
- [ ] Synthetic `host_blocked` audit entry surfaced in the UI
- [ ] Documented opt-out path in README (email-based until we have a public site)
- [ ] `just fetch-purge-host <host>` recipe to remove existing `RawSource` for a blocked host
- [ ] Tests cover blocklist short-circuit, audit trail surfacing, purge recipe

## Ordering / dependencies

These can land in any order but naturally pair:

1. **First:** `fetch_blocked_hosts` (#4 minus the purge recipe) — smallest, highest leverage. Lets us respond to a complaint same-day.
2. **Second:** `robots.txt` (#1) + per-host rate limiting (#3) — both touch the registry entry point and the host-keyed cache machinery (`host_pref.py` is the natural template).
3. **Third:** UA consolidation (#2) — trivial code change, mostly waiting on the operational decision about the official email/URL.
4. **Fourth:** purge recipe + documented opt-out page — once there's a public site to host the docs.

## Non-goals

- Not building a full crawl scheduler. The fetch chain is on-demand from user-submitted URLs and synthesis pipelines, not a continuous crawler.
- Not implementing DMCA takedown tooling — that's a SaaS-launch concern, separate from local/dev mode.
- Not changing any provider implementations from #176.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch chain: good-citizen mitigations (robots.txt, rate limiting, contact UA, opt-out) #179

Context

1. `robots.txt` check (soft default, per-host opt-out)

2. Contact-email User-Agent (deferred until we have an official email)

3. Per-host rate limiting

4. Site-operator opt-out mechanism

Ordering / dependencies

Non-goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fetch chain: good-citizen mitigations (robots.txt, rate limiting, contact UA, opt-out) #179

Description

Context

1. robots.txt check (soft default, per-host opt-out)

2. Contact-email User-Agent (deferred until we have an official email)

3. Per-host rate limiting

4. Site-operator opt-out mechanism

Ordering / dependencies

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `robots.txt` check (soft default, per-host opt-out)