Skip to content

feat(kt-db): graph public cache toggles + raw_source canonical_url/doi#191

Merged
charlie83Gs merged 2 commits intomainfrom
feat/raw-source-canonical-schema
Apr 8, 2026
Merged

feat(kt-db): graph public cache toggles + raw_source canonical_url/doi#191
charlie83Gs merged 2 commits intomainfrom
feat/raw-source-canonical-schema

Conversation

@charlie83Gs
Copy link
Copy Markdown
Contributor

Summary

PR3 of the multigraph public-cache series. Lays down the schema state the PublicGraphBridge (PR4) needs to look up sources across graphs by stable identity.

  • graphs.contribute_to_public and graphs.use_public_cache — per-graph toggles, default true. The default graph ignores them in code since it has no upstream. Lives in the public schema only (control plane).
  • raw_sources.canonical_url / doi (graph-db) and matching write_raw_sources.canonical_url / doi (write-db). Both non-unique indexed (ix_*_canonical_url, ix_*_doi).

⚠️ The bridge queries write-db, never graph-db. The write-db indexes are the load-bearing ones; the graph-db copies exist for sync parity and future analytics. Workers never touch graph-db.

Backfill

Both migrations include a best-effort backfill that walks existing rows in batches of 1000 and computes a minimal canonical_url + doi using a small helper inlined in the migration file. The helper:

  • lowercases scheme + host
  • drops fragment
  • regex-extracts a DOI substring (10.NNNN/...)

Tracker stripping and duplicate-slash collapsing are intentionally not done in the backfill — they're cosmetic on legacy rows and a weaker backfill can only cost missed cache hits, never wrong matches. New writes use the full kt_providers.fetch.canonical helper from PR2.

The helper is inlined (rather than imported from kt-providers) so kt-db does not pick up a workspace dependency on a sibling lib.

Migration generation

Both files generated via the alembic CLI per CLAUDE.md (never hand-author IDs):

cd libs/kt-db && uv run alembic revision -m "add graph public cache toggles and raw_source canonical_url doi"
cd libs/kt-db && uv run alembic -c alembic_write.ini revision -m "add write_raw_source canonical_url doi"

Single-head verified on both configs:

$ uv run alembic heads
38859c06de60 (head)
$ uv run alembic -c alembic_write.ini heads
777cdf5ff9e5 (head)

Test plan

  • uv run --project libs/kt-db pytest libs/kt-db/tests/ -x -q — 205 passed
  • alembic upgrade head (graph-db) — applied cleanly
  • alembic -c alembic_write.ini upgrade head (write-db) — applied cleanly
  • alembic downgrade -1 && upgrade head round-trip on both configs — clean
  • CI all green

PR sequence

  1. PR1 — Provider classificationfeat(kt-providers): classify search/fetch providers as public or private #186
  2. PR2 — Canonicalization helpersfeat(kt-providers): canonical URL + DOI helpers for multigraph cache lookup #187
  3. PR3 — Schema migrationsthis PR
  4. PR4 — PublicGraphBridge + WorkerState wiring
  5. PR5 — Ingest workflow integration
  6. PR6 — API surface + Frontend
  7. PR7 — Robustness/sweeper

🤖 Generated with Claude Code

…l/doi

PR3 of the multigraph public-cache series. Adds the schema state needed
for the PublicGraphBridge to look up sources across graphs by stable
identity:

- ``graphs.contribute_to_public`` and ``graphs.use_public_cache`` —
  per-graph toggles, default ON for non-default graphs (the default graph
  ignores them in code since it has no upstream).
- ``raw_sources.canonical_url`` / ``doi`` (graph-db) and matching columns
  on ``write_raw_sources`` (write-db). Both non-unique indexed.

CRITICAL: the bridge queries write-db, never graph-db. The write-db
indexes are the load-bearing ones; the graph-db copies exist for sync
parity and future analytics.

Both migrations include a best-effort backfill that walks existing rows
in batches of 1000 and computes a minimal canonical_url + DOI inline
(without depending on kt-providers, to avoid pulling a workspace dep into
kt-db). New writes use the full canonical helper from PR2, so a weaker
backfill only costs missed cache hits on legacy rows — never wrong
matches, which is the failure mode we actually care about.

Migrations generated via the alembic CLI; verified single head on both
configs and round-tripped upgrade/downgrade against a fresh DB.
- ruff format on both PR3 migration files (Backend Lint was red)
- broaden DOI tail-strip to ".),;]" — scraped URIs in legacy rows often
  end in trailing punctuation that .rstrip(".)") missed
- guard `_simple_canonicalize` against empty/whitespace URIs so we never
  write empty strings into canonical_url (return None instead)
@charlie83Gs charlie83Gs merged commit b9572a7 into main Apr 8, 2026
18 checks passed
@charlie83Gs charlie83Gs deleted the feat/raw-source-canonical-schema branch April 8, 2026 22:16
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 8, 2026


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant