feat(kt-db): graph public cache toggles + raw_source canonical_url/doi#191
Merged
charlie83Gs merged 2 commits intomainfrom Apr 8, 2026
Merged
feat(kt-db): graph public cache toggles + raw_source canonical_url/doi#191charlie83Gs merged 2 commits intomainfrom
charlie83Gs merged 2 commits intomainfrom
Conversation
…l/doi PR3 of the multigraph public-cache series. Adds the schema state needed for the PublicGraphBridge to look up sources across graphs by stable identity: - ``graphs.contribute_to_public`` and ``graphs.use_public_cache`` — per-graph toggles, default ON for non-default graphs (the default graph ignores them in code since it has no upstream). - ``raw_sources.canonical_url`` / ``doi`` (graph-db) and matching columns on ``write_raw_sources`` (write-db). Both non-unique indexed. CRITICAL: the bridge queries write-db, never graph-db. The write-db indexes are the load-bearing ones; the graph-db copies exist for sync parity and future analytics. Both migrations include a best-effort backfill that walks existing rows in batches of 1000 and computes a minimal canonical_url + DOI inline (without depending on kt-providers, to avoid pulling a workspace dep into kt-db). New writes use the full canonical helper from PR2, so a weaker backfill only costs missed cache hits on legacy rows — never wrong matches, which is the failure mode we actually care about. Migrations generated via the alembic CLI; verified single head on both configs and round-tripped upgrade/downgrade against a fresh DB.
- ruff format on both PR3 migration files (Backend Lint was red)
- broaden DOI tail-strip to ".),;]" — scraped URIs in legacy rows often
end in trailing punctuation that .rstrip(".)") missed
- guard `_simple_canonicalize` against empty/whitespace URIs so we never
write empty strings into canonical_url (return None instead)
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR3 of the multigraph public-cache series. Lays down the schema state the
PublicGraphBridge(PR4) needs to look up sources across graphs by stable identity.graphs.contribute_to_publicandgraphs.use_public_cache— per-graph toggles, defaulttrue. The default graph ignores them in code since it has no upstream. Lives in the public schema only (control plane).raw_sources.canonical_url/doi(graph-db) and matchingwrite_raw_sources.canonical_url/doi(write-db). Both non-unique indexed (ix_*_canonical_url,ix_*_doi).Backfill
Both migrations include a best-effort backfill that walks existing rows in batches of 1000 and computes a minimal
canonical_url+doiusing a small helper inlined in the migration file. The helper:10.NNNN/...)Tracker stripping and duplicate-slash collapsing are intentionally not done in the backfill — they're cosmetic on legacy rows and a weaker backfill can only cost missed cache hits, never wrong matches. New writes use the full
kt_providers.fetch.canonicalhelper from PR2.The helper is inlined (rather than imported from
kt-providers) sokt-dbdoes not pick up a workspace dependency on a sibling lib.Migration generation
Both files generated via the alembic CLI per CLAUDE.md (never hand-author IDs):
Single-head verified on both configs:
Test plan
uv run --project libs/kt-db pytest libs/kt-db/tests/ -x -q— 205 passedalembic upgrade head(graph-db) — applied cleanlyalembic -c alembic_write.ini upgrade head(write-db) — applied cleanlyalembic downgrade -1 && upgrade headround-trip on both configs — cleanPR sequence
PR1 — Provider classification✅ feat(kt-providers): classify search/fetch providers as public or private #186PR2 — Canonicalization helpers✅ feat(kt-providers): canonical URL + DOI helpers for multigraph cache lookup #187PublicGraphBridge+ WorkerState wiring🤖 Generated with Claude Code