feat(kt-graph): PublicGraphBridge + WorkerGraphEngine wrapping#192
Merged
charlie83Gs merged 2 commits intomainfrom Apr 8, 2026
Merged
feat(kt-graph): PublicGraphBridge + WorkerGraphEngine wrapping#192charlie83Gs merged 2 commits intomainfrom
charlie83Gs merged 2 commits intomainfrom
Conversation
PR4 of the multigraph public-cache series. Lays down the standalone
``PublicGraphBridge`` module and wires it through ``WorkerGraphEngine``
+ ``WorkerState`` so workflows have a single, terse plumbing point in
PR5.
## What's new
- ``libs/kt-graph/src/kt_graph/public_bridge.py`` — standalone bridge
with three entry points:
* ``lookup_cached_source(canonical_url, doi)`` — read-only query
against the *default* graph's write-db. Returns a detached
``CachedSourceImport`` snapshot (raw source + facts + fact_sources
+ concept/entity nodes + their embeddings) so it can cross worker
boundaries safely.
* ``import_cached_source(snapshot, target_write_session, target_qdrant_prefix)``
— writes the snapshot into the target graph: upserts the raw source
by ``content_hash``, dedups facts against the target Qdrant fact
collection (0.92 atomic / 0.85 compound), mirrors fact_source rows,
and concept-similarity-matches each cached node against the target
node collection (threshold from
``Settings.public_bridge_concept_match_threshold``, default 0.93).
Misses create new ``write_nodes`` rows with deterministic keys.
Compensating Qdrant deletes on partial failure.
* ``contribute_source_and_facts(raw_source_id, source_write_session,
source_qdrant_prefix)`` — pushes a freshly-decomposed source + its
facts upstream to the default graph. Sources and facts only — node
structure is NOT contributed (the public graph runs its own pipeline
on the accumulated fact pool). Best-effort, all errors swallowed.
The bridge is **write-db only on both sides** — workers never touch
graph-db. ``CachedSourceImport`` is plain data (no live ORM rows) so
the lookup session can close before the import session opens.
- ``libs/kt-graph/src/kt_graph/worker_engine.py`` — three pass-through
methods on ``WorkerGraphEngine`` (``lookup_public_cache``,
``import_from_public``, ``contribute_to_public``). Each no-ops
gracefully when ``_public_bridge is None`` — that is the universal
"skip" signal so workflow code stays free of self-reference guards.
The engine fills in its own session + Qdrant prefix on every call.
- ``libs/kt-hatchet/src/kt_hatchet/lifespan.py`` — ``WorkerState`` now
carries ``default_graph_id`` (resolved once at startup from
``graphs.is_default = TRUE``) and exposes a ``make_worker_engine()``
factory that constructs a ``PublicGraphBridge`` only when the current
``graph_id`` differs from ``default_graph_id``. Every worker
(ingest/bottomup/nodes/synthesis) inherits cross-graph capability for
free through this single factory — no per-workflow wiring.
- ``libs/kt-config/src/kt_config/settings.py`` — two new fields under a
new ``public_bridge`` YAML section:
* ``public_bridge_concept_match_threshold`` (default 0.93) — high on
purpose, false matches collapse distinct concepts which is far
worse than the cost of an occasional duplicate.
* ``public_cache_refresh_after_days`` (default 365) — cache hits
older than this still serve immediately but flag the snapshot as
stale. The async refresh workflow itself lands in PR7.
- ``libs/kt-hatchet/pyproject.toml`` — adds ``kt-graph`` as a workspace
dep. No cycle (kt-graph never imports kt-hatchet).
## Test plan
- ``libs/kt-graph/tests/test_public_bridge.py`` — 13 unit tests covering:
* Engine pass-throughs no-op when bridge is None (cache miss, import
no-op, contribute no-op).
* Engine delegates to the bridge with the correct session + prefix
when one is wired.
* ``lookup_cached_source`` short-circuits cleanly: missing keys, no
Qdrant, resolver failure → all return ``None``.
* Happy-path lookup with mocked sessions + Qdrant → returns a fully
populated ``CachedSourceImport`` with embeddings on facts and nodes.
* Staleness threshold: zero disables, recent is fresh, ancient is
stale, ``None`` ``retrieved_at`` is fresh (defensive).
Full integration with a real schema + Qdrant ships in PR5 once the
workflow is wired up.
- [x] kt-config: 35 passed
- [x] kt-db: 205 passed
- [x] kt-hatchet: 40 passed
- [x] kt-graph: 86 passed (13 new in test_public_bridge.py)
- [ ] CI all green
PR4 review (#192) flagged five substantive correctness gaps. All addressed; full test suite still green (kt-graph 95 / kt-hatchet 40). ## Correctness fixes 1. **`_match_or_create_node` no longer over-reports `created` and no longer trusts the remote `node_uuid`** (review #1). - The local uuid is now derived from `key_to_uuid(make_node_key(...))` so it never collides with the existing unique index on `write_nodes.node_uuid` when the same concept already exists locally under a different historical id. - The insert uses `RETURNING node_uuid` to distinguish a real insert from an ON CONFLICT no-op. On no-op the bridge re-SELECTs the existing local row's `node_uuid` and reports `created=False` so `ImportResult.nodes_matched` increments correctly. 2. **`_load_linked_nodes` array-overlap query now has integration coverage** (review #2). New test file `tests/integration/test_public_bridge_db.py` spins up a real write-db schema and exercises the full SQL surface that the unit tests can't: - The `write_nodes.fact_ids && ARRAY[...]` overlap operator — including the concept/entity type filter (perspective rows must NOT match) and an empty-input short-circuit. - `_load_linked_facts` / `_load_linked_fact_sources` provenance joins. - `_upsert_raw_source` returning the correct local id on both insert and ON CONFLICT branches. - `_match_or_create_node` reuse-vs-create branches without Qdrant. - `_upsert_fact_source` idempotency under re-imports. 3. **`_upsert_raw_source` now returns the local id, not the remote one** (review #3). Both call sites updated: - `import_cached_source` records `result.raw_source_id = local_raw_id` so PR5's workflow code attaches downstream rows to the right id. - `contribute_source_and_facts` discards the return — fact_source rows there are keyed on `content_hash`, not the source id. 4. **`make_worker_engine` now refuses to wire a bridge for a non-default graph that's missing its Qdrant collection prefix** (review #4). Empty prefix would silently dedup against the default graph's collection — exactly the cross-contamination this whole subsystem exists to prevent. Fail loud at construction rather than discover it in production. 5. **`_upsert_fact_source` now uses a deterministic UUID5** keyed on `(local_fact_id, raw_source_content_hash)` instead of a fresh `uuid.uuid4()` (review #8). Re-imports of the same source into the same target graph become true no-ops without needing a schema-level unique constraint. PR5's workflow should still avoid re-imports, this is defence in depth. ## Smaller things - **lifespan.py**: extracted `_resolve_default_graph_id()` helper — the duplicated try/except block in `worker_lifespan()` and `build_worker_state()` collapses to one call (review #5). - **test_public_bridge.py**: comment on the staleness assertion now reflects the actual fixture date (`2023-01-01`, not `2026-01-01`) (review #6). - **CLAUDE.md**: noted the one-way `kt-hatchet → kt-graph` workspace dep added in PR4 so future contributors don't reverse it (review #9). ## Test plan - [x] kt-graph: 95 passed (13 unit + 9 new integration on `test_public_bridge_db.py`, plus the existing 73) - [x] kt-hatchet: 40 passed - [ ] CI all green
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR4 of the multigraph public-cache series. Lays down the standalone
PublicGraphBridgemodule and wires it throughWorkerGraphEngine+WorkerStateso PR5's workflow integration is a one-line plumbing change per call site.What's new
libs/kt-graph/src/kt_graph/public_bridge.py(new)Standalone bridge with three entry points:
lookup_cached_source(canonical_url, doi)— read-only query against the default graph's write-db. Returns a detachedCachedSourceImportsnapshot (raw source + facts + fact_sources + concept/entity nodes + their embeddings) so it can cross worker boundaries safely.import_cached_source(snapshot, target_write_session, target_qdrant_prefix)— writes the snapshot into the target graph: upserts the raw source bycontent_hash, dedups facts against the target Qdrant fact collection (0.92 atomic / 0.85 compound), mirrors fact_source rows, and concept-similarity-matches each cached node against the target node collection (threshold fromSettings.public_bridge_concept_match_threshold, default 0.93). Misses create newwrite_nodesrows with deterministic keys. Compensating Qdrant deletes on partial failure.contribute_source_and_facts(raw_source_id, source_write_session, source_qdrant_prefix)— pushes a freshly-decomposed source + its facts upstream to the default graph. Sources and facts only — node structure is NOT contributed (the public graph runs its own pipeline on the accumulated fact pool). Best-effort, all errors swallowed.libs/kt-graph/src/kt_graph/worker_engine.pyThree pass-through methods on
WorkerGraphEngine(lookup_public_cache,import_from_public,contribute_to_public). Each no-ops gracefully when_public_bridge is None— that is the universal "skip" signal so workflow code stays free of self-reference guards. The engine fills in its own session + Qdrant prefix on every call.libs/kt-hatchet/src/kt_hatchet/lifespan.pyWorkerStatenow carriesdefault_graph_id(resolved once at startup fromgraphs.is_default = TRUE) and exposes amake_worker_engine()factory that constructs aPublicGraphBridgeonly when the currentgraph_iddiffers fromdefault_graph_id. Every worker (ingest/bottomup/nodes/synthesis) inherits cross-graph capability for free through this single factory — no per-workflow wiring.libs/kt-config/src/kt_config/settings.pyTwo new fields under a new
public_bridgeYAML section:public_bridge_concept_match_threshold(default 0.93) — high on purpose; false matches collapse distinct concepts which is far worse than the cost of an occasional duplicate that the local dedup pipeline merges later.public_cache_refresh_after_days(default 365) — cache hits older than this still serve immediately but flag the snapshot as stale. The async refresh workflow itself lands in PR7.libs/kt-hatchet/pyproject.tomlAdds
kt-graphas a workspace dep. No cycle (kt-graph never imports kt-hatchet — verified with grep).Test plan
libs/kt-graph/tests/test_public_bridge.py— 13 unit tests (mocked sessions + Qdrant) covering:lookup_cached_sourceshort-circuits cleanly: missing keys, no Qdrant, resolver failure → all returnNone.CachedSourceImportwith embeddings on facts and nodes.Noneretrieved_atis fresh (defensive).Full end-to-end integration with a real schema + Qdrant lands in PR5 when the bridge is wired into
ingest.py.uv run --project libs/kt-config pytest libs/kt-config/tests/ -q— 35 passeduv run --project libs/kt-db pytest libs/kt-db/tests/ -q— 205 passeduv run --project libs/kt-hatchet pytest libs/kt-hatchet/tests/ -q— 40 passeduv run --project libs/kt-graph pytest libs/kt-graph/tests/ -q— 86 passed (13 new)PR sequence
PR1 — Provider classification✅ feat(kt-providers): classify search/fetch providers as public or private #186PR2 — Canonicalization helpers✅ feat(kt-providers): canonical URL + DOI helpers for multigraph cache lookup #187PR3 — Schema migrations✅ feat(kt-db): graph public cache toggles + raw_source canonical_url/doi #191🤖 Generated with Claude Code