FD-337: windowed re-extraction + cross-session parent back-fill#124
Conversation
Long/multi-session chat extraction recovered only ~half the family tree (single 200+ statement LLM pass drops people and parent links). Two mechanisms, both measured: - llmutil.gemini_structured: accept an optional `model` override. Fixes a pre-existing FD-333 regression (c8d5c00) that passed model= to a function without the param, crashing EVERY extraction that reached Pass 3 (SARF review) with TypeError. - pdp.extract_full: window long conversations (>WINDOW_SIZE statements) on a clone of the diagram, committing each window so later windows see earlier results as committed context, then re-stage the combined new items as negative-id deltas. Short spans keep the existing single-shot path byte-for-byte. Contract preserved: never commits into the caller's diagram. Window size via FD_WINDOW_SIZE (default 80). - schema.commit_pdp_items: deterministic cross-session parent back-fill — set parents on an already-committed child when a later session emits a birth event naming its couple (the delta-only repair could never reach a committed child). - connectivity_check --accumulate / run_extract_full_f1: null each discussion's persisted cursor for from-empty measurement (a stale "accepted through order N" cursor made the from-empty accumulate extract an empty tail). run_extract_full_f1 also now prints PairBonds + ParentChild precision. Measured (diagram 1924, accumulate 55,58,60): single-shot median ~35% LCC → windowed+backfill median ~84%, recovering ~31 of 32 people. No F1 regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces windowed extraction in extract_full to handle large discussions by chunking statements into windows, along with a cross-session parent backfill mechanism to link committed children to their parents when birth or adoption events are processed later. It also updates evaluation scripts to reset extraction cursors and track precision and recall metrics. The reviewer feedback suggests separating the ID remapping dictionary by entity type in _restage_new_items to avoid ID overlaps, adding an early return in extract_full when there are no statements to extract to save LLM tokens, and adding defensive checks for the presence of the "id" key in dictionary items to prevent potential KeyErrors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def _restage_new_items(working: DiagramData, original: DiagramData) -> PDP: | ||
| """Re-stage items committed across windows that are absent from the original | ||
| diagram as a fresh negative-id PDP. References into original committed items | ||
| stay positive; references among the new items are remapped to negatives.""" | ||
| orig_people = {p["id"] for p in original.people} | ||
| orig_events = {e["id"] for e in original.events} | ||
| orig_bonds = {pb["id"] for pb in original.pair_bonds} | ||
|
|
||
| new_people = [p for p in working.people if p["id"] not in orig_people] | ||
| new_events = [e for e in working.events if e["id"] not in orig_events] | ||
| new_bonds = [pb for pb in working.pair_bonds if pb["id"] not in orig_bonds] | ||
|
|
||
| next_id = -1 | ||
| id_map: dict[int, int] = {} | ||
| for item in [*new_people, *new_bonds, *new_events]: | ||
| id_map[item["id"]] = next_id | ||
| next_id -= 1 | ||
|
|
||
| def remap(old): | ||
| return id_map.get(old, old) if old is not None else None | ||
|
|
||
| people, bonds, events = [], [], [] | ||
| for p in new_people: | ||
| d = dict(p) | ||
| d["id"] = id_map[p["id"]] | ||
| d["parents"] = remap(p.get("parents")) | ||
| people.append(from_dict(Person, d)) | ||
| for pb in new_bonds: | ||
| d = dict(pb) | ||
| d["id"] = id_map[pb["id"]] | ||
| d["person_a"] = remap(pb.get("person_a")) | ||
| d["person_b"] = remap(pb.get("person_b")) | ||
| bonds.append(from_dict(PairBond, d)) | ||
| for e in new_events: | ||
| d = dict(e) | ||
| d["id"] = id_map[e["id"]] | ||
| for k in ("person", "spouse", "child"): | ||
| d[k] = remap(d.get(k)) | ||
| d["relationshipTargets"] = [remap(t) for t in d.get("relationshipTargets") or []] | ||
| d["relationshipTriangles"] = [ | ||
| remap(t) for t in d.get("relationshipTriangles") or [] | ||
| ] | ||
| for k in ("dateTime", "endDateTime"): | ||
| v = d.get(k) | ||
| if v is not None and hasattr(v, "toString"): | ||
| d[k] = v.toString("yyyy-MM-dd") | ||
| events.append(from_dict(Event, d)) | ||
|
|
||
| return PDP(people=people, events=events, pair_bonds=bonds) |
There was a problem hiding this comment.
When remapping entity IDs, ensure the remapping dictionary corresponds to the correct entity type. Using a single id_map for all entity types (People, PairBonds, Events) can lead to incorrect remapping if there are any ID overlaps or if the ID generation strategy changes in the future. Additionally, add defensive checks to ensure the "id" key exists in the dictionary items before accessing it to prevent potential KeyErrors.
def _restage_new_items(working: DiagramData, original: DiagramData) -> PDP:
"""Re-stage items committed across windows that are absent from the original
diagram as a fresh negative-id PDP. References into original committed items
stay positive; references among the new items are remapped to negatives."""
orig_people = {p["id"] for p in original.people if "id" in p}
orig_events = {e["id"] for e in original.events if "id" in e}
orig_bonds = {pb["id"] for pb in original.pair_bonds if "id" in pb}
new_people = [p for p in working.people if "id" in p and p["id"] not in orig_people]
new_events = [e for e in working.events if "id" in e and e["id"] not in orig_events]
new_bonds = [pb for pb in working.pair_bonds if "id" in pb and pb["id"] not in orig_bonds]
person_id_map: dict[int, int] = {}
bond_id_map: dict[int, int] = {}
event_id_map: dict[int, int] = {}
next_id = -1
for p in new_people:
person_id_map[p["id"]] = next_id
next_id -= 1
for pb in new_bonds:
bond_id_map[pb["id"]] = next_id
next_id -= 1
for e in new_events:
event_id_map[e["id"]] = next_id
next_id -= 1
people, bonds, events = [], [], []
for p in new_people:
d = dict(p)
d["id"] = person_id_map[p["id"]]
d["parents"] = bond_id_map.get(p.get("parents"), p.get("parents")) if p.get("parents") is not None else None
people.append(from_dict(Person, d))
for pb in new_bonds:
d = dict(pb)
d["id"] = bond_id_map[pb["id"]]
d["person_a"] = person_id_map.get(pb.get("person_a"), pb.get("person_a")) if pb.get("person_a") is not None else None
d["person_b"] = person_id_map.get(pb.get("person_b"), pb.get("person_b")) if pb.get("person_b") is not None else None
bonds.append(from_dict(PairBond, d))
for e in new_events:
d = dict(e)
d["id"] = event_id_map[e["id"]]
for k in ("person", "spouse", "child"):
val = d.get(k)
if val is not None:
d[k] = person_id_map.get(val, val)
d["relationshipTargets"] = [person_id_map.get(t, t) for t in d.get("relationshipTargets") or []]
d["relationshipTriangles"] = [
person_id_map.get(t, t) for t in d.get("relationshipTriangles") or []
]
for k in ("dateTime", "endDateTime"):
v = d.get(k)
if v is not None and hasattr(v, "toString"):
d[k] = v.toString("yyyy-MM-dd")
events.append(from_dict(Event, d))
return PDP(people=people, events=events, pair_bonds=bonds)References
- When remapping entity IDs, ensure the remapping dictionary corresponds to the correct entity type. For instance, do not remap a PairBond ID using a Person ID map.
There was a problem hiding this comment.
Applied the per-type maps in 2d900b2 — matches the existing reassign_delta_ids convention in this file. For the record it wasn't a correctness bug: commit_pdp_items assigns globally-unique ids from a single counter, so the combined map had no overlaps. Declining the defensive id guards — committed items always carry an id, and per our fail-early policy a missing one should raise, not be silently skipped.
| to_extract = [s for s in ordered if (s.order or 0) > cursor] | ||
| prior_committed = [s for s in ordered if (s.order or 0) <= cursor] | ||
|
|
||
| if len(to_extract) <= WINDOW_SIZE: |
There was a problem hiding this comment.
If to_extract is empty (meaning all statements in the discussion have already been extracted and committed), we can return early with empty PDP and PDPDeltas objects. This avoids unnecessary LLM calls, saving API tokens and execution time.
if not to_extract:
diagram_data.pdp = PDP()
return PDP(), PDPDeltas()
if len(to_extract) <= WINDOW_SIZE:There was a problem hiding this comment.
Applied in 2d900b2. It also avoids the degenerate empty-tail extraction (the model emitting spurious items when nothing is past the cursor), so good catch.
| def _backfill_committed_parents(self) -> None: | ||
| """Cross-session parent back-fill: when a birth/adopted event names a | ||
| couple as the parents of an already-committed child whose `parents` is | ||
| still null, set it. The committed-data analogue of | ||
| pdp.infer_parents_from_birth_events, which only reaches delta people — | ||
| so a child committed in an earlier session is otherwise unreachable when | ||
| a later session states its parentage.""" | ||
| bond_by_dyad: dict[tuple[int, int], int] = {} | ||
| for pb in self.pair_bonds: | ||
| a, b = pb.get("person_a"), pb.get("person_b") | ||
| if a is not None and b is not None: | ||
| bond_by_dyad[tuple(sorted((a, b)))] = pb["id"] | ||
| people_by_id = {p["id"]: p for p in self.people} | ||
| for e in self.events: | ||
| kind = e.get("kind") | ||
| kind = kind.value if isinstance(kind, EventKind) else kind | ||
| if kind not in (EventKind.Birth.value, EventKind.Adopted.value): | ||
| continue | ||
| person, spouse, child = e.get("person"), e.get("spouse"), e.get("child") | ||
| if person is None or spouse is None or child is None: | ||
| continue | ||
| bond_id = bond_by_dyad.get(tuple(sorted((person, spouse)))) | ||
| if bond_id is None: | ||
| continue | ||
| ch = people_by_id.get(child) | ||
| if ch is None or ch.get("parents") is not None: | ||
| continue | ||
| ch["parents"] = bond_id | ||
| _log.info( | ||
| f"backfill_committed_parents: committed Person {child} " | ||
| f"parents={bond_id} from birth event {e.get('id')}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Add defensive checks to ensure the "id" key exists in the dictionary items before accessing it. This prevents potential KeyErrors if any person, event, or pair bond in the diagram data is malformed or missing an ID.
def _backfill_committed_parents(self) -> None:
"""Cross-session parent back-fill: when a birth/adopted event names a
couple as the parents of an already-committed child whose `parents` is
still null, set it. The committed-data analogue of
pdp.infer_parents_from_birth_events, which only reaches delta people —
so a child committed in an earlier session is otherwise unreachable when
a later session states its parentage."""
bond_by_dyad: dict[tuple[int, int], int] = {}
for pb in self.pair_bonds:
a, b = pb.get("person_a"), pb.get("person_b")
if a is not None and b is not None and "id" in pb:
bond_by_dyad[tuple(sorted((a, b)))] = pb["id"]
people_by_id = {p["id"]: p for p in self.people if "id" in p}
for e in self.events:
kind = e.get("kind")
kind = kind.value if isinstance(kind, EventKind) else kind
if kind not in (EventKind.Birth.value, EventKind.Adopted.value):
continue
person, spouse, child = e.get("person"), e.get("spouse"), e.get("child")
if person is None or spouse is None or child is None:
continue
bond_id = bond_by_dyad.get(tuple(sorted((person, spouse))))
if bond_id is None:
continue
ch = people_by_id.get(child)
if ch is None or ch.get("parents") is not None:
continue
ch["parents"] = bond_id
_log.info(
f"backfill_committed_parents: committed Person {child} "
f"parents={bond_id} from birth event {e.get('id')}"
)There was a problem hiding this comment.
Declining: committed people/events/pair_bonds always have an id assigned by commit_pdp_items. A committed item missing one is an invariant violation elsewhere; per our fail-early / no-bandaid policy we let it raise rather than silently skip.
…y return on empty tail - _restage_new_items: separate person/bond/event id maps (matches reassign_delta_ids convention; removes coupling to the global-unique-id invariant) - extract_full: early return when nothing is past the cursor, avoiding a wasted LLM call and the degenerate empty-tail extraction Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ride - test_commit: committed-child parent back-fill + no-overwrite guard - test_extract_full: windowing runs per-window and preserves the caller's committed diagram; _restage_new_items keeps committed refs positive and remaps new refs negative; gemini_structured accepts model= (FD-333 crash guard) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FD-337: Re-extraction for long/multi-session conversations
Corrected diagnosis (the ticket's "stuck at 25-30%" was a measurement bug)
The from-empty
connectivity_check --accumulatehonored each discussion's persisted re-extraction cursor. Diagram 1924's discussions carry a stale "accepted through order 200" cursor, so the from-empty accumulate extracted an empty tail and the model produced a degenerate fragment → the ~25-30% figure. It was never the extraction that was stuck; it was the measurement.The real defect is still real: the actual stored diagram 1924 is 32 people in 14 components — 56% connected. Production's incremental extraction (with the Pass-3 crash below live) left it fragmented.
What this PR does
llmutil.gemini_structured) — accepts optionalmodel=. A pre-existing regression (FD-333, c8d5c00) crashed every extraction reaching Pass 3 (SARF review) withTypeError.pdp.extract_full) — conversations >WINDOW_SIZE(default 80, envFD_WINDOW_SIZE) are extracted in windows on a clone, each window committed so later windows see it as context, then re-staged as negative-id deltas. Short spans keep the single-shot path byte-for-byte. Contract preserved — never commits into the caller's diagram (verified by test).schema.commit_pdp_items) — setsparentson an already-committed child when a later session emits a birth event naming its couple.connectivity_check/run_extract_full_f1null the persisted cursor for from-empty measurement; F1 runner now prints PairBonds + ParentChild precision.Honest attribution (accumulate 55,58,60, real implementation)
Windowing earns its place on completeness: single-shot recovers only ~22 of 32 people; windowing recovers ~31. LCC% is a ratio that hides this — the denominator grows with completeness.
Acceptance status — borderline, stated plainly
15-run windowed distribution:
53, 69, 75, 77, 77, 78, 80, 80, 81, 83, 83, 84, 84, 84, 90.So the AC's ≥80% is met on the median only; the mean is just under and the floor is not uniformly ≥80%.
Floor limitation (why it isn't higher) — and the ceiling
The sub-80 runs drop a real family member's bridge — e.g. the proband's sibling's spouse (Vance+Meredith) extracted as a free-floating couple, or Jim+Sheila's family correctly internally-linked but not joined to the proband. These are missing relationship links the model omits run-to-run. The obvious levers are ruled out by the ticket: proband-linking prompt directives (tried, caused dangling-reference crashes), name-matching dedup (forbidden), extra structural-completion passes (tried, regressed F1).
The bridges exist — they're variance-distributed. A 6-run consensus experiment (merge the union of repeated windowed extractions) reaches 94.9% connectivity (plain LCC) vs a single-run median of ~84% — each run misses a different subset of bridges, and the union recovers nearly all of them. So the floor is achievable via multi-sample consensus at ~K× extraction cost — viable as an async "deep re-extraction," not real-time. That's a product decision (is K× latency acceptable for a high-fidelity re-extraction mode?), so it's not in this PR. This confirms the floor is a model-variance limitation, not a code gap.
No F1 regression
Aggregate ~0.65, ParentChild F1 ~0.80 (precision reported), People ~0.93. On baseline.
Companion: fdserver#24.
🤖 Generated with Claude Code