Skip to content

Add SelfProteome — nearest-self lookup architecture (part A of #124)#135

Merged
iskandr merged 3 commits into
masterfrom
add-self-proteome-nearest
Apr 16, 2026
Merged

Add SelfProteome — nearest-self lookup architecture (part A of #124)#135
iskandr merged 3 commits into
masterfrom
add-self-proteome-nearest

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented Apr 15, 2026

Summary

Part A of #124 (cross-reactivity analysis via nearest-self lookups). Ships the `SelfProteome` class + its integration with `TopiaryPredictor`, so downstream code can start consuming `self_nearest_*` columns in real predictor runs.

Scope is deliberately minimal — one nearest-by-sequence axis, substitutions only, `scope="all"` and `scope="non_cta"` — to get the architecture in and reviewed. The richer pieces queued for follow-up PRs are listed below.

What ships

  • `topiary/self_proteome.py` — `SelfProteome` class with three constructors (`from_peptides`, `from_fasta`, `from_ensembl`).
  • SIMD-vectorized Hamming-distance nearest search against int8-encoded reference arrays per peptide length. Chunked for memory bound.
  • `scope="all"` and `scope="non_cta"` (defaults for human via pirlygenes; non-human must supply `cta_source` explicitly; unsupported combos raise with an actionable message). Callable `scope=` accepted as the universal escape hatch.
  • Composite `reference_version` stamped on every output row; custom filters hash into the string for reproducibility.
  • `TopiaryPredictor(self_proteome=…)` kwarg attaches `self_nearest_*` columns to prediction output before filter/sort evaluate, so DSL expressions can reference them.
  • New `docs/self_proteome.md` page + nav entry.

What's deferred (tracked on #124)

  • `scope="protected_tissues"` (HPA / GTEx-backed tissue filtering).
  • 1aa insertion / deletion candidates.
  • The second + third "nearest" axes: `self_nearest_by_binding` and `self_strongest_nearby`.
  • `self_nearest_candidates` structured column.
  • Seed-and-extend algorithm — benchmark decides the default.
  • Bundled prediction artifacts for the self-proteome × common HLA pairs.

Test plan

  • 21 new tests in `tests/test_self_proteome.py` (construction, FASTA parsing, nearest lookup, scope resolution error paths, TopiaryPredictor integration).
  • `./test.sh` — 1132 passed, 3 skipped (up from 1111).
  • `./lint.sh` — clean.
  • `mkdocs build --strict` — clean.
  • CI green.

Part A of #124.

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 15, 2026

Coverage Status

coverage: 88.201% (-0.001%) from 88.202% — add-self-proteome-nearest into master

iskandr added a commit that referenced this pull request Apr 15, 2026
- Document tie-breaking semantics in SelfProteome.nearest docstring:
  when multiple reference peptides share the minimum Hamming distance,
  the first in internal array order wins.  Deterministic but
  implementation-dependent; the full tied set becomes available via
  the self_nearest_candidates column in part B.
- Add TestEnsemblHappyPath with a mocked pyensembl genome covering
  (a) CTA filtering actually drops proteins from the excluded genes
  and (b) from_ensembl's nearest lookup returns real gene/transcript
  provenance end-to-end (not just error paths).
- Add TestTieBreaking pinning the documented tie-breaking rule so a
  future refactor can't silently change which reference wins when
  multiple candidates are equidistant.

3 new tests (24 total for self_proteome, up from 21; full suite 1135
pass from 1132).  Lint clean.
@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Vaxrank-consumer review

Following up from my comment on #124 now that this lands a concrete architecture. The shape is good for vaxrank — some specific preferences below on candidate surfacing and non-human handling.

What works for vaxrank out of the box

  • from_peptides(dict, peptide_lengths=...) means vaxrank can feed its own user-supplied pyensembl proteome in without pulling a second bundled artifact. Good — this was my main Computed self_nearest_* columns against a curated non-CTA proteome #124 ask. (Vaxrank's principle is "user picks the reference"; anything that makes the bundled human Ensembl default the only path would be a non-starter.)
  • self_nearest_* columns attached before filter_by/sort_by. Enables Column("self_nearest_edit_distance") >= 3 as a direct filter expression. Vaxrank's current occurs_in_reference boolean maps cleanly to self_nearest_edit_distance == 0.
  • reference_version stamping. Matches the predictor_version pattern from CachedPredictor — nice consistency for reproducibility artifacts.

Which peptides to surface when there are ties — the main ask

For cross-reactivity risk assessment, the single-closest peptide (current self_nearest_peptide scalar) is insufficient. Three concrete scenarios:

  1. Distance-1 has many hits. A mutant 9-mer with five distance-1 matches across testis, pancreas, and heart has a very different risk profile than one with a single distance-1 match in testis. Surfacing only the first-in-array match hides this.
  2. Near-misses matter. Distance-1 vs distance-2 is often biologically the boundary between "probably cross-reactive" and "probably tolerated." A caller tuning a filter needs to see the distance-2 cloud, not just the minimum.
  3. Tie-breaking by binding, not insertion order. Current deterministic-but-array-order tie-break makes self_nearest_peptide depend on construction order of the reference. A user who rebuilds the proteome from a different FASTA order gets different "nearest" answers — silently. That's a reproducibility footgun.

My preferences, in priority order:

A. Cheap scalar companions to self_nearest_peptide — land in this PR.

  • self_nearest_n_at_min_distance (int): how many reference peptides tied at self_nearest_edit_distance. Tells the caller at a glance whether self_nearest_peptide represents one of many or a unique match. Trivial to compute — already have the diffs == best_dist mask.
  • self_nearest_is_unique (bool): convenience derivative of the above (== 1).

B. Structured candidate column — follow-up PR (already queued per the scope note).

  • self_nearest_candidates: list of dicts (or a nested DataFrame) with up to max_candidates_per_query entries, sorted by (distance asc, then a caller-controlled tiebreaker). Including threshold_distance to cap the list prevents explosion at large radii.
  • Default tiebreaker: binding strength at the query's allele if available on the row; fall back to gene_id lexicographic for determinism. Reference-array insertion order should only be the final fallback.
  • Exposing why a specific candidate was chosen (the tiebreaker that won) in the structured record is valuable for audit — e.g., {"tiebreaker": "min_affinity"}.

C. Binding-aware axis is the right eventual shape. Your three-axis plan (sequence-nearest, binding-similar, strongest-binder) captures the right decomposition. For vaxrank, self_strongest_nearby within a distance radius is probably the most load-bearing — a self peptide that is a strong MHC binder at the same allele within Hamming distance 2 is the high-risk signal, more so than the single nearest-by-sequence. Worth prioritizing self_strongest_nearby over self_nearest_by_binding for a follow-up PR.

Non-human species

Current "explicit cta_source or raise" default for non-human is the right call — silent unfiltered results would be a misleading cross-reactivity signal. Some small asks:

  1. Species-defaults registry as a public hook. _SPECIES_DEFAULTS is private. Expose SelfProteome.register_species_defaults(species, cta_source=...) so downstream projects (vaxrank, mhcnuggets, lab-specific forks) can install their own mouse/rat/macaque defaults without monkey-patching. The _resolve_cta_gene_ids logic already supports set / callable cta_source; this is just the "how do I wire a default for my species" story.
  2. Error message should point users somewhere. The current ValueError says "no default registered." For a user running species="mouse" that's a dead end. A sentence like "For mouse, consider passing an explicit set from [TCGA CT gene list / Almeida et al. 2009 / ...] or use scope='all'" gives the user a next step. Even "there's no curated list we recommend; use scope='all'" is informative.
  3. Pin species naming to pyensembl's conventions. species="human" vs species="homo_sapiens" vs pyensembl's species lookup — worth a docstring line making the accepted forms explicit. Otherwise subtle typo bugs ("humans", "h_sapiens") will surface as confusing pyensembl errors downstream.
  4. scope="all" should log once at construction, not silently, when the species has no registered CTA default and no cta_source= was passed. Not raise — warn. Gives the user a visible "you're running unfiltered; if this species has published CTAs, consider supplying them" nudge.

Sequencing for vaxrank adoption

Still behind #123 (predict_wt) and #126 (evaluate_scores). When vaxrank adopts this:

  • Swap vaxrank/reference_proteome.py's ReferenceProteome.contains() for a SelfProteome.from_peptides({...}) construction, feed vaxrank's user-supplied pyensembl proteome in.
  • Drop the occurs_in_reference boolean in favor of self_nearest_edit_distance == 0 for the existing "peptide already in reference" filter.
  • Expose the fuller self_nearest_* set in vaxrank's report output as a new risk signal.

None of that is urgent — but the architecture you're shipping here is compatible with it, and items A.1/A.2 above would pay off for vaxrank even before the three-axis story lands.

@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Correction on candidate surfacing (supersedes §2 of my earlier comment)

Re-reading my own take, I anchored on "pick a better single scalar" when the biologically principled primitive is the full candidate set. Reframing:

Primary primitive — the rich candidate set

What vaxrank (and honestly any cross-reactivity consumer) wants first is all nearby self peptides with enough biological context to reason about them. Concretely, a method like:

SelfProteome.all_near(
    peptides,
    *,
    max_distance=2,        # Hamming threshold; callers pick the biology
    flanking_width=10,     # residues on each side of the match
    max_per_query=None,    # optional cap to bound memory on pathological cases
) -> pd.DataFrame          # one row per (query_peptide, self_peptide) pair

Returning one row per (query, match) pair with columns:

Column Why it matters
peptide, self_peptide, self_peptide_length, edit_distance The match itself
self_gene_id, self_transcript_id, self_reference_offset Provenance — lets callers join to expression, tissue, isoform data
self_flanking_upstream, self_flanking_downstream (length flanking_width) Proteasomal cleavage depends on flanks. A distance-1 match whose flanks are totally different from the query's source context is a much weaker cross-reactivity signal than one where both the peptide and its surrounding residues conserve. Without flanks the caller can't distinguish these.
self_reference_version Reproducibility stamp (already in the PR)

Flanking context is the piece I think is most important to land in this primitive — it can't be reconstructed by the caller from gene_id + offset alone without re-reading the proteome, defeating the purpose of the index. The proteome already has the protein sequences in hand; surfacing [offset - flanking_width : offset] and [offset + L : offset + L + flanking_width] is nearly free.

Derived accessors — different "nearest" semantics as filters on the candidate set

Once the candidate set exists, different "nearest" definitions are just different group-by reductions over it:

candidates = ref.all_near(peptides, max_distance=2, flanking_width=10)

# Distance-based (what current self_nearest_peptide returns)
nearest_by_distance = candidates.sort_values(
    ["peptide", "edit_distance"]
).groupby("peptide").first()

# Binding-based (the self_strongest_nearby axis) — needs binding predictions
# on self_peptide at the query's allele; join to them, then reduce
nearest_by_binding = candidates.merge(self_preds, ...).sort_values(
    ["peptide", "affinity"]
).groupby("peptide").first()

# Flank-conservation (requires caller-side query flanks too)
nearest_by_flanks = score_flank_identity(candidates, query_context) \
    .sort_values(["peptide", "flank_identity"], ascending=[True, False]) \
    .groupby("peptide").first()

If topiary wants these prepackaged, they become thin helpers on SelfProteome or TopiaryPredictor — but the key is they're derivations over the same primitive, not independently-computed axes. That keeps the expensive proteome scan to one pass.

What that means for this PR

  • The current nearest() method stays as a convenience (equivalent to all_near(max_distance=max_mutant_length).groupby("peptide").first() with a documented tie-break). Predictor integration attaches scalar self_nearest_* columns as today — fine for DSL use.
  • The new primitive all_near() ships alongside, and the TopiaryPredictor integration grows an optional self_proteome_mode="all" (or similar) that attaches the candidate set as a separate result artifact — conceptually TopiaryResult.self_candidates or a list-column rather than stuffing N rows into the flat predictions df.
  • Tie-breaking in the scalar path gets less load-bearing once callers have the candidate set: they can apply their own tiebreaker from the full set.

Flanking context specifically — biological rationale

Quick sketch of why I'd push for flanks even in the scalar/default path:

  1. Proteasomal cleavage depends on residues flanking the MHC-bound peptide. A "nearest self" match at distance 1 that sits in a protein where the immediate flanks would never be cleaved to produce that peptide is essentially not a presentable self peptide — low cross-reactivity risk.
  2. TCR cross-reactivity is driven by the MHC-facing residues of the peptide itself, but the TCR's clonal history is shaped by which self peptides the thymus presented — which is determined by flanking-dependent processing. Flanks inform whether the self match would ever have been seen by central tolerance.
  3. Vaxrank's current ReferenceProteome.contains(peptide) check is flank-blind and we've long suspected it's too coarse. The replacement should do better, and flanks are the obvious upgrade.

The default flanking_width=10 is a common choice in cleavage-prediction tools (NetChop, NetMHCpan-4.1 context features). Configurable, but having a sensible default surfaces the signal for callers who don't know to ask for it.

Non-human asks still stand

Species-defaults registry hook, better error message pointing users somewhere, pyensembl species-name pinning, and a warn-once on silent scope="all" — all independent of the candidate-surfacing redesign. Keep those.

@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Code-review pass (distinct from the two design comments above)

Line-level issues from reading the diff. A few are blockers IMO; others are nits.

Blocker-grade

1. Memory scaling of the distance tensor — likely unusable on a full human proteome today.

topiary/self_proteome.py around line 240:

diffs = (
    q_chunk[:, None, :] != ref_arr[None, :, :]
).sum(axis=2)

allocates a (chunk_size, M, L) boolean tensor. For human with default peptide_lengths=(8,9,10,11), M for 9-mers after dedup on the non-CTA proteome is ~8–10M. At chunk_size=1000, L=9 that's ~8–10 GB of bools allocated per iteration. Even chunk_size=100 is ~1 GB. This will OOM on typical workstations for the headline use case this PR documents in the README.

Two ways forward:

  • Band the reference too. Chunk both queries and reference: for q_chunk in queries; for r_chunk in reference: — keeps the working tensor at chunk_size × ref_chunk_size × L regardless of total M. Keeps the running argmin across reference chunks.
  • Exploit the "first-mismatch stops counting" structure with an early-exit loop per query when best_dist == 0. For a self proteome search where most queries either hit distance 0 or bail out small, this is a large win.

Either is a real change — noting so you can decide whether to block this PR on it or ship with a documented max-queries / subset-proteome constraint + tracking issue. At a minimum, chunk_size=1000 is not a safe default and should be lowered (or auto-sized from M) before shipping.

2. repr(callable) in the reference-version hash is unstable across runs.

_apply_fasta_scope (line ~410) and _resolve_ensembl_scope (line ~440):

label = "callable-" + hashlib.sha256(repr(scope).encode()).hexdigest()[:12]

For a lambda or local def, repr(scope) includes the memory address: <function <lambda> at 0x10abc1230>. Two runs of identical user code get different labels, so reference_version silently differs between runs — breaking the reproducibility invariant the stamp is meant to guarantee. Options:

  • Hash inspect.getsource(scope) when available; fall back to a warning + random label with a "this run's label won't match any other" note.
  • Hash scope.__code__.co_code + co_consts — stable per-source-identical-lambda across interpreter runs (though not across Python versions).
  • Require the caller to pass a stable scope_label= alongside a callable, and raise if missing.

I'd pick the third (explicit > implicit) given reproducibility is the whole point of the string.

Worth fixing before merge

3. Tie-breaking is deterministic-by-dict-insertion-order, which drifts when the reference source iterator changes.

_build_index accumulates peptides into pep_dict = {L: {}} in the order records yields them, then _resolve_length's argmin returns the first minimum. So self_nearest_peptide in a tie is "whichever reference peptide _iter_ensembl_proteins / _parse_fasta happened to see first." Rebuild the proteome from a differently-ordered FASTA → different nearest returned, even with identical reference_version.

Fix suggestion: once you have the best_dist mask, pick the tie-winner by a content-deterministic rule (e.g., lexicographic min of reference peptides in the tied set). ~3 extra lines:

tied_mask = diffs[np.arange(len(q_chunk))[:, None], np.arange(diffs.shape[1])] == best_dist[:, None]
# among tied, pick the row whose peptide sorts first
for k in range(len(q_chunk)):
    tied_peps = [ref_peps[j] for j in np.where(tied_mask[k])[0]]
    winner = min(tied_peps)
    ...

(That's O(ties) per query, negligible compared to the full NN scan.) Happy to see a different stable rule — but order-of-iteration is the wrong one.

4. _iter_ensembl_proteins silently drops proteins whose gene_id_of_protein_id or protein_sequence fails.

try:
    gene_id = genome.gene_id_of_protein_id(protein_id)
except ValueError:
    continue

For a user assembling their first self-proteome, "I loaded Ensembl release X and got Y peptides" is opaque when N proteins were silently dropped. Log a count or accumulate to self._build_warnings, surface on SelfProteome as a property, or at minimum logger.info("skipped %d proteins with no gene mapping", n_skipped) at the end of the iterator.

Nits

5. Provenance collects every occurrence but nearest() only surfaces prov[0].

_build_index builds provenance[pep] = [(gene_id, transcript_id, offset), ...] with one entry per occurrence (paralogs, duplicated peptides). In _resolve_length:

prov = self._provenance.get(ref_pep)
if prov:
    gene_id, transcript_id, offset = prov[0]

Drops N-1 entries on the floor. This is the same "the full candidate set is already in hand" observation from my design comment above — the fix there is to surface all_near() as a first-class primitive; meanwhile, the dropped data is at least wasting memory.

6. from_peptides takes peptides_by_source: Dict[str, str] but the parameter name reads like "peptides" (short sequences) when the values are full protein sequences.

Minor, but protein_sequences_by_source or proteins_by_source reads more accurately and avoids confusion with query peptides. Docstring already says "amino_acid_sequence" which is clearer.

7. nearest([])pd.DataFrame([]) returns a frame with zero columns, which will break any downstream .merge(nearest, on="peptide") or Column("self_nearest_*") filter that assumes the columns exist. Either construct an empty frame with the canonical column schema, or short-circuit _maybe_attach_self_nearest to not call nearest() on empty input.

Overall

The architecture is right; the hamming-distance routine needs either a two-axis chunking or a rewrite before this is usable on the documented headline workflow (human non-CTA). The reproducibility-stamp bug (#2) also seems worth fixing before ship since it undermines the feature's own claimed guarantee. Everything else is polish.

iskandr added 2 commits April 15, 2026 21:53
SelfProteome holds a species-tagged, scope-filtered reference protein
corpus indexed by peptide length and answers per-query nearest-
neighbor lookups.  Plugs into TopiaryPredictor via a new
self_proteome= kwarg; result DataFrame gains self_nearest_peptide /
_peptide_length / _edit_distance / _gene_id / _transcript_id /
_reference_offset / _reference_version columns.  Columns are joined
before filter_by / sort_by so they can participate in DSL expressions.

This PR ships the architecture with one axis (sequence-nearest) and
substitutions only.  Follow-ups queued on #124:
  - scope="protected_tissues" + HPA/GTEx tissue filter
  - 1aa insertion / deletion candidates
  - self_nearest_by_binding / self_strongest_nearby second + third axes
  - self_nearest_candidates structured column
  - Seed-and-extend algorithm once the benchmark decides

## Surface

- SelfProteome.from_peptides(dict, peptide_lengths=...) — test helper
  for in-memory reference sets.
- SelfProteome.from_fasta(path, scope=...) — FASTA loader, scope
  limited to "all" or a callable (no gene metadata in FASTA).
- SelfProteome.from_ensembl(species, release, scope=..., cta_source=...)
  — pyensembl-backed loader.  Defaults to scope="non_cta" for human,
  stripping CTA genes via the existing topiary.sources pirlygenes
  integration.  Non-human species must pass cta_source explicitly
  when using "non_cta"; a clear ValueError fires otherwise.
- reference_version property composes an "ensembl-{species}-{release}+
  scope-{scope}+..." string that gets stamped on every output row.
  Custom filters (sets or callables) hash into the string so
  reproducibility holds even without a stable label.
- nearest(peptides) returns a DataFrame with one row per query,
  preserving input order.  Queries whose length isn't represented in
  the reference get None rows rather than raising.

## Algorithm

SIMD-vectorized Hamming distance against int8-encoded (M, L) arrays
per peptide length, chunked for memory bound.  Substitutions only
today; indels queued alongside the seed-and-extend alternative so the
benchmark on #124 can decide the default.

## Tests

21 new tests in tests/test_self_proteome.py covering construction
(from_peptides, from_fasta, short-sequence edge case), nearest
lookup (exact match, 1 sub, ordering, length mismatch, provenance,
reference-version stamping, mixed lengths), scope resolution
(non-human non_cta error paths, pirlygenes species guard, unknown
scope rejection, custom-set CTA), and TopiaryPredictor integration.
Full suite 1132 passed (up from 1111), lint clean, mkdocs --strict
clean.

## Docs

- docs/self_proteome.md (new page) covers the full surface, scope
  matrix, non-human guidance, reference-version semantics, and
  algorithm caveats.  Linked in mkdocs.yml nav between Cached
  Predictions and Ranking DSL.
- Document tie-breaking semantics in SelfProteome.nearest docstring:
  when multiple reference peptides share the minimum Hamming distance,
  the first in internal array order wins.  Deterministic but
  implementation-dependent; the full tied set becomes available via
  the self_nearest_candidates column in part B.
- Add TestEnsemblHappyPath with a mocked pyensembl genome covering
  (a) CTA filtering actually drops proteins from the excluded genes
  and (b) from_ensembl's nearest lookup returns real gene/transcript
  provenance end-to-end (not just error paths).
- Add TestTieBreaking pinning the documented tie-breaking rule so a
  future refactor can't silently change which reference wins when
  multiple candidates are equidistant.

3 new tests (24 total for self_proteome, up from 21; full suite 1135
pass from 1132).  Lint clean.
@iskandr iskandr force-pushed the add-self-proteome-nearest branch from 4c2668f to c63fe06 Compare April 16, 2026 01:54
@iskandr iskandr merged commit 2d14f1e into master Apr 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants