Add SelfProteome — nearest-self lookup architecture (part A of #124)#135
Conversation
- Document tie-breaking semantics in SelfProteome.nearest docstring: when multiple reference peptides share the minimum Hamming distance, the first in internal array order wins. Deterministic but implementation-dependent; the full tied set becomes available via the self_nearest_candidates column in part B. - Add TestEnsemblHappyPath with a mocked pyensembl genome covering (a) CTA filtering actually drops proteins from the excluded genes and (b) from_ensembl's nearest lookup returns real gene/transcript provenance end-to-end (not just error paths). - Add TestTieBreaking pinning the documented tie-breaking rule so a future refactor can't silently change which reference wins when multiple candidates are equidistant. 3 new tests (24 total for self_proteome, up from 21; full suite 1135 pass from 1132). Lint clean.
Vaxrank-consumer reviewFollowing up from my comment on #124 now that this lands a concrete architecture. The shape is good for vaxrank — some specific preferences below on candidate surfacing and non-human handling. What works for vaxrank out of the box
Which peptides to surface when there are ties — the main askFor cross-reactivity risk assessment, the single-closest peptide (current
My preferences, in priority order: A. Cheap scalar companions to
B. Structured candidate column — follow-up PR (already queued per the scope note).
C. Binding-aware axis is the right eventual shape. Your three-axis plan (sequence-nearest, binding-similar, strongest-binder) captures the right decomposition. For vaxrank, Non-human speciesCurrent "explicit
Sequencing for vaxrank adoptionStill behind #123 (predict_wt) and #126 (evaluate_scores). When vaxrank adopts this:
None of that is urgent — but the architecture you're shipping here is compatible with it, and items A.1/A.2 above would pay off for vaxrank even before the three-axis story lands. |
Correction on candidate surfacing (supersedes §2 of my earlier comment)Re-reading my own take, I anchored on "pick a better single scalar" when the biologically principled primitive is the full candidate set. Reframing: Primary primitive — the rich candidate setWhat vaxrank (and honestly any cross-reactivity consumer) wants first is all nearby self peptides with enough biological context to reason about them. Concretely, a method like: SelfProteome.all_near(
peptides,
*,
max_distance=2, # Hamming threshold; callers pick the biology
flanking_width=10, # residues on each side of the match
max_per_query=None, # optional cap to bound memory on pathological cases
) -> pd.DataFrame # one row per (query_peptide, self_peptide) pairReturning one row per (query, match) pair with columns:
Flanking context is the piece I think is most important to land in this primitive — it can't be reconstructed by the caller from Derived accessors — different "nearest" semantics as filters on the candidate setOnce the candidate set exists, different "nearest" definitions are just different group-by reductions over it: candidates = ref.all_near(peptides, max_distance=2, flanking_width=10)
# Distance-based (what current self_nearest_peptide returns)
nearest_by_distance = candidates.sort_values(
["peptide", "edit_distance"]
).groupby("peptide").first()
# Binding-based (the self_strongest_nearby axis) — needs binding predictions
# on self_peptide at the query's allele; join to them, then reduce
nearest_by_binding = candidates.merge(self_preds, ...).sort_values(
["peptide", "affinity"]
).groupby("peptide").first()
# Flank-conservation (requires caller-side query flanks too)
nearest_by_flanks = score_flank_identity(candidates, query_context) \
.sort_values(["peptide", "flank_identity"], ascending=[True, False]) \
.groupby("peptide").first()If topiary wants these prepackaged, they become thin helpers on What that means for this PR
Flanking context specifically — biological rationaleQuick sketch of why I'd push for flanks even in the scalar/default path:
The default Non-human asks still standSpecies-defaults registry hook, better error message pointing users somewhere, pyensembl species-name pinning, and a warn-once on silent |
Code-review pass (distinct from the two design comments above)Line-level issues from reading the diff. A few are blockers IMO; others are nits. Blocker-grade1. Memory scaling of the distance tensor — likely unusable on a full human proteome today.
diffs = (
q_chunk[:, None, :] != ref_arr[None, :, :]
).sum(axis=2)allocates a Two ways forward:
Either is a real change — noting so you can decide whether to block this PR on it or ship with a documented max-queries / subset-proteome constraint + tracking issue. At a minimum, 2.
label = "callable-" + hashlib.sha256(repr(scope).encode()).hexdigest()[:12]For a lambda or local def,
I'd pick the third (explicit > implicit) given reproducibility is the whole point of the string. Worth fixing before merge3. Tie-breaking is deterministic-by-dict-insertion-order, which drifts when the reference source iterator changes.
Fix suggestion: once you have the tied_mask = diffs[np.arange(len(q_chunk))[:, None], np.arange(diffs.shape[1])] == best_dist[:, None]
# among tied, pick the row whose peptide sorts first
for k in range(len(q_chunk)):
tied_peps = [ref_peps[j] for j in np.where(tied_mask[k])[0]]
winner = min(tied_peps)
...(That's O(ties) per query, negligible compared to the full NN scan.) Happy to see a different stable rule — but order-of-iteration is the wrong one. 4. try:
gene_id = genome.gene_id_of_protein_id(protein_id)
except ValueError:
continueFor a user assembling their first self-proteome, "I loaded Ensembl release X and got Y peptides" is opaque when N proteins were silently dropped. Log a count or accumulate to Nits5. Provenance collects every occurrence but
prov = self._provenance.get(ref_pep)
if prov:
gene_id, transcript_id, offset = prov[0]Drops N-1 entries on the floor. This is the same "the full candidate set is already in hand" observation from my design comment above — the fix there is to surface 6. Minor, but 7. OverallThe architecture is right; the hamming-distance routine needs either a two-axis chunking or a rewrite before this is usable on the documented headline workflow (human non-CTA). The reproducibility-stamp bug (#2) also seems worth fixing before ship since it undermines the feature's own claimed guarantee. Everything else is polish. |
SelfProteome holds a species-tagged, scope-filtered reference protein corpus indexed by peptide length and answers per-query nearest- neighbor lookups. Plugs into TopiaryPredictor via a new self_proteome= kwarg; result DataFrame gains self_nearest_peptide / _peptide_length / _edit_distance / _gene_id / _transcript_id / _reference_offset / _reference_version columns. Columns are joined before filter_by / sort_by so they can participate in DSL expressions. This PR ships the architecture with one axis (sequence-nearest) and substitutions only. Follow-ups queued on #124: - scope="protected_tissues" + HPA/GTEx tissue filter - 1aa insertion / deletion candidates - self_nearest_by_binding / self_strongest_nearby second + third axes - self_nearest_candidates structured column - Seed-and-extend algorithm once the benchmark decides ## Surface - SelfProteome.from_peptides(dict, peptide_lengths=...) — test helper for in-memory reference sets. - SelfProteome.from_fasta(path, scope=...) — FASTA loader, scope limited to "all" or a callable (no gene metadata in FASTA). - SelfProteome.from_ensembl(species, release, scope=..., cta_source=...) — pyensembl-backed loader. Defaults to scope="non_cta" for human, stripping CTA genes via the existing topiary.sources pirlygenes integration. Non-human species must pass cta_source explicitly when using "non_cta"; a clear ValueError fires otherwise. - reference_version property composes an "ensembl-{species}-{release}+ scope-{scope}+..." string that gets stamped on every output row. Custom filters (sets or callables) hash into the string so reproducibility holds even without a stable label. - nearest(peptides) returns a DataFrame with one row per query, preserving input order. Queries whose length isn't represented in the reference get None rows rather than raising. ## Algorithm SIMD-vectorized Hamming distance against int8-encoded (M, L) arrays per peptide length, chunked for memory bound. Substitutions only today; indels queued alongside the seed-and-extend alternative so the benchmark on #124 can decide the default. ## Tests 21 new tests in tests/test_self_proteome.py covering construction (from_peptides, from_fasta, short-sequence edge case), nearest lookup (exact match, 1 sub, ordering, length mismatch, provenance, reference-version stamping, mixed lengths), scope resolution (non-human non_cta error paths, pirlygenes species guard, unknown scope rejection, custom-set CTA), and TopiaryPredictor integration. Full suite 1132 passed (up from 1111), lint clean, mkdocs --strict clean. ## Docs - docs/self_proteome.md (new page) covers the full surface, scope matrix, non-human guidance, reference-version semantics, and algorithm caveats. Linked in mkdocs.yml nav between Cached Predictions and Ranking DSL.
- Document tie-breaking semantics in SelfProteome.nearest docstring: when multiple reference peptides share the minimum Hamming distance, the first in internal array order wins. Deterministic but implementation-dependent; the full tied set becomes available via the self_nearest_candidates column in part B. - Add TestEnsemblHappyPath with a mocked pyensembl genome covering (a) CTA filtering actually drops proteins from the excluded genes and (b) from_ensembl's nearest lookup returns real gene/transcript provenance end-to-end (not just error paths). - Add TestTieBreaking pinning the documented tie-breaking rule so a future refactor can't silently change which reference wins when multiple candidates are equidistant. 3 new tests (24 total for self_proteome, up from 21; full suite 1135 pass from 1132). Lint clean.
4c2668f to
c63fe06
Compare
Summary
Part A of #124 (cross-reactivity analysis via nearest-self lookups). Ships the `SelfProteome` class + its integration with `TopiaryPredictor`, so downstream code can start consuming `self_nearest_*` columns in real predictor runs.
Scope is deliberately minimal — one nearest-by-sequence axis, substitutions only, `scope="all"` and `scope="non_cta"` — to get the architecture in and reviewed. The richer pieces queued for follow-up PRs are listed below.
What ships
What's deferred (tracked on #124)
Test plan
Part A of #124.