SelfProteome: include= rename, protected_tissues, BLOSUM62, indels (part B of #124)#138
Merged
Conversation
Filters the reference proteome to genes expressed in named tissues —
the cross-reactivity-relevant subset of self.
## Multi-species design
Three paths, ordered by specificity:
1. **Explicit gene set** (any species): pass
tissue_gene_ids={gene_ids} with pre-computed gene IDs (e.g. from
Tabula Muris for mouse, or custom RNA-seq for dog). Bypasses
pirlygenes entirely. tissues= and min_tissue_ntpm= ignored.
2. **Human default**: queries pirlygenes via
topiary.sources.tissue_expressed_gene_ids with the requested
tissue list (defaults to curated vital-organ set: heart_muscle,
lung, liver, kidney, cerebral_cortex) and min_ntpm threshold.
3. **Non-human without tissue_gene_ids**: raises ValueError with an
actionable message explaining that pirlygenes tissue data is
human-only and directing users to supply tissue_gene_ids=.
## Version string
- Explicit gene set: "protected_tissues+gene_ids-sha256:{hash}"
- Human default: "protected_tissues+tissues-{list}+min_ntpm-{value}"
## Tests
4 new tests:
- Non-human without tissue_gene_ids raises.
- Non-human with explicit gene_ids works (mocked pyensembl).
- Human with default tissues works (mocked pirlygenes via monkeypatch).
- Human with custom tissues list works.
Old test_protected_tissues_not_yet_implemented updated to test the
non-human error path instead (the NotImplementedError is gone).
Full suite 1170 pass (up from 1166), lint clean.
Renames the public API parameter from `scope=` (abstract jargon) to
`include=` (reads naturally at the call site):
SelfProteome.from_ensembl(include="non_cta")
SelfProteome.from_ensembl(include="protected_tissues")
SelfProteome.from_ensembl(include="all")
SelfProteome.from_ensembl(include=lambda gene_id: ...)
Adds include="protected_tissues" — filters the reference proteome to
genes expressed in named tissues. Multi-species safe:
- Human default: pirlygenes/HPA expression data, curated vital-organ
tissue list (heart, lung, liver, kidney, cerebral cortex).
- Any species with explicit data: tissue_gene_ids={gene_ids} bypasses
pirlygenes entirely.
- Non-human without tissue_gene_ids: raises ValueError with actionable
message directing users to supply their own data.
4 new tests + all existing tests updated for the parameter rename.
29 self_proteome tests pass, lint clean.
Default metric is now "blosum62" — conservative substitutions (I↔L, BLOSUM62 score +2) produce lower distances than non-conservative (I↔W, score -3). BLOSUM62 matrix loaded lazily from Biopython at first use, not hardcoded. metric="hamming" (count mismatches) kept as opt-in. New output column: self_nearest_blosum_distance (only present when metric="blosum62"). self_nearest_edit_distance (Hamming) is always computed regardless of metric since it's useful for threshold-based filtering. 5 new tests covering exact-match zero, conservative vs radical sub ordering, tie-breaking preference for conservative over radical when both are Hamming-1, hamming opt-in, invalid metric error. 34 self_proteome tests pass, lint clean.
When include_indels=True (default), after the same-length SIMD scan
finds a best substitution match, the method also checks whether any
1-deletion (L-1) or 1-insertion (L+1) neighbor of the query exists
in the reference. An indel match at edit_distance=1 beats a
same-length match at edit_distance≥2.
Cost: ~L + L×20 ≈ 200 hash-set lookups per query (for 9-mers).
The indel check only fires when the same-length best has
edit_distance ≥ 2 — exact matches and 1-substitution matches short-
circuit. Reference peptide sets at each length are built lazily on
first use and cached.
New output column: self_nearest_edit_type ("deletion" / "insertion")
on indel-matched rows. Same-length matches don't set this column.
4 new tests: deletion beats distant substitution, insertion match
found at L+1, exact match beats indel, include_indels=False disables.
38 self_proteome tests pass (up from 34). Full suite 1179 pass
(up from 1170). Lint clean.
…d test - reference_version property: +scope- → +include- to match the renamed parameter. - Module docstring rewritten: "Scopes" → "Include modes", added distance metrics and indels sections, updated deferred-features list to reflect what's now shipped vs queued. - nearest() docstring: added explicit note that indel matching returns the first hit found (not the best), and that the upcoming self_nearest_candidates structured column will expose the full set. - New test: test_indel_beats_distant_blosum_match verifies that a 1-deletion at edit_distance=1 wins over a ≥2-sub same-length match even when metric="blosum62", and that blosum_distance is None for the indel row (can't score across different lengths positionally). - Fixed version-string assertion in test_reference_version_embeds_* and test_from_ensembl_indexes_proteins_and_filters_cta. 39 self_proteome tests (up from 38), 1180 total, lint clean.
iskandr
added a commit
that referenced
this pull request
Apr 16, 2026
This was referenced Apr 24, 2026
iskandr
added a commit
that referenced
this pull request
Apr 24, 2026
topiary.self_proteome imported BLOSUM62 via Bio.Align.substitution_matrices.load (added in v5.9.0), but biopython was not declared as a dependency and adding it just for a 400-integer constant is overkill. Inline the standard 20x20 matrix as a numpy constant; values verified to match biopython's BLOSUM62 exactly. Fixes the 17 ModuleNotFoundError: No module named 'Bio' failures on the #138 self_proteome tests that block CI on all open PRs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds `scope="protected_tissues"` to `SelfProteome.from_ensembl` — filters the reference proteome to genes expressed in named tissues. Multi-species-safe: pirlygenes/HPA for human defaults, explicit `tissue_gene_ids=` for any species.
Multi-species paths
Test plan
Part B of #124.