Skip to content

SelfProteome: include= rename, protected_tissues, BLOSUM62, indels (part B of #124)#138

Merged
iskandr merged 5 commits into
masterfrom
add-protected-tissues-scope
Apr 16, 2026
Merged

SelfProteome: include= rename, protected_tissues, BLOSUM62, indels (part B of #124)#138
iskandr merged 5 commits into
masterfrom
add-protected-tissues-scope

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented Apr 16, 2026

Summary

Adds `scope="protected_tissues"` to `SelfProteome.from_ensembl` — filters the reference proteome to genes expressed in named tissues. Multi-species-safe: pirlygenes/HPA for human defaults, explicit `tissue_gene_ids=` for any species.

Multi-species paths

Species How Example
Human (default) pirlygenes tissue expression data, curated vital-organ tissue list `SelfProteome.from_ensembl(scope="protected_tissues")`
Human (custom) pirlygenes with user-specified tissues `..., tissues=["lung", "brain"], min_tissue_ntpm=5.0)`
Mouse / dog / any User-supplied gene set `..., tissue_gene_ids={"ENSMUSG...", ...})`
Non-human without data Raises with actionable message `..., species="mouse", scope="protected_tissues")` → `ValueError`

Test plan

  • 4 new tests (non-human error, non-human with gene_ids, human default, human custom)
  • `./test.sh` — 1170 passed (up from 1166)
  • `./lint.sh` — clean
  • CI green

Part B of #124.

Filters the reference proteome to genes expressed in named tissues —
the cross-reactivity-relevant subset of self.

## Multi-species design

Three paths, ordered by specificity:

1. **Explicit gene set** (any species): pass
   tissue_gene_ids={gene_ids} with pre-computed gene IDs (e.g. from
   Tabula Muris for mouse, or custom RNA-seq for dog).  Bypasses
   pirlygenes entirely.  tissues= and min_tissue_ntpm= ignored.

2. **Human default**: queries pirlygenes via
   topiary.sources.tissue_expressed_gene_ids with the requested
   tissue list (defaults to curated vital-organ set: heart_muscle,
   lung, liver, kidney, cerebral_cortex) and min_ntpm threshold.

3. **Non-human without tissue_gene_ids**: raises ValueError with an
   actionable message explaining that pirlygenes tissue data is
   human-only and directing users to supply tissue_gene_ids=.

## Version string

- Explicit gene set: "protected_tissues+gene_ids-sha256:{hash}"
- Human default: "protected_tissues+tissues-{list}+min_ntpm-{value}"

## Tests

4 new tests:
- Non-human without tissue_gene_ids raises.
- Non-human with explicit gene_ids works (mocked pyensembl).
- Human with default tissues works (mocked pirlygenes via monkeypatch).
- Human with custom tissues list works.

Old test_protected_tissues_not_yet_implemented updated to test the
non-human error path instead (the NotImplementedError is gone).

Full suite 1170 pass (up from 1166), lint clean.
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 16, 2026

Coverage Status

coverage: 88.402% (+0.2%) from 88.201% — add-protected-tissues-scope into master

iskandr added 3 commits April 15, 2026 23:04
Renames the public API parameter from `scope=` (abstract jargon) to
`include=` (reads naturally at the call site):

  SelfProteome.from_ensembl(include="non_cta")
  SelfProteome.from_ensembl(include="protected_tissues")
  SelfProteome.from_ensembl(include="all")
  SelfProteome.from_ensembl(include=lambda gene_id: ...)

Adds include="protected_tissues" — filters the reference proteome to
genes expressed in named tissues.  Multi-species safe:

- Human default: pirlygenes/HPA expression data, curated vital-organ
  tissue list (heart, lung, liver, kidney, cerebral cortex).
- Any species with explicit data: tissue_gene_ids={gene_ids} bypasses
  pirlygenes entirely.
- Non-human without tissue_gene_ids: raises ValueError with actionable
  message directing users to supply their own data.

4 new tests + all existing tests updated for the parameter rename.
29 self_proteome tests pass, lint clean.
Default metric is now "blosum62" — conservative substitutions (I↔L,
BLOSUM62 score +2) produce lower distances than non-conservative
(I↔W, score -3).  BLOSUM62 matrix loaded lazily from Biopython at
first use, not hardcoded.

metric="hamming" (count mismatches) kept as opt-in.

New output column: self_nearest_blosum_distance (only present when
metric="blosum62").  self_nearest_edit_distance (Hamming) is always
computed regardless of metric since it's useful for threshold-based
filtering.

5 new tests covering exact-match zero, conservative vs radical sub
ordering, tie-breaking preference for conservative over radical when
both are Hamming-1, hamming opt-in, invalid metric error.

34 self_proteome tests pass, lint clean.
When include_indels=True (default), after the same-length SIMD scan
finds a best substitution match, the method also checks whether any
1-deletion (L-1) or 1-insertion (L+1) neighbor of the query exists
in the reference.  An indel match at edit_distance=1 beats a
same-length match at edit_distance≥2.

Cost: ~L + L×20 ≈ 200 hash-set lookups per query (for 9-mers).
The indel check only fires when the same-length best has
edit_distance ≥ 2 — exact matches and 1-substitution matches short-
circuit.  Reference peptide sets at each length are built lazily on
first use and cached.

New output column: self_nearest_edit_type ("deletion" / "insertion")
on indel-matched rows.  Same-length matches don't set this column.

4 new tests: deletion beats distant substitution, insertion match
found at L+1, exact match beats indel, include_indels=False disables.

38 self_proteome tests pass (up from 34).  Full suite 1179 pass
(up from 1170).  Lint clean.
@iskandr iskandr changed the title SelfProteome scope="protected_tissues" — multi-species (part B of #124) SelfProteome: include= rename, protected_tissues, BLOSUM62, indels (part B of #124) Apr 16, 2026
…d test

- reference_version property: +scope- → +include- to match the
  renamed parameter.
- Module docstring rewritten: "Scopes" → "Include modes", added
  distance metrics and indels sections, updated deferred-features
  list to reflect what's now shipped vs queued.
- nearest() docstring: added explicit note that indel matching
  returns the first hit found (not the best), and that the upcoming
  self_nearest_candidates structured column will expose the full set.
- New test: test_indel_beats_distant_blosum_match verifies that a
  1-deletion at edit_distance=1 wins over a ≥2-sub same-length match
  even when metric="blosum62", and that blosum_distance is None for
  the indel row (can't score across different lengths positionally).
- Fixed version-string assertion in test_reference_version_embeds_*
  and test_from_ensembl_indexes_proteins_and_filters_cta.

39 self_proteome tests (up from 38), 1180 total, lint clean.
@iskandr iskandr merged commit 6881138 into master Apr 16, 2026
8 checks passed
iskandr added a commit that referenced this pull request Apr 24, 2026
topiary.self_proteome imported BLOSUM62 via
Bio.Align.substitution_matrices.load (added in v5.9.0), but biopython
was not declared as a dependency and adding it just for a 400-integer
constant is overkill. Inline the standard 20x20 matrix as a numpy
constant; values verified to match biopython's BLOSUM62 exactly.

Fixes the 17 ModuleNotFoundError: No module named 'Bio' failures on
the #138 self_proteome tests that block CI on all open PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants