Skip to content

DSL extensions: logistic, Column, WT, peptide properties#98

Merged
iskandr merged 4 commits into
masterfrom
dsl-extensions
Apr 9, 2026
Merged

DSL extensions: logistic, Column, WT, peptide properties#98
iskandr merged 4 commits into
masterfrom
dsl-extensions

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented Apr 9, 2026

Intent

Extend the ranking/filtering DSL to support the signals needed for Vaxrank-style composite scoring and general immunogenicity/manufacturability analysis. After this PR, users can express:

score = (
    0.4 * Affinity["netmhcpan"].logistic(350, 150)
    + 0.3 * Presentation["mhcflurry"].score
    - 0.1 * Column("cysteine_count")
    - 0.1 * abs(Column("charge"))
    + 0.1 * (Affinity.score - WT(Affinity).score)  # differential binding
)

Changes

1. .logistic(midpoint, width) on Expr

Logistic sigmoid transform: 1 / (1 + exp((x - midpoint) / width)). Vaxrank uses midpoint=350, width=150 for IC50→score conversion. Same pattern as existing .norm().

2. Column("name") — arbitrary DataFrame column access

New Expr subclass that reads any column from the group DataFrame. This is the key unlock: external signals (read counts, expression, peptide properties, custom annotations) become first-class ranking signals without special-casing each one.

Column("hydrophobicity") >= -0.5
Column("n_alt_reads").sqrt()

CLI: column(name) in filter/ranking strings. Errors with available columns when column is missing.

3. WT(accessor) — wildtype comparison wrapper

Wraps a KindAccessor to read wildtype prediction columns (wt_value, wt_score, wt_percentile_rank) instead of the mutant columns. Avoids duplicating every field on KindAccessor.

WT(Affinity).value                        # WT IC50
WT(Affinity["netmhcpan"]).score           # qualified WT
Affinity.score - WT(Affinity).score       # differential

Works with method qualification — same prediction_method_name filtering as the mutant side. When WT columns don't exist (non-variant inputs), evaluates to NaN.

4. topiary.properties — peptide property columns

New module computing amino acid properties directly on the peptide column using vectorized pandas string operations (fast enough for proteome-scale):

from topiary.properties import add_peptide_properties

df = add_peptide_properties(df, groups=["manufacturability"])

Named groups:

  • "core": charge, hydrophobicity, aromaticity, molecular_weight
  • "manufacturability": core + cysteine_count, instability_index, max_7mer_hydrophobicity, cterm_7mer_hydrophobicity, difficult_nterm, difficult_cterm, asp_pro_bonds
  • "immunogenicity": core + tcr_charge, tcr_aromaticity, tcr_hydrophobicity

Computation uses pandas str.count(), str[n], apply() with lookup dicts — no external dependencies.

Properties accessible in DSL via Column("charge"), Column("cysteine_count"), etc.

Not in this PR

  • add_wildtype_predictions() function (the step that actually runs WT predictions and populates the wt_* columns) — separate PR, depends on predictor refactoring
  • Evaluator performance refactor (vectorized filter/sort) — separate PR

Placeholder commit — implementation follows.
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 9, 2026

Coverage Status

coverage: 85.885% (+0.4%) from 85.441% — dsl-extensions into master

iskandr added 3 commits April 9, 2026 07:08
Expr.logistic(midpoint, width) — logistic sigmoid transform for
Vaxrank-compatible IC50 scoring.

Column("name") — reference any DataFrame column in expressions.
Enables peptide properties, read counts, and custom annotations as
first-class ranking signals. Errors with "Did you mean" on typos.
CLI: column(name) <= threshold.

WT(accessor) — wildtype comparison wrapper. Reads wt_value, wt_score,
wt_percentile_rank columns alongside mutant values. Works with method
qualification: WT(Affinity["netmhcpan"]).score. Returns NaN when WT
columns don't exist (non-variant inputs). WT fields are for ranking
expressions only, not filters.

topiary.properties module — compute amino acid properties on the peptide
column using vectorized pandas operations. Named groups:
- "core": charge, hydrophobicity, aromaticity, molecular_weight
- "manufacturability": core + cysteine_count, instability_index,
  max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds
- "immunogenicity": core + tcr_charge/aromaticity/hydrophobicity
Includes dipeptide-based instability index (Guruprasad et al. 1990).
Supports prefix parameter for WT peptide properties.

52 new tests covering all features, edge cases, and error paths.
…rrors

- Fix WT docstring showing filter usage that actually raises TypeError
- Validate column() names in CLI parser: reject empty names and nested parens
- Column.evaluate() gives clear TypeError for non-numeric values
- 3 new tests for the above edge cases
@iskandr iskandr merged commit 125c4b2 into master Apr 9, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants