DSL extensions: logistic, Column, WT, peptide properties#98
Merged
Conversation
Placeholder commit — implementation follows.
Expr.logistic(midpoint, width) — logistic sigmoid transform for
Vaxrank-compatible IC50 scoring.
Column("name") — reference any DataFrame column in expressions.
Enables peptide properties, read counts, and custom annotations as
first-class ranking signals. Errors with "Did you mean" on typos.
CLI: column(name) <= threshold.
WT(accessor) — wildtype comparison wrapper. Reads wt_value, wt_score,
wt_percentile_rank columns alongside mutant values. Works with method
qualification: WT(Affinity["netmhcpan"]).score. Returns NaN when WT
columns don't exist (non-variant inputs). WT fields are for ranking
expressions only, not filters.
topiary.properties module — compute amino acid properties on the peptide
column using vectorized pandas operations. Named groups:
- "core": charge, hydrophobicity, aromaticity, molecular_weight
- "manufacturability": core + cysteine_count, instability_index,
max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds
- "immunogenicity": core + tcr_charge/aromaticity/hydrophobicity
Includes dipeptide-based instability index (Guruprasad et al. 1990).
Supports prefix parameter for WT peptide properties.
52 new tests covering all features, edge cases, and error paths.
…rrors - Fix WT docstring showing filter usage that actually raises TypeError - Validate column() names in CLI parser: reject empty names and nested parens - Column.evaluate() gives clear TypeError for non-numeric values - 3 new tests for the above edge cases
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Extend the ranking/filtering DSL to support the signals needed for Vaxrank-style composite scoring and general immunogenicity/manufacturability analysis. After this PR, users can express:
Changes
1.
.logistic(midpoint, width)on ExprLogistic sigmoid transform:
1 / (1 + exp((x - midpoint) / width)). Vaxrank usesmidpoint=350, width=150for IC50→score conversion. Same pattern as existing.norm().2.
Column("name")— arbitrary DataFrame column accessNew
Exprsubclass that reads any column from the group DataFrame. This is the key unlock: external signals (read counts, expression, peptide properties, custom annotations) become first-class ranking signals without special-casing each one.CLI:
column(name)in filter/ranking strings. Errors with available columns when column is missing.3.
WT(accessor)— wildtype comparison wrapperWraps a
KindAccessorto read wildtype prediction columns (wt_value,wt_score,wt_percentile_rank) instead of the mutant columns. Avoids duplicating every field on KindAccessor.Works with method qualification — same
prediction_method_namefiltering as the mutant side. When WT columns don't exist (non-variant inputs), evaluates to NaN.4.
topiary.properties— peptide property columnsNew module computing amino acid properties directly on the peptide column using vectorized pandas string operations (fast enough for proteome-scale):
Named groups:
"core": charge, hydrophobicity, aromaticity, molecular_weight"manufacturability": core + cysteine_count, instability_index, max_7mer_hydrophobicity, cterm_7mer_hydrophobicity, difficult_nterm, difficult_cterm, asp_pro_bonds"immunogenicity": core + tcr_charge, tcr_aromaticity, tcr_hydrophobicityComputation uses pandas
str.count(),str[n],apply()with lookup dicts — no external dependencies.Properties accessible in DSL via
Column("charge"),Column("cysteine_count"), etc.Not in this PR
add_wildtype_predictions()function (the step that actually runs WT predictions and populates thewt_*columns) — separate PR, depends on predictor refactoring