feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow#11740
Open
pinin4fjords wants to merge 7 commits into
Open
feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow#11740pinin4fjords wants to merge 7 commits into
pinin4fjords wants to merge 7 commits into
Conversation
…e): scaffold ORF catalogue chain
Add a three-component upstream chain for building cross-caller Ribo-seq
ORF catalogues:
- custom/orfnormalise: dispatches on meta.caller in {ribocode, ribotish,
ribotricer, rpbp, price}, emits unified BED12 + sidecar TSV with the
harmonised orf_class vocabulary (canonical_cds / uORF / dORF / novel_u
/ smORF / other).
- custom/orfmerge: class-aware clustering across callers and samples,
records called_by_<caller> / score_<caller> provenance with
direction-aware best-score aggregation.
- orftable_fasta_gtf_buildorfcatalogue: composes normaliser x N callers
+ merger + bedtools/getfasta + seqkit/translate, emits BED12,
catalogue TSV, orf_to_gene TSV, AA FASTA and an MQC sidecar.
Templates use the same python/pandas/pyyaml Wave container as
custom/bed12codonpositions. Stub tests scaffolded; real-fixture tests
follow once the paired test-datasets PR lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_ID fallback
- orfnormalise: 5 per-caller tests + 1 stub (all green). Fall back to
parsing the ribotricer ORF_ID when the optional 'coordinate' column is
absent (detect-orfs output does not carry it). Tolerate empty GTF
inputs by requiring a real file before trying to read.
- orfmerge: chain via setup{} from orfnormalise outputs (ribotish +
ribocode on chr20); disambiguate per-caller prefix via a
bundled setup_prefix.config so the merger doesn't see name collisions.
- subworkflow: bundle nextflow.config that scopes the same caller-aware
prefix to ORFTABLE_FASTA_GTF_BUILDORFCATALOGUE:CUSTOM_ORFNORMALISE.
Test sets up GUNZIP for the chr20 FASTA since bedtools/getfasta needs
uncompressed input.
Fixtures live in nf-core/test-datasets#2070; will switch the test URLs
from the pinin4fjords fork back to params.modules_testdata_base_path
once that PR merges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d5 stubs PRICE encodes strand as a single char appended to the chrom rather than a separate colon-bracketed field (`19+:start-end`), so the previous regex never matched and the parser returned an empty BED12 on real data. Switch to a non-greedy `chrom` capture so the trailing strand char is correctly extracted. Drop the per-module stub tests: real-fixture tests already cover every caller + the merge + the full subworkflow chain, and the stub snapshots were tripping the test_snap_md5sum lint rule (empty-file md5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… parsers + empty-cohort handling
Inherited parser bugs (carried over from the local riboseq port):
- RiboCode: aa_length always 0 because the parser looked up an
AA_length column that does not exist in predicted_orfs.txt. Derive
from ORF_length (nt) instead.
- Rp-Bp: bf_mean read from column 15 (p_translated_var) instead of
column 18 (bayes_factor_mean), and orf_type read from column 14
(a metric, not a category). RPBP's predicted-orfs BED has no orf_type
column; default to canonical_cds for the post-selectfinalpredictionset
curated set.
- Ribo-TISH: combined p-value never read because the parser looked for
Pvalcombined/Pvalue/Pvalcom; actual columns are FisherPvalue /
RiboPvalue / TISPvalue, with "None" string sentinels. Walk the
preference list and skip None strings.
Subworkflow gaps:
- ch_orf_tables empty case crashed CUSTOM_ORFMERGE (arity '1..*'
violated by .collect() emitting an empty list). Filter out the empty
case so output channels are simply empty.
- BEDTOOLS_GETFASTA and SEQKIT_TRANSLATE got generic
${meta.id}.{fa,fasta} filenames; pin to .catalogue.nt.fa and
.catalogue.aa.fasta via bundled ext.prefix. Add ext.args = '-split -s
-nameOnly' to getfasta (splice-aware extraction of BED12 blocks) and
'--trim' to seqkit translate (drop trailing stop codons).
- meta.yml: cite all five caller tools with verified DOIs (RiboCode
Xiao 2018, Ribo-TISH Zhang 2017, Ribotricer Choudhary 2020, Rp-Bp
Malone 2017, PRICE Erhard 2018); document that ch_orf_tables empty
short-circuits; clarify GTF is required for ribocode/ribotish and
optional for the rest; note that cohort-level output meta is
hardcoded to [id:'cohort'].
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es, tests Surfaced + addressed via independent third-party assessment of branch 6597190 (parsers vs. canonical upstream output formats; merger clustering logic; test depth). Parsers (orfnormalise): - Ribotricer: detect-orfs output does NOT have the `coordinate` column (that lives in the prepare-orfs index file). The previous fallback treated the ORF_ID span as a single block, which for any multi-exon host transcript emits a BED12 record that spans introns - biologically nonsense and contaminates the catalogue AA FASTA produced via bedtools/getfasta + seqkit/translate. New _ribotricer_blocks_from_id takes the transcript map and intersects the ORF span with the host transcript's exon structure (same pattern ribotish's fallback already uses), recovering proper multi-exon blocks. - PRICE: orf_id_raw.split("__", 1)[0] used double-underscore; actual ID format is `<tid>_<type>_<index>` (single underscores). Was dead code for fixtures with Gene populated, but still wrong; switch to single underscore so the transcript_id lookup against the GTF works. Merger (orfmerge): - cluster_by_transcript collapsed every uORF / dORF / other ORF sharing a transcript_id into one cluster. A transcript can host multiple distinct uORFs (biologically common) so the catalogue under-reported them. Split into cluster_by_transcript (canonical_cds only) + cluster_by_transcript_position (uORF/dORF/other; keyed on outer span as well so distinct positional ORFs stay separate). - Document order-dependence of cluster_by_reciprocal_overlap at the threshold boundary. - Drop dead `_bed_key` helper. Tests: - Snapshot-only assertions caught changes but not correctness. Add content-level checks per caller: aa_length > 0 for all rows, score column populated, orf_class distribution not collapsed to a single bucket, multi-block BED12 records for ribotricer and price. These would have caught every parser bug in the assessment. - Merger test asserts called_by_*/score_* columns are coherent (catches the case where a parser bug populates the indicator but the score column is empty). Documentation (meta.yml): - Cite all five caller tools with verified DOIs (already in previous commit) - add note that Rp-Bp's predicted-orfs BED has no orf_type column so this module defaults Rp-Bp calls to canonical_cds. - Note PRICE iORF/intronic/orphan -> 'other' collapse (lossy). - Note ribotricer multi-exon block recovery via GTF + that GTF is strongly recommended for ribotricer (not just ribocode/ribotish). - Subworkflow: AA fasta now lands as `${cohort}.catalogue.aa.fasta` (from the previous commit). Paired test-datasets PR nf-core/test-datasets#2070 also gets a small fixture update: the ribotish/predict fixture's source run lacked a TIS BAM so every p-value column came out as the literal string "None", which silently masked the broken pvalue parser. Substitute the real RiboPvalue value into the FisherPvalue column via awk so the parser path is exercised; documented in the fixture README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…provenance header The source columns the parser reads for each derived TSV field (`score`, `orf_type`, `aa_length`, length-for-aa-derivation) were previously hardcoded in each parser. Some callers expose multiple meaningful choices - Ribo-TISH has TISPvalue / RiboPvalue / FisherPvalue (plus Q-values), Rp-Bp has bayes_factor_mean / chi_square_p / p_translated_mean - and there was no way to override the default chain. This change: - Plumbs `task.ext.args` through the module to the template. - Adds `--score-field`, `--orf-type-field`, `--length-field`, `--aa-length-field` CLI options. When set, an override replaces the per-caller default chain (so the user gets exactly what they asked for). - Refactors the rpbp parser from positional column indices to csv.DictReader + named RPBP_COLUMNS so users can address columns by name (e.g. `--score-field bayes_factor_var`). - Centralises field-name preference chains into a DEFAULT_FIELDS table next to the parsers. - Writes a `# parser_columns: caller=<X> score=<col> orf_type=<col> ...` provenance header at the top of the normalised TSV, so consumers (and reviewers) can verify which source column was actually read for each derived field. Standard csv.DictReader and pandas `comment='#'` skip the line automatically. Also updates custom/orfmerge's load_normalised to skip `#`-prefixed comment lines so the new provenance header doesn't leak into the cohort catalogue. Test changes: - Existing per-caller tests gain a provenance-line assertion (verifies `# parser_columns:` is present and reports the expected default column for at least the ribocode case). - New `homo_sapiens [chr20] - ribotish - score-field override` test runs with `ext.args = '--score-field RiboPvalue'` and asserts the provenance line reports `score=RiboPvalue` rather than the default `FisherPvalue`. - Tests now filter `#`-prefixed lines before parsing the TSV header. Verified end-to-end on chr19/chr20 fixtures: default + override paths emit the right provenance line, and all parser-specific assertions (non-zero aa_length, populated score column, non-collapsed class distribution, multi-exon BED12 blocks) continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… subworkflow channel now carries it as val too
nf-core/modules has a hardcoded `permitted_meta_keys = {"id",
"single_end"}` allow-list - any other meta.<key> reference in main.nf
fails the `main_nf_meta_key` lint check. The previous design embedded
the per-record caller id in meta.caller, which violated that policy
and tripped CI lint.
Lift the caller out of meta and into a proper val input - the pattern
used by raxmlng (val(model)), last/mafconvert (val(format)),
clair3 (val(platform)), and similar nf-core modules:
- Module input: `tuple val(meta), path(orfs_table), val(caller)` plus
the unchanged `tuple val(meta2), path(gtf)`. The `caller` value goes
into a per-caller enum in meta.yml (one of ribocode / ribotish /
ribotricer / rpbp / price).
- Module main.nf no longer references meta.caller; the script binds
`caller` directly from the input tuple.
- Subworkflow `ch_orf_tables` API also moves caller out of meta:
`[ val(meta), path(orf_table), val(caller) ]`. Subworkflow appends
caller to meta.id locally for the normalise call so the merger's
`beds/*` staging gets unique per-caller filenames; the merger then
rebuilds meta as `[id: 'cohort']` so no caller leakage downstream.
- Subworkflow's bundled `nextflow.config` drops the
`withName: CUSTOM_ORFNORMALISE` prefix override (the meta.id append
in main.nf does the work).
Pre-commit fixes:
- ruff E741: rename `l` to `line` in write_outputs (loop variable).
- ruff UP015: drop the redundant `"r"` mode arg from open() in open_text.
- ruff format: minor reformatting of long literal lists.
Tests:
- All 6 orfnormalise tests + 1 merger test + 1 subworkflow test pass
module input as the new 3-tuple (meta carries only `id`; caller is
the trailing val). Merger setup config still constructs
`meta + [caller: ...]` for prefix disambiguation - that meta key is
consumed by the *test* setup_prefix.config (not by the module),
which is fine since lint only scans modules' main.nf.
Local `nf-core modules lint custom/orfnormalise` now passes cleanly
(only the cosmetic Wave container-version warnings remain, shared
with the reference custom/bed12codonpositions module).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
pinin4fjords
added a commit
to pinin4fjords/riboseq
that referenced
this pull request
May 22, 2026
…rces Pre-merge upstream PRs leave `custom/bed12codonpositions`, `custom/orfnormalise`, `custom/orfmerge` and the `orftable_fasta_gtf_buildorfcatalogue` subworkflow without a modules.json entry, which lets `nf-core lint` abort with an interactive "Was the module installed from a different branch" prompt that fails under CI's no-TTY shell. Register them all under the existing https://github.com/pinin4fjords/nf-core-modules.git entry (same pattern as dotseq/dotseq). Once nf-core/modules#11740 and the custom-bed12codonpositions PR merge to master, swap the source and branches/SHAs via `nf-core modules update`. [skip ci] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Ribo-seq experiments measure which open reading frames (ORFs) on the transcriptome are actively translated by ribosomes. Five widely-used callers - RiboCode, Ribo-TISH, Ribotricer, Rp-Bp, PRICE - each emit ORF predictions in a different table format with different score semantics and classification vocabularies. There's no upstream way to combine them into one cohort catalogue.
This PR adds three components that fill that gap:
custom/orfnormalise- parses one caller's output into a unified BED12 + sidecar TSV. The caller (one of ribocode, ribotish, ribotricer, rpbp, price) is supplied as avalinput.custom/orfmerge- class-aware clustering of normalised calls across callers and samples into one cohort catalogue.orftable_fasta_gtf_buildorfcataloguesubworkflow - composes the two withbedtools/getfasta+seqkit/translateto produce a catalogue AA FASTA.Bundled because the merger's schema is defined by the normaliser, and the subworkflow only makes sense once both modules exist.
Harmonised schema
orf_classis the union vocabulary across the five callers:canonical_cdsuORF/dORFnovel_usmORFaa_length <= 100, regardless of locationotherScore direction is per-caller (p-values lower-better, Bayes factors / phase scores higher-better); the merger keeps the direction-appropriate best per cluster.
Merger clustering is class-aware:
canonical_cds: by(transcript_id, strand)(one CDS per transcript).uORF/dORF/other: by(transcript_id, strand, start, end)(a transcript can host multiple distinct positional ORFs).novel_u/smORF: greedy reciprocal-overlap (default 0.8).Configurable source columns + provenance
Each caller's output exposes multiple candidate columns for
score(Ribo-TISH alone has FisherPvalue / RiboPvalue / TISPvalue plus Q-value variants),orf_type, and length. Defaults are inmeta.yml; override per-call viaext.args:Every normalised TSV starts with a
# parser_columns: ...line recording exactly which source column was read for each derived field, so downstream consumers (and reviewers) can verify provenance. Standard CSV / pandas readers skip#-prefixed lines.Example output
Cohort catalogue TSV (
called_by_*/score_*track which callers contributed):Subworkflow also emits multi-block BED12 (intron-skipping),
orf_to_gene.tsv(many-to-many ORF → host lookup), catalogue AA FASTA, and a MultiQC custom-content per-class-count sidecar.Test data
Paired with nf-core/test-datasets#2070 (five real-tool outputs sliced to <13 KB each from existing module-test outputs; full provenance in the test-data README). Tests reference the fork branch inline; will swap to
params.modules_testdata_base_pathonce #2070 merges.Test plan
custom/orfnormalisewith content-level assertions (non-zeroaa_length, populatedscore, non-collapsedorf_classdistribution, multi-block BED12 for multi-exon callers, provenance header present).custom/orfmergechains two normaliser invocations and asserts coherentcalled_by_*/score_*columns.--score-fieldoverride exercised; provenance header confirms it.nf-core modules lint+nf-core subworkflows lintclean (only the two cosmetic Wave container-version warnings shared with custom/bed12codonpositions remain).