feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow by pinin4fjords · Pull Request #11740 · nf-core/modules

pinin4fjords · 2026-05-21T16:42:54Z

What

Ribo-seq experiments measure which open reading frames (ORFs) on the transcriptome are actively translated by ribosomes. Five widely-used callers - RiboCode, Ribo-TISH, Ribotricer, Rp-Bp, PRICE - each emit ORF predictions in a different table format with different score semantics and classification vocabularies. There's no upstream way to combine them into one cohort catalogue.

This PR adds three components that fill that gap:

custom/orfnormalise - parses one caller's output into a unified BED12 + sidecar TSV. The caller (one of ribocode, ribotish, ribotricer, rpbp, price) is supplied as a val input.
custom/orfmerge - class-aware clustering of normalised calls across callers and samples into one cohort catalogue.
orftable_fasta_gtf_buildorfcatalogue subworkflow - composes the two with bedtools/getfasta + seqkit/translate to produce a catalogue AA FASTA.

Bundled because the merger's schema is defined by the normaliser, and the subworkflow only makes sense once both modules exist.

Harmonised schema

orf_class is the union vocabulary across the five callers:

Class	Meaning
`canonical_cds`	Maps to an annotated CDS (or truncated / extended variant)
`uORF` / `dORF`	ORF in the 5'UTR / 3'UTR of an annotated transcript
`novel_u`	Novel / intergenic ORF, no annotated host CDS
`smORF`	`aa_length <= 100`, regardless of location
`other`	Internal / overlap / frame variants

Score direction is per-caller (p-values lower-better, Bayes factors / phase scores higher-better); the merger keeps the direction-appropriate best per cluster.

Merger clustering is class-aware:

canonical_cds: by (transcript_id, strand) (one CDS per transcript).
uORF / dORF / other: by (transcript_id, strand, start, end) (a transcript can host multiple distinct positional ORFs).
novel_u / smORF: greedy reciprocal-overlap (default 0.8).

Configurable source columns + provenance

Each caller's output exposes multiple candidate columns for score (Ribo-TISH alone has FisherPvalue / RiboPvalue / TISPvalue plus Q-value variants), orf_type, and length. Defaults are in meta.yml; override per-call via ext.args:

withName: 'CUSTOM_ORFNORMALISE_RIBOTISH' {
    ext.args = '--score-field FisherQvalue'  // FDR-adjusted, not raw p-value
}

Every normalised TSV starts with a # parser_columns: ... line recording exactly which source column was read for each derived field, so downstream consumers (and reviewers) can verify provenance. Standard CSV / pandas readers skip #-prefixed lines.

Example output

Cohort catalogue TSV (called_by_* / score_* track which callers contributed):

orf_id        chrom  start     end       strand  gene_id          orf_class      aa_length  called_by_ribotish  called_by_ribocode  score_ribotish  score_ribocode
orf_00000001  20     16731702  16741005  +       ENSG00000125870  canonical_cds  224        0                   1                                   0.0101442
orf_00000002  20     17570208  17607146  +       ENSG00000125868  canonical_cds  164        0                   1                                   0.00112689

Subworkflow also emits multi-block BED12 (intron-skipping), orf_to_gene.tsv (many-to-many ORF → host lookup), catalogue AA FASTA, and a MultiQC custom-content per-class-count sidecar.

Test data

Paired with nf-core/test-datasets#2070 (five real-tool outputs sliced to <13 KB each from existing module-test outputs; full provenance in the test-data README). Tests reference the fork branch inline; will swap to params.modules_testdata_base_path once #2070 merges.

Test plan

All 5 callers' fixtures round-trip through custom/orfnormalise with content-level assertions (non-zero aa_length, populated score, non-collapsed orf_class distribution, multi-block BED12 for multi-exon callers, provenance header present).
custom/orfmerge chains two normaliser invocations and asserts coherent called_by_* / score_* columns.
Subworkflow chains GUNZIP + normaliser × 2 + merger + getfasta + translate end-to-end.
--score-field override exercised; provenance header confirms it.
nf-core modules lint + nf-core subworkflows lint clean (only the two cosmetic Wave container-version warnings shared with custom/bed12codonpositions remain).

…e): scaffold ORF catalogue chain Add a three-component upstream chain for building cross-caller Ribo-seq ORF catalogues: - custom/orfnormalise: dispatches on meta.caller in {ribocode, ribotish, ribotricer, rpbp, price}, emits unified BED12 + sidecar TSV with the harmonised orf_class vocabulary (canonical_cds / uORF / dORF / novel_u / smORF / other). - custom/orfmerge: class-aware clustering across callers and samples, records called_by_<caller> / score_<caller> provenance with direction-aware best-score aggregation. - orftable_fasta_gtf_buildorfcatalogue: composes normaliser x N callers + merger + bedtools/getfasta + seqkit/translate, emits BED12, catalogue TSV, orf_to_gene TSV, AA FASTA and an MQC sidecar. Templates use the same python/pandas/pyyaml Wave container as custom/bed12codonpositions. Stub tests scaffolded; real-fixture tests follow once the paired test-datasets PR lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_ID fallback - orfnormalise: 5 per-caller tests + 1 stub (all green). Fall back to parsing the ribotricer ORF_ID when the optional 'coordinate' column is absent (detect-orfs output does not carry it). Tolerate empty GTF inputs by requiring a real file before trying to read. - orfmerge: chain via setup{} from orfnormalise outputs (ribotish + ribocode on chr20); disambiguate per-caller prefix via a bundled setup_prefix.config so the merger doesn't see name collisions. - subworkflow: bundle nextflow.config that scopes the same caller-aware prefix to ORFTABLE_FASTA_GTF_BUILDORFCATALOGUE:CUSTOM_ORFNORMALISE. Test sets up GUNZIP for the chr20 FASTA since bedtools/getfasta needs uncompressed input. Fixtures live in nf-core/test-datasets#2070; will switch the test URLs from the pinin4fjords fork back to params.modules_testdata_base_path once that PR merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d5 stubs PRICE encodes strand as a single char appended to the chrom rather than a separate colon-bracketed field (`19+:start-end`), so the previous regex never matched and the parser returned an empty BED12 on real data. Switch to a non-greedy `chrom` capture so the trailing strand char is correctly extracted. Drop the per-module stub tests: real-fixture tests already cover every caller + the merge + the full subworkflow chain, and the stub snapshots were tripping the test_snap_md5sum lint rule (empty-file md5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… parsers + empty-cohort handling Inherited parser bugs (carried over from the local riboseq port): - RiboCode: aa_length always 0 because the parser looked up an AA_length column that does not exist in predicted_orfs.txt. Derive from ORF_length (nt) instead. - Rp-Bp: bf_mean read from column 15 (p_translated_var) instead of column 18 (bayes_factor_mean), and orf_type read from column 14 (a metric, not a category). RPBP's predicted-orfs BED has no orf_type column; default to canonical_cds for the post-selectfinalpredictionset curated set. - Ribo-TISH: combined p-value never read because the parser looked for Pvalcombined/Pvalue/Pvalcom; actual columns are FisherPvalue / RiboPvalue / TISPvalue, with "None" string sentinels. Walk the preference list and skip None strings. Subworkflow gaps: - ch_orf_tables empty case crashed CUSTOM_ORFMERGE (arity '1..*' violated by .collect() emitting an empty list). Filter out the empty case so output channels are simply empty. - BEDTOOLS_GETFASTA and SEQKIT_TRANSLATE got generic ${meta.id}.{fa,fasta} filenames; pin to .catalogue.nt.fa and .catalogue.aa.fasta via bundled ext.prefix. Add ext.args = '-split -s -nameOnly' to getfasta (splice-aware extraction of BED12 blocks) and '--trim' to seqkit translate (drop trailing stop codons). - meta.yml: cite all five caller tools with verified DOIs (RiboCode Xiao 2018, Ribo-TISH Zhang 2017, Ribotricer Choudhary 2020, Rp-Bp Malone 2017, PRICE Erhard 2018); document that ch_orf_tables empty short-circuits; clarify GTF is required for ribocode/ribotish and optional for the rest; note that cohort-level output meta is hardcoded to [id:'cohort']. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es, tests Surfaced + addressed via independent third-party assessment of branch 6597190 (parsers vs. canonical upstream output formats; merger clustering logic; test depth). Parsers (orfnormalise): - Ribotricer: detect-orfs output does NOT have the `coordinate` column (that lives in the prepare-orfs index file). The previous fallback treated the ORF_ID span as a single block, which for any multi-exon host transcript emits a BED12 record that spans introns - biologically nonsense and contaminates the catalogue AA FASTA produced via bedtools/getfasta + seqkit/translate. New _ribotricer_blocks_from_id takes the transcript map and intersects the ORF span with the host transcript's exon structure (same pattern ribotish's fallback already uses), recovering proper multi-exon blocks. - PRICE: orf_id_raw.split("__", 1)[0] used double-underscore; actual ID format is `<tid>_<type>_<index>` (single underscores). Was dead code for fixtures with Gene populated, but still wrong; switch to single underscore so the transcript_id lookup against the GTF works. Merger (orfmerge): - cluster_by_transcript collapsed every uORF / dORF / other ORF sharing a transcript_id into one cluster. A transcript can host multiple distinct uORFs (biologically common) so the catalogue under-reported them. Split into cluster_by_transcript (canonical_cds only) + cluster_by_transcript_position (uORF/dORF/other; keyed on outer span as well so distinct positional ORFs stay separate). - Document order-dependence of cluster_by_reciprocal_overlap at the threshold boundary. - Drop dead `_bed_key` helper. Tests: - Snapshot-only assertions caught changes but not correctness. Add content-level checks per caller: aa_length > 0 for all rows, score column populated, orf_class distribution not collapsed to a single bucket, multi-block BED12 records for ribotricer and price. These would have caught every parser bug in the assessment. - Merger test asserts called_by_*/score_* columns are coherent (catches the case where a parser bug populates the indicator but the score column is empty). Documentation (meta.yml): - Cite all five caller tools with verified DOIs (already in previous commit) - add note that Rp-Bp's predicted-orfs BED has no orf_type column so this module defaults Rp-Bp calls to canonical_cds. - Note PRICE iORF/intronic/orphan -> 'other' collapse (lossy). - Note ribotricer multi-exon block recovery via GTF + that GTF is strongly recommended for ribotricer (not just ribocode/ribotish). - Subworkflow: AA fasta now lands as `${cohort}.catalogue.aa.fasta` (from the previous commit). Paired test-datasets PR nf-core/test-datasets#2070 also gets a small fixture update: the ribotish/predict fixture's source run lacked a TIS BAM so every p-value column came out as the literal string "None", which silently masked the broken pvalue parser. Substitute the real RiboPvalue value into the FisherPvalue column via awk so the parser path is exercised; documented in the fixture README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…provenance header The source columns the parser reads for each derived TSV field (`score`, `orf_type`, `aa_length`, length-for-aa-derivation) were previously hardcoded in each parser. Some callers expose multiple meaningful choices - Ribo-TISH has TISPvalue / RiboPvalue / FisherPvalue (plus Q-values), Rp-Bp has bayes_factor_mean / chi_square_p / p_translated_mean - and there was no way to override the default chain. This change: - Plumbs `task.ext.args` through the module to the template. - Adds `--score-field`, `--orf-type-field`, `--length-field`, `--aa-length-field` CLI options. When set, an override replaces the per-caller default chain (so the user gets exactly what they asked for). - Refactors the rpbp parser from positional column indices to csv.DictReader + named RPBP_COLUMNS so users can address columns by name (e.g. `--score-field bayes_factor_var`). - Centralises field-name preference chains into a DEFAULT_FIELDS table next to the parsers. - Writes a `# parser_columns: caller=<X> score=<col> orf_type=<col> ...` provenance header at the top of the normalised TSV, so consumers (and reviewers) can verify which source column was actually read for each derived field. Standard csv.DictReader and pandas `comment='#'` skip the line automatically. Also updates custom/orfmerge's load_normalised to skip `#`-prefixed comment lines so the new provenance header doesn't leak into the cohort catalogue. Test changes: - Existing per-caller tests gain a provenance-line assertion (verifies `# parser_columns:` is present and reports the expected default column for at least the ribocode case). - New `homo_sapiens [chr20] - ribotish - score-field override` test runs with `ext.args = '--score-field RiboPvalue'` and asserts the provenance line reports `score=RiboPvalue` rather than the default `FisherPvalue`. - Tests now filter `#`-prefixed lines before parsing the TSV header. Verified end-to-end on chr19/chr20 fixtures: default + override paths emit the right provenance line, and all parser-specific assertions (non-zero aa_length, populated score column, non-collapsed class distribution, multi-exon BED12 blocks) continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… subworkflow channel now carries it as val too nf-core/modules has a hardcoded `permitted_meta_keys = {"id", "single_end"}` allow-list - any other meta.<key> reference in main.nf fails the `main_nf_meta_key` lint check. The previous design embedded the per-record caller id in meta.caller, which violated that policy and tripped CI lint. Lift the caller out of meta and into a proper val input - the pattern used by raxmlng (val(model)), last/mafconvert (val(format)), clair3 (val(platform)), and similar nf-core modules: - Module input: `tuple val(meta), path(orfs_table), val(caller)` plus the unchanged `tuple val(meta2), path(gtf)`. The `caller` value goes into a per-caller enum in meta.yml (one of ribocode / ribotish / ribotricer / rpbp / price). - Module main.nf no longer references meta.caller; the script binds `caller` directly from the input tuple. - Subworkflow `ch_orf_tables` API also moves caller out of meta: `[ val(meta), path(orf_table), val(caller) ]`. Subworkflow appends caller to meta.id locally for the normalise call so the merger's `beds/*` staging gets unique per-caller filenames; the merger then rebuilds meta as `[id: 'cohort']` so no caller leakage downstream. - Subworkflow's bundled `nextflow.config` drops the `withName: CUSTOM_ORFNORMALISE` prefix override (the meta.id append in main.nf does the work). Pre-commit fixes: - ruff E741: rename `l` to `line` in write_outputs (loop variable). - ruff UP015: drop the redundant `"r"` mode arg from open() in open_text. - ruff format: minor reformatting of long literal lists. Tests: - All 6 orfnormalise tests + 1 merger test + 1 subworkflow test pass module input as the new 3-tuple (meta carries only `id`; caller is the trailing val). Merger setup config still constructs `meta + [caller: ...]` for prefix disambiguation - that meta key is consumed by the *test* setup_prefix.config (not by the module), which is fine since lint only scans modules' main.nf. Local `nf-core modules lint custom/orfnormalise` now passes cleanly (only the cosmetic Wave container-version warnings remain, shared with the reference custom/bed12codonpositions module). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rces Pre-merge upstream PRs leave `custom/bed12codonpositions`, `custom/orfnormalise`, `custom/orfmerge` and the `orftable_fasta_gtf_buildorfcatalogue` subworkflow without a modules.json entry, which lets `nf-core lint` abort with an interactive "Was the module installed from a different branch" prompt that fails under CI's no-TTY shell. Register them all under the existing https://github.com/pinin4fjords/nf-core-modules.git entry (same pattern as dotseq/dotseq). Once nf-core/modules#11740 and the custom-bed12codonpositions PR merge to master, swap the source and branches/SHAs via `nf-core modules update`. [skip ci] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pinin4fjords and others added 6 commits May 21, 2026 17:16

pinin4fjords marked this pull request as ready for review May 21, 2026 18:08

pinin4fjords mentioned this pull request May 22, 2026

feat: cross-sample ORF catalogue nf-core/riboseq#187

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow#11740

feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow#11740
pinin4fjords wants to merge 7 commits into
nf-core:masterfrom
pinin4fjords:custom-orf-catalogue

pinin4fjords commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pinin4fjords commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Harmonised schema

Configurable source columns + provenance

Example output

Test data

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pinin4fjords commented May 21, 2026 •

edited

Loading