Skip to content

feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow#11740

Open
pinin4fjords wants to merge 7 commits into
nf-core:masterfrom
pinin4fjords:custom-orf-catalogue
Open

feat(custom): orfnormalise + orfmerge modules + orftable_fasta_gtf_buildorfcatalogue subworkflow#11740
pinin4fjords wants to merge 7 commits into
nf-core:masterfrom
pinin4fjords:custom-orf-catalogue

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 21, 2026

What

Ribo-seq experiments measure which open reading frames (ORFs) on the transcriptome are actively translated by ribosomes. Five widely-used callers - RiboCode, Ribo-TISH, Ribotricer, Rp-Bp, PRICE - each emit ORF predictions in a different table format with different score semantics and classification vocabularies. There's no upstream way to combine them into one cohort catalogue.

This PR adds three components that fill that gap:

  • custom/orfnormalise - parses one caller's output into a unified BED12 + sidecar TSV. The caller (one of ribocode, ribotish, ribotricer, rpbp, price) is supplied as a val input.
  • custom/orfmerge - class-aware clustering of normalised calls across callers and samples into one cohort catalogue.
  • orftable_fasta_gtf_buildorfcatalogue subworkflow - composes the two with bedtools/getfasta + seqkit/translate to produce a catalogue AA FASTA.

Bundled because the merger's schema is defined by the normaliser, and the subworkflow only makes sense once both modules exist.

Harmonised schema

orf_class is the union vocabulary across the five callers:

Class Meaning
canonical_cds Maps to an annotated CDS (or truncated / extended variant)
uORF / dORF ORF in the 5'UTR / 3'UTR of an annotated transcript
novel_u Novel / intergenic ORF, no annotated host CDS
smORF aa_length <= 100, regardless of location
other Internal / overlap / frame variants

Score direction is per-caller (p-values lower-better, Bayes factors / phase scores higher-better); the merger keeps the direction-appropriate best per cluster.

Merger clustering is class-aware:

  • canonical_cds: by (transcript_id, strand) (one CDS per transcript).
  • uORF / dORF / other: by (transcript_id, strand, start, end) (a transcript can host multiple distinct positional ORFs).
  • novel_u / smORF: greedy reciprocal-overlap (default 0.8).

Configurable source columns + provenance

Each caller's output exposes multiple candidate columns for score (Ribo-TISH alone has FisherPvalue / RiboPvalue / TISPvalue plus Q-value variants), orf_type, and length. Defaults are in meta.yml; override per-call via ext.args:

withName: 'CUSTOM_ORFNORMALISE_RIBOTISH' {
    ext.args = '--score-field FisherQvalue'  // FDR-adjusted, not raw p-value
}

Every normalised TSV starts with a # parser_columns: ... line recording exactly which source column was read for each derived field, so downstream consumers (and reviewers) can verify provenance. Standard CSV / pandas readers skip #-prefixed lines.

Example output

Cohort catalogue TSV (called_by_* / score_* track which callers contributed):

orf_id        chrom  start     end       strand  gene_id          orf_class      aa_length  called_by_ribotish  called_by_ribocode  score_ribotish  score_ribocode
orf_00000001  20     16731702  16741005  +       ENSG00000125870  canonical_cds  224        0                   1                                   0.0101442
orf_00000002  20     17570208  17607146  +       ENSG00000125868  canonical_cds  164        0                   1                                   0.00112689

Subworkflow also emits multi-block BED12 (intron-skipping), orf_to_gene.tsv (many-to-many ORF → host lookup), catalogue AA FASTA, and a MultiQC custom-content per-class-count sidecar.

Test data

Paired with nf-core/test-datasets#2070 (five real-tool outputs sliced to <13 KB each from existing module-test outputs; full provenance in the test-data README). Tests reference the fork branch inline; will swap to params.modules_testdata_base_path once #2070 merges.

Test plan

  • All 5 callers' fixtures round-trip through custom/orfnormalise with content-level assertions (non-zero aa_length, populated score, non-collapsed orf_class distribution, multi-block BED12 for multi-exon callers, provenance header present).
  • custom/orfmerge chains two normaliser invocations and asserts coherent called_by_* / score_* columns.
  • Subworkflow chains GUNZIP + normaliser × 2 + merger + getfasta + translate end-to-end.
  • --score-field override exercised; provenance header confirms it.
  • nf-core modules lint + nf-core subworkflows lint clean (only the two cosmetic Wave container-version warnings shared with custom/bed12codonpositions remain).

pinin4fjords and others added 6 commits May 21, 2026 17:16
…e): scaffold ORF catalogue chain

Add a three-component upstream chain for building cross-caller Ribo-seq
ORF catalogues:

- custom/orfnormalise: dispatches on meta.caller in {ribocode, ribotish,
  ribotricer, rpbp, price}, emits unified BED12 + sidecar TSV with the
  harmonised orf_class vocabulary (canonical_cds / uORF / dORF / novel_u
  / smORF / other).
- custom/orfmerge: class-aware clustering across callers and samples,
  records called_by_<caller> / score_<caller> provenance with
  direction-aware best-score aggregation.
- orftable_fasta_gtf_buildorfcatalogue: composes normaliser x N callers
  + merger + bedtools/getfasta + seqkit/translate, emits BED12,
  catalogue TSV, orf_to_gene TSV, AA FASTA and an MQC sidecar.

Templates use the same python/pandas/pyyaml Wave container as
custom/bed12codonpositions. Stub tests scaffolded; real-fixture tests
follow once the paired test-datasets PR lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_ID fallback

- orfnormalise: 5 per-caller tests + 1 stub (all green). Fall back to
  parsing the ribotricer ORF_ID when the optional 'coordinate' column is
  absent (detect-orfs output does not carry it). Tolerate empty GTF
  inputs by requiring a real file before trying to read.
- orfmerge: chain via setup{} from orfnormalise outputs (ribotish +
  ribocode on chr20); disambiguate per-caller prefix via a
  bundled setup_prefix.config so the merger doesn't see name collisions.
- subworkflow: bundle nextflow.config that scopes the same caller-aware
  prefix to ORFTABLE_FASTA_GTF_BUILDORFCATALOGUE:CUSTOM_ORFNORMALISE.
  Test sets up GUNZIP for the chr20 FASTA since bedtools/getfasta needs
  uncompressed input.

Fixtures live in nf-core/test-datasets#2070; will switch the test URLs
from the pinin4fjords fork back to params.modules_testdata_base_path
once that PR merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d5 stubs

PRICE encodes strand as a single char appended to the chrom rather than
a separate colon-bracketed field (`19+:start-end`), so the previous
regex never matched and the parser returned an empty BED12 on real
data. Switch to a non-greedy `chrom` capture so the trailing strand
char is correctly extracted.

Drop the per-module stub tests: real-fixture tests already cover every
caller + the merge + the full subworkflow chain, and the stub snapshots
were tripping the test_snap_md5sum lint rule (empty-file md5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… parsers + empty-cohort handling

Inherited parser bugs (carried over from the local riboseq port):

- RiboCode: aa_length always 0 because the parser looked up an
  AA_length column that does not exist in predicted_orfs.txt. Derive
  from ORF_length (nt) instead.
- Rp-Bp: bf_mean read from column 15 (p_translated_var) instead of
  column 18 (bayes_factor_mean), and orf_type read from column 14
  (a metric, not a category). RPBP's predicted-orfs BED has no orf_type
  column; default to canonical_cds for the post-selectfinalpredictionset
  curated set.
- Ribo-TISH: combined p-value never read because the parser looked for
  Pvalcombined/Pvalue/Pvalcom; actual columns are FisherPvalue /
  RiboPvalue / TISPvalue, with "None" string sentinels. Walk the
  preference list and skip None strings.

Subworkflow gaps:

- ch_orf_tables empty case crashed CUSTOM_ORFMERGE (arity '1..*'
  violated by .collect() emitting an empty list). Filter out the empty
  case so output channels are simply empty.
- BEDTOOLS_GETFASTA and SEQKIT_TRANSLATE got generic
  ${meta.id}.{fa,fasta} filenames; pin to .catalogue.nt.fa and
  .catalogue.aa.fasta via bundled ext.prefix. Add ext.args = '-split -s
  -nameOnly' to getfasta (splice-aware extraction of BED12 blocks) and
  '--trim' to seqkit translate (drop trailing stop codons).
- meta.yml: cite all five caller tools with verified DOIs (RiboCode
  Xiao 2018, Ribo-TISH Zhang 2017, Ribotricer Choudhary 2020, Rp-Bp
  Malone 2017, PRICE Erhard 2018); document that ch_orf_tables empty
  short-circuits; clarify GTF is required for ribocode/ribotish and
  optional for the rest; note that cohort-level output meta is
  hardcoded to [id:'cohort'].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es, tests

Surfaced + addressed via independent third-party assessment of branch
6597190 (parsers vs. canonical upstream output formats; merger
clustering logic; test depth).

Parsers (orfnormalise):

- Ribotricer: detect-orfs output does NOT have the `coordinate` column
  (that lives in the prepare-orfs index file). The previous fallback
  treated the ORF_ID span as a single block, which for any multi-exon
  host transcript emits a BED12 record that spans introns - biologically
  nonsense and contaminates the catalogue AA FASTA produced via
  bedtools/getfasta + seqkit/translate. New _ribotricer_blocks_from_id
  takes the transcript map and intersects the ORF span with the host
  transcript's exon structure (same pattern ribotish's fallback already
  uses), recovering proper multi-exon blocks.
- PRICE: orf_id_raw.split("__", 1)[0] used double-underscore; actual ID
  format is `<tid>_<type>_<index>` (single underscores). Was dead code
  for fixtures with Gene populated, but still wrong; switch to single
  underscore so the transcript_id lookup against the GTF works.

Merger (orfmerge):

- cluster_by_transcript collapsed every uORF / dORF / other ORF sharing
  a transcript_id into one cluster. A transcript can host multiple
  distinct uORFs (biologically common) so the catalogue under-reported
  them. Split into cluster_by_transcript (canonical_cds only) +
  cluster_by_transcript_position (uORF/dORF/other; keyed on outer span
  as well so distinct positional ORFs stay separate).
- Document order-dependence of cluster_by_reciprocal_overlap at the
  threshold boundary.
- Drop dead `_bed_key` helper.

Tests:

- Snapshot-only assertions caught changes but not correctness. Add
  content-level checks per caller: aa_length > 0 for all rows, score
  column populated, orf_class distribution not collapsed to a single
  bucket, multi-block BED12 records for ribotricer and price. These
  would have caught every parser bug in the assessment.
- Merger test asserts called_by_*/score_* columns are coherent
  (catches the case where a parser bug populates the indicator but
  the score column is empty).

Documentation (meta.yml):

- Cite all five caller tools with verified DOIs (already in previous
  commit) - add note that Rp-Bp's predicted-orfs BED has no orf_type
  column so this module defaults Rp-Bp calls to canonical_cds.
- Note PRICE iORF/intronic/orphan -> 'other' collapse (lossy).
- Note ribotricer multi-exon block recovery via GTF + that GTF is
  strongly recommended for ribotricer (not just ribocode/ribotish).
- Subworkflow: AA fasta now lands as `${cohort}.catalogue.aa.fasta`
  (from the previous commit).

Paired test-datasets PR nf-core/test-datasets#2070 also gets a small
fixture update: the ribotish/predict fixture's source run lacked a TIS
BAM so every p-value column came out as the literal string "None",
which silently masked the broken pvalue parser. Substitute the real
RiboPvalue value into the FisherPvalue column via awk so the parser
path is exercised; documented in the fixture README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…provenance header

The source columns the parser reads for each derived TSV field
(`score`, `orf_type`, `aa_length`, length-for-aa-derivation) were
previously hardcoded in each parser. Some callers expose multiple
meaningful choices - Ribo-TISH has TISPvalue / RiboPvalue / FisherPvalue
(plus Q-values), Rp-Bp has bayes_factor_mean / chi_square_p /
p_translated_mean - and there was no way to override the default chain.

This change:

- Plumbs `task.ext.args` through the module to the template.
- Adds `--score-field`, `--orf-type-field`, `--length-field`,
  `--aa-length-field` CLI options. When set, an override replaces the
  per-caller default chain (so the user gets exactly what they asked
  for).
- Refactors the rpbp parser from positional column indices to
  csv.DictReader + named RPBP_COLUMNS so users can address columns by
  name (e.g. `--score-field bayes_factor_var`).
- Centralises field-name preference chains into a DEFAULT_FIELDS
  table next to the parsers.
- Writes a `# parser_columns: caller=<X> score=<col> orf_type=<col>
  ...` provenance header at the top of the normalised TSV, so
  consumers (and reviewers) can verify which source column was
  actually read for each derived field. Standard csv.DictReader and
  pandas `comment='#'` skip the line automatically.

Also updates custom/orfmerge's load_normalised to skip `#`-prefixed
comment lines so the new provenance header doesn't leak into the
cohort catalogue.

Test changes:

- Existing per-caller tests gain a provenance-line assertion (verifies
  `# parser_columns:` is present and reports the expected default
  column for at least the ribocode case).
- New `homo_sapiens [chr20] - ribotish - score-field override` test
  runs with `ext.args = '--score-field RiboPvalue'` and asserts the
  provenance line reports `score=RiboPvalue` rather than the default
  `FisherPvalue`.
- Tests now filter `#`-prefixed lines before parsing the TSV header.

Verified end-to-end on chr19/chr20 fixtures: default + override paths
emit the right provenance line, and all parser-specific assertions
(non-zero aa_length, populated score column, non-collapsed class
distribution, multi-exon BED12 blocks) continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords marked this pull request as ready for review May 21, 2026 18:08
… subworkflow channel now carries it as val too

nf-core/modules has a hardcoded `permitted_meta_keys = {"id",
"single_end"}` allow-list - any other meta.<key> reference in main.nf
fails the `main_nf_meta_key` lint check. The previous design embedded
the per-record caller id in meta.caller, which violated that policy
and tripped CI lint.

Lift the caller out of meta and into a proper val input - the pattern
used by raxmlng (val(model)), last/mafconvert (val(format)),
clair3 (val(platform)), and similar nf-core modules:

- Module input: `tuple val(meta), path(orfs_table), val(caller)` plus
  the unchanged `tuple val(meta2), path(gtf)`. The `caller` value goes
  into a per-caller enum in meta.yml (one of ribocode / ribotish /
  ribotricer / rpbp / price).
- Module main.nf no longer references meta.caller; the script binds
  `caller` directly from the input tuple.
- Subworkflow `ch_orf_tables` API also moves caller out of meta:
  `[ val(meta), path(orf_table), val(caller) ]`. Subworkflow appends
  caller to meta.id locally for the normalise call so the merger's
  `beds/*` staging gets unique per-caller filenames; the merger then
  rebuilds meta as `[id: 'cohort']` so no caller leakage downstream.
- Subworkflow's bundled `nextflow.config` drops the
  `withName: CUSTOM_ORFNORMALISE` prefix override (the meta.id append
  in main.nf does the work).

Pre-commit fixes:

- ruff E741: rename `l` to `line` in write_outputs (loop variable).
- ruff UP015: drop the redundant `"r"` mode arg from open() in open_text.
- ruff format: minor reformatting of long literal lists.

Tests:

- All 6 orfnormalise tests + 1 merger test + 1 subworkflow test pass
  module input as the new 3-tuple (meta carries only `id`; caller is
  the trailing val). Merger setup config still constructs
  `meta + [caller: ...]` for prefix disambiguation - that meta key is
  consumed by the *test* setup_prefix.config (not by the module),
  which is fine since lint only scans modules' main.nf.

Local `nf-core modules lint custom/orfnormalise` now passes cleanly
(only the cosmetic Wave container-version warnings remain, shared
with the reference custom/bed12codonpositions module).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to pinin4fjords/riboseq that referenced this pull request May 22, 2026
…rces

Pre-merge upstream PRs leave `custom/bed12codonpositions`,
`custom/orfnormalise`, `custom/orfmerge` and the
`orftable_fasta_gtf_buildorfcatalogue` subworkflow without a
modules.json entry, which lets `nf-core lint` abort with an interactive
"Was the module installed from a different branch" prompt that fails
under CI's no-TTY shell.

Register them all under the existing
https://github.com/pinin4fjords/nf-core-modules.git entry (same pattern
as dotseq/dotseq). Once nf-core/modules#11740 and the
custom-bed12codonpositions PR merge to master, swap the source and
branches/SHAs via `nf-core modules update`.

[skip ci]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant