Skip to content

feat(subworkflows): gtf_hybridmerge_gffcompare#11729

Merged
pinin4fjords merged 21 commits into
nf-core:masterfrom
pinin4fjords:feat/subworkflow-gtf-buildhybridannotation
May 21, 2026
Merged

feat(subworkflows): gtf_hybridmerge_gffcompare#11729
pinin4fjords merged 21 commits into
nf-core:masterfrom
pinin4fjords:feat/subworkflow-gtf-buildhybridannotation

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 21, 2026

Summary

Adds a new subworkflow gtf_hybridmerge_gffcompare that builds a hybrid GTF from a novel-transcript GTF and a reference annotation:

  1. gffcompare classifies every novel transcript against the reference (assigns class codes).
  2. gawk keeps only transcripts whose class code is in a user-supplied set (e.g. "u" for novel intergenic, "u,j,i" for a wider net).
  3. bedtools intersect -v (optional) subtracts survivors that overlap a user-supplied blacklist BED, e.g. rRNA or repeat regions. Skipped by passing channel.empty().
  4. gawk concatenates the survivors with an annotation backbone GTF, stripping comment lines and synthesising any gene row whose gene_id is referenced by a novel transcript but absent from the backbone. Synth gene rows span the union of their child transcripts' coordinates so the gene span is scientifically correct, not just the first transcript's footprint.

The output is a self-consistent hybrid GTF suitable as a --sjdbGTFfile for a second STAR pass, as input to gffread for transcriptome FASTA generation, or as the annotation any other downstream tool consumes.

The subworkflow takes a separate ch_reference_gtf (used by gffcompare for classification) and ch_backbone_gtf (the annotation skeleton the survivors are grafted onto). For most callers these are the same file - pass the same channel twice. Callers that want classification done against a fuller annotation but the novel transcripts grafted onto a reduced backbone (e.g. canonical-only) pass two different channels.

The emitted hybrid_gtf meta is the backbone's meta, not the novel input's. The hybrid GTF is logically the backbone annotation with novel transcripts grafted on, so its identity (and any downstream .join() that wants to align it with star-index, transcriptome-fasta or other backbone-keyed channels) should follow the backbone.

Name

gtf_hybridmerge_gffcompare follows the spec's "name by purpose" branch (the subworkflow has conditional logic - the optional blacklist intersect) plus the dominant tool. The compound verb "hybridmerge" matches the subsampledepth/markduplicates shape used by existing subworkflows; "build" and "annotation" were dropped from an earlier draft as redundant (every subworkflow produces output; a GTF is an annotation). Happy to revisit if the reviewer prefers a different form.

Components

  • gffcompare (existing)
  • gawk (existing, aliased as GAWK_FILTER and GAWK_CONCAT)
  • bedtools/intersect (existing)

No new modules are introduced. The two gawk programs (class-code filter and backbone-plus-novel concat) are buffer-then-emit so the order of transcript / exon rows in their inputs doesn't matter; the awk source lives in awk/{filter,concat}.awk next to the subworkflow main.nf and is passed to GAWK via the program_file input (${moduleDir}/awk/..., so the subworkflow stays location-independent). Consumers includeConfig the bundled nextflow.config or pass it with -c for the prefix / suffix / closure-form args.

Caller-policy bits

The bundled config sets ext.args = '-v' on BEDTOOLS_INTERSECT so the channel name ch_blacklist_bed actually subtracts. Consumers that want strand-aware subtraction (-v -s) or positive overlap can override in their own modules.config.

The class-codes value flows through the subworkflow's val_class_codes take onto meta.class_codes and is read at task launch via a closure-form ext.args on GAWK_FILTER. The key is dropped from meta before the hybrid_gtf is emitted (carried by the backbone-meta swap at the concat step) so the value cannot leak downstream and break consumer joins.

Scope

Input format: standard GTF2 with gene_id / transcript_id attributes and gene / transcript feature types. GFF3 dialects and prokaryotic GTFs without transcript rows are out of scope. Surveyed nf-core/rnaseq's issue queue around attribute / feature-type edge cases: the only recurring concern is prokaryotic annotations, which don't apply to hybrid-transcriptome building. Parameterising attribute / feature-type names would add surface area for no real-world benefit in the consumers that need this subworkflow.

Resource sizing: GAWK steps run under process_single and hold the full input in awk arrays. Fine for annotation-scale GTFs; consumers feeding >~500 MB GTFs should override memory / cpus directly via withName.

Test plan

  • nf-test test subworkflows/nf-core/gtf_hybridmerge_gffcompare/tests/main.nf.test --profile docker passes locally (2 tests covering both the no-blacklist passthrough branch and the with-blacklist intersect branch).
  • nf-core subworkflows lint gtf_hybridmerge_gffcompare clean (28 passed, 0 warned, 0 failed).
  • pre-commit run clean.
  • CI green.

Tests use existing nf-core/test-datasets:modules fixtures: chr21_gencode.gtf as reference + backbone, genome.gtf (chr22 subset) as the novel input, genome.blacklist_intervals.bed (chr22) as the blacklist. The cross-contig setup gives every chr22 novel transcript a class_code "u" against the chr21 reference, exercising the class-code filter, the blacklist subtraction, and the synth-gene-row code path in one chain. No new test fixtures required.

Beyond the snapshot, the tests assert:

  • every non-comment line in the hybrid GTF carries a well-formed gene_id "..." attribute (catches gawk/Groovy string-escaping regressions),
  • the hybrid_gtf meta equals the backbone meta exactly (catches both meta-source mistakes and key-leakage),
  • at least one synthesised gene row appears on the novel contig (confirms the synth code path actually fired),
  • every synthesised gene row's span equals the min(start) / max(end) of its child transcripts (catches a regression to first-transcript coords).

🤖 Generated with Claude Code

pinin4fjords and others added 8 commits May 21, 2026 10:35
New subworkflow that classifies a novel-transcript GTF against a reference
annotation with gffcompare, filters the annotated GTF by class code with
gawk, optionally subtracts intervals overlapping a user-supplied blacklist
BED via bedtools intersect, and merges the survivors into a backbone
annotation. Missing gene rows for surviving novel transcripts are
synthesised from transcript coordinates.

Caller-policy bits (the specific class codes to keep, the bedtools args
for the blacklist intersect) flow through take/emit channels or are set
by the consumer pipeline's modules.config; the bundled nextflow.config
ships only the two static awk programs and the prefix/suffix settings
the chain needs to stop input/output names colliding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… for existing test-datasets fixtures

Switches both tests to real fixtures from nf-core/test-datasets:modules so
the chain is exercised end-to-end with proper annotation data and the
test-datasets browser stays the canonical source for the inputs:

  - reference + backbone = data/genomics/homo_sapiens/genome/chr21/sequence/chr21_gencode.gtf
  - novel               = data/genomics/homo_sapiens/genome/genome.gtf  (chr22 subset)
  - blacklist           = data/genomics/homo_sapiens/genome/genome.blacklist_intervals.bed

Cross-contig setup means every chr22 novel transcript is classified as
class_code 'u' by gffcompare, then class-code filtered, then either
passed through to the concat or dropped by the chr22 blacklist intersect,
and finally grafted onto the chr21 backbone with synthesised gene rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d step comments

The concat awk's synthesised gene rows were printing literally
`gene_id  gid ;` instead of `gene_id "ENSG..."`. Groovy single-quoted
strings interpret `\"` as `"`, which broke the awk string-quote
boundaries; the runtime saw `"gene_id "" gid "";"` and parsed `gid` as
literal text rather than the captured variable. Doubled the backslash
(`\\"`) so Groovy emits the literal `\"` awk needs.

Also added concise step comments to the subworkflow main.nf walking a
reader through the chain (gffcompare classification, class-code filter,
optional blacklist branch, synthetic gene rows).

Verified end-to-end against the test fixtures: chr21 reference (272
genes / 367 transcripts), chr22 novel input (12 transcripts), all
classified `class_code "u"`, blacklist intersect drops to expected
counts, synth rows now carry proper `gene_id "ENSG..."` attribute, chr21
backbone preserved byte-for-byte by the concat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te format

Adds content assertions next to the existing snapshot so a regression in
the concat awk's gene_id attribute formatting fails the test instead of
just changing the snapshot hash. Specifically:

- every non-comment line in the hybrid GTF must contain a well-formed
  `gene_id "..."` attribute (catches the broken-quote class of bug that
  produced `gene_id  gid ;` before).
- the no-blacklist test additionally asserts at least one synthesised
  gene row exists on chr22 (the novel contig), confirming the concat awk
  actually grafted novel genes onto the chr21 backbone rather than only
  passing the backbone through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s from emitted meta

The `class_codes` key was folded onto meta so GAWK_FILTER's closure-form
`ext.args` could read it at task launch, but the key then leaked all the
way through to the hybrid_gtf output meta. A consumer pipeline joining
the hybrid_gtf channel against another channel keyed on the same id
(e.g. `[id:'sample']`) would have failed the join because the meta maps
no longer matched.

Strip the key immediately after GAWK_FILTER consumes it. Add a test
assertion that the emitted hybrid_gtf meta does not contain
`class_codes`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…id_gtf

The hybrid GTF is the backbone annotation with novel transcripts merged
in, so its identity should follow the backbone, not the novel input. The
emit now carries `ch_backbone_gtf`'s meta (e.g. nf-core/riboseq's
`[id: 'hybrid_reference']`) rather than the novel input's
(`[id: 'stringtie_merge']`), which is what riboseq's local CONCAT_GTF
already does before its rewrite to this subworkflow.

Swapping the meta at the concat-input stage also naturally drops the
transient `class_codes` key folded onto the novel meta for GAWK_FILTER,
so the explicit strip step is no longer needed.

Test assertion changed from "no class_codes in meta" to the stricter
"meta equals backbone meta", which catches both the leakage class of bug
and the wrong-meta-source class of bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y calls

Nextflow 26.04's strict parser rejects the legacy `Channel.X` form;
factory methods on the channel namespace must be lowercase. Swap
`Channel.of`, `Channel.value`, and `Channel.empty` to `channel.of`,
`channel.value`, `channel.empty` in main.nf, meta.yml, and the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords marked this pull request as ready for review May 21, 2026 10:16
pinin4fjords and others added 2 commits May 21, 2026 11:32
…ridmerge_gffcompare

Tightens the directory / workflow / channel-selector name to better
match the established subworkflow naming precedent. The previous form
was 22 characters of purpose ("buildhybridannotation") which read long
next to peers like `bam_subsampledepth_samtools` (14 chars,
"subsampledepth") or `bam_markduplicates_picard` ("markduplicates").

"build" is redundant (every subworkflow produces something) and
"annotation" duplicates the gtf_ file-type prefix; "hybridmerge"
captures the action (merge novel into a hybrid annotation) in the same
compound-verb shape as the precedents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…om:pinin4fjords/nf-core-modules into feat/subworkflow-gtf-buildhybridannotation
@pinin4fjords pinin4fjords changed the title feat(subworkflows): gtf_buildhybridannotation_gffcompare feat(subworkflows): gtf_hybridmerge_gffcompare May 21, 2026
pinin4fjords and others added 4 commits May 21, 2026 11:35
One short line per step; drop the multi-line restatements of what the
code does. The non-obvious bits (why we fold class_codes onto meta, why
backbone goes first into the concat awk, that bedtools args are caller
policy) survive in one sentence each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit was a comments-only change marked [skip ci]; this empty
commit re-triggers CI so the PR shows a green status on the final state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit accidentally carried a skip-CI marker in its body, so
GitHub Actions matched it and bypassed the workflows. This empty commit
re-runs CI so the PR shows a green status on the final state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent filter awk

Addresses self-review notes M1, M2, M3, L1:

M1 - the bundled nextflow.config now sets `ext.args = '-v'` on
BEDTOOLS_INTERSECT so the channel name `ch_blacklist_bed` actually
subtracts intervals (matches what "blacklist" means everywhere else).
Previously the test was running positive overlap by default, which was
the opposite of the documented semantics. Consumers can still override
in their own modules.config for strand-aware (-v -s) or positive
intersect.

M2 - GAWK_FILTER's awk is now buffer-then-emit instead of single-pass.
The previous form silently dropped exons if any exon row preceded its
parent transcript (gffcompare's annotated.gtf always emits transcripts
first in practice, but the assumption was a footgun). The new awk
records keep_tx[] in a single pass over the input then prints survivors
from the buffer in original input order during END.

M3 - gffcompare_stats added to both snapshot blocks; it was in the
emit and documented in meta.yml but absent from the snapshot, so
regressions in the classification step were not caught.

L1 - meta.yml `ch_novel_gtf` description now notes the
single-novel-vs-single-backbone design constraint and points callers
with multiple novel GTFs at a pre-merge.

The hybrid_gtf md5 changes for the with-blacklist test (now reflects
real subtraction instead of overlap-only) - that snapshot rebake is
correct, not a regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to pinin4fjords/riboseq that referenced this pull request May 21, 2026
Swap modules/local/gedi/{indexgenome,price} for the nf-core/modules
versions (git_sha 51af5cb84874647aa4733742b86142025705b042). The
upstream modules differ only by convention: configurable ext.prefix
on the indexgenome output directory and an additive versions_gedi
emit alongside the versions topic. Setting ext.prefix='price_index'
on GEDI_INDEXGENOME preserves the existing published path and the
${index}/reference.oml contract consumed by GEDI_PRICE, so behaviour
is unchanged.

ribocode/* and ribotish/* were already moved to modules/nf-core/
via earlier `nf-core modules install` work, so this commit only
handles gedi.

Follow-up housekeeping once these PRs merge:
- rpbp/* via nf-core/modules#11695
- concat_gtf and filter_gtf_class_code via nf-core/modules#11729

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords and others added 4 commits May 21, 2026 12:16
…ild transcripts

Addresses L2 properly rather than leaving it as flagged. The previous
synth logic emitted a gene row using the first novel transcript's
coordinates per gene_id, which under-reported the gene's footprint when
the gene had multiple novel transcripts. STAR / Salmon / featureCounts /
gffread don't read gene-row coords, but a strict reader (e.g. a genome
browser, a custom span-based filter) would have seen the wrong span.

GAWK_CONCAT is now a buffer-then-emit awk: pass over the input tracks
`gene_seen[]` for backbone gene rows and accumulates `min_start[gid]`,
`max_end[gid]` for every novel transcript; END walks the buffered lines
in original order and, immediately before the first transcript of a
novel gene_id, emits the synth gene row with the union span. Same
buffer-then-emit pattern we used for GAWK_FILTER in the previous commit.

Added a defensive nf-test assertion that walks the chr22 portion of the
hybrid output and asserts every gene row's span equals the
min(start)/max(end) of its child transcripts, so a future regression to
first-transcript coords would fail loudly instead of silently changing
the snapshot.

Verified against the chr21+chr22 fixture: 7 chr22 gene_ids (3 with
multiple novel transcripts) all have correct union spans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esource profile

The two awk programs are unavoidably terse one-liners (have to fit in a
single `ext.args2` string). Adds prose block comments above each
`withName` walking a reader through BEGIN / per-line / END behaviour, so
nobody has to reverse-engineer the program to understand what it does.

meta.yml now documents:

- input format scope (GTF2 with `gene_id` / `transcript_id` and standard
  feature types - GFF3 and prokaryotic GTFs are out of scope). Surveyed
  the nf-core/rnaseq issue queue around GTF-attribute and feature-type
  edge cases; the recurring concern is prokaryotic annotations, which
  don't apply to hybrid-transcriptome building. Standard GTF attribute
  names are reliable across Ensembl / GENCODE / RefSeq / StringTie so
  parameterising them would add surface area for no real-world benefit.

- resource sizing: the buffer-then-emit awks hold the full input in
  memory, fine under `process_single` for annotation-scale GTFs. Note
  for consumers to bump the label if they're feeding >~500 MB GTFs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
withLabel resolution runs before withName, so setting `label` inside a
withName selector does not re-trigger label-based resource resolution.
The correct way to bump resources for the bundled GAWK steps is to set
the resource attributes (memory, cpus, time) directly inside withName.
Updated the prose in nextflow.config header and meta.yml accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread subworkflows/nf-core/gtf_hybridmerge_gffcompare/nextflow.config Outdated
Comment thread subworkflows/nf-core/gtf_hybridmerge_gffcompare/nextflow.config
pinin4fjords and others added 3 commits May 21, 2026 15:22
…lves Jonas review)

Both gawk programs now live as multi-line, comment-annotated `.awk`
files under `awk/` next to the subworkflow main.nf and are passed to
GAWK via the `program_file` input. Pattern follows the existing
`subworkflows/nf-core/utils_references/schema_references.json` precedent
for bundled-resource path resolution
(`${projectDir}/subworkflows/nf-core/<name>/...`), which travels with
the subworkflow when `nf-core subworkflows install` copies it into a
consumer pipeline.

The bundled `nextflow.config` drops both `ext.args2` walls of text plus
the long prose annotations that were trying to make them readable;
explanations now live as native awk comments inside the awk programs
where they belong. Output md5s unchanged - byte-identical hybrid GTF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords requested a review from jonasscheid May 21, 2026 14:41
Copy link
Copy Markdown
Contributor

@jonasscheid jonasscheid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good idea with outsourcing the awk programms

@pinin4fjords pinin4fjords added this pull request to the merge queue May 21, 2026
Merged via the queue into nf-core:master with commit bf60969 May 21, 2026
23 checks passed
@pinin4fjords pinin4fjords deleted the feat/subworkflow-gtf-buildhybridannotation branch May 21, 2026 14:54
pinin4fjords added a commit to pinin4fjords/riboseq that referenced this pull request May 21, 2026
nf-core/modules#11729 has merged. Pin the subworkflow at the merge
commit (sha bf60969) on master rather than carrying it as a vendored
fork-branch copy. No file changes - the subworkflow contents already
match master from the most recent sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants