feat(subworkflows): gtf_hybridmerge_gffcompare by pinin4fjords · Pull Request #11729 · nf-core/modules

pinin4fjords · 2026-05-21T09:35:49Z

Summary

Adds a new subworkflow gtf_hybridmerge_gffcompare that builds a hybrid GTF from a novel-transcript GTF and a reference annotation:

gffcompare classifies every novel transcript against the reference (assigns class codes).
gawk keeps only transcripts whose class code is in a user-supplied set (e.g. "u" for novel intergenic, "u,j,i" for a wider net).
bedtools intersect -v (optional) subtracts survivors that overlap a user-supplied blacklist BED, e.g. rRNA or repeat regions. Skipped by passing channel.empty().
gawk concatenates the survivors with an annotation backbone GTF, stripping comment lines and synthesising any gene row whose gene_id is referenced by a novel transcript but absent from the backbone. Synth gene rows span the union of their child transcripts' coordinates so the gene span is scientifically correct, not just the first transcript's footprint.

The output is a self-consistent hybrid GTF suitable as a --sjdbGTFfile for a second STAR pass, as input to gffread for transcriptome FASTA generation, or as the annotation any other downstream tool consumes.

The subworkflow takes a separate ch_reference_gtf (used by gffcompare for classification) and ch_backbone_gtf (the annotation skeleton the survivors are grafted onto). For most callers these are the same file - pass the same channel twice. Callers that want classification done against a fuller annotation but the novel transcripts grafted onto a reduced backbone (e.g. canonical-only) pass two different channels.

The emitted hybrid_gtf meta is the backbone's meta, not the novel input's. The hybrid GTF is logically the backbone annotation with novel transcripts grafted on, so its identity (and any downstream .join() that wants to align it with star-index, transcriptome-fasta or other backbone-keyed channels) should follow the backbone.

Name

gtf_hybridmerge_gffcompare follows the spec's "name by purpose" branch (the subworkflow has conditional logic - the optional blacklist intersect) plus the dominant tool. The compound verb "hybridmerge" matches the subsampledepth/markduplicates shape used by existing subworkflows; "build" and "annotation" were dropped from an earlier draft as redundant (every subworkflow produces output; a GTF is an annotation). Happy to revisit if the reviewer prefers a different form.

Components

gffcompare (existing)
gawk (existing, aliased as GAWK_FILTER and GAWK_CONCAT)
bedtools/intersect (existing)

No new modules are introduced. The two gawk programs (class-code filter and backbone-plus-novel concat) are buffer-then-emit so the order of transcript / exon rows in their inputs doesn't matter; the awk source lives in awk/{filter,concat}.awk next to the subworkflow main.nf and is passed to GAWK via the program_file input (${moduleDir}/awk/..., so the subworkflow stays location-independent). Consumers includeConfig the bundled nextflow.config or pass it with -c for the prefix / suffix / closure-form args.

Caller-policy bits

The bundled config sets ext.args = '-v' on BEDTOOLS_INTERSECT so the channel name ch_blacklist_bed actually subtracts. Consumers that want strand-aware subtraction (-v -s) or positive overlap can override in their own modules.config.

The class-codes value flows through the subworkflow's val_class_codes take onto meta.class_codes and is read at task launch via a closure-form ext.args on GAWK_FILTER. The key is dropped from meta before the hybrid_gtf is emitted (carried by the backbone-meta swap at the concat step) so the value cannot leak downstream and break consumer joins.

Scope

Input format: standard GTF2 with gene_id / transcript_id attributes and gene / transcript feature types. GFF3 dialects and prokaryotic GTFs without transcript rows are out of scope. Surveyed nf-core/rnaseq's issue queue around attribute / feature-type edge cases: the only recurring concern is prokaryotic annotations, which don't apply to hybrid-transcriptome building. Parameterising attribute / feature-type names would add surface area for no real-world benefit in the consumers that need this subworkflow.

Resource sizing: GAWK steps run under process_single and hold the full input in awk arrays. Fine for annotation-scale GTFs; consumers feeding >~500 MB GTFs should override memory / cpus directly via withName.

Test plan

nf-test test subworkflows/nf-core/gtf_hybridmerge_gffcompare/tests/main.nf.test --profile docker passes locally (2 tests covering both the no-blacklist passthrough branch and the with-blacklist intersect branch).
nf-core subworkflows lint gtf_hybridmerge_gffcompare clean (28 passed, 0 warned, 0 failed).
pre-commit run clean.
CI green.

Tests use existing nf-core/test-datasets:modules fixtures: chr21_gencode.gtf as reference + backbone, genome.gtf (chr22 subset) as the novel input, genome.blacklist_intervals.bed (chr22) as the blacklist. The cross-contig setup gives every chr22 novel transcript a class_code "u" against the chr21 reference, exercising the class-code filter, the blacklist subtraction, and the synth-gene-row code path in one chain. No new test fixtures required.

Beyond the snapshot, the tests assert:

every non-comment line in the hybrid GTF carries a well-formed gene_id "..." attribute (catches gawk/Groovy string-escaping regressions),
the hybrid_gtf meta equals the backbone meta exactly (catches both meta-source mistakes and key-leakage),
at least one synthesised gene row appears on the novel contig (confirms the synth code path actually fired),
every synthesised gene row's span equals the min(start) / max(end) of its child transcripts (catches a regression to first-transcript coords).

🤖 Generated with Claude Code

New subworkflow that classifies a novel-transcript GTF against a reference annotation with gffcompare, filters the annotated GTF by class code with gawk, optionally subtracts intervals overlapping a user-supplied blacklist BED via bedtools intersect, and merges the survivors into a backbone annotation. Missing gene rows for surviving novel transcripts are synthesised from transcript coordinates. Caller-policy bits (the specific class codes to keep, the bedtools args for the blacklist intersect) flow through take/emit channels or are set by the consumer pipeline's modules.config; the bundled nextflow.config ships only the two static awk programs and the prefix/suffix settings the chain needs to stop input/output names colliding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… for existing test-datasets fixtures Switches both tests to real fixtures from nf-core/test-datasets:modules so the chain is exercised end-to-end with proper annotation data and the test-datasets browser stays the canonical source for the inputs: - reference + backbone = data/genomics/homo_sapiens/genome/chr21/sequence/chr21_gencode.gtf - novel = data/genomics/homo_sapiens/genome/genome.gtf (chr22 subset) - blacklist = data/genomics/homo_sapiens/genome/genome.blacklist_intervals.bed Cross-contig setup means every chr22 novel transcript is classified as class_code 'u' by gffcompare, then class-code filtered, then either passed through to the concat or dropped by the chr22 blacklist intersect, and finally grafted onto the chr21 backbone with synthesised gene rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d step comments The concat awk's synthesised gene rows were printing literally `gene_id gid ;` instead of `gene_id "ENSG..."`. Groovy single-quoted strings interpret `\"` as `"`, which broke the awk string-quote boundaries; the runtime saw `"gene_id "" gid "";"` and parsed `gid` as literal text rather than the captured variable. Doubled the backslash (`\\"`) so Groovy emits the literal `\"` awk needs. Also added concise step comments to the subworkflow main.nf walking a reader through the chain (gffcompare classification, class-code filter, optional blacklist branch, synthetic gene rows). Verified end-to-end against the test fixtures: chr21 reference (272 genes / 367 transcripts), chr22 novel input (12 transcripts), all classified `class_code "u"`, blacklist intersect drops to expected counts, synth rows now carry proper `gene_id "ENSG..."` attribute, chr21 backbone preserved byte-for-byte by the concat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…te format Adds content assertions next to the existing snapshot so a regression in the concat awk's gene_id attribute formatting fails the test instead of just changing the snapshot hash. Specifically: - every non-comment line in the hybrid GTF must contain a well-formed `gene_id "..."` attribute (catches the broken-quote class of bug that produced `gene_id gid ;` before). - the no-blacklist test additionally asserts at least one synthesised gene row exists on chr22 (the novel contig), confirming the concat awk actually grafted novel genes onto the chr21 backbone rather than only passing the backbone through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s from emitted meta The `class_codes` key was folded onto meta so GAWK_FILTER's closure-form `ext.args` could read it at task launch, but the key then leaked all the way through to the hybrid_gtf output meta. A consumer pipeline joining the hybrid_gtf channel against another channel keyed on the same id (e.g. `[id:'sample']`) would have failed the join because the meta maps no longer matched. Strip the key immediately after GAWK_FILTER consumes it. Add a test assertion that the emitted hybrid_gtf meta does not contain `class_codes`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…id_gtf The hybrid GTF is the backbone annotation with novel transcripts merged in, so its identity should follow the backbone, not the novel input. The emit now carries `ch_backbone_gtf`'s meta (e.g. nf-core/riboseq's `[id: 'hybrid_reference']`) rather than the novel input's (`[id: 'stringtie_merge']`), which is what riboseq's local CONCAT_GTF already does before its rewrite to this subworkflow. Swapping the meta at the concat-input stage also naturally drops the transient `class_codes` key folded onto the novel meta for GAWK_FILTER, so the explicit strip step is no longer needed. Test assertion changed from "no class_codes in meta" to the stricter "meta equals backbone meta", which catches both the leakage class of bug and the wrong-meta-source class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y calls Nextflow 26.04's strict parser rejects the legacy `Channel.X` form; factory methods on the channel namespace must be lowercase. Swap `Channel.of`, `Channel.value`, and `Channel.empty` to `channel.of`, `channel.value`, `channel.empty` in main.nf, meta.yml, and the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ridmerge_gffcompare Tightens the directory / workflow / channel-selector name to better match the established subworkflow naming precedent. The previous form was 22 characters of purpose ("buildhybridannotation") which read long next to peers like `bam_subsampledepth_samtools` (14 chars, "subsampledepth") or `bam_markduplicates_picard` ("markduplicates"). "build" is redundant (every subworkflow produces something) and "annotation" duplicates the gtf_ file-type prefix; "hybridmerge" captures the action (merge novel into a hybrid annotation) in the same compound-verb shape as the precedents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…om:pinin4fjords/nf-core-modules into feat/subworkflow-gtf-buildhybridannotation

One short line per step; drop the multi-line restatements of what the code does. The non-obvious bits (why we fold class_codes onto meta, why backbone goes first into the concat awk, that bedtools args are caller policy) survive in one sentence each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous commit was a comments-only change marked [skip ci]; this empty commit re-triggers CI so the PR shows a green status on the final state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous commit accidentally carried a skip-CI marker in its body, so GitHub Actions matched it and bypassed the workflows. This empty commit re-runs CI so the PR shows a green status on the final state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ent filter awk Addresses self-review notes M1, M2, M3, L1: M1 - the bundled nextflow.config now sets `ext.args = '-v'` on BEDTOOLS_INTERSECT so the channel name `ch_blacklist_bed` actually subtracts intervals (matches what "blacklist" means everywhere else). Previously the test was running positive overlap by default, which was the opposite of the documented semantics. Consumers can still override in their own modules.config for strand-aware (-v -s) or positive intersect. M2 - GAWK_FILTER's awk is now buffer-then-emit instead of single-pass. The previous form silently dropped exons if any exon row preceded its parent transcript (gffcompare's annotated.gtf always emits transcripts first in practice, but the assumption was a footgun). The new awk records keep_tx[] in a single pass over the input then prints survivors from the buffer in original input order during END. M3 - gffcompare_stats added to both snapshot blocks; it was in the emit and documented in meta.yml but absent from the snapshot, so regressions in the classification step were not caught. L1 - meta.yml `ch_novel_gtf` description now notes the single-novel-vs-single-backbone design constraint and points callers with multiple novel GTFs at a pre-merge. The hybrid_gtf md5 changes for the with-blacklist test (now reflects real subtraction instead of overlap-only) - that snapshot rebake is correct, not a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Swap modules/local/gedi/{indexgenome,price} for the nf-core/modules versions (git_sha 51af5cb84874647aa4733742b86142025705b042). The upstream modules differ only by convention: configurable ext.prefix on the indexgenome output directory and an additive versions_gedi emit alongside the versions topic. Setting ext.prefix='price_index' on GEDI_INDEXGENOME preserves the existing published path and the ${index}/reference.oml contract consumed by GEDI_PRICE, so behaviour is unchanged. ribocode/* and ribotish/* were already moved to modules/nf-core/ via earlier `nf-core modules install` work, so this commit only handles gedi. Follow-up housekeeping once these PRs merge: - rpbp/* via nf-core/modules#11695 - concat_gtf and filter_gtf_class_code via nf-core/modules#11729 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ild transcripts Addresses L2 properly rather than leaving it as flagged. The previous synth logic emitted a gene row using the first novel transcript's coordinates per gene_id, which under-reported the gene's footprint when the gene had multiple novel transcripts. STAR / Salmon / featureCounts / gffread don't read gene-row coords, but a strict reader (e.g. a genome browser, a custom span-based filter) would have seen the wrong span. GAWK_CONCAT is now a buffer-then-emit awk: pass over the input tracks `gene_seen[]` for backbone gene rows and accumulates `min_start[gid]`, `max_end[gid]` for every novel transcript; END walks the buffered lines in original order and, immediately before the first transcript of a novel gene_id, emits the synth gene row with the union span. Same buffer-then-emit pattern we used for GAWK_FILTER in the previous commit. Added a defensive nf-test assertion that walks the chr22 portion of the hybrid output and asserts every gene row's span equals the min(start)/max(end) of its child transcripts, so a future regression to first-transcript coords would fail loudly instead of silently changing the snapshot. Verified against the chr21+chr22 fixture: 7 chr22 gene_ids (3 with multiple novel transcripts) all have correct union spans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…esource profile The two awk programs are unavoidably terse one-liners (have to fit in a single `ext.args2` string). Adds prose block comments above each `withName` walking a reader through BEGIN / per-line / END behaviour, so nobody has to reverse-engineer the program to understand what it does. meta.yml now documents: - input format scope (GTF2 with `gene_id` / `transcript_id` and standard feature types - GFF3 and prokaryotic GTFs are out of scope). Surveyed the nf-core/rnaseq issue queue around GTF-attribute and feature-type edge cases; the recurring concern is prokaryotic annotations, which don't apply to hybrid-transcriptome building. Standard GTF attribute names are reliable across Ensembl / GENCODE / RefSeq / StringTie so parameterising them would add surface area for no real-world benefit. - resource sizing: the buffer-then-emit awks hold the full input in memory, fine under `process_single` for annotation-scale GTFs. Note for consumers to bump the label if they're feeding >~500 MB GTFs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

withLabel resolution runs before withName, so setting `label` inside a withName selector does not re-trigger label-based resource resolution. The correct way to bump resources for the bundled GAWK steps is to set the resource attributes (memory, cpus, time) directly inside withName. Updated the prose in nextflow.config header and meta.yml accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lves Jonas review) Both gawk programs now live as multi-line, comment-annotated `.awk` files under `awk/` next to the subworkflow main.nf and are passed to GAWK via the `program_file` input. Pattern follows the existing `subworkflows/nf-core/utils_references/schema_references.json` precedent for bundled-resource path resolution (`${projectDir}/subworkflows/nf-core/<name>/...`), which travels with the subworkflow when `nf-core subworkflows install` copies it into a consumer pipeline. The bundled `nextflow.config` drops both `ext.args2` walls of text plus the long prose annotations that were trying to make them readable; explanations now live as native awk comments inside the awk programs where they belong. Output md5s unchanged - byte-identical hybrid GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aths

jonasscheid

LGTM, good idea with outsourcing the awk programms

nf-core/modules#11729 has merged. Pin the subworkflow at the merge commit (sha bf60969) on master rather than carrying it as a vendored fork-branch copy. No file changes - the subworkflow contents already match master from the most recent sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pinin4fjords and others added 8 commits May 21, 2026 10:35

Merge branch 'master' into feat/subworkflow-gtf-buildhybridannotation

689a1c1

pinin4fjords marked this pull request as ready for review May 21, 2026 10:16

pinin4fjords and others added 2 commits May 21, 2026 11:32

Merge branch 'feat/subworkflow-gtf-buildhybridannotation' of github.c…

d74723c

…om:pinin4fjords/nf-core-modules into feat/subworkflow-gtf-buildhybridannotation

pinin4fjords changed the title ~~feat(subworkflows): gtf_buildhybridannotation_gffcompare~~ feat(subworkflows): gtf_hybridmerge_gffcompare May 21, 2026

pinin4fjords and others added 4 commits May 21, 2026 11:35

ci: trigger CI on current head

20e619e

Previous commit was a comments-only change marked [skip ci]; this empty commit re-triggers CI so the PR shows a green status on the final state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pinin4fjords and others added 4 commits May 21, 2026 12:16

docs(gtf_hybridmerge_gffcompare): tighten resource-override note

624e07f

jonasscheid requested changes May 21, 2026

View reviewed changes

Comment thread subworkflows/nf-core/gtf_hybridmerge_gffcompare/nextflow.config Outdated

Comment thread subworkflows/nf-core/gtf_hybridmerge_gffcompare/nextflow.config

pinin4fjords and others added 3 commits May 21, 2026 15:22

refactor(gtf_hybridmerge_gffcompare): use moduleDir for bundled awk p…

20c2016

…aths

Merge branch 'master' into feat/subworkflow-gtf-buildhybridannotation

3636320

pinin4fjords requested a review from jonasscheid May 21, 2026 14:41

jonasscheid approved these changes May 21, 2026

View reviewed changes

pinin4fjords added this pull request to the merge queue May 21, 2026

Merged via the queue into nf-core:master with commit bf60969 May 21, 2026
23 checks passed

pinin4fjords deleted the feat/subworkflow-gtf-buildhybridannotation branch May 21, 2026 14:54

pinin4fjords mentioned this pull request May 22, 2026

feat: StringTie novel transcript discovery + hybrid GTF nf-core/riboseq#182

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(subworkflows): gtf_hybridmerge_gffcompare#11729

feat(subworkflows): gtf_hybridmerge_gffcompare#11729
pinin4fjords merged 21 commits into
nf-core:masterfrom
pinin4fjords:feat/subworkflow-gtf-buildhybridannotation

pinin4fjords commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jonasscheid left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pinin4fjords commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Name

Components

Caller-policy bits

Scope

Test plan

Uh oh!

Uh oh!

Uh oh!

jonasscheid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pinin4fjords commented May 21, 2026 •

edited

Loading