feat(subworkflows): gtf_hybridmerge_gffcompare#11729
Merged
pinin4fjords merged 21 commits intoMay 21, 2026
Merged
Conversation
New subworkflow that classifies a novel-transcript GTF against a reference annotation with gffcompare, filters the annotated GTF by class code with gawk, optionally subtracts intervals overlapping a user-supplied blacklist BED via bedtools intersect, and merges the survivors into a backbone annotation. Missing gene rows for surviving novel transcripts are synthesised from transcript coordinates. Caller-policy bits (the specific class codes to keep, the bedtools args for the blacklist intersect) flow through take/emit channels or are set by the consumer pipeline's modules.config; the bundled nextflow.config ships only the two static awk programs and the prefix/suffix settings the chain needs to stop input/output names colliding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… for existing test-datasets fixtures Switches both tests to real fixtures from nf-core/test-datasets:modules so the chain is exercised end-to-end with proper annotation data and the test-datasets browser stays the canonical source for the inputs: - reference + backbone = data/genomics/homo_sapiens/genome/chr21/sequence/chr21_gencode.gtf - novel = data/genomics/homo_sapiens/genome/genome.gtf (chr22 subset) - blacklist = data/genomics/homo_sapiens/genome/genome.blacklist_intervals.bed Cross-contig setup means every chr22 novel transcript is classified as class_code 'u' by gffcompare, then class-code filtered, then either passed through to the concat or dropped by the chr22 blacklist intersect, and finally grafted onto the chr21 backbone with synthesised gene rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d step comments The concat awk's synthesised gene rows were printing literally `gene_id gid ;` instead of `gene_id "ENSG..."`. Groovy single-quoted strings interpret `\"` as `"`, which broke the awk string-quote boundaries; the runtime saw `"gene_id "" gid "";"` and parsed `gid` as literal text rather than the captured variable. Doubled the backslash (`\\"`) so Groovy emits the literal `\"` awk needs. Also added concise step comments to the subworkflow main.nf walking a reader through the chain (gffcompare classification, class-code filter, optional blacklist branch, synthetic gene rows). Verified end-to-end against the test fixtures: chr21 reference (272 genes / 367 transcripts), chr22 novel input (12 transcripts), all classified `class_code "u"`, blacklist intersect drops to expected counts, synth rows now carry proper `gene_id "ENSG..."` attribute, chr21 backbone preserved byte-for-byte by the concat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te format Adds content assertions next to the existing snapshot so a regression in the concat awk's gene_id attribute formatting fails the test instead of just changing the snapshot hash. Specifically: - every non-comment line in the hybrid GTF must contain a well-formed `gene_id "..."` attribute (catches the broken-quote class of bug that produced `gene_id gid ;` before). - the no-blacklist test additionally asserts at least one synthesised gene row exists on chr22 (the novel contig), confirming the concat awk actually grafted novel genes onto the chr21 backbone rather than only passing the backbone through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s from emitted meta The `class_codes` key was folded onto meta so GAWK_FILTER's closure-form `ext.args` could read it at task launch, but the key then leaked all the way through to the hybrid_gtf output meta. A consumer pipeline joining the hybrid_gtf channel against another channel keyed on the same id (e.g. `[id:'sample']`) would have failed the join because the meta maps no longer matched. Strip the key immediately after GAWK_FILTER consumes it. Add a test assertion that the emitted hybrid_gtf meta does not contain `class_codes`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…id_gtf The hybrid GTF is the backbone annotation with novel transcripts merged in, so its identity should follow the backbone, not the novel input. The emit now carries `ch_backbone_gtf`'s meta (e.g. nf-core/riboseq's `[id: 'hybrid_reference']`) rather than the novel input's (`[id: 'stringtie_merge']`), which is what riboseq's local CONCAT_GTF already does before its rewrite to this subworkflow. Swapping the meta at the concat-input stage also naturally drops the transient `class_codes` key folded onto the novel meta for GAWK_FILTER, so the explicit strip step is no longer needed. Test assertion changed from "no class_codes in meta" to the stricter "meta equals backbone meta", which catches both the leakage class of bug and the wrong-meta-source class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y calls Nextflow 26.04's strict parser rejects the legacy `Channel.X` form; factory methods on the channel namespace must be lowercase. Swap `Channel.of`, `Channel.value`, and `Channel.empty` to `channel.of`, `channel.value`, `channel.empty` in main.nf, meta.yml, and the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ridmerge_gffcompare
Tightens the directory / workflow / channel-selector name to better
match the established subworkflow naming precedent. The previous form
was 22 characters of purpose ("buildhybridannotation") which read long
next to peers like `bam_subsampledepth_samtools` (14 chars,
"subsampledepth") or `bam_markduplicates_picard` ("markduplicates").
"build" is redundant (every subworkflow produces something) and
"annotation" duplicates the gtf_ file-type prefix; "hybridmerge"
captures the action (merge novel into a hybrid annotation) in the same
compound-verb shape as the precedents.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…om:pinin4fjords/nf-core-modules into feat/subworkflow-gtf-buildhybridannotation
One short line per step; drop the multi-line restatements of what the code does. The non-obvious bits (why we fold class_codes onto meta, why backbone goes first into the concat awk, that bedtools args are caller policy) survive in one sentence each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit was a comments-only change marked [skip ci]; this empty commit re-triggers CI so the PR shows a green status on the final state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit accidentally carried a skip-CI marker in its body, so GitHub Actions matched it and bypassed the workflows. This empty commit re-runs CI so the PR shows a green status on the final state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent filter awk Addresses self-review notes M1, M2, M3, L1: M1 - the bundled nextflow.config now sets `ext.args = '-v'` on BEDTOOLS_INTERSECT so the channel name `ch_blacklist_bed` actually subtracts intervals (matches what "blacklist" means everywhere else). Previously the test was running positive overlap by default, which was the opposite of the documented semantics. Consumers can still override in their own modules.config for strand-aware (-v -s) or positive intersect. M2 - GAWK_FILTER's awk is now buffer-then-emit instead of single-pass. The previous form silently dropped exons if any exon row preceded its parent transcript (gffcompare's annotated.gtf always emits transcripts first in practice, but the assumption was a footgun). The new awk records keep_tx[] in a single pass over the input then prints survivors from the buffer in original input order during END. M3 - gffcompare_stats added to both snapshot blocks; it was in the emit and documented in meta.yml but absent from the snapshot, so regressions in the classification step were not caught. L1 - meta.yml `ch_novel_gtf` description now notes the single-novel-vs-single-backbone design constraint and points callers with multiple novel GTFs at a pre-merge. The hybrid_gtf md5 changes for the with-blacklist test (now reflects real subtraction instead of overlap-only) - that snapshot rebake is correct, not a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords
added a commit
to pinin4fjords/riboseq
that referenced
this pull request
May 21, 2026
Swap modules/local/gedi/{indexgenome,price} for the nf-core/modules
versions (git_sha 51af5cb84874647aa4733742b86142025705b042). The
upstream modules differ only by convention: configurable ext.prefix
on the indexgenome output directory and an additive versions_gedi
emit alongside the versions topic. Setting ext.prefix='price_index'
on GEDI_INDEXGENOME preserves the existing published path and the
${index}/reference.oml contract consumed by GEDI_PRICE, so behaviour
is unchanged.
ribocode/* and ribotish/* were already moved to modules/nf-core/
via earlier `nf-core modules install` work, so this commit only
handles gedi.
Follow-up housekeeping once these PRs merge:
- rpbp/* via nf-core/modules#11695
- concat_gtf and filter_gtf_class_code via nf-core/modules#11729
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ild transcripts Addresses L2 properly rather than leaving it as flagged. The previous synth logic emitted a gene row using the first novel transcript's coordinates per gene_id, which under-reported the gene's footprint when the gene had multiple novel transcripts. STAR / Salmon / featureCounts / gffread don't read gene-row coords, but a strict reader (e.g. a genome browser, a custom span-based filter) would have seen the wrong span. GAWK_CONCAT is now a buffer-then-emit awk: pass over the input tracks `gene_seen[]` for backbone gene rows and accumulates `min_start[gid]`, `max_end[gid]` for every novel transcript; END walks the buffered lines in original order and, immediately before the first transcript of a novel gene_id, emits the synth gene row with the union span. Same buffer-then-emit pattern we used for GAWK_FILTER in the previous commit. Added a defensive nf-test assertion that walks the chr22 portion of the hybrid output and asserts every gene row's span equals the min(start)/max(end) of its child transcripts, so a future regression to first-transcript coords would fail loudly instead of silently changing the snapshot. Verified against the chr21+chr22 fixture: 7 chr22 gene_ids (3 with multiple novel transcripts) all have correct union spans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esource profile The two awk programs are unavoidably terse one-liners (have to fit in a single `ext.args2` string). Adds prose block comments above each `withName` walking a reader through BEGIN / per-line / END behaviour, so nobody has to reverse-engineer the program to understand what it does. meta.yml now documents: - input format scope (GTF2 with `gene_id` / `transcript_id` and standard feature types - GFF3 and prokaryotic GTFs are out of scope). Surveyed the nf-core/rnaseq issue queue around GTF-attribute and feature-type edge cases; the recurring concern is prokaryotic annotations, which don't apply to hybrid-transcriptome building. Standard GTF attribute names are reliable across Ensembl / GENCODE / RefSeq / StringTie so parameterising them would add surface area for no real-world benefit. - resource sizing: the buffer-then-emit awks hold the full input in memory, fine under `process_single` for annotation-scale GTFs. Note for consumers to bump the label if they're feeding >~500 MB GTFs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
withLabel resolution runs before withName, so setting `label` inside a withName selector does not re-trigger label-based resource resolution. The correct way to bump resources for the bundled GAWK steps is to set the resource attributes (memory, cpus, time) directly inside withName. Updated the prose in nextflow.config header and meta.yml accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jonasscheid
requested changes
May 21, 2026
…lves Jonas review)
Both gawk programs now live as multi-line, comment-annotated `.awk`
files under `awk/` next to the subworkflow main.nf and are passed to
GAWK via the `program_file` input. Pattern follows the existing
`subworkflows/nf-core/utils_references/schema_references.json` precedent
for bundled-resource path resolution
(`${projectDir}/subworkflows/nf-core/<name>/...`), which travels with
the subworkflow when `nf-core subworkflows install` copies it into a
consumer pipeline.
The bundled `nextflow.config` drops both `ext.args2` walls of text plus
the long prose annotations that were trying to make them readable;
explanations now live as native awk comments inside the awk programs
where they belong. Output md5s unchanged - byte-identical hybrid GTF.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jonasscheid
approved these changes
May 21, 2026
Contributor
jonasscheid
left a comment
There was a problem hiding this comment.
LGTM, good idea with outsourcing the awk programms
pinin4fjords
added a commit
to pinin4fjords/riboseq
that referenced
this pull request
May 21, 2026
nf-core/modules#11729 has merged. Pin the subworkflow at the merge commit (sha bf60969) on master rather than carrying it as a vendored fork-branch copy. No file changes - the subworkflow contents already match master from the most recent sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new subworkflow
gtf_hybridmerge_gffcomparethat builds a hybrid GTF from a novel-transcript GTF and a reference annotation:gffcompareclassifies every novel transcript against the reference (assigns class codes).gawkkeeps only transcripts whose class code is in a user-supplied set (e.g."u"for novel intergenic,"u,j,i"for a wider net).bedtools intersect -v(optional) subtracts survivors that overlap a user-supplied blacklist BED, e.g. rRNA or repeat regions. Skipped by passingchannel.empty().gawkconcatenates the survivors with an annotation backbone GTF, stripping comment lines and synthesising anygenerow whosegene_idis referenced by a novel transcript but absent from the backbone. Synth gene rows span the union of their child transcripts' coordinates so the gene span is scientifically correct, not just the first transcript's footprint.The output is a self-consistent hybrid GTF suitable as a
--sjdbGTFfilefor a second STAR pass, as input to gffread for transcriptome FASTA generation, or as the annotation any other downstream tool consumes.The subworkflow takes a separate
ch_reference_gtf(used by gffcompare for classification) andch_backbone_gtf(the annotation skeleton the survivors are grafted onto). For most callers these are the same file - pass the same channel twice. Callers that want classification done against a fuller annotation but the novel transcripts grafted onto a reduced backbone (e.g. canonical-only) pass two different channels.The emitted
hybrid_gtfmeta is the backbone's meta, not the novel input's. The hybrid GTF is logically the backbone annotation with novel transcripts grafted on, so its identity (and any downstream.join()that wants to align it with star-index, transcriptome-fasta or other backbone-keyed channels) should follow the backbone.Name
gtf_hybridmerge_gffcomparefollows the spec's "name by purpose" branch (the subworkflow has conditional logic - the optional blacklist intersect) plus the dominant tool. The compound verb "hybridmerge" matches thesubsampledepth/markduplicatesshape used by existing subworkflows; "build" and "annotation" were dropped from an earlier draft as redundant (every subworkflow produces output; a GTF is an annotation). Happy to revisit if the reviewer prefers a different form.Components
gffcompare(existing)gawk(existing, aliased asGAWK_FILTERandGAWK_CONCAT)bedtools/intersect(existing)No new modules are introduced. The two
gawkprograms (class-code filter and backbone-plus-novel concat) are buffer-then-emit so the order of transcript / exon rows in their inputs doesn't matter; the awk source lives inawk/{filter,concat}.awknext to the subworkflowmain.nfand is passed to GAWK via theprogram_fileinput (${moduleDir}/awk/..., so the subworkflow stays location-independent). ConsumersincludeConfigthe bundlednextflow.configor pass it with-cfor the prefix / suffix / closure-form args.Caller-policy bits
The bundled config sets
ext.args = '-v'onBEDTOOLS_INTERSECTso the channel namech_blacklist_bedactually subtracts. Consumers that want strand-aware subtraction (-v -s) or positive overlap can override in their ownmodules.config.The class-codes value flows through the subworkflow's
val_class_codestake ontometa.class_codesand is read at task launch via a closure-formext.argsonGAWK_FILTER. The key is dropped from meta before the hybrid_gtf is emitted (carried by the backbone-meta swap at the concat step) so the value cannot leak downstream and break consumer joins.Scope
Input format: standard GTF2 with
gene_id/transcript_idattributes andgene/transcriptfeature types. GFF3 dialects and prokaryotic GTFs without transcript rows are out of scope. Surveyed nf-core/rnaseq's issue queue around attribute / feature-type edge cases: the only recurring concern is prokaryotic annotations, which don't apply to hybrid-transcriptome building. Parameterising attribute / feature-type names would add surface area for no real-world benefit in the consumers that need this subworkflow.Resource sizing: GAWK steps run under
process_singleand hold the full input in awk arrays. Fine for annotation-scale GTFs; consumers feeding >~500 MB GTFs should overridememory/cpusdirectly viawithName.Test plan
nf-test test subworkflows/nf-core/gtf_hybridmerge_gffcompare/tests/main.nf.test --profile dockerpasses locally (2 tests covering both the no-blacklist passthrough branch and the with-blacklist intersect branch).nf-core subworkflows lint gtf_hybridmerge_gffcompareclean (28 passed, 0 warned, 0 failed).pre-commit runclean.Tests use existing
nf-core/test-datasets:modulesfixtures:chr21_gencode.gtfas reference + backbone,genome.gtf(chr22 subset) as the novel input,genome.blacklist_intervals.bed(chr22) as the blacklist. The cross-contig setup gives every chr22 novel transcript aclass_code "u"against the chr21 reference, exercising the class-code filter, the blacklist subtraction, and the synth-gene-row code path in one chain. No new test fixtures required.Beyond the snapshot, the tests assert:
gene_id "..."attribute (catches gawk/Groovy string-escaping regressions),generow appears on the novel contig (confirms the synth code path actually fired),🤖 Generated with Claude Code