Skip to content

feat(ribocode): pre-build pyfasta indexes + prefix-scoped outputs#11685

Merged
pinin4fjords merged 3 commits into
masterfrom
ribocode-fusion-safe-outputs
May 19, 2026
Merged

feat(ribocode): pre-build pyfasta indexes + prefix-scoped outputs#11685
pinin4fjords merged 3 commits into
masterfrom
ribocode-fusion-safe-outputs

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 18, 2026

Two related changes carried in nf-core/riboseq#174.

ribocode/prepare

After prepare_transcripts runs, pre-build the pyfasta .gdx/.flat sidecars for annotation/transcripts_sequence.fa by instantiating RiboCode.prepare_transcripts.GenomeSeq directly.

Why: RiboCode's downstream detectORF.py opens transcripts_sequence.fa via pyfasta, which lazily writes .gdx/.flat next to the FASTA on first read. Under Nextflow's default symlink staging those writes go through the symlink and land in the upstream RIBOCODE_PREPARE task's work dir; parallel consumers then race on the same sidecar paths, with last-writer-wins behaviour on shared/network storage. Building the sidecars in the producing task makes them part of the published annotation directory and removes the lazy-write path entirely - same pattern as samtools/faidx shipping .fai alongside its FASTA.

ribocode/ribocode

Switch orf_txt and orf_txt_collapsed from *.txt / *_collapsed.txt to ${prefix}.txt / ${prefix}_collapsed.txt. The previous globs matched both files into the same emit. The prefix binding is promoted out of def in both script: and stub: so it resolves at the output-glob stage; the Nextflow 26 strict parser rejects re-declaring the same local across the two blocks. Existing stub assertion adjusted from process.out.orf_txt[0][1][0] to process.out.orf_txt[0][1].

Snapshot deltas

  • ribocode/prepare: gains two new file md5s (.flat, .gdx); existing md5s unchanged.
  • ribocode/ribocode: drops the duplicate test_collapsed.txt entry the old *.txt glob double-counted.

Source: nf-core/riboseq#174. Supersedes the ribocode half of #11684.

Two related changes carried in nf-core/riboseq#174 and split out of the
bundled PR #11684.

ribocode/prepare: pre-build the pyfasta `.gdx`/`.flat` indexes for
`annotation/transcripts_sequence.fa` immediately after `prepare_transcripts`,
using the same `key_fn` RiboCode applies internally (split on first space,
otherwise split on `|`). Stub touches the two new sidecars.

Why: downstream RiboCode steps open the FASTA with pyfasta, which lazily
writes `.gdx`/`.flat` next to the input on first read. Under Fusion staging
those writes land back at the upstream task's S3 prefix and silently
corrupt the staged copy on retries. Building the indexes inside the
producing task fixes it.

ribocode/ribocode: switch the `orf_txt` and `orf_txt_collapsed` output
globs from `*.txt` / `*_collapsed.txt` to `${prefix}.txt` /
`${prefix}_collapsed.txt` so multi-instance publication is unambiguous
(`*.txt` previously matched both files into the same emit). The `prefix`
binding is promoted out of `def` in both `script:` and `stub:` so it
resolves at the output-glob stage; the Nextflow 26 strict parser rejects
re-declaring the same local with `def` across both blocks. The existing
stub assertion at `process.out.orf_txt[0][1][0]` is corrected to the new
single-file shape (`process.out.orf_txt[0][1]`).

Source: nf-core/riboseq#174

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords and others added 2 commits May 18, 2026 17:04
The lazy pyfasta sidecar write isn't Fusion-specific - it's a Nextflow
symlink-staging concern that affects any backend (writes leak back to
the producer task's work dir via the staged-input symlink).

Rewording the inline comment to match. No code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…build

Replace the inline 8-line python heredoc (which replicated RiboCode's
`get_chrom` key_fn verbatim) with a single `python -c` line that imports
and instantiates `RiboCode.prepare_transcripts.GenomeSeq` directly. The
class constructor itself runs `Fasta(filename, key_fn=get_chrom)` with
the same key function, so we drop the replication while producing
byte-identical .gdx/.flat sidecars (md5-verified on the realistic FASTA
format prepare_transcripts emits).

No snapshot change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords marked this pull request as ready for review May 18, 2026 16:25
-o annotation \\
$args
# Pre-build the pyfasta .gdx/.flat sidecars by instantiating RiboCode's own GenomeSeq -
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be an additional flag in prepare_transcripts in the future? Maybe worth to file an issue to their github repo

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's kind of something that comes out of the workflow use case specifically, but I can at least flag it with them.

@pinin4fjords
Copy link
Copy Markdown
Member Author

@jonasscheid good shout - opened xryanglab/RiboCode#70 upstream proposing the eager-build patch. If they take it we can drop the inline pre-build entirely. I'll add a link in the module's inline comment so future maintainers can find it.

Merged via the queue into master with commit f3e22f6 May 19, 2026
33 checks passed
@pinin4fjords pinin4fjords deleted the ribocode-fusion-safe-outputs branch May 19, 2026 10:13
@pinin4fjords
Copy link
Copy Markdown
Member Author

Update: opened the upstream PR too - xryanglab/RiboCode#71. If they accept it, the next ribocode container build picks up the eager index and we can drop the inline pre-build from this module entirely. For now this PR remains the workaround.

@jonasscheid
Copy link
Copy Markdown
Contributor

Great stuff 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants