fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV#170
Open
an-altosian wants to merge 5 commits into
Open
fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV#170an-altosian wants to merge 5 commits into
an-altosian wants to merge 5 commits into
Conversation
Empirical evidence from 10x Atera (Xenium Gen2) Cell Pellet preview
data: all non-XR pipeline modes (qc, preview, segfree-baysor,
segfree-ficture) failed at workflow init in ~90s with errorMessage
'<bundle>/gene_panel.json' even though gene_panel.json is only read by
XENIUMRANGER_RELABEL_RESEGMENT.
Root cause: the .map { ... file(<bundle>/gene_panel.json,
checkIfExists: true) } closure in the else-branch at workflows/
spatialaxe.nf:343-352 is evaluated eagerly by Nextflow as soon as
ch_input emits, regardless of whether ch_gene_panel is downstream-
consumed. For any bundle that doesn't ship gene_panel.json — 10x
Atera (Xenium Gen2 preview data) is one such case — workflow init
fails for every mode, including qc/preview/segfree which never need
a gene panel.
Fix: wrap the entire ch_gene_panel construction in if (do_relabel)
{ ... }. When false, ch_gene_panel keeps its channel.empty()
initialisation from line 112. No behaviour change for image/
coordinate modes (do_relabel can still be true via --gene_panel
or --relabel_genes).
PARQUET_TO_CSV is a pure format transformation: every input row maps to one output row. The pandas-based implementation eagerly loaded the entire parquet into a DataFrame before writing CSV. For the 10x Atera WTA panel transcripts.parquet (236M rows x 13 cols, 2.9 GB compressed, ~30+ GB uncompressed), this OOMed twice at process_low defaults (12 GB -> 24 GB) before succeeding on attempt 3 (36 GB, 34 GB peak rss) in a Tower run on Atera Cell Pellet. Replace with pyarrow iter_batches() + pa_csv.CSVWriter. Memory is bounded by --batch-size (default 200000 rows ~= 130 MB) instead of the full row count. Same I/O, same CPU, ~100x less peak memory. Verified locally on a 1000-row sample of real Atera transcripts.parquet: output is well-formed RFC-4180 CSV with same column order and same data. Cosmetic differences from pandas (string quoting, '23' vs '23.0', 'true' vs 'True') are universally tolerated by downstream consumers (Baysor Julia CSV.jl, Ficture pd.read_csv) and the only affected bool column (is_gene) isn't read by either consumer.
14 tasks
Preventive guardrail against the regression we hit in PR nf-core#154 (May 2026). The python3 ${moduleDir}/templates/<script>.py invocation breaks on Seqera Platform Tower running AWS Batch — ${moduleDir} interpolates to a head-node path that does not exist on worker containers, so every task fails at start with 'No such file or directory'. This pre-commit hook makes CI fail before such code can be merged. If a contributor needs to keep templates inside the module directory, use the canonical 'template' directive or declare the template as a path input so Nextflow stages it to the worker. Empirical evidence: Atera compatibility evaluation 2026-05-28. See PR nf-core#154 inline comments for the full analysis.
Both failed checks (nf-core lint, docker | 25.04.0 | 3/5) failed at 'Install Nextflow' step with HTTP 520 from Cloudflare while fetching nf-core/setup-nextflow@v2 release assets. No code change needed — retrigger CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent, narrowly-scoped fixes surfaced by an empirical compatibility evaluation of
nf-core/spatialaxeagainst 10x Atera (Xenium Gen2 / "Xenium v2") preview data — three WTA-panel samples (Cell Pellet, Breast Cancer, Cervical Cancer; 18k-target panel, 236M transcripts in the smallest sample). Full evaluation report and per-tool upgrade analysis live in companion issues — see Related below.Fix 1 —
workflows/spatialaxe.nf: guardgene_panel.jsonread ondo_relabelThe
.map { ... file(<bundle>/gene_panel.json, checkIfExists: true) }closure in the else-branch atworkflows/spatialaxe.nf:343-352is evaluated eagerly by Nextflow as soon asch_inputemits, regardless of whetherch_gene_panelis downstream-consumed.ch_gene_panelis only consumed byXENIUMRANGER_RELABEL_RESEGMENTinsideif (do_relabel), so for any bundle that doesn't shipgene_panel.jsonand a run that doesn't set--gene_panel/--relabel_genes, workflow init still fails — even forqc/preview/segfreemodes that never invoke the XR relabel step.Empirical evidence: all four non-XR mode smoke runs on Atera Cell Pellet failed at workflow init in ~90s with
errorMessage = '<bundle>/gene_panel.json'. After this fix, QC mode succeeds end-to-end on the same bundle.Fix: wrap the entire
ch_gene_panelconstruction inif (do_relabel) { ... }. When false,ch_gene_panelkeeps itschannel.empty()initialisation from line 112. No behaviour change forimage/coordinatemodes (do_relabelcan still be true via--gene_panelor--relabel_genes).Fix 2 —
bin/utility_parquet_to_csv.py: stream-convert instead of eager-loadPARQUET_TO_CSV is a pure format transformation (every input row maps to one output row). The pandas-based implementation eagerly loaded the entire parquet into a DataFrame before writing CSV:
```python
df = pd.read_parquet(transcripts, engine="pyarrow") # full table in memory
df.to_csv(...)
```
For Atera's
transcripts.parquet(236M rows × 13 cols, 2.9 GB compressed, ~30+ GB uncompressed), this OOMed twice atprocess_lowdefaults (12 GB → 24 GB) before succeeding on attempt 3 (36 GB; peak rss 34 GB) in a Tower run.Replace with pyarrow
iter_batches()+pa_csv.CSVWriter. Memory bounded by--batch-size(default 200,000 rows ≈ 130 MB) instead of the full row count. Same I/O, same CPU, ~100× less peak memory.Verified locally on a 1000-row sample of real Atera
transcripts.parquet: output is well-formed RFC-4180 CSV with same column order and same data. Cosmetic differences from pandas (string quoting,23vs23.0,truevsTrue) are universally tolerated by downstream consumers (Baysor Julia CSV.jl, Ficturepd.read_csv); the only affected bool column (is_gene) isn't read by either consumer.Test plan
make check(ruff + pre-commit) passes locallyOut of scope
bin/baysor_preprocess_transcripts.py(pd.read_parquet+ filter in memory). It OOMed at 168 GB in our Tower tests. Worth a follow-up perf PR.refactor(modules): convert all 19 module-level bin scripts to Nextflow templates) merged in May, but PR Renaming pipeline into spatialaxe #162 (rename via template-sync) overwrote those changes. If the templates conversion is reattempted, thepython3 \${moduleDir}/templates/<script>.pyshell-call invocation breaks on Tower/AWS Batch because\${moduleDir}interpolates to a head-node path the workers don't have. I'll leave a note on refactor(modules): convert all 19 module-level bin scripts to Nextflow templates #154 with the empirical evidence so it doesn't get re-introduced silently.Related
docs/2026-05-28_REVIEW_atera-on-spatialaxe-compatibility.md(in companion Altos-internal branch)