fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV by an-altosian · Pull Request #170 · nf-core/spatialaxe

an-altosian · 2026-05-28T20:42:23Z

Summary

Two independent, narrowly-scoped fixes surfaced by an empirical compatibility evaluation of nf-core/spatialaxe against 10x Atera (Xenium Gen2 / "Xenium v2") preview data — three WTA-panel samples (Cell Pellet, Breast Cancer, Cervical Cancer; 18k-target panel, 236M transcripts in the smallest sample). Full evaluation report and per-tool upgrade analysis live in companion issues — see Related below.

Fix 1 — `workflows/spatialaxe.nf`: guard `gene_panel.json` read on `do_relabel`

The .map { ... file(<bundle>/gene_panel.json, checkIfExists: true) } closure in the else-branch at workflows/spatialaxe.nf:343-352 is evaluated eagerly by Nextflow as soon as ch_input emits, regardless of whether ch_gene_panel is downstream-consumed. ch_gene_panel is only consumed by XENIUMRANGER_RELABEL_RESEGMENT inside if (do_relabel), so for any bundle that doesn't ship gene_panel.json and a run that doesn't set --gene_panel / --relabel_genes, workflow init still fails — even for qc / preview / segfree modes that never invoke the XR relabel step.

Empirical evidence: all four non-XR mode smoke runs on Atera Cell Pellet failed at workflow init in ~90s with errorMessage = '<bundle>/gene_panel.json'. After this fix, QC mode succeeds end-to-end on the same bundle.

Fix: wrap the entire ch_gene_panel construction in if (do_relabel) { ... }. When false, ch_gene_panel keeps its channel.empty() initialisation from line 112. No behaviour change for image/coordinate modes (do_relabel can still be true via --gene_panel or --relabel_genes).

Fix 2 — `bin/utility_parquet_to_csv.py`: stream-convert instead of eager-load

PARQUET_TO_CSV is a pure format transformation (every input row maps to one output row). The pandas-based implementation eagerly loaded the entire parquet into a DataFrame before writing CSV:

```python
df = pd.read_parquet(transcripts, engine="pyarrow") # full table in memory
df.to_csv(...)
```

For Atera's transcripts.parquet (236M rows × 13 cols, 2.9 GB compressed, ~30+ GB uncompressed), this OOMed twice at process_low defaults (12 GB → 24 GB) before succeeding on attempt 3 (36 GB; peak rss 34 GB) in a Tower run.

Replace with pyarrow iter_batches() + pa_csv.CSVWriter. Memory bounded by --batch-size (default 200,000 rows ≈ 130 MB) instead of the full row count. Same I/O, same CPU, ~100× less peak memory.

Verified locally on a 1000-row sample of real Atera transcripts.parquet: output is well-formed RFC-4180 CSV with same column order and same data. Cosmetic differences from pandas (string quoting, 23 vs 23.0, true vs True) are universally tolerated by downstream consumers (Baysor Julia CSV.jl, Ficture pd.read_csv); the only affected bool column (is_gene) isn't read by either consumer.

Test plan

make check (ruff + pre-commit) passes locally
Bug 1 fix validated on Atera Cell Pellet via Tower: workflow init now passes (QC mode SUCCEEDED in 137s; v3/v4 reruns confirm)
Streaming PARQUET_TO_CSV verified locally on a 1000-row sample of real Atera transcripts.parquet
CI: GitHub Actions on this PR — pending reviewer trigger
Optional: re-run an existing Xenium v1 (XOA 4.x) nf-test fixture to confirm no regression in current panels

Out of scope

XeniumRanger Atera compatibility: ROI for this PR is non-XR modes only. 10x is expected to publish XR Atera support separately.
Other Atera-scale memory tuning: the same eager-load anti-pattern lives in bin/baysor_preprocess_transcripts.py (pd.read_parquet + filter in memory). It OOMed at 168 GB in our Tower tests. Worth a follow-up perf PR.
Tool upgrades (Baysor C++, Punkst, Cellpose v4, etc.): tracked separately as upgrade-discussion issues [upgrade] Baysor: migrate Julia v0.7.1 → C++ cpp-0.8.2 (eliminates PARQUET_TO_CSV for Baysor paths) #164–[upgrade] Proseg: bump v3.1.0 → v3.1.1 (patch release) #169.
Templates-directive considerations: PR refactor(modules): convert all 19 module-level bin scripts to Nextflow templates #154 (refactor(modules): convert all 19 module-level bin scripts to Nextflow templates) merged in May, but PR Renaming pipeline into spatialaxe #162 (rename via template-sync) overwrote those changes. If the templates conversion is reattempted, the python3 \${moduleDir}/templates/<script>.py shell-call invocation breaks on Tower/AWS Batch because \${moduleDir} interpolates to a head-node path the workers don't have. I'll leave a note on refactor(modules): convert all 19 module-level bin scripts to Nextflow templates #154 with the empirical evidence so it doesn't get re-introduced silently.

Atera compatibility evaluation report: docs/2026-05-28_REVIEW_atera-on-spatialaxe-compatibility.md (in companion Altos-internal branch)
Tool upgrade discussion issues: [upgrade] Baysor: migrate Julia v0.7.1 → C++ cpp-0.8.2 (eliminates PARQUET_TO_CSV for Baysor paths) #164 (Baysor C++), [upgrade] Ficture: migrate seqscope/ficture (Python) → Yichen-Si/punkst (C++) #165 (Punkst), [upgrade] Cellpose: identify current Wave-container version pin vs upstream v4.1.1 #166 (Cellpose), [upgrade] StarDist: identify current Wave-container version pin vs upstream 0.9.2 #167 (StarDist), [investigate] Segger: clarify version 1.0.14 source — dongzehe/segger container vs upstream dpeerlab/segger (canonical) / EliHei2/segger_dev (legacy) #168 (Segger), [upgrade] Proseg: bump v3.1.0 → v3.1.1 (patch release) #169 (Proseg)

Empirical evidence from 10x Atera (Xenium Gen2) Cell Pellet preview data: all non-XR pipeline modes (qc, preview, segfree-baysor, segfree-ficture) failed at workflow init in ~90s with errorMessage '<bundle>/gene_panel.json' even though gene_panel.json is only read by XENIUMRANGER_RELABEL_RESEGMENT. Root cause: the .map { ... file(<bundle>/gene_panel.json, checkIfExists: true) } closure in the else-branch at workflows/ spatialaxe.nf:343-352 is evaluated eagerly by Nextflow as soon as ch_input emits, regardless of whether ch_gene_panel is downstream- consumed. For any bundle that doesn't ship gene_panel.json — 10x Atera (Xenium Gen2 preview data) is one such case — workflow init fails for every mode, including qc/preview/segfree which never need a gene panel. Fix: wrap the entire ch_gene_panel construction in if (do_relabel) { ... }. When false, ch_gene_panel keeps its channel.empty() initialisation from line 112. No behaviour change for image/ coordinate modes (do_relabel can still be true via --gene_panel or --relabel_genes).

PARQUET_TO_CSV is a pure format transformation: every input row maps to one output row. The pandas-based implementation eagerly loaded the entire parquet into a DataFrame before writing CSV. For the 10x Atera WTA panel transcripts.parquet (236M rows x 13 cols, 2.9 GB compressed, ~30+ GB uncompressed), this OOMed twice at process_low defaults (12 GB -> 24 GB) before succeeding on attempt 3 (36 GB, 34 GB peak rss) in a Tower run on Atera Cell Pellet. Replace with pyarrow iter_batches() + pa_csv.CSVWriter. Memory is bounded by --batch-size (default 200000 rows ~= 130 MB) instead of the full row count. Same I/O, same CPU, ~100x less peak memory. Verified locally on a 1000-row sample of real Atera transcripts.parquet: output is well-formed RFC-4180 CSV with same column order and same data. Cosmetic differences from pandas (string quoting, '23' vs '23.0', 'true' vs 'True') are universally tolerated by downstream consumers (Baysor Julia CSV.jl, Ficture pd.read_csv) and the only affected bool column (is_gene) isn't read by either consumer.

Preventive guardrail against the regression we hit in PR nf-core#154 (May 2026). The python3 ${moduleDir}/templates/<script>.py invocation breaks on Seqera Platform Tower running AWS Batch — ${moduleDir} interpolates to a head-node path that does not exist on worker containers, so every task fails at start with 'No such file or directory'. This pre-commit hook makes CI fail before such code can be merged. If a contributor needs to keep templates inside the module directory, use the canonical 'template' directive or declare the template as a path input so Nextflow stages it to the worker. Empirical evidence: Atera compatibility evaluation 2026-05-28. See PR nf-core#154 inline comments for the full analysis.

Both failed checks (nf-core lint, docker | 25.04.0 | 3/5) failed at 'Install Nextflow' step with HTTP 520 from Cloudflare while fetching nf-core/setup-nextflow@v2 release assets. No code change needed — retrigger CI.

…rsists

an-altosian added 2 commits May 28, 2026 20:41

an-altosian mentioned this pull request May 28, 2026

refactor(modules): convert all 19 module-level bin scripts to Nextflow templates #154

Merged

14 tasks

an-altosian added 3 commits May 28, 2026 20:48

ci: retrigger after upstream Cloudflare 520 flakes

b46e722

Both failed checks (nf-core lint, docker | 25.04.0 | 3/5) failed at 'Install Nextflow' step with HTTP 520 from Cloudflare while fetching nf-core/setup-nextflow@v2 release assets. No code change needed — retrigger CI.

ci: retrigger nf-core#2 — Cloudflare 520 on nf-core/setup-nextflow pe…

6f7b016

…rsists

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV#170

fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV#170
an-altosian wants to merge 5 commits into
nf-core:devfrom
an-altosian:feature/atera-upstream-fixes

an-altosian commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

an-altosian commented May 28, 2026

Summary

Fix 1 — workflows/spatialaxe.nf: guard gene_panel.json read on do_relabel

Fix 2 — bin/utility_parquet_to_csv.py: stream-convert instead of eager-load

Test plan

Out of scope

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix 1 — `workflows/spatialaxe.nf`: guard `gene_panel.json` read on `do_relabel`

Fix 2 — `bin/utility_parquet_to_csv.py`: stream-convert instead of eager-load