Skip to content

fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV#170

Open
an-altosian wants to merge 5 commits into
nf-core:devfrom
an-altosian:feature/atera-upstream-fixes
Open

fix: 10x Atera (Xenium Gen2) compatibility — gene_panel.json gating + streaming PARQUET_TO_CSV#170
an-altosian wants to merge 5 commits into
nf-core:devfrom
an-altosian:feature/atera-upstream-fixes

Conversation

@an-altosian
Copy link
Copy Markdown
Collaborator

Summary

Two independent, narrowly-scoped fixes surfaced by an empirical compatibility evaluation of nf-core/spatialaxe against 10x Atera (Xenium Gen2 / "Xenium v2") preview data — three WTA-panel samples (Cell Pellet, Breast Cancer, Cervical Cancer; 18k-target panel, 236M transcripts in the smallest sample). Full evaluation report and per-tool upgrade analysis live in companion issues — see Related below.

Fix 1 — workflows/spatialaxe.nf: guard gene_panel.json read on do_relabel

The .map { ... file(<bundle>/gene_panel.json, checkIfExists: true) } closure in the else-branch at workflows/spatialaxe.nf:343-352 is evaluated eagerly by Nextflow as soon as ch_input emits, regardless of whether ch_gene_panel is downstream-consumed. ch_gene_panel is only consumed by XENIUMRANGER_RELABEL_RESEGMENT inside if (do_relabel), so for any bundle that doesn't ship gene_panel.json and a run that doesn't set --gene_panel / --relabel_genes, workflow init still fails — even for qc / preview / segfree modes that never invoke the XR relabel step.

Empirical evidence: all four non-XR mode smoke runs on Atera Cell Pellet failed at workflow init in ~90s with errorMessage = '<bundle>/gene_panel.json'. After this fix, QC mode succeeds end-to-end on the same bundle.

Fix: wrap the entire ch_gene_panel construction in if (do_relabel) { ... }. When false, ch_gene_panel keeps its channel.empty() initialisation from line 112. No behaviour change for image/coordinate modes (do_relabel can still be true via --gene_panel or --relabel_genes).

Fix 2 — bin/utility_parquet_to_csv.py: stream-convert instead of eager-load

PARQUET_TO_CSV is a pure format transformation (every input row maps to one output row). The pandas-based implementation eagerly loaded the entire parquet into a DataFrame before writing CSV:

```python
df = pd.read_parquet(transcripts, engine="pyarrow") # full table in memory
df.to_csv(...)
```

For Atera's transcripts.parquet (236M rows × 13 cols, 2.9 GB compressed, ~30+ GB uncompressed), this OOMed twice at process_low defaults (12 GB → 24 GB) before succeeding on attempt 3 (36 GB; peak rss 34 GB) in a Tower run.

Replace with pyarrow iter_batches() + pa_csv.CSVWriter. Memory bounded by --batch-size (default 200,000 rows ≈ 130 MB) instead of the full row count. Same I/O, same CPU, ~100× less peak memory.

Verified locally on a 1000-row sample of real Atera transcripts.parquet: output is well-formed RFC-4180 CSV with same column order and same data. Cosmetic differences from pandas (string quoting, 23 vs 23.0, true vs True) are universally tolerated by downstream consumers (Baysor Julia CSV.jl, Ficture pd.read_csv); the only affected bool column (is_gene) isn't read by either consumer.

Test plan

  • make check (ruff + pre-commit) passes locally
  • Bug 1 fix validated on Atera Cell Pellet via Tower: workflow init now passes (QC mode SUCCEEDED in 137s; v3/v4 reruns confirm)
  • Streaming PARQUET_TO_CSV verified locally on a 1000-row sample of real Atera transcripts.parquet
  • CI: GitHub Actions on this PR — pending reviewer trigger
  • Optional: re-run an existing Xenium v1 (XOA 4.x) nf-test fixture to confirm no regression in current panels

Out of scope

Related

Empirical evidence from 10x Atera (Xenium Gen2) Cell Pellet preview
data: all non-XR pipeline modes (qc, preview, segfree-baysor,
segfree-ficture) failed at workflow init in ~90s with errorMessage
'<bundle>/gene_panel.json' even though gene_panel.json is only read by
XENIUMRANGER_RELABEL_RESEGMENT.

Root cause: the .map { ... file(<bundle>/gene_panel.json,
checkIfExists: true) } closure in the else-branch at workflows/
spatialaxe.nf:343-352 is evaluated eagerly by Nextflow as soon as
ch_input emits, regardless of whether ch_gene_panel is downstream-
consumed. For any bundle that doesn't ship gene_panel.json — 10x
Atera (Xenium Gen2 preview data) is one such case — workflow init
fails for every mode, including qc/preview/segfree which never need
a gene panel.

Fix: wrap the entire ch_gene_panel construction in if (do_relabel)
{ ... }. When false, ch_gene_panel keeps its channel.empty()
initialisation from line 112. No behaviour change for image/
coordinate modes (do_relabel can still be true via --gene_panel
or --relabel_genes).
PARQUET_TO_CSV is a pure format transformation: every input row maps
to one output row. The pandas-based implementation eagerly loaded the
entire parquet into a DataFrame before writing CSV. For the 10x Atera
WTA panel transcripts.parquet (236M rows x 13 cols, 2.9 GB
compressed, ~30+ GB uncompressed), this OOMed twice at process_low
defaults (12 GB -> 24 GB) before succeeding on attempt 3 (36 GB,
34 GB peak rss) in a Tower run on Atera Cell Pellet.

Replace with pyarrow iter_batches() + pa_csv.CSVWriter. Memory is
bounded by --batch-size (default 200000 rows ~= 130 MB) instead of
the full row count. Same I/O, same CPU, ~100x less peak memory.

Verified locally on a 1000-row sample of real Atera
transcripts.parquet: output is well-formed RFC-4180 CSV with same
column order and same data. Cosmetic differences from pandas (string
quoting, '23' vs '23.0', 'true' vs 'True') are universally tolerated
by downstream consumers (Baysor Julia CSV.jl, Ficture pd.read_csv)
and the only affected bool column (is_gene) isn't read by either
consumer.
Preventive guardrail against the regression we hit in PR nf-core#154 (May 2026).
The python3 ${moduleDir}/templates/<script>.py invocation breaks on
Seqera Platform Tower running AWS Batch — ${moduleDir} interpolates to
a head-node path that does not exist on worker containers, so every
task fails at start with 'No such file or directory'.

This pre-commit hook makes CI fail before such code can be merged. If a
contributor needs to keep templates inside the module directory, use
the canonical 'template' directive or declare the template as a path
input so Nextflow stages it to the worker.

Empirical evidence: Atera compatibility evaluation 2026-05-28. See
PR nf-core#154 inline comments for the full analysis.
Both failed checks (nf-core lint, docker | 25.04.0 | 3/5) failed at
'Install Nextflow' step with HTTP 520 from Cloudflare while fetching
nf-core/setup-nextflow@v2 release assets. No code change needed —
retrigger CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant