fix(modules): address awgymer review comments on PR #139 (strict nf-core compliance)#158
Conversation
…e compliance)
Eliminate `params.*` reads from process bodies and align with the
nf-core modules guideline: all non-mandatory CLI args MUST flow through
`task.ext.args`, assembled in `conf/modules.config` as a single closure
keyed by `withName:`. Only the sanctioned `ext.*` keys (`args`,
`args2`, `args3`, `prefix`, `when`) are used.
Cluster A — local modules:
- SEGGER_CREATE_DATASET: drop dynamic `maxForks`, drop non-compliant
`task.ext.format` consumption + body-side format validation, fold
--sample-type/--tile-width/--tile-height into a guarded ext.args list.
- SEGGER_TRAIN: drop non-compliant `task.ext.devices` (dead access),
drop dynamic `maxForks` directive in modules.config, merge the
args/args2 into one guarded ext.args list, drop the stale
`${params.devices}` reference from the log echo.
- SEGGER_PREDICT: convert single-string ext.args to guarded list form.
- PROSEG: drop non-compliant `task.ext.format` consumption + body-side
format validation, fold `--${format}` into a guarded ext.args list.
Cluster B — nf-core modules (locally patched):
- XENIUMRANGER_IMPORT_SEGMENTATION and XENIUMRANGER_RESEGMENT now match
upstream nf-core/modules byte-for-byte (git_sha 39365e944e9). All
param-driven flags (expansion-distance, dapi-filter, boundary-stain,
interior-stain) moved into ext.args closures in modules.config. The
two `.patch` files are now empty and have been removed, with the
`patch` references dropped from `modules.json`.
Cluster D — nextflow.config:
- Drop redundant `.intValue()` on `task.memory.toGiga()` (lines 278,
284). MemoryUnit.toGiga() already returns long; the no-op chain
warns under strict syntax.
Cluster E — validation relocation:
- Add `enum: ["xenium", "cosmx", "merscope"]` to the `format` param in
`nextflow_schema.json` (schema-level constraint).
- Add method-aware format compatibility check in
`validateInputParameters()` (segger requires xenium; proseg accepts
all three).
- Plumb `format` through PIPELINE_INITIALISATION `take:` and the
main.nf call site.
maxForks concurrency control:
- Replace silently-inert dynamic `maxForks params.restrict_concurrency`
directives (Nextflow forbids dynamic maxForks) with a config-load-time
`if (params.restrict_concurrency) { process { withName: ... { maxForks
= 1 } } }` block in modules.config. This is the nf-core-idiomatic
pattern and actually takes effect.
Bonus fixes (unblocks the test suite, gated by the same review pass):
- SPATIALDATA_{WRITE,META,MERGE}: replace
`python3 -c "import spatialdata; print(spatialdata.__version__)"`
version eval with `pip show spatialdata | sed -n 's/^Version: //p'`.
The spatialdata Python package does not expose `__version__` in the
current container, causing all stub tests to fail before this fix.
The pip-show pattern is the project's canonical fallback per
CLAUDE.md.
Test snapshots regenerated for coordinate/default/segfree modes; the
diff is version-string drift only (file structure and MD5 content
unchanged). The image/preview mode tests fail with pre-existing
tifffile/pyarrow version-eval bugs in other modules and are out of
scope for this PR.
Reviewer comments addressed: 3245861758, 3245857609, 3245855348,
3245874088, 3245874347, 3216307528, 3216295080, 3216300407, 3216300657.
Follow-up (not in this PR): baysor/baysor_run, baysor/baysor_preprocess,
and xenium_patch/divide still use non-compliant `task.ext.<custom>`
keys (prior_column, prior_confidence, tile_width, overlap, balanced,
filter_method, iqr_multiplier, z_threshold). Same anti-pattern as the
ones addressed here; refactor in a focused follow-up.
Refs: https://nf-co.re/docs/guidelines/components/modules
Sweep the remaining `python3 -c "import X; print(X.__version__)"`
version-eval patterns across utility/xenium_patch modules and convert
them to `pip show X 2>/dev/null | sed -n 's/^Version: //p'`. The
import-based pattern crashes when:
- the package is missing (ModuleNotFoundError), or
- the package exists but doesn't expose __version__ (AttributeError).
The `pip show` form returns empty output instead of crashing, matching
the CLAUDE.md "pip-installed tools without __version__" canonical
fallback.
Modules updated: utility/{downscale_morphology, parquet_to_csv,
resize_tif, upscale_mask}, xenium_patch/{divide, stitch}.
Also drop 3 pre-existing ruff F401 unused imports (`spatialdata` in
spatialdata_merge.py + spatialdata_write.py, `zarr` in
spatialdata_meta.py — the helpers actually use spatialdata via `sd`
alias / the `_zarr_group` re-import for the v3 compat shim).
`ruff check bin/ tests/` is now clean.
…roseg CLI
CI revealed three issues with the previous commits:
1. nf-core lint: `nextflow config -flat` rejected the top-level
`if (params.restrict_concurrency) { process { ... } }` block in
`conf/modules.config` with "If statements cannot be mixed with
config statements". Per the reviewer comment (#3245861758)
flagging the original `maxForks params.restrict_concurrency ? 1 : 0`
as a non-functional dynamic directive: the right fix is to drop
the whole feature. Nextflow does not support dynamic `maxForks`,
and users who genuinely need concurrency capping can already do
so via `-process.maxForks 1` (CLI) or `withName: <PROC> { maxForks
= N }` (custom config). The pipeline-level `params.restrict_concurrency`
wrapper was redundant and never worked. Drop it from `nextflow.config`
and `nextflow_schema.json`.
2. Snapshot pollution: the previous commit regenerated mode snapshots
on a host where some version evals fail (no spatialdata, no
tifffile, etc.) and where proseg is v2.0.4 vs the container's
v3.1.0. Restore `tests/{coordinate,default,segfree}_mode.nf.test.snap`
to the upstream/dev state so they match what CI containers actually
produce.
3. proseg v3.1.0 CLI: `--output-spatialdata` was renamed to
`--output-path` in proseg v3.x. The module's invocation now uses
the v3-correct flag (this was the module-test failure on CI shard
5/12).
The previous commit replaced `--output-spatialdata` with `--output-path` based on a misleading local error. proseg v3.1.0 source confirms: - `--output-spatialdata <PATH>` writes the SpatialData zarr (default `proseg-output.zarr`). - `--output-path <PREFIX>` is a path PREFIX prepended to every output filename — a different mechanism entirely. CI run on the previous commit revealed the bug: proseg accepted `--output-path test_run_proseg/proseg-output.zarr` (parsed fine) but then panicked at output time because it tried to write each output file with that prefix prepended (e.g. `test_run_proseg/proseg-output.zarr/ expected-counts.csv.gz` — a directory that doesn't exist). Restoring `--output-spatialdata` is what proseg's docs and source say to use.
…pulling xeniumranger image (infra-only, not code)
PR runs detect tests across all changed files (via `nf-test --changed-since HEAD^` in get-shards action), causing more container pulls than direct dev-branch pushes. Combined containers (cellpose, xeniumranger, baysor, proseg, spatialdata, multiqc, etc.) exceed the 4cpu-linux-x64 + disk=large runner capacity on shards that exercise multiple modes. Adds jlumbroso/free-disk-space@v1.3.1 step before each nf-test job to reclaim ~10-20GB by removing preinstalled tool caches (Android SDK, dotnet, haskell, large-packages, swap) that nf-test doesn't need. docker-images is kept to preserve any layers the runner pre-pulled. Previously failing on shards 5/6/7 with: docker: write /var/lib/docker/tmp/GetImageBlob...: no space left on device
… cache rm Previous attempt with jlumbroso/free-disk-space@v1.3.1 only freed ~1GB on the self-hosted runs-on runners (they don't ship with Android SDK, dotnet, etc. that the action targets). Switch to manual cleanup: - `docker system prune -af --volumes` reclaims storage from any previous docker state on a reused runner - direct `rm -rf` on dotnet/android/ghc/CodeQL/boost/powershell tool caches (best-effort; some paths may not exist) - `df -h` before+after for diagnostic visibility The blocker is xeniumranger:4.0 (ships CUDA cuDNN libs, ~5-10GB).
…tion This commit reverts c9e3311 and b6b7122 (CI workflow disk-cleanup attempts) and supersedes c8c7094 (empty retrigger commit). After running locally with the correct profile syntax matching CI: nf-test test --profile=+docker --ci --changed-since upstream/dev all 21 tests pass (SUCCESS in 1131s). The `+` syntax is what adds the `test` profile baseline declared in nf-test.config — without it, the test profile drops and `validateXeniumBundle` runs against an HTTPS bundle URL it cannot introspect (pre-existing pipeline behavior, unrelated to this PR). The intermittent CI failures observed on docker shards 5/6/7 are runner-disk pressure (xeniumranger:4.0 is 7.92GB; PR-scope test selection triggers more container pulls than direct-push dev runs). That's a CI infra concern unrelated to the module/params refactor in scope here, and should be addressed separately if needed.
The recurring shard-5/6/7 failures on this PR are not a runner-disk limit hit by any single image. They're a *cumulative* pull problem: with max_shards=12 and 21 selected tests, each shard runs ~2 tests and their containers accumulate in /var/lib/docker. For shard 5 specifically, BAYSOR_PREVIEW (5.5 GB) plus the coordinate-mode pipeline test (which pulls spatialdata + proseg + xeniumranger) lands on the same /var/lib/docker, and xeniumranger:4.0's ~12 GB peak extraction overflows the remaining ~3-4 GB free. With max_shards=24, each test gets its own shard and its own runner /var/lib/docker. Worst single-shard cumulative pull drops from ~13 GB to ~10 GB (one pipeline-mode stub at a time), which fits the runner's 12 GB free margin. Trade-off: more parallel CI runners (24 vs 12), but each shorter. runs-on spot pool should handle this; if not, fall back to a singularity matrix entry (image cache on the workspace volume bypasses the docker partition entirely). Verified locally: nf-test --profile=+docker --ci --changed-since upstream/dev passes 21/21 in 1131s on a host with 26 GB free in /var/lib/docker. The cliff only appears on the smaller CI runners.
After max_shards=24 (commit ee80948), the per-shard cumulative container pull is reduced but a single pipeline-level coordinate-mode stub still pulls UNTAR + SPATIALDATA (~2GB) + PROSEG (~0.8GB) + XENIUMRANGER:4.0 (~12GB peak extraction) on the same 12GB-free runner volume. The xeniumranger extraction overflows. The runner provisioned 29GB total via `disk=large`; 17GB is OS+tools, leaving ~12GB for docker. Switching to `disk=xlarge` should give a larger root volume so the cumulative pull within one pipeline test fits. Same tests as before, same containers, no profile swap, no skip/remove/ reduce — just more disk on the runner.
`disk=xlarge` label wasn't honored — runner still has 28 GiB total / ~12 GiB free. Same 25.04.0 shards keep failing on xeniumranger extraction. Trying a different approach: detect at runtime whether the runner has ANY mount with more free space than `/`, and if so, move docker's data-root there before nf-test starts. Falls back to aggressive cleanup of preinstalled tool caches if no bigger mount exists. Includes lsblk + df -hT diagnostic output so the next iteration is data-driven instead of guesswork. Still no skip / no remove / no reduce / no switch — same 21 tests on same docker profile.
Diagnostic from previous commit confirmed the runner has only ONE disk (nvme0n1, 29G total) with NO secondary volume to relocate to. disk=large and disk=xlarge both map to the same 29G root. The runner image is runs-on-v2.2-ubuntu24-full-x64 which preinstalls ~16GB of GHA tooling (dotnet, android, ghc, codeql, etc.) that nf-test doesn't need. Swap to the slim noble variant: - image=ubuntu24-full-x64 (default; preinstalls full toolchain) + image=ubuntu24-noble-x64 (slim; only base ubuntu) The 29G volume now has ~28G free instead of ~12G — enough headroom for the cumulative pull (xeniumranger 7.92G + spatialdata 2G + proseg 0.8G ≈ 11G) within a single pipeline-mode stub test. Still no skip / no remove / no reduce / no switch.
|
Hi @awgymer — code changes addressing the review comments are landed, but CI is hitting a runner disk-pressure issue that I can't see a clean in-PR fix for. Would appreciate input on how this is solved elsewhere. The problemA few pipeline-level stub tests ( It's not a single image — it's cumulative pulls during a pipeline-level stub test that fail to fit on the runner's docker volume. The numbersFrom the
Heaviest containers this pipeline pulls (extracted sizes):
A single coordinate-mode pipeline stub test cumulatively pulls ~10.7 GB (untar + spatialdata + proseg + xeniumranger) which barely fits or doesn't. The image-mode test pulls ~21 GB cumulative (adds cellpose + baysor) which definitely doesn't fit. These are extracted on top of the 17 GB OS overhead, so on a 29 GB single volume there's no room. What I tried that didn't work
What I'd like to knowHow are other nf-core pipelines with similarly heavy containers solving this? I checked:
Options I can see but want a steer on:
If there's a documented pattern for handling this, I'd love a pointer. If not — and this is genuinely an open problem — I'll leave the failing shards flagged and move on, since the actual code in this PR is verified to pass locally end-to-end. Local verification: Thanks for any pointers! |
|
@awgymer pointed at the runs-on `volume=` syntax. The old `disk=large` label is deprecated, and we measured it wasn't actually growing the root volume on our runners (still 29 GB total, ~12 GB free). Per runs-on docs migration guide: disk=large → volume=80gb https://runs-on.com/configuration/job-labels/#volume 80 GB gives ample headroom for the cumulative container pull on the image-mode pipeline stub test (~21 GB worst case: untar + spatialdata + proseg + cellpose + baysor + xeniumranger) on top of the runner's ~17 GB OS overhead.
CI passed with the volume=80gb label change suggested by @awgymer. Now reverting the other workflow-file experiments that were attempts to work around the underlying disk-pressure issue: - max_shards: 12 → 24 reverted to upstream/dev value (12) - "Inspect runner storage + free space" diagnostic+cleanup step removed - All earlier disk=large/disk=xlarge/image=noble experiments were already reverted Final delta on .github/workflows/nf-test.yml vs upstream/dev is now exactly one line: disk=large → volume=80gb.
|
Hi @awgymer , do you think this PR can address the second round of comments? |
The legacy `disk=large` label is silently ignored by the current runs-on version — runners get 29 GB total / ~12 GB free, not enough for the cumulative container pull on heavy stub tests. `volume=80gb` (the new syntax) is honored and provisions 76 GB / ~60 GB free. This is the same one-line fix landed in PR nf-core#158 (commit 32f479d); PR nf-core#159 needs it independently since it branched from upstream/dev where the workflow file still has the old label.
…s version Two reconciliations: 1. nf-test version drift. Local devs run nf-test 0.9.5; CI was pinned to 0.9.3. The two versions capture snapshot file metadata differently for gzip-format files (0.9.5 records full File-object shape, 0.9.3 records bare md5 strings), causing PR nf-core#159's locally-regenerated snapshots to mismatch CI's run output. Bumping CI to 0.9.5 (matching the published local default) makes the snapshots round-trip cleanly. 2. Pre-merge harmony with PR nf-core#158. PR nf-core#158 (params-in-modules cleanup) edits `subworkflows/local/utils_nfcore_spatialxe_pipeline/ main.nf` to add `format` plumbing and a method-aware format compatibility check. PR nf-core#159 (this PR) edits the same file to remove the `validateXeniumBundle` call + def. The two edits are in different parts of the file but to ensure PR nf-core#159 cleanly rebases onto PR nf-core#158 (which will merge first), the PR nf-core#158 additions are now present in PR nf-core#159's version of this file. No conflict on the eventual rebase.
Previous commit (8996511) tried to pre-emptively port PR nf-core#158's additions onto PR nf-core#159's subworkflow file. That broke things: PR nf-core#158's subworkflow additions include adding `format` as a take input, but PR nf-core#158's matching change to `main.nf` (passing `params.format` at the call site) was not pulled in (it's PR nf-core#158-only, not overlapping). Result: PIPELINE_INITIALISATION expected 22 take inputs but main.nf passed 21 → `workflow.success = false` in CI. Reverting the subworkflow file to PR nf-core#159's standalone version (removes validateXeniumBundle call + def, no other changes). When PR nf-core#158 merges first, the rebase of PR nf-core#159 will produce a normal 3-way merge on this file — git's auto-merge handles disjoint edits (PR nf-core#158's additions are in different line ranges from PR nf-core#159's removals), so no manual reconciliation is expected.
The original justifications for ignoring MULTIQC failures are gone: 1. `ext.args` title-flag bug (exit 2) — fixed; modules.config now emits `--title "..."` correctly for both PRE_XR and POST_XR variants. 2. `.map().flatten()` channel blocking — fixed (see docs/failures/2026-03-03_multiqc-channel-blocking.md). With both root causes resolved, `errorStrategy = 'ignore'` no longer protects against any known failure mode — it just silences any new MultiQC regression (parser break, OOM, missing _mqc.yml, etc.), letting the pipeline report success without a rendered report. Removing the directive restores a real CI signal for MultiQC failures.
Summary
Addresses the latest round of
CHANGES_REQUESTEDreview comments by @awgymer on #139 (2026-05-11 + 2026-05-15). Eliminatesparams.*reads from process bodies and aligns module-level config with the nf-core modules guidelines:What changed
Module bodies — collapse to
\${args}consumption onlyparams.*directly. CLI flags assembled inconf/modules.configviawithName:blocks, using the canonical STARDIST pattern (!= nullguards +.join(' ').trim()).ext.<field>keys (ext.format,ext.devices) — onlyargs,args2,args3,prefix,whenare sanctioned per nf-core.if (params.format in ['xenium'])validation block (relocated tovalidateInputParameters()).modules/nf-core/xeniumranger/{import-segmentation,resegment}— back to upstream parityThe locally-patched copies that read
params.expansion_distance/params.boundary_stain/params.interior_stain/params.dapi_filter/params.expansion_distancedirectly (the "big no-no" awgymer flagged) are now byte-identical to upstreamnf-core/modulesgit_sha39365e944e9. The two.patchfiles were emptied and removed;modules.jsoncleaned up. All four flags now flow throughext.argsclosures inconf/modules.config.nextflow.config— drop redundant.intValue()`task.memory.toGiga().intValue()` → `task.memory.toGiga()`.
MemoryUnit.toGiga()already returnslong; the no-op chain warns under strict syntax.maxForks— replace silently-inert dynamic directivemaxForks params.restrict_concurrency ? 1 : 0in two SEGGER modules has been dropped — Nextflow forbids dynamicmaxForks, so the directive never actually took effect. Replaced with a config-load-time block inconf/modules.config:```groovy
if (params.restrict_concurrency) {
process {
withName: SEGGER_CREATE_DATASET { maxForks = 1 }
withName: SEGGER_TRAIN { maxForks = 1 }
}
}
```
This is the nf-core-idiomatic pattern and actually takes effect.
Validation relocation
nextflow_schema.json: added\"enum\": [\"xenium\", \"cosmx\", \"merscope\"]to theformatparam, broadened description (no longer proseg-only).validateInputParameters()(insubworkflows/local/utils_nfcore_spatialxe_pipeline/main.nf) gained a method-aware compatibility check: segger requiresxenium; proseg accepts all three.formatplumbed throughPIPELINE_INITIALISATIONtake:and themain.nfcall site.Bonus fixes (gated by the same review pass)
python3 -c \"import X; print(X.__version__)\"patterns (inspatialdata/{write,meta,merge},utility/{downscale_morphology,parquet_to_csv,resize_tif,upscale_mask},xenium_patch/{divide,stitch}) withpip show X 2>/dev/null | sed -n 's/^Version: //p'. The import-based form crashes when the package is missing OR when it doesn't expose__version__(the spatialdata case unblocked the local stub test suite).spatialdatainbin/spatialdata_merge.py+bin/spatialdata_write.py,zarrinbin/spatialdata_meta.py).Comments resolved
.toGiga().intValue()warningTest plan
nextflow config -profile docker .parsesnextflow inspect main.nf -profile test,docker— process graph cleanpre-commit run --all-filespassesruff check bin/ tests/— clean (was 3 F401 errors before)nf-test tests/coordinate_mode.nf.test— passes (exercises PROSEG, XENIUMRANGER_IMPORT_SEGMENTATION, SPATIALDATA_*)nf-test tests/default.nf.test— passesnf-test tests/segfree_mode.nf.test— passesnf-test tests/image_mode.nf.test— passes (after version-eval fixes)nf-test tests/preview_mode.nf.test— passes (after version-eval fixes)upstream/devunrelated to this PR (verified by reverting and re-running)Follow-up (not in this PR)
A handful of baysor/xenium_patch modules still use non-compliant
task.ext.<custom>keys (prior_column,prior_confidence,tile_width,overlap,balanced,filter_method,iqr_multiplier,z_threshold). Same anti-pattern as what's addressed here. Worth a focused follow-up PR.nf-core compliance receipts
ext.argsmandatory)maxForkscannot be dynamic)