Add cluster metrics viz#11372
Conversation
|
Um, you need to rebase these changes on the current master, as there was a bulk update (and you have a massive conflict ;) ). |
c038148 to
9e0f196
Compare
|
Hmm, this is an entirely custom set of modules? |
|
Thanks @SPPearce, that makes sense. I’ve moved both custom Python-based modules under Both module tests now pass locally for |
SPPearce
left a comment
There was a problem hiding this comment.
I think it'd need to be custom/clustermetrics and custom/clustervisualiation (no underscores allowed).
You'll need to use template directly, rather than referring to ${projectDir}/modules/nf-core/custom/cluster_metrics/templates/cluster_metrics.py.
|
Thanks @SPPearce — I renamed the custom modules to remove underscores and updated both modules to use |
- cluster_metrics: computes clustering quality metrics + k-sweep - cluster_viz: generates PCA, UMAP and t-SNE plots colored by cluster - Both use conda environment.yml - Full nf-test coverage
462fff8 to
e4877e4
Compare
SPPearce
left a comment
There was a problem hiding this comment.
You can't use moduleDir at all in the main.nf, it won't work on cloud systems as the python script won't be there.
You need to use a template, like these modules.
There was a problem hiding this comment.
You could make the docker/singularity containers via Seqera Containers rather than needing a custom image in the nf-core quay.io.
pinin4fjords
left a comment
There was a problem hiding this comment.
Hi @dbaku42, thanks for the contribution. AI-assisted review (Claude, on behalf of @pinin4fjords). Some overlap with @SPPearce's earlier feedback. Beginner-friendly notes plus suggestion blocks you can apply directly.
The biggest fix is the templates/ mechanism. Right now your script: block runs python3 ${moduleDir}/templates/cluster_metrics.py, which is just running a normal Python script that happens to live in templates/. No templating is happening. This works on your laptop because ${moduleDir} resolves to a local path, but it breaks on cloud/HPC executors because that folder is not staged into the work directory the task runs in.
Nextflow's template directive does two things:
- Stages the file into the task work directory automatically (so it works on any executor).
- Runs the file through Groovy GString interpolation before staging, so
${features},${task.ext.prefix}, etc. get substituted into the Python source. This replacesargparseentirely.
Because the file goes through Groovy interpolation, a few Python source patterns need adjusting (\n -> \\n, r"\s+" -> r"\\s+", etc., I verified this locally). I have flagged the specific spots inline. See modules/nf-core/custom/tx2gene/templates/tx2gene.py for the canonical shape.
Two scope questions for discussion:
- The
*_selected.json"best k" output bakes a decision into the module, and the sweep refits KMeans regardless of what method produced the input labels. Could the module just emit the k-sweep table and let the consuming pipeline pick? cluster_vizdoes PCA + UMAP + t-SNE + plotting + TSV export, and the PCA piece is essentially a re-plot of upstream PLINK eigenvec output. Consider dropping PCA plotting from this module and tightening to the embedding methods that genuinely belong together.
The clustervisualiation directory name looks like a typo for clustervisualization; worth fixing while you're renaming things.
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
…iz.py Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
…iz.py Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
…R to fix numba caching in Singularity
… imports, and KMean n_init stable
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
…d test assertions
|
Hi @pinin4fjords , I'm stuck on a snapshot mismatch issue and would appreciate some guidance. The tests for CUSTOM_CLUSTERMETRICS and CUSTOM_CLUSTERVISUALIZATION are failing because the output files (test_metrics.tsv, test_k_sweep.csv, test_selected.json) produce different MD5 hashes in CI compared to my local environment. The root cause seems to be that my local snapshots were generated using the Wave container, while the CI runs tests using conda (as shown in the logs: Creating env using conda: .../environment.yml). The two environments resolve slightly different package versions, leading to numerically different outputs from scikit-learn/pandas. Thanks in advance! 🙏 |
Templates:
- clustering.py: replace inline `${n_clusters}` / `${dbscan_eps}` /
`${dbscan_min_samples}` (which fail ruff and don't parse as Python source)
with locals assigned at the top of main(); replace yaml.dump with the
format_yaml_like helper used by the other two templates so pyyaml is no
longer needed in the env.
- cluster_metrics.py and cluster_viz.py: drop the PLINK-aware
_normalise_id_column / multi-mode load_clusters glue; require sample_id
+ cluster on the documented inputs. Drop the silent try/except plot
warning. Drop the cluster_mode / alignment_mode branching from main()
along with the redundant input_clusters / input_features / n_samples_used
/ alignment_mode metadata. Drop the argparse + sys.argv wrapper pattern
in cluster_metrics.py in favour of the direct template-substitution main()
that cluster_viz.py already uses.
Environments:
- Strip the misleading `# clustermetrics/environment.yml` headers; prune
per-module deps so the env declares what the script actually imports
(clustering: numpy/pandas/python/scikit-learn; clustermetrics: + matplotlib;
clustervisualization: + matplotlib/seaborn/umap-learn). Pin to the versions
Wave resolved unpinned: numpy 2.4.4, pandas 3.0.3, python 3.12.13,
scikit-learn 1.8.0, matplotlib 3.10.9, seaborn 0.13.2, umap-learn 0.5.12.
Containers:
- Rebuild Wave containers per module and replace the shared `_pruned:0...`
URL (which didn't actually contain the packages the env declared) with
the matching freshly-built Docker tag. Switch the singularity branch from
`docker://...` to the proper https blob URL per nf-core convention.
Tests:
- Drop the orphan `test_pca.eigenvec` test data file from
clustervisualization now that the PCA piece is no longer in the script.
- Regenerate snapshots (run on AWS x86 Linux via docker profile).
- Pre-commit autofix (prettier, end-of-file-fixer, ruff-format).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds pyyaml to all three environment.yml files and replaces the embedded format_yaml_like helper in clustering.py / cluster_metrics.py / cluster_viz.py with a one-line `yaml.dump(versions, fh, default_flow_style=False, sort_keys=False)`. format_yaml_like dates from before Seqera Containers / Wave made it cheap to add deps. Now that we control the env precisely, pyyaml is a clean swap and the templates are ~10 lines shorter each. Rebuilt Wave Docker + Singularity containers per module with pyyaml=6.0.3 pinned (matching what Wave resolved unpinned) and updated the container URLs. Regenerated snapshots on the AWS x86 Linux VM with --profile docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CPU/container-environment portability issues surfaced on CI that didn't appear on the AWS x86 VM I used to regenerate snapshots: 1. CUSTOM_CLUSTERING `*_clustering_info.json` md5 differed between CI and the snapshot. KMeans inertia is a sum of squared distances, and BLAS reduction order varies by a few ULPs across CPU instruction sets even with a fixed `random_state`. Keep full precision in the production output and round the value in the test snapshot only, matching the `Math.round(... / 1000) * 1000` pattern used in `custom/basicpy`. 2. CUSTOM_CLUSTERVISUALIZATION stub `versions.yml` had `umap-learn: null` under singularity. The stub heredoc ran `python3 -c "import umap"`, which triggers numba's JIT cache write to a path that is read-only inside the singularity image, so the import silently failed and the bash substitution expanded to an empty string. Switched the stub version checks to `importlib.metadata.version(<pkg>)` for every package so they read dist-info without importing the module. Same answer, no import side effects. Regenerated the clustering snapshot on the VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse commodity functions and reduce hand-rolled glue:
- clustering.py: replace the 45-line manual eigenvec line parser with
`pd.read_csv(sep=r"\\s+")` + a small header check. Same accepted
layouts (FID/IID and IID-only, with the leading '#' on the header
line); drops the no-header inference path PLINK2 doesn't actually
emit.
- clustering.py: `len(set(labels) - {-1})` instead of the
`len(set(...)) - (1 if -1 in labels else 0)` puzzle; drop redundant
`int()` wraps around `len()`; replace `Path(...).write_text(json.dumps())`
with `json.dump` + `open()` to match the versions.yml write style.
- cluster_metrics.py / cluster_viz.py: replace the positional-index
alignment dance with `load_features().join(load_clusters(), how="inner")`.
Both helpers now return DataFrame / Series indexed by sample_id, so
pandas does the join. Drops ~10 lines of indexing logic per module
and removes a subtle reliance on `common.index` preserving positional
identity.
- cluster_metrics.py: tighten `cluster_quality()` to one expression and
drop the early-return + shared dict pattern. Returns just the three
scores; `n_clusters` is set at the call site, so the k-sweep no longer
needs to filter the result down before merging.
- cluster_metrics.py / cluster_viz.py: same json.dump cleanup.
Regenerated snapshots on the AWS x86 VM; all six tests pass under
`--profile docker`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .gitignore picked up entries for local snpclustering subworkflow scaffolding and a stray `modules/nf-core/clustering/` ignore. None of those should land in nf-core/modules master; restoring the file to match origin/master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- clustering tests: add a DBSCAN scenario alongside the existing kmeans one. The DBSCAN branch of clustering.py was previously untested. - clustermetrics + clustervisualization meta.yml: tighten the `features` and `clusters` input descriptions and patterns. Both modules require a TSV with a `sample_id` column and numeric features, and a CSV with `sample_id` + integer `cluster` columns; the old `"Feature matrix file"` / `pattern: "*"` did not advertise the schema or noise-label convention. - All three modules' stub `versions.yml` blocks: switch every package lookup to `importlib.metadata.version(...)`. clustervisualization already did this for umap-learn to dodge numba's read-only cache issue under singularity; using it everywhere is more robust against similar import-side-effect surprises and keeps the three stubs consistent. All 7 nf-tests pass under `--profile docker` on the AWS x86 VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nf-core spec (modules/general.md): "All non-mandatory command-line tool
non-file arguments MUST be provided as a string via the $task.ext.args
variable." Adds argparse-on-task.ext.args plumbing in two template
scripts so the previously-hardcoded tuning parameters can be overridden
via modules.config:
- cluster_metrics.py: --k-min (default 2) and --k-max (default 12) for
the KMeans sweep range.
- cluster_viz.py: --umap-neighbors (default 15) and --tsne-perplexity
(default 30) for the embedding parameters. Both are still clamped
against sample count so tiny inputs work.
The `task.ext.args` substitution is parsed with shlex+argparse from
inside the template (matrixfilter does the same with R's parse_args),
which keeps required inputs as `val` channels (per
input-output-options.md) and optional ones via `ext.args` (per
general.md). `clustering` has no optional non-file args; its required
algorithm/n_clusters/dbscan_eps/dbscan_min_samples remain `val` channel
inputs.
Output filename style also unified to `${prefix}.<descriptor>.<ext>`
(dot-separated) across all three modules, matching the spec example
(`${prefix}.fq.gz`) and the existing dot-separated pattern in
`clustervisualization` / `custom/tx2gene` / `custom/matrixfilter`:
- clustering: `*_clusters.csv` -> `*.clusters.csv`,
`*_clustering_info.json` -> `*.clustering_info.json`.
- clustermetrics: `*_metrics.tsv` -> `*.metrics.tsv`,
`*_k_sweep.csv` -> `*.k_sweep.csv`,
`*_selected.json` -> `*.selected.json`, and plot files from
`*_<name>.png` -> `*.<name>.png`. Also renamed
`_calinski.png` -> `.calinski_harabasz.png` to match the other metric
files (was the only truncated one).
Updated meta.yml output patterns and stub touches to match. Regenerated
snapshots on the AWS x86 VM with `--profile docker`; all 7 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords
left a comment
There was a problem hiding this comment.
Hi @dbaku42, sorry for the volume of direct commits on your branch (AI-assisted by Claude on my behalf) - given the back-and-forth pattern of the last few rounds it felt like the fastest way to land the modules in a clean, spec-aligned state. Please review and push back on anything that doesn't sit right; happy to revert specific bits if you disagree with a choice. Highlights of what changed and why:
- Containers rebuilt on Seqera Containers via Wave to match the actual
environment.ymlfor each module. nf-core uses Biocontainers or Seqera Containers; for multi-package Python envs like these, Seqera Containers via thewaveCLI is the right call. Singularity branch is thehttps://community-cr-prod.seqera.io/.../datablob URL (notdocker://). ${...}substitutions inclustering.py: ruff lints the source as Python, soKMeans(n_clusters=${n_clusters}, ...)fails. Moved substitutions to locals at the top ofmain()and use the locals from then on.versions.ymlwrites viayaml.dump(withpyyamladded to the envs), since adding deps is cheap now that we're on Wave. Stub heredocs useimportlib.metadata.version(...)to avoid import side effects (import umaptriggers numba JIT cache writes that fail under read-only singularity).- Test portability: KMeans inertia wobbles a few ULPs across CPUs (BLAS reduction order), so the clustering test rounds it in the assertion only. Added a DBSCAN test case so both algorithm branches are exercised.
ext.argsplumbing incluster_metrics.py(--k-min,--k-max) andcluster_viz.py(--umap-neighbors,--tsne-perplexity), parsed viaargparse+shlex.split("$task.ext.args")inside the template - matches the spec that optional non-file args go viaext.args. Same approach ascustom/matrixfilteruses for R.- Output filenames unified to
${prefix}.<descriptor>.<ext>across all three modules (was mixed dot/underscore), matching the spec example and the existingcustom/tx2gene/custom/matrixfilterpattern. - Simplifications: removed the PLINK-aware
_normalise_id_column/ multi-modeload_clustersfromcluster_metrics.pyandcluster_viz.py(both already requiresample_id+clusterand theclusteringmodule emits exactly that). Replaced the positional-index alignment withload_features().join(load_clusters(), how="inner"). Replaced the 45-line manual eigenvec parser withpd.read_csv. Reverted a stray.gitignorechange unrelated to this PR.
One blocker before this can land: tests/data/ files can't live here. nf-core stores all module test data in nf-core/test-datasets on the modules branch, referenced via params.modules_testdata_base_path. See custom/basicpy for an example. To unblock:
- PR your five test files to the
modulesbranch ofnf-core/test-datasetsunder something likegenomics/popgen/clustering/. - Once it merges, update each
tests/main.nf.testto useparams.modules_testdata_base_path + ...and delete thetests/data/directories. - Re-record snapshots and push.
Marking as Changes Requested for that, and resolving the previous threads since they're addressed by the recent commits. Apologies again for the heavy hand - shout if anything needs reverting.
Local workflow to catch the easy stuff before pushing in future:
prek run --all-files
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustering/tests/main.nf.test
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustermetrics/tests/main.nf.test
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustervisualization/tests/main.nf.test|
Hi @pinin4fjords, thank you so much for all the work you did directly on the branch (and for the super clear explanations)! Regarding the remaining blocker:
I'll let you know as soon as the test-datasets PR is opened (and I'll tag you). Thank you again for the huge help ❤️ @nf-core/modules |
|
Hi @pinin4fjords, I've just opened the PR for the test data on nf-core/test-datasets: As soon as it's merged I'll update the two Thanks again for all the help! |
Description
This PR adds two new modules:
cluster_metrics: Computes standard clustering quality metrics (Silhouette score, Calinski-Harabasz index, Davies-Bouldin index) and performs a k-sweep analysis.cluster_viz: Generates 2D visualizations (PCA, UMAP, t-SNE) colored by cluster label and exports the coordinates as TSV files.Both modules are designed to work together with the existing
clusteringmodule (which performs KMeans/DBSCAN on PLINK2 PCA results).Features
condaviaenvironment.ymlnf-test(normal + stub tests)Author
Related modules
clustering)This completes the core components of the
snpclusteringsubworkflow.