A Snakemake workflow for reducing high-dimensional feature tables into module-level summaries with statistical testing and optional biological annotation.
The workflow is built using Snakemake and consists of the following steps:
- Input combination (optional) — Merges multiple feature tables via row-bind (same features, different samples) or column-bind (same samples, different features) before downstream processing.
- Feature normality testing — Tests each feature for normality (Shapiro-Wilk) and records results as an annotation source for downstream summary selection.
- Feature preprocessing and imputation — Filters features by missingness and unique-value thresholds, optionally imputes missing values, optionally benchmarks imputation methods, and optionally applies distribution-shift filtering.
- Module reduction — Reduces preprocessed features into hierarchical-clustering modules using a configured discovery subset, with optional repeated subsampling when imputation is applied.
- Module annotation (optional) — Annotates gene-like modules with GO:BP pathway labels via
gprofiler2, or names modules from top-loading members using an LLM. - Module set enrichment (optional) — Tests module members for enrichment in configured GMT gene sets.
- Univariate testing — modules — Fits per-module models (e.g.
lmer, rank-transform) and stores effect sizes, p-values, and model-assumption results. - Univariate testing — original features — Same testing on original/member features, dropping NA samples per feature and requiring a minimum group size.
- Result summarisation — Converts method-specific result matrices into wide summary CSVs, selecting parametric or non-parametric fields based on model assumptions met.
- Visualisation — Plots module-level summaries and original feature distributions with group labels and optional module annotations.
The workflow supports three input modes: single (one feature table), batch (multiple dataset directories discovered automatically), and combine (tables merged before processing).
Detailed information about input data and workflow configuration can be found in the config/README.md.
| Input | Description |
|---|---|
| Feature table | CSV with sample_key as the first column; all subsequent numeric columns are treated as features |
| Sample metadata | CSV with sample_key column plus grouping, individual ID, batch, and covariate columns |
| Dataset directories (batch mode) | Subdirectories under input_root, each containing bulk_x_features.csv and bulk_metadata.csv |
| GMT file (optional) | Gene set file for module set enrichment |
All outputs are written to <results_dir>/ as configured in config/config.yaml.
| Directory | Key output files |
|---|---|
preprocess_impute/{dataset_id}/ |
bulk_x_features.csv, benchmark.rds, preprocess_summary.csv |
evaluate_preprocessing/{error_metric}/ |
error_distribution.pdf, error_threshold*.pdf, feature_retention_summary*.csv/pdf |
modules_hc/{dataset_id}/{subset_name}/ |
bulk_x_features.csv, module_eigengenes_hc.rds, module_members/ |
meaning_module_genes/{dataset_id}/{subset_name}/ |
bulk_x_features.csv, module_annotations.csv, modules/ |
meaning_module_llm/{dataset_id}/{subset_name}/ |
bulk_x_features.csv, module_annotations.csv, modules/ |
univariate_test/{dataset_id}/{method}/ |
result_matrices.rds, failed_features.csv |
summarise_tests/{dataset_id}/ |
summary_df.csv |
summarise_tests/ |
summary_df.csv (aggregate across all datasets) |
univariate_test_members/{dataset_id}/{method}/ |
result_matrices.rds, failed_features.csv |
summarise_tests_members/{dataset_id}/ |
summary_df.csv |
summarise_tests_members/ |
summary_df.csv (aggregate across all datasets) |
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, please cite the repository URL or its DOI and the tools listed in the References section.
Change to the workflow directory and adjust options in config/config.yaml.
cd path/to/feature-flowPerform a dry run to check the workflow before execution:
snakemake --dry-runRun with test files using conda:
snakemake --cores 2 --sdm conda --directory .testRun with apptainer / singularity:
snakemake --cores 2 --sdm conda apptainer --directory .testRun on an HPC cluster via SLURM (recommended for production). First activate the environment:
bash envs/feature-module-reduction-env.sh
load_mamba
mamba activate feature-module-reduction-env
pip install snakemake-executor-plugin-slurmThen submit using the provided template script:
bash slurm_submit_template.shFor runs that use LLM module naming, source OpenAI credentials before submitting:
source ~/.config/openai/env.shThe profiles/ directory can contain any number of workflow-specific profiles that users can choose from.
The profiles README.md provides more details.
- Liezel Tamon
- University of Oxford
- ORCID profile
Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 2021. https://doi.org/10.12688/f1000research.29032.2
Bashford-Rogers Lab. vdjremix. https://github.com/Bashford-Rogers-lab/vdjremix