The Database Creation Workflow (DBCW) is a reproducible Snakemake-based pipeline for building the reference databases required for taxonomic and alignment-based metagenomic analyses. The workflow was developed as a modular subworkflow of the MMCAW pipeline but can also be executed independently.
DBCW automates the retrieval, construction, and organisation of databases required for downstream workflows, ensuring consistent and reproducible database generation across analyses.
DBCW supports the following components:
- Retrieval of reference sequences and taxonomy data
- Construction of Kraken2 taxonomic classification databases
- Construction of CAT contig annotation databases
- Construction of BLAST nucleotide (NT) databases
- Preparation of human reference genome (GRCh38.p14)
- Version tracking and checksum recording for reproducibility
- Standardised output structure for downstream workflow compatibility
All steps are controlled via the configuration file and can be enabled or disabled as required.
- Snakemake (version aligned with parent workflow)
- Conda (Python 3.10.x)
Software dependencies are managed via Conda environments defined in: workflow/envs/
Clone the repository:
git clone https://github.com/merfre/Database_Creation_Workflow
cd Database_Creation_WorkflowCreate and activate the environment:
conda env create -f workflow/envs/<environment_file>.yaml
conda activate db_creationDry run:
snakemake -nPerform a dry run to check the workflow:
snakemake -nRun the workflow using multiple cores:
snakemake --cores 10Execute the workflow locally:
snakemake --printshellcmds --use-conda --cores 10After successful execution, you can create a self-contained interactive HTML report with all results:
snakemake --report dbcw_final_report.htmlMMCAW was developed and tested on the University of Hull’s Viper HPC. If using a cluster, configure and run with an appropriate Snakemake profile (e.g., SLURM, PBS, etc.):
snakemake --profile <your-cluster-profile>DBCW can be triggered automatically within MMCAW by enabling database creation in the MMCAW configuration file. Alternatively, databases can be pre-built using DBCW and referenced directly in downstream workflows.
DBCW requires:
- Access to reference sequence repositories (e.g. NCBI)
- Configuration file specifying database parameters and output locations
DBCW generates:
- Kraken2 database
- CAT database (reference + taxonomy)
- BLAST nucleotide (NT) database
- Human reference genome (GRCh38.p14)
- Logs, version information, and checksums for reproducibility
. ├── workflow/ │ ├── Snakefile │ ├── rules/ │ ├── envs/ │ └── config/ ├── resources/ └── config/ └── config.yaml
Detailed descriptions of resources and configuration are provided in:
resources/README.mdconfig/README.md
- Workflow implemented in Snakemake v7.22.0
- All software dependencies are managed via Conda
- Snakemake’s built-in benchmarking is enabled by default to record:
- Rule-level runtime
- CPU and memory usage
- Resource performance across datasets of varying size and complexity
This supports systematic evaluation of workflow efficiency and scalability.
Where feasible, raw sequencing data and associated bioinformatic workflows have been archived:
- Zenodo: doi: 10.5281/zenodo.17753185
If you use MMCAW in your work, please cite:
Merideth Naomi Freiheit (2025). Development of Reproducible Metagenomic Approaches for Skin and Wound Microbiome Analysis. University of Hull.