DBCW — Database Creation Workflow

Overview

The Database Creation Workflow (DBCW) is a reproducible Snakemake-based pipeline for building the reference databases required for taxonomic and alignment-based metagenomic analyses. The workflow was developed as a modular subworkflow of the MMCAW pipeline but can also be executed independently.

DBCW automates the retrieval, construction, and organisation of databases required for downstream workflows, ensuring consistent and reproducible database generation across analyses.

Scope of the workflow

DBCW supports the following components:

Retrieval of reference sequences and taxonomy data
Construction of Kraken2 taxonomic classification databases
Construction of CAT contig annotation databases
Construction of BLAST nucleotide (NT) databases
Preparation of human reference genome (GRCh38.p14)
Version tracking and checksum recording for reproducibility
Standardised output structure for downstream workflow compatibility

All steps are controlled via the configuration file and can be enabled or disabled as required.

Requirements

Software

Snakemake (version aligned with parent workflow)
Conda (Python 3.10.x)

Environment management

Software dependencies are managed via Conda environments defined in: workflow/envs/

Installation

Clone the repository:

git clone https://github.com/merfre/Database_Creation_Workflow
cd Database_Creation_Workflow

Create and activate the environment:

conda env create -f workflow/envs/<environment_file>.yaml
conda activate db_creation

Usage

Basic run

Dry run:

snakemake -n

Basic run (local)

Perform a dry run to check the workflow:

snakemake -n

Run the workflow using multiple cores:

snakemake --cores 10

Execute the workflow locally:

snakemake --printshellcmds --use-conda --cores 10

After successful execution, you can create a self-contained interactive HTML report with all results:

snakemake --report dbcw_final_report.html

Running on an HPC / cluster

MMCAW was developed and tested on the University of Hull’s Viper HPC. If using a cluster, configure and run with an appropriate Snakemake profile (e.g., SLURM, PBS, etc.):

snakemake --profile <your-cluster-profile>

Integration with MMCAW

DBCW can be triggered automatically within MMCAW by enabling database creation in the MMCAW configuration file. Alternatively, databases can be pre-built using DBCW and referenced directly in downstream workflows.

Inputs (brief)

DBCW requires:

Access to reference sequence repositories (e.g. NCBI)
Configuration file specifying database parameters and output locations

Outputs (brief)

DBCW generates:

Kraken2 database
CAT database (reference + taxonomy)
BLAST nucleotide (NT) database
Human reference genome (GRCh38.p14)
Logs, version information, and checksums for reproducibility

Repository structure

. ├── workflow/ │ ├── Snakefile │ ├── rules/ │ ├── envs/ │ └── config/ ├── resources/ └── config/ └── config.yaml

Detailed descriptions of resources and configuration are provided in:

resources/README.md
config/README.md

Reproducibility & benchmarking

Workflow implemented in Snakemake v7.22.0
All software dependencies are managed via Conda
Snakemake’s built-in benchmarking is enabled by default to record:
- Rule-level runtime
- CPU and memory usage
- Resource performance across datasets of varying size and complexity

This supports systematic evaluation of workflow efficiency and scalability.

Data availability

Where feasible, raw sequencing data and associated bioinformatic workflows have been archived:

Zenodo: doi: 10.5281/zenodo.17753185

Citation / Thesis

If you use MMCAW in your work, please cite:

Merideth Naomi Freiheit (2025). Development of Reproducible Metagenomic Approaches for Skin and Wound Microbiome Analysis. University of Hull.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
config		config
resources		resources
workflow		workflow
.editorconfig		.editorconfig
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBCW — Database Creation Workflow

Overview

Scope of the workflow

Requirements

Software

Environment management

Installation

Usage

Basic run

Basic run (local)

Running on an HPC / cluster

Integration with MMCAW

Inputs (brief)

Outputs (brief)

Repository structure

Reproducibility & benchmarking

Data availability

Citation / Thesis

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DBCW — Database Creation Workflow

Overview

Scope of the workflow

Requirements

Software

Environment management

Installation

Usage

Basic run

Basic run (local)

Running on an HPC / cluster

Integration with MMCAW

Inputs (brief)

Outputs (brief)

Repository structure

Reproducibility & benchmarking

Data availability

Citation / Thesis

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages