Skip to content

merfre/Database_Creation_Workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DBCW — Database Creation Workflow

Overview

The Database Creation Workflow (DBCW) is a reproducible Snakemake-based pipeline for building the reference databases required for taxonomic and alignment-based metagenomic analyses. The workflow was developed as a modular subworkflow of the MMCAW pipeline but can also be executed independently.

DBCW automates the retrieval, construction, and organisation of databases required for downstream workflows, ensuring consistent and reproducible database generation across analyses.

Scope of the workflow

DBCW supports the following components:

  • Retrieval of reference sequences and taxonomy data
  • Construction of Kraken2 taxonomic classification databases
  • Construction of CAT contig annotation databases
  • Construction of BLAST nucleotide (NT) databases
  • Preparation of human reference genome (GRCh38.p14)
  • Version tracking and checksum recording for reproducibility
  • Standardised output structure for downstream workflow compatibility

All steps are controlled via the configuration file and can be enabled or disabled as required.

Requirements

Software

  • Snakemake (version aligned with parent workflow)
  • Conda (Python 3.10.x)

Environment management

Software dependencies are managed via Conda environments defined in: workflow/envs/

Installation

Clone the repository:

git clone https://github.com/merfre/Database_Creation_Workflow
cd Database_Creation_Workflow

Create and activate the environment:

conda env create -f workflow/envs/<environment_file>.yaml
conda activate db_creation

Usage

Basic run

Dry run:

snakemake -n

Basic run (local)

Perform a dry run to check the workflow:

snakemake -n

Run the workflow using multiple cores:

snakemake --cores 10

Execute the workflow locally:

snakemake --printshellcmds --use-conda --cores 10

After successful execution, you can create a self-contained interactive HTML report with all results:

snakemake --report dbcw_final_report.html

Running on an HPC / cluster

MMCAW was developed and tested on the University of Hull’s Viper HPC. If using a cluster, configure and run with an appropriate Snakemake profile (e.g., SLURM, PBS, etc.):

snakemake --profile <your-cluster-profile>

Integration with MMCAW

DBCW can be triggered automatically within MMCAW by enabling database creation in the MMCAW configuration file. Alternatively, databases can be pre-built using DBCW and referenced directly in downstream workflows.

Inputs (brief)

DBCW requires:

  • Access to reference sequence repositories (e.g. NCBI)
  • Configuration file specifying database parameters and output locations

Outputs (brief)

DBCW generates:

  • Kraken2 database
  • CAT database (reference + taxonomy)
  • BLAST nucleotide (NT) database
  • Human reference genome (GRCh38.p14)
  • Logs, version information, and checksums for reproducibility

Repository structure

. ├── workflow/ │ ├── Snakefile │ ├── rules/ │ ├── envs/ │ └── config/ ├── resources/ └── config/ └── config.yaml

Detailed descriptions of resources and configuration are provided in:

  • resources/README.md
  • config/README.md

Reproducibility & benchmarking

  • Workflow implemented in Snakemake v7.22.0
  • All software dependencies are managed via Conda
  • Snakemake’s built-in benchmarking is enabled by default to record:
    • Rule-level runtime
    • CPU and memory usage
    • Resource performance across datasets of varying size and complexity

This supports systematic evaluation of workflow efficiency and scalability.

Data availability

Where feasible, raw sequencing data and associated bioinformatic workflows have been archived:

  • Zenodo: doi: 10.5281/zenodo.17753185

Citation / Thesis

If you use MMCAW in your work, please cite:

Merideth Naomi Freiheit (2025). Development of Reproducible Metagenomic Approaches for Skin and Wound Microbiome Analysis. University of Hull.

About

Reproducible Snakemake workflow for building reference databases (Kraken2, CAT, BLAST) for metagenomic analysis.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages