Skip to content
Ley Lab MetaGenome Profiler DataBase generator
Jupyter Notebook Python Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
tests/samples llmgp-db now easier to re-run with more genomes Jul 23, 2019
.gitignore finished GTDB metadata summary Mar 19, 2019
DO_NOT_BACKUP_THIS_FOLDER skip backups Apr 24, 2019
GTDB_metadata_filter.R updated docs Oct 9, 2019
Snakefile llmgp-db now easier to re-run with more genomes Jul 23, 2019
config.yaml updated docs Oct 9, 2019
genome_download.R fixed genome_download unclassified FP error Mar 21, 2019 pipeline fully working Mar 19, 2019


Struo: a pipeline for building custom databases for common metagenome profilers

"Struo" --> from the Latin: “I build” or “I gather”


Cuesta-Zuluaga, Jacobo de la, Ruth E. Ley, and Nicholas D. Youngblut. 2019. “Struo: A Pipeline for Building Custom Databases for Common Metagenome Profilers.” bioRxiv.

Pre-built custom databases

Custom GTDB databases available at the struo data ftp server

GTDB releases available:

  • Release 86

GTDB releases in progress

  • Release 89



conda env setup

  • r-argparse
  • r-curl
  • r-data.table
  • r-dplyr
  • ncbi-genome-download
  • snakemake

UniRef diamond database(s)

You will need a UniRef diamond database for the humann2 database construction (e.g., UniRef90). See the "Download a translated search database" section of the humann2 docs.

Getting reference genomes for the custom databases

Downloading genomes

  • If using GTDB genomes, run GTDB_metadata_filter.R to select genomes
  • If downloading genomes from genbank/refseq, you can use genome_download.R

Input data (samples.txt file)

  • The pipeline requires a tab-delimited table that includes the following columns (column names specified in the config.yaml file):
    • Sample ID
      • This will usually just be the species/strain names
    • Path to the genome assembly fasta file
      • NOTE: these must be gzip'ed
    • taxonomy ID
      • This should be the NCBI taxonomy ID at the species/strain level
        • Needed for Kraken
    • taxonomy
      • This should at least include g__<genus>;s__<species>
      • The taxonomy can include higher levels, as long as levels 6 & 7 are genus and species
      • Any taxonomy lacking genus and/or species levels will be labeled:
        • g__unclassified (if no genus)
        • s__unclassified (if no species)
      • This is needed for humann2

Running the pipeline

Edit the config.yaml

  • Specify the input/output paths
  • Modify parameters as needed

Running locally

snakemake --use-conda

Running on a cluster

If SGE, then you can use the script. You can create a similar bash script for other cluster architectures. See the following resources for help:

General info on using snakemake

Snakemake allows for easy re-running of the pipeline on just genomes that have not yet been processed. You can just add more genomes to the input table and re-run the pipeline (test first with --dryrun). Snakemake should just process the new genomes and then re-create the combined dataset files (this must be done each time). Make sure to not mess with the files in the nuc_filtered and prot_filtered directories! Otherwise, snakemake may try to run all genomes again through the computationally expensive gene annotation process.

Using the resulting databases

Set the database paths in humann2, kraken2, etc. to the new, custom database files.

  • humann2
    • nucleotide
      • all_genes_annot.fna.gz
    • amino acid
      • all_genes.dmnd
  • kraken2
    • database*mers.kraken

Adding more samples (genomes) to an existing custom DB

Add new genomes to the input table and delete the following files (if they exist):

  • humann2 database
    • all_genes_annot.dmnd
  • kraken database
    • hash.k2d
    • taxo.k2d
  • bracken database
    • database100mers.kraken
    • database150mers.kraken

Adding existing gene sequences to humann2 databases

If you have gene sequences already formatted for creating a humann2 custom DB, and you'd like to include them with the gene sequences generated from the input genomes, then just provide the file paths to the nuc/prot fasta files (humann2_nuc_seqs and humann2_prot_seqs in the config.yaml file).

All genes (from genomes & user-provided) will be clustered altogether with vsearch. See the config.yaml for the default clustering parameters used.

You can’t perform that action at this time.