Skip to content

jkimlab/TAMA

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

TAMA

  • Taxonomy Analysis pipeline for metagenome using Meta-Analysis

System requirements (tested versions)

  • Programs

      - perl (v5.22.1)
      - python (2.7.12)
      - java (1.8.0)
      - git (2.7.4)
      - gcc, g++ (5.4.0)
      - make (GNU Make 4.1)
      - zip (3.0)
      - unzip (6.00)
      - curl (7.47.0)
      - Sort::Key::Natural (perl library)
    
  • Required system resources

      When you run TAMA based on the species rank analysis using the included example dataset,
      you need to prepare resources as below.
      - Disk: (approximately) 300GB 
              * CLARK: 88GB, Kraken: 188GB, Centrifuge: 9.6GB
      - Memory: (approximately) 185GB 
              * This large memory is required for running the taxonomy analysis tools,
                especially CLARK and Kraken
    

Download and installation

  • Downloading TAMA

      git clone https://github.com/jkimlab/TAMA.git
      cd TAMA
    
  • Installing TAMA

    [[ RECOMMENDED ]]

      * Installing by docker (https://www.docker.com/)
    
      [Build docker image]
      docker build -t [image_name] .
      
      [Run a container]
      docker run -it [image_name] /bin/bash
      cd TAMA
    

    [ Additional ]

      * Installing by source code
        (We recommend installing it by docker.)
    
      [ Check the required perl libraries ]
      ./setup.pl --check
      
      [ Install TAMA package ]
      ./setup.pl --install
    
      [ Uninstall TAMA package ]
      ./setup.pl --uninstall
    
  • Running TAMA with an example dataset and an example reference database

      * If you want to test whether the installation is successful,
        set the prepared examples and run example command file.
      * [ PLEASE CHECK! ] This requires more than 30GB of memory.
      * This command file generates an example reference database and
        runs TAMA with an example dataset.  
        
      ./setup.pl --example
      source ./src/env.sh 
      bash Example_run.cmd.sh
    
  • Setting the integrated TAMA databases

      [ Downloading and setting the integrated TAMA database ]
      ./setup.pl --db (or ./setup.pl --db --species)
      
      * In the case of CLARK, the independent database is used for each taxonomic rank.
        So, the six databases of different taxonomic ranks are provided. (default: species) 
      * If you want to analyze with other taxonomic ranks,
        you need to download additional databases using taxonomic rank options. 
        please add taxonomic rank options
      * Example:
        If you want to analyze by species rank, genus rank, and phylum rank,
        run the setup.pl with below options for setting each database.
      
      ./setup.pl --db --species --genus --phylum
    
  • Before downloading the database, please check the required disk space (To download and set the databases, you need to prepare about twice as much storage space, respectively.)

Tool Species Genus Family Order Class Phylum
CLARK 88 GB 90 GB 89 GB 90 GB 88 GB 91 GB
Tool DB
Kraken 188 GB
Centrifuge 9.6 GB

Running TAMA

  • Running TAMA with example datasets and the integrated TAMA database

      bash Run_TAMA.cmd.sh
      
      * Before running this command, you have to set the integrated TAMA database.
      * How to set the integrated TAMA database is described above.
      * [ PLEASE CHECK! ] This requires more than 185GB of memory.
        (Loading database for CLARK: approximately 155GB, Kraken: approximately 184GB) 
    
  • Options of running TAMA

      ./TAMA.pl
      
      Usage: ./TAMA.pl [option] --param param.txt 
      Options:
      	-p		The number of threads  (default: 1)
      	-o		Path of output directory  (default: Current directory)
      	-t		Save temporary files (default: False)
      			If you want to save, type 'True'
      	-h|help		Print help page
      Input:
      	--param		(Required) Path of paramter file
    
  • To run TAMA, you need to prepare params file

      * You can see detailed options in the "params" file. 
    
      ### Example of parameter file 
      # Each line has 'PARAMETERNAME' and 'VALUE', they concatenate by '='
      # PARAMETERNAME is started by '$' symbol
      # Please do not change the 'PARAMETERNAME'
      # You can use your input files or options by changing the 'VALUE' after '=' symbol
    
      [Project]
      # PROJECTNAME parameter is the name of your project using followed all input datasets. 
      # Multiple datasets could be included in the project and analyzed in a single run. 
      # A directory with this name will be created in the output directory.
      $PROJECTNAME=TEST
    
      [Basic_options]
      # TOOL parameter has to get a list of taxonomy analysis tools. 
      # You can use one and more tools for your analysis from CLARK, centrifuge, and kraken.
      # (default: CLARK,centrifuge,kraken)
      $TOOL=CLARK,centrifuge,kraken
      # RANK parameter is targeted taxonomic rank of analysis.
      # You can use one taxonomic rank from species, genus, family, order, class, and phylum.
      # (default: species) 
      $RANK=species
      # META-THRESHOLD parameter is the filtering threshold of meta-analysis for abundance estimation. 
      # You can use a positive number between 0 and 1. 
      # If the bigger threshold is used then the more strict filtering of meta-analysis will be done. 
      # (default: 0.34)
      $META-THRESHOLD=
      # WEIGHT-CLARK parameter is the weight (or confidence level) of CLARK results for meta-analysis. 
      # You can use a positive number between 0 and 1. 
      # (default: 0.9374)
      $WEIGHT-CLARK=
      # WEIGHT-centrifuge parameter is the weight (or confidence level) of centrifuge results for meta-analysis.
      # You can use a positive number between 0 and 1.
      # (default: 0.9600)
      $WEIGHT-centrifuge=
      # WEIGHT-kraken parameter is the weight (or confidence level) of kraken results for meta-analysis.
      # You can use a positive number between 0 and 1. 
      # (default: 0.9362)
      $WEIGHT-kraken=
    
      [Database]
      # DBNAME parameter is the directory name which has CLARK, kraken, and centrifuge directory 
      # and this directory have to be in 'DB' directory of TAMA package. 
      # Each directory (CLARK, kraken, and centrifuge) should have their database files. 
      # (default: tama)
      $DBNAME=tama
    
      [Input]
      # You need to separate each dataset with '>' symbol. 
      # The name of each dataset should be described after '>'. 
      # The dataset could have multiple numbers of input sequence files. 
      # All the sequence files will be treated as an input to the current dataset until the next '>' symbol. 
      # The directory with this name will be created in the project directory 
      # (==> output_dir/projectname/samplename)
      >sample1
      # PAIRED1 and PAIRED2 parameter are the forward and 
      # reverse strand reads of a pair of paired-end sequencing read. 
      # If you have multiple sequence files, 
      # you should write a list of files by concatenating ',' or write them in multiple lines.
      $PAIRED1=examples/sample1.1.fq
      $PAIRED2=examples/sample1.2.fq
      $PAIRED1=
      $PAIRED2=
      # SINGLE parameter is the single-end read file. 
      # If you have multiple sequence files, 
      # you should write a list of files by concatenating ',' or write them in multiple lines.
      $SINGLE=
    
      >sample2
      $PAIRED1=examples/sample2.1.fq
      $PAIRED2=examples/sample2.2.fq
      $SINGLE=
    
      [Preprocessing]
      # TRIMMOMATIC-RUN parameter get whether to execute read trimming process or not. 
      # If you want to do the read trimming process, 
      # you should write 'true', or not you should write 'false'. 
      # (default: true)
      $TRIMMOMATIC-RUN=true
      # TRIMMOMATIC-OPTION parameter is the running options of the Trimmomatic program. 
      # (default: AVGQUAL:2 LEADING:3 TRAILING:3)
      $TRIMMOMATIC-OPTION=
      # BAYESHAMMER-RUN parameter get whether to execute read error correction process or not. 
      # If you want to do the read error correction process, 
      # you should write 'true', or not you should write 'false'. 
      # (default: true)
      $BAYESHAMMER-RUN=true
    

TAMA output

  • Read classification profile

      - File name: read_classi.out
      
      - File format: there are three columns without header line
          Column 1: Input read sequence ID
          Column 2: Assigned taxon ID (or IDs)
          Column 3: Meta-score
    
  • Relative species abundance profile

      - File name: abundance_profile.*.out
      ('*' is the input meta-score threshold (value of META-THRESHOLD parameter))
      * To generate the abundance profile,
        filtered read classification profile is generated before abundance estimation.
      * The filtered read classification profile: read_classi.*.out
        ('*' is the input meta-score threshold (value of META-THRESHOLD parameter))
      
      - File format: there are seven columns with header line
          
      1. The first line indicates name of each column
          (1) Scientific name: the scientific name of this taxon
          (2) Taxon ID: taxon id of this genome for the input taxonomic rank
          (3) Genome size: estimated genome size of this taxon 
              (if the rank of input taxonomy is not a 'species', '-' is recorded)
          (4) Ratio: the proportion of reads assigned to this taxon 
              ([the number of reads with this taxon ID]/[total number of reads])
          (5) # of Read count: the number of reads assigned to this taxon
          (6) # of Total read: the total number of reads in this sample
          (7) Abundance: relative species abundance of this taxon
          
      2. The second line has information of 'unclassified reads'
      3. From the third line, there are results of predicted species (in the reference database) 
         from 'classified reads'
      
      - Example: 
              Scientific name	Taxonomy ID	Genome size	Ratio	# of Read count	# of Total read	Abundance
              NA	NA	NA	8.4622056739089e-05	4	47269	-
              Buchnera aphidicola	9	1722047	0.931223423385305	44018	47269	0.776485940413685
              Lysobacter enzymogenes	69	451961	0.0354354862594935	1675	47269	0.11258023255361
              Cellulomonas gilvus	11	430464	0.033256468298462	1572	47269	0.110933827032704
    

Creating custom reference database

  • Creating a custom database with example reference set

      source ./src/env.sh
      ./src/Create_customDB.pl -ref ./examples/ref_list.example -names ./examples/names.dmp.gz -nodes ./examples/nodes.dmp.gz -o example_db
      
      
      * Please add the path of jellyfish to your PATH or run the code below before running Create_customDB.pl
      
      source ./src/env.sh
      
      * If you miss this step, creating Kraken database would be failed.
    
  • Options of running the creating custom database

      ./src/Create_customDB.pl -h
      
      Usage: ./Create_customDB.pl -ref <reference list> -names <names.dmp> -nodes <nodes.dmp> -o <output directory>
      -ref       <reference list>
        This file must have paths of reference genome sequence file and their taxon ID in two separated column with '\t'.
      -names     <names.dmp>
        Enter the NCBI names.dmp file has the information of scientific names.
      -nodes     <nodes.dmp>
        Enter the NCBI nodes.dmp file has the information of taxonomy tree.
      -o | -out  <output directory name>  A directory with this name will be created into '[path of TAMA]/DB'.
      -p | -cpu  <num of threads>
      -rank      <target taxonomic rank>
        Select one from species(default), genus, family, order, class, and phylum.
      -h | -help Print this page.
    
  • Preparing input for generating custom database

      * To create an integrated TAMA database, you have to prepare three input files.
      1. The list of reference files and their taxon ID (user made)
      - This file has to contain a pair of reference genome sequence file path and it's taxon ID
        in two columns deliminated by '\t'
      - Example:
        GCF_000007365.fna.gz	438753
        GCF_000007725.fna.gz	198804
        GCF_000009605.fna.gz	224915
      - Also, you can see an example in the examples directory of TAMA (ref_list.example)
        
      2. Taxonomy names file (names.dmp file from NCBI taxonomy)
      3. Taxonomy nodes file (nodes.dmp file from NCBI taxonomy)
      - You can download names.dmp and nodes.dmp from NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy)
      - Also, you can use the names.dmp and the nodes.dmp files in the examples directory of TAMA
    

Third party tools

How to cite

Sim M, Lee J, Lee D, Kwon D, Kim J. TAMA: improved metagenomic sequence classification through meta-analysis. BMC Bioinformatics. 2020 May 12;21(1):185. doi: 10.1186/s12859-020-3533-7.

Contact

E-mail: bioinfolabkr@gmail.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages