Skip to content

The all subcommand

Guanliang MENG edited this page Jun 22, 2023 · 1 revision

You can provide this subcommand with single-end or paired-end fastq data (raw data or clean data), MitoZ will try to give you annotated mitogenomes directly.

$ mitoz all -h
usage: mitoz all [-h] [--outprefix <str>] [--thread_number <int>] [--workdir <directory>]
                 [--clade {Chordata,Arthropoda,Echinodermata,Annelida-segmented-worms,Bryozoa,Mollusca,Nematoda,Nemertea-ribbon-worms,Porifera-sponges}]
                 [--genetic_code <INT>] [--species_name <STR>] [--template_sbt <file>] --fq1 <file>
                 [--fq2 <file>] [--phred64] [--insert_size <INT>] [--fastq_read_length <INT>]
                 [--data_size_for_mt_assembly <float1>,<float2>] [--skip_filter] [--filter_other_para <str>]
                 [--assembler {mitoassemble,spades,megahit}] [--tmp_dir <STR>] [--kmers <INT> [<INT> ...]]
                 [--kmers_megahit <INT> [<INT> ...]] [--kmers_spades <INT> [<INT> ...]] [--memory <INT>]
                 [--resume_assembly] [--profiles_dir <STR>] [--slow_search] [--filter_by_taxa] --requiring_taxa
                 <STR> [--requiring_relax {0,1,2,3,4,5,6}] [--min_abundance <float>]

Run all steps for mitochondrial genome anlysis from input fastq files.

optional arguments:
  -h, --help            show this help message and exit

Common arguments:
  --outprefix <str>     output prefix [out]
  --thread_number <int>
                        thread number [8]
  --workdir <directory>
                        working directory [./]
  --clade {Chordata,Arthropoda,Echinodermata,Annelida-segmented-worms,Bryozoa,Mollusca,Nematoda,Nemertea-ribbon-worms,Porifera-sponges}
                        which clade does your species belong to? [Arthropoda]
  --genetic_code <INT>  which genetic code table to use? 'auto' means determined by '--clade' option. [auto]
  --species_name <STR>  species name to use in output genbank file ['Test sp.']
  --template_sbt <file>
                        The sqn template to generate the resulting genbank file. Go to
                        https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/#Template to generate your own template
                        file if you like.
                        ['/home/gmeng/.conda/envs/mybase/envs/mitozEnv.test3.6/lib/python3.8/site-
                        packages/mitoz/annotate/script/template.sbt']

Input fastq information:
  --fq1 <file>          Fastq1 file [required]
  --fq2 <file>          Fastq2 file [optional]
  --phred64             Are the fastq phred64 encoded? [False]
  --insert_size <INT>   insert size of input fastq files [250]
  --fastq_read_length <INT>
                        read length of fastq reads, used by the filter subcommand and mitoAssemble. [150]
  --data_size_for_mt_assembly <float1>,<float2>
                        Data size (Gbp) used for mitochondrial genome assembly, usually between 2~8 Gbp is
                        enough. The float1 means the size (Gbp) of raw data to be subsampled, while the float2
                        means the size of clean data must be >= float2 Gbp, otherwise MitoZ will STOP running!
                        When only float1 is set, float2 is assumed to be 0. (1) Set float1 to be 0 if you want
                        to use ALL raw data; (2) Set 0,0 if you want to use ALL raw data and do NOT interrupt
                        MitoZ even if you got very little clean data. If you got missing mitochondrial genes,
                        try (1) differnt kmers; (2)different assembler; (3) increase <float1>,<float2> [2,0]
  --skip_filter         Skip the rawdata filtering step, assuming input fastq are clean data. To subsample such
                        clean data, set <float2> of the --data_size_for_mt_assembly option to be larger than 0
                        (using all input clean data by default). [False]
  --filter_other_para <str>
                        other parameter for filtering. []

Assembly arguments:
  --assembler {mitoassemble,spades,megahit}
                        Assembler to be used. [megahit]
  --tmp_dir <STR>       Set temp directory for megahit if necessary (See
                        https://github.com/linzhi2013/MitoZ/issues/176)
  --kmers <INT> [<INT> ...]
                        kmer size(s) to be used. Multiple kmers can be used, separated by space [71]
  --kmers_megahit <INT> [<INT> ...]
                        kmer size(s) to be used. Multiple kmers can be used, separated by space. Only for
                        megahit [43 71 99]
  --kmers_spades <INT> [<INT> ...]
                        kmer size(s) to be used. Multiple kmers can be used, separated by space. Only for spades
                        ['auto']
  --memory <INT>        memory size limit for spades/megahit, no enough memory will make the two programs halt
                        or exit [50]
  --resume_assembly     to resume previous assembly running [False]

Search mitochondrial sequences arguments:
  --profiles_dir <STR>  Directory cotaining 'CDS_HMM/', 'MT_database/' and 'rRNA_CM/'.
                        [/home/gmeng/.conda/envs/mybase/envs/mitozEnv.test3.6/lib/python3.8/site-
                        packages/mitoz/profiles]
  --slow_search         By default, we firstly use tiara to perform quick sequence classification (100 times
                        faster than usual!), however, it is valid only when your mitochondrial sequences are >=
                        3000 bp. If you have missing genes, set '--slow_search' to use the tradicitiona search
                        mode. [False]
  --filter_by_taxa      filter out non-requiring_taxa sequences by mito-PCGs annotation to do taxa
                        assignment.[True]
  --requiring_taxa <STR>
                        filtering out non-requiring taxa sequences which may be contamination [required]
  --requiring_relax {0,1,2,3,4,5,6}
                        The relaxing threshold for filtering non-target-requiring_taxa. The larger digital means
                        more relaxing. [0]
  --min_abundance <float>
                        the minimum abundance of sequence required. Set this to any value <= 0 if you do NOT
                        want to filter sequences by abundance [10]

Now MitoZ uses three de novo assemblers, MitoAssemble, Megahit, and SPAdes. The users are encouraged to test different assemblers when one of the assemblers fails to deliver a good mitogenome. If your server does not have enough memory, you can try to set the --memory option and use Megahit or SPAdes for assembly. For example, --memory 50 means limiting the assembler to use a maximum of 50 GB RAM.

To specify a specific assembler, use the --assembler option.

Warning:

  • --assembler spades only accepts paired-end data, which means that you need to provide both --fq1 and --fq2!

  • Use --data_size_for_mt_assembly 0 if you want to use ALL your fastq data for mitogenome assembly (no matter which assembler you are going to use)!!

Examples

Firstly, create a directory for the analysis of your sample (it is better if each sample has its own directory):

mkdir -p /home/gmeng/work/sampleID   # change this path to your own working path
cd /home/gmeng/work/sampleID

1. Paired-end (PE) fastq data

  • PE data works with all three assemblers (--assembler megahit, --assembler spades and --assembler mitoassemble)
source activate mitozEnv

fq1=/path/to/read.1.fq.gz
fq2=/path/to/read.2.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--fq2 $fq2 \
--assembler megahit

Or use --skip_filter if you want to skip the raw data filter step (assuming your data is already clean data):

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
fq2=/path/to/read.2.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--fq2 $fq2 \
--assembler megahit \
--skip_filter

Or if you want to limit the resource the software going to use (--memory works with --assembler megahit and --assembler spades only):

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
fq2=/path/to/read.2.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--fq2 $fq2 \
--assembler megahit \
--memory 80

By default, MitoZ only extracts 5 Gbp clean data for mitogenome assembly. To force MitoZ to use all your input fastq data for assembly, use the --data_size_for_mt_assembly 0 option:

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
fq2=/path/to/read.2.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--fq2 $fq2 \
--assembler megahit \
--data_size_for_mt_assembly  0

2. Single-end (SE) fastq data

  • SE data does not work with the --assembler spades option.

Use mitoassemble for assembly:

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--assembler mitoassemble \
--fastq_read_length 151 \
--kmers 91 71 51

Or use spades for assembly:

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--assembler megahit

More about mitoassemble

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
fq2=/path/to/read.2.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--fq2 $fq2 \
--assembler mitoassemble \
--fastq_read_length 151 \
--kmers 91 71 51

Please change the --fastq_read_length 151 and --clade Chordata and --requiring_taxa and --genetic_code 2 according to your fastq files and samples. You can also try any other different kmer sizes (odd numbers), say 65, or using more different kmers, say --kmers 91 65 71 51

The above will run mitochondrial genome assembly using kmer 91, 71, and 51 separately using the mitoAssemble assembler and 8 threads by default.

If you do not want to filter your input fastq files, add the --skip_filter option:

source activate mitozEnv

fq1=/path/to/read.1.fq.gz
fq2=/path/to/read.2.fq.gz
out=YourSampleID

mitoz all \
--outprefix $out \
--clade Chordata \
--requiring_taxa Chordata \
--genetic_code 2 \
--fq1 $fq1 \
--fq2 $fq2 \
--assembler mitoassemble \
--fastq_read_length 151 \
--kmers 91 71 51 \
--skip_filter

Please keep in mind that, each kmer assembly is quite time-consuming. If the previous kmer assembly already gets a circular mitochondrial genome, you do not have to run the remained kmer assembly, which means that you can kill the job at this point, and then annotate the mitochondrial genome directly.

You can go to check the /home/gmeng/work/sampleID/mt_assembly/ directory, and check the sampleID.mitoAssemble.K*.result directories.

But of course, you can also let the above command run until it finishes, some kmer assemblies could give better results.

Clone this wiki locally