Skip to content

The findmitoscaf subcommand

Guanliang MENG edited this page Jun 22, 2023 · 1 revision

You can use this subcommand to search your fasta file (generated by MitoZ or any other assemblers) for mitogenomes.

$ mitoz findmitoscaf -h
usage: mitoz findmitoscaf [-h] --fastafile <file> [--fq1 <file>] [--fq2 <file>] --outprefix <STR>
                          [--workdir <STR>] [--thread_number <INT>] [--profiles_dir <STR>] [--slow_search]
                          [--filter_by_taxa] --requiring_taxa <STR> [--requiring_relax {0,1,2,3,4,5,6}]
                          [--min_abundance <float>] [--abundance_pattern <STR>] [--skip_read_mapping]
                          [--genetic_code <INT>]
                          [--clade {Chordata,Arthropoda,Echinodermata,Annelida-segmented-worms,Bryozoa,Mollusca,Nematoda,Nemertea-ribbon-worms,Porifera-sponges}]

Search for mitochondrial sequences from input fasta file.

optional arguments:
  -h, --help            show this help message and exit
  --fastafile <file>    Input fasta file. Gzip supported. [required]
  --fq1 <file>          Input fastq 1 file. use this option if the headers of your '--fastafile' does NOT have
                        abundance information BUT you WANT to filter sequence by their sequencing abundances
                        [optional]
  --fq2 <file>          Input fastq 2 file. use this option if the headers of your '--fastafile' does NOT have
                        abundance information BUT you WANT to filter sequence by their sequencing abundances
                        [optional]
  --outprefix <STR>     output prefix
  --workdir <STR>       workdir [./]
  --thread_number <INT>
                        thread number [8]
  --profiles_dir <STR>  Directory cotaining 'CDS_HMM/', 'MT_database/' and 'rRNA_CM/'.
                        [/home/gmeng/.conda/envs/mybase/envs/mitozEnv.test3.6/lib/python3.8/site-
                        packages/mitoz/profiles]
  --slow_search         By default, we firstly use tiara to perform quick sequence classification (100 times
                        faster than usual!), however, it is valid only when your mitochondrial sequences are >=
                        3000 bp. If you have missing genes, set '--slow_search' to use the tradicitiona search
                        mode. [False]
  --filter_by_taxa      filter out non-requiring_taxa sequences by mito-PCGs annotation to do taxa
                        assignment.[True]
  --requiring_taxa <STR>
                        filtering out non-requiring taxa sequences which may be contamination [required]
  --requiring_relax {0,1,2,3,4,5,6}
                        The relaxing threshold for filtering non-target-requiring_taxa. The larger digital means
                        more relaxing. [0]
  --min_abundance <float>
                        the minimum abundance of sequence required. Set this to any value <= 0 if you do NOT
                        want to filter sequences by abundance [10]
  --abundance_pattern <STR>
                        the regular expression pattern to capture the abundance information in the header of
                        sequence ['abun\=([0-9]+\.*[0-9]*)']
  --skip_read_mapping   Skip read-mapping step, assuming we can extract the abundance from seqid line. [False]
  --genetic_code <INT>  which genetic code table to use? 'auto' means determined by '--clade' option. [auto]
  --clade {Chordata,Arthropoda,Echinodermata,Annelida-segmented-worms,Bryozoa,Mollusca,Nematoda,Nemertea-ribbon-worms,Porifera-sponges}
                        which clade does your species belong to? [Arthropoda]

About the input fasta file:

  • The sequence header lines of your input fasta files should have abundance information, for example, >Congtig1 abun=38.2. If your fasta file has abundance information, but does not match the regular expression abun\=([0-9]+\.*[0-9]*), then you can modify the value of --abundance_pattern , to make MitoZ extract the abundance information from your sequence header line.

    For example, if your sequence header lines look like >Congtig1 coverage:38.2, you can set --abundance_pattern 'coverage\:([0-9]+\.*[0-9]*)'.

  • However, if your sequence header lines do not have the abundance information, you can then set the --fq1 and --fq2 options if you want to filter the sequences by abundance (you don't have to though).

  • If you do not want to filter by abundance, set --min_abundance 0, and you do not need to set the --fq1 and --fq2 options.

Clone this wiki locally