Skip to content

Tutorial

Guanliang MENG edited this page Sep 3, 2023 · 74 revisions

Here we have a specimen called DM01, with its fastq files DM01_1.fastq.gz and DM01_2.fastq.gz (which are called raw data):

$ mkdir -p /share/pool/guanliang/test/DM01
$ cd /share/pool/guanliang/test/DM01
$ ls -hrtl
-rw-rw-r-- 1 gmeng 5.6G Jun 29 00:50 DM01_2.fastq.gz
-rw-rw-r-- 1 gmeng 4.9G Jun 29 00:50 DM01_1.fastq.gz
-rw-rw-r-- 1 gmeng  416 Jun 29 01:03 all.sh

The raw data

  • Can be paired-end (PE) or single-end (SE)

    • SE data does not work with the --assembler spades option.
    • If you are using PE data but your read 1 and read 2 files are not properly paired, the assemblers and BWA will complain because you are going to use the wrong pair-end information for scaffolding, etc. To fix this problem, you can: (1) use some tools (see here) to fix your read 1 and read 2 files before using MitoZ. (2) treat your data as SE data by using only the -fq1 but not the -fq2 option in MitoZ. See this for more information.
  • Generated typically by Illumina or BGISEQ

  • Typically from whole-genome shotgun (WGS) sequencing of the total DNA of a single species, but I suppose long-range PCR + NGS data is also great.

  • Target-enrichment data or transcriptome data might also work, but the success rate might not be as high as DNA data.

  • Fastq data of pooled multiple species might also work since MitoZ will output any sequences with >= 5 protein-coding genes (PCGs)

  • You do NOT need to trim and filter the raw data by yourself, just provide MitoZ with them directly (you can do it of course, but it may be just wasting your time).

    • In MitoZ ≥3.5, we use --data_size_for_mt_assembly <float1>,<float2> format. The float1 means the size (Gbp) of raw data to be subsampled, while the float2 means the size of clean data should be >= float2 Gbp, otherwise MitoZ will STOP running. When only float1 is set, float2 is assumed to be 0. (1) Set float1 to be 0 if you want to use ALL raw data; (2) Set 0,0 if you want to use ALL raw data and do NOT interrupt MitoZ even if you got very little clean data.

    • If you want to skip the filtering step, set --skip_filter --data_size_for_mt_assembly <float1>,<float2> at the same time, MitoZ will subsample <float2> Gbp of the input clean fastq data (<float1> is useless here).

Have a look at the command we are going to use:

$ cat all.sh

source activate /share/pool/guanliang/soft/mitoz3.6

mitoz all  \
--outprefix DM01 \
--thread_number 16 \
--clade Chordata \
--genetic_code 2 \
--species_name "Homo sapiens" \
--fq1 DM01_1.fastq.gz \
--fq2 DM01_2.fastq.gz \
--fastq_read_length 151 \
--data_size_for_mt_assembly 3,0 \
--assembler megahit \
--kmers_megahit 59 79 99 \
--memory 50 \
--requiring_taxa Chordata

# WARNING: it is better NOT to use the default kmers of `--kmers_megahit`!
# Instead, try larger kmers FIRST!
# If they do not work,
# then try smaller Kmers, e.g. `--kmers_megahit 43 71 99`

# Small kmers can lead to wrong results, e.g. quite a long mitogenome but low sequencing coverage in some regions.
  • Change the values of --clade and --genetic_code according to your sample.

  • You can set some values for --species_name, or just do not set this option. This option is only used when generating a GenBank file.

  • If your DM01_1.fastq.gz and DM01_2.fastq.gz are not in the current directory (i.e. /share/pool/guanliang/test/DM01), you need to specify the absolute paths or relative paths for them, e.g. --fq1 /path/to/DM01_1.fastq.gz --fq2 /path/to/DM01_2.fastq.gz.

  • Set the read length of your fastq data with --fastq_read_length, here my read length is 151.

  • Tell MitoZ to use only 3G bp of raw fastq data for mitogenome assembly: --data_size_for_mt_assembly 3,0, or --data_size_for_mt_assembly 3. The default value is 2,0.

  • Here we use Megahit for assembly: --assembler megahit.

  • We set kmers for Megahit: --kmers_megahit 59 79 99 119 141, which can avoid possible problems here: https://github.com/linzhi2013/MitoZ/wiki/Known-issues#8-megahit-gets-very-long-sequences and https://github.com/linzhi2013/MitoZ/issues/172#issuecomment-1608575634 . The default value is --kmers_megahit 21 29 39 59 79 99 119 141, which can cause wrong mitogenomes sometimes, therefore, I would recommend trying larger kmers first, and if they do not work, you can try smaller kmers, say --kmers_megahit 39 59 79 99 119 141. More kmers will take longer running time. I will update the default Kmer values in the next release. Tips: the kmer values should be smaller than your maximum read length.

  • If you are going to use Spades for the assembly, make sure that your fq1 and fq2 files are properly paired. Otherwise, the Spades will not work. See here for more details and solutions.

  • And we ask Megahit to use a maximum of 50 GB RAM: --memory 50. However, if your server does not have enough RAM, you might get some errors like https://github.com/linzhi2013/MitoZ/issues/174#issuecomment-1544089254.

  • Finally, set a required taxonomic name (e.g. order, class, family, genus) for filtering out non-target-taxon mitochondrial sequences: --requiring_taxa Chordata. Here, my specimen belongs to Chordata.

Because I used the SGE cluster to run the job, I submitted the job with:

$ qsub -cwd -l vf=50g -q small.q,medium.q,large.q -pe smp 16 -N DM01 all.sh

The above command will print out a JOBID, with which you can check how much RAM MitoZ is using from time to time (only if you want):

$ qstat -j JOBID

However, if you are going to run MitoZ locally, do:

$ nohup sh all.sh >m.log 2>m.err &

You can then use the top or htop (https://anaconda.org/conda-forge/htop) commands to check how much RAM MitoZ is using from time to time.

And wait for the running to finish.

Why do we check how much RAM MitoZ is using? Just to make sure if your servers have enough resources for it, and to know how many specimens you can run at the same time (especially if you run MitoZ locally). See also https://github.com/linzhi2013/MitoZ/wiki/Installation#4-about-the-thread-number-and-data-size.

When the run finished, I got:

$ ls -lhrt
total 11G
-rw-rw-r-- 1 gmeng 5.6G Jun 29 00:50 DM01_2.fastq.gz
-rw-rw-r-- 1 gmeng 4.9G Jun 29 00:50 DM01_1.fastq.gz
-rw-rw-r-- 1 gmeng  416 Jun 29 01:03 all.sh
drwxrwxr-x 2 gmeng  282 Jun 29 02:33 clean_data
drwxrwxr-x 4 gmeng   60 Jun 29 05:49 mt_assembly
drwxrwxr-x 4 gmeng  119 Jun 29 05:54 mt_annotation
-rw-r--r-- 1 gmeng 6.5K Jun 29 06:02 all.sh.o563317
-rw-rw-r-- 1 gmeng  45K Jun 29 06:02 mitoz.log
drwxrwxr-x 4 gmeng   91 Jun 29 06:02 DM01.result
-rw-r--r-- 1 gmeng 3.3M Jun 29 06:02 all.sh.e563317

The final resulting files are in DM01.result. See the 5. The final resulting files section and sections after it below for more details.

I will explain these files step by step.

1. The raw data filtering step

The clean_data directory:

$ ls -lhrt clean_data/
total 20G
-rw-rw-r-- 1 gmeng gmeng 137K Jun 29 02:28 DM01.QC.json
-rw-rw-r-- 1 gmeng gmeng 483K Jun 29 02:28 DM01.QC.html
-rw-rw-r-- 1 gmeng gmeng 4.7G Jun 29 02:28 DM01.clean_R1.fq.gz
-rw-rw-r-- 1 gmeng gmeng 5.3G Jun 29 02:28 DM01.clean_R2.fq.gz
-rw-rw-r-- 1 gmeng gmeng    0 Jun 29 02:28 FastpTask.o.log
-rw-rw-r-- 1 gmeng gmeng    0 Jun 29 02:28 FastpTask.e.log
-rw-rw-r-- 1 gmeng gmeng    0 Jun 29 02:28 FastpTask.done
-rw-rw-r-- 1 gmeng gmeng 4.5G Jun 29 02:33 sampling_3.0GB.clean_R1.fq.gz
-rw-rw-r-- 1 gmeng gmeng 5.2G Jun 29 02:37 sampling_3.0GB.clean_R2.fq.gz

In the all.sh script, we provided MitoZ with the raw data files DM01_1.fastq.gz and DM01_2.fastq.gz, and MitoZ used the fastp program to clean them, the clean data files are DM01.clean_R1.fq.gz and DM01.clean_R2.fq.gz.

You can have a look at the filter report files DM01.QC.json and DM01.QC.html if you like.

In the all.sh script, we set --data_size_for_mt_assembly 3, so MitoZ extracted 3G bp data for us automatically from the clean data files, they are sampling_3.0GB.clean_R1.fq.gz and sampling_3.0GB.clean_R2.fq.gz, which were used for subsequent mitogenome assembly.

2. The mitogenome assembly step

The mt_assembly directory:

$ ls -lhrt mt_assembly
total 8.0K
drwxrwxr-x 3 gmeng gmeng 4.0K Jun 29 05:50 megahit
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 05:50 DM01.megahit.result

In the all.sh script, we asked MitoZ to use Megahit to perform assembly (--assembler megahit), so MitoZ created a subdirectory called megahit here.

Temporary files for getting the mitochondrial genomes are in the mt_assembly/megahit/ directory:

$ ls -lhrt mt_assembly/megahit/
total 74M
drwxrwxr-x 3 gmeng gmeng  195 Jun 29 05:37 megahit_out
-rw-rw-r-- 1 gmeng gmeng 102K Jun 29 05:40 DM01.megahit.hmmout
-rw-rw-r-- 1 gmeng gmeng 6.6K Jun 29 05:40 DM01.megahit.hmmtblout
-rw-rw-r-- 1 gmeng gmeng 5.5K Jun 29 05:40 DM01.megahit.hmmtblout.besthit
-rw-rw-r-- 1 gmeng gmeng 1.8K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim
-rw-rw-r-- 1 gmeng gmeng 272K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa
-rw-rw-r-- 1 gmeng gmeng  20K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.ndb
-rw-rw-r-- 1 gmeng gmeng   80 Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.nto
-rw-rw-r-- 1 gmeng gmeng  16K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.ntf
-rw-rw-r-- 1 gmeng gmeng  236 Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.not
-rw-rw-r-- 1 gmeng gmeng  15K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds
-rw-rw-r-- 1 gmeng gmeng 6.7K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.pep
-rw-rw-r-- 1 gmeng gmeng 2.4K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds.taxa
-rw-rw-r-- 1 gmeng gmeng 1.5K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds.taxa.sorted.uniq
-rw-rw-r-- 1 gmeng gmeng  14K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds.filtered
-rw-rw-r-- 1 gmeng gmeng  23K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa
-rw-rw-r-- 1 gmeng gmeng  884 Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa
-rw-rw-r-- 1 gmeng gmeng  73M Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.bwa.bam
-rw-rw-r-- 1 gmeng gmeng 490K Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.depth
-rw-rw-r-- 1 gmeng gmeng  23K Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.abun.fa
-rw-rw-r-- 1 gmeng gmeng   97 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
-rw-rw-r-- 1 gmeng gmeng  787 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.high_abundance_10.0X
-rw-rw-r-- 1 gmeng gmeng  548 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta
-rw-rw-r-- 1 gmeng gmeng  407 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like
-rw-rw-r-- 1 gmeng gmeng  463 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted
-rw-rw-r-- 1 gmeng gmeng  219 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.dotted
-rw-rw-r-- 1 gmeng gmeng  388 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked
-rw-rw-r-- 1 gmeng gmeng  166 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.stat
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
-rw-rw-r-- 1 gmeng gmeng 5.9K Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa
-rw-rw-r-- 1 gmeng gmeng   75 Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:50 DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng gmeng  191 Jun 29 05:50 DM01.megahit.overlap_information
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:50 DM01.megahit.start2end_for-circular-mt-only

In this directory, somes of the most useful files could be mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim and mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim.fa, which are the initial resulting files output by HMMER, and can be used for checking cross-contamination in your big projects and search for other mitogenomes if your dataset contains mitogenomes of multiple species. See https://github.com/linzhi2013/MitoZ/wiki/Some-important-intermediate-files.

Other subsequent sequence files: DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa, DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.abun.fa, DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa and DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa.

Meanwhile, MitoZ collected the import resulting files to the DM01.megahit.result directory. Have a look at it:

$ ls -lhrt mt_assembly/DM01.megahit.result/
total 64K
-rw-rw-r-- 1 gmeng gmeng   97 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
-rw-rw-r-- 1 gmeng gmeng  548 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta
-rw-rw-r-- 1 gmeng gmeng   75 Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
-rw-rw-r-- 1 gmeng gmeng 5.9K Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:50 DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng gmeng  191 Jun 29 05:50 DM01.megahit.overlap_information

MitoZ uses the HMMER program to fish out the candidate mitochondrial sequences, and then further filter them by the taxonomic assignment, sequences belonging to non-target-taxon are discarded (--requiring_taxa Chordata). Next, MitoZ separates the remained sequences by the abundance information, low-coverage sequences are output to DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance and DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta, in case the users want to search for some missing genes from them.

$ cat mt_assembly/DM01.megahit.result/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
k141_626356	COX3	658	784	506	378	506	-	abun=1.0000
k141_626356	ND3	1	307	308	1	506	-	abun=1.0000

For the high-coverage sequences, MitoZ further picks the sequences based on the completeness of protein-coding genes (PCGs). So we got the files DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa vs. DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked and DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa.

$ cat mt_assembly/DM01.megahit.result/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
>k141_402622 abun=10.9365 length=5830 score=3.835
ND5	3319	3805	20	588	+	4

Although this sequence (k141_402622) has high coverage of 10.9365, but there better sequences:

$ cat  mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked
>k141_250555 abun=1110.0000 length=16594 score=19535.356
ND3	1	309	1	308	-	4
COX3	379	1163	1	784	-	4
ATP6	1163	1846	1	683	-	4
ATP8	1837	2004	1	160	-	4
COX2	2086	2769	1	683	-	4
COX1	2927	4482	1	1542	-	4
ND2	4885	5928	1	1034	-	4
ND1	6139	7113	11	971	-	4
CYTB	10887	12027	1	1138	-	4
ND6	12099	12620	1	509	+	4
ND5	12615	14454	12	1828	-	4
ND4	14674	16054	1	1370	-	4
ND4L	16048	16344	2	297	-	4

The digit 4 of the last column indicates this gene is complete compared to a defined length for the gene (For me, the definition file is /share/pool/guanliang/soft/mitoz3.4/lib/python3.8/site-packages/mitoz/profiles/CDS_HMM/Chordata_CDS_length_list)

Finally, MitoZ checks if the picked mitogenome (DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa) is complete:

$ grep '>' mt_assembly/DM01.megahit.result/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
>k141_250555 abun=1110.0000 length=16594 score=19535.356

If there are overlapping between the 5'-end and 3'-end of the sequence, MitoZ will remove the overlapping region, generating two final resulting files for the assembly step:

$ ls -lhrt  mt_assembly/DM01.megahit.result/DM01.megahit.overlap_information mt_assembly/DM01.megahit.result/DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng 17K Jun 29 05:50 mt_assembly/DM01.megahit.result/DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng 191 Jun 29 05:50 mt_assembly/DM01.megahit.result/DM01.megahit.overlap_information

The DM01.megahit.mitogenome.fa file is the mitochondrial genome file, which will be used for subsequent analysis.

$ grep '>' mt_assembly/DM01.megahit.result/DM01.megahit.mitogenome.fa
>k141_250555 topology=circular

The DM01.megahit.overlap_information shows the overlapping information:

$ cat mt_assembly/DM01.megahit.result/DM01.megahit.overlap_information
>k141_250555 overlap between 5' and 3' are 142bp
AAGCCCGAGGGTTAGAAGTACTAGAATAGCTGATGCTCAAAAGAATGTTGTTACAGGAGAAGGGAGTTGGTCTCCTCAAGGGAGAGGAAGTAATAAGGCAATTTCTAGGTCAAACAAGAGAAATAGGATCGCAACTAGGAA

Because the overlapping region is quite long (142bp), and the sequence content is not simple repeats, we have high confidence in saying the mitogenome is complete. See also https://github.com/linzhi2013/MitoZ/wiki/Known-issues#1-circularity-problem-and-its-solution and https://github.com/linzhi2013/MitoZ/wiki/Known-issues#2-simple-repeats-found-in-the-overlapping-region.

Besides, you can have a look at the sequencing depth along with the sequences:

$ head mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.depth
k141_402622	1	38
k141_402622	2	38
k141_402622	3	38
k141_402622	4	39
k141_402622	5	39
k141_402622	6	40
k141_402622	7	40
k141_402622	8	40
k141_402622	9	40
k141_402622	10	40

The header's meaning: seqid position_of_the_sequence coverage.

If there are some very high or very low coverage, which may indicate there are Ns or repeats and need further check. The completeness will be further verified in the final visualization step, from the SVG/PNG files you can see the coverage distribution clearly (see later section).

3. The annotation step

The mt_annotation directory:

$ ls -hlrt mt_annotation/
total 8.0K
drwxrwxr-x 3 gmeng gmeng 4.0K Jun 29 05:54 tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 06:02 DM01.DM01.megahit.mitogenome.fa.result

Let's have a look at the mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa first:

$ ls -lhrt mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
total 404K
-rw-rw-r-- 1 gmeng  17K Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
-rw-rw-r-- 1 gmeng   20 Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.not
-rw-rw-r-- 1 gmeng  20K Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.ndb
-rw-rw-r-- 1 gmeng    8 Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.nto
-rw-rw-r-- 1 gmeng  16K Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.ntf
-rw-rw-r-- 1 gmeng  13K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.cds.position.cds
-rw-rw-r-- 1 gmeng 5.4K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.pep
-rw-rw-r-- 1 gmeng  503 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position
-rw-rw-r-- 1 gmeng 1.5K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.cds.position.cds.taxa
-rw-rw-r-- 1 gmeng   29 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
-rw-rw-r-- 1 gmeng  503 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position.sorted
-rw-rw-r-- 1 gmeng  388 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position.sorted.revised
-rw-rw-r-- 1 gmeng  378 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position.sorted.revised.filtered
-rw-rw-r-- 1 gmeng 1.5K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.ft
-rw-rw-r-- 1 gmeng 1.4K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.trna
-rw-rw-r-- 1 gmeng 1.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.trna.ft
-rw-rw-r-- 1 gmeng 1.4K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.tbl
-rw-rw-r-- 1 gmeng  11K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.out
-rw-rw-r-- 1 gmeng   95 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.ft
-rw-rw-r-- 1 gmeng 1.5K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.tbl
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.out
-rw-rw-r-- 1 gmeng   95 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.ft
-rw-rw-r-- 1 gmeng   77 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.control_region.ft
-rw-rw-r-- 1 gmeng 3.2K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.fsa
-rw-rw-r-- 1 gmeng  79K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.sqn
-rw-rw-r-- 1 gmeng  36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
-rw-rw-r-- 1 gmeng  494 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.val
-rw-rw-r-- 1 gmeng   79 Jun 29 05:54 errorsummary.val
-rw-rw-r-- 1 gmeng 4.9K Jun 29 05:54 summary.txt
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
-rw-rw-r-- 1 gmeng  12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
-rw-rw-r-- 1 gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
-rw-rw-r-- 1 gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
-rw-rw-r-- 1 gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
drwxrwxr-x 2 gmeng 4.0K Jun 29 06:02 mt_visualization

The input mitogenome file is renamed as mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.

The initial PCG annotation result file:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.cds.position.cds`
>gi_NC_004387_ATP8_Homo_sapiens_55_aa_Ref1:55aa	[mRNA]	locus=k141_250555:1840:2004:-
ATGCCCCAGTTAAATCCCGCCCCCTGATTTGCTATTATAATCTTTTCATG
ACTAATCTTTTTAACAGTTATTCCACCTAAAGTCTTAGCTCACCACTTTC
CTAATGAACCTACCCCCCAAAACGTAAAAAAATCAAAATCAGAAACCTGA
ACCTGACCATGACTG
>gi_NC_004387_ND2_Homo_sapiens_350_aa_Ref1:347aa	[mRNA]	locus=k141_250555:4891:5928:-
ATGAATCCTTACGTCCTCTCAGCCTTACTAATAGGCCTAGGCCTTGGAAC
CACAGTTACATTTGCTAGCTCCCACTGGTTATTAGCCTGGATAGGCCTTG
AAATAAACACACTGGCCATCCTCCCTCTCATAGCACAACACCACCACCCC
CGAGCCATTGAAGCTACAACCAAATACTTCTTAATTCAAGCATCCGCTGC
CGCTACCATCTTGTTTGCCAGCTCAACAAATGCCTGACTGTCCGGACAAT
GAGAAATTCAACAAATTACTCACCCCCTTCCCATTGTTATAATCACCATT
...

The CDS feature table:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.ft
>Feature k141_250555
309	>1	gene
			gene	ND3
309	>1	CDS
			product	NADH dehydrogenase subunit 3
			transl_table	2
1163	379	gene
			gene	COX3
1163	379	CDS
			product	cytochrome c oxidase subunit III
			transl_table	2
			note	TAA stop codon is completed by the addition of 3' Aresidues to the mRNA
...

The tRNA feature table:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.trna.ft
>Feature k141_250555
379	310	gene
			gene	trnG(ucc)
379	310	tRNA
			product	tRNA-Gly
2078	2006	gene
			gene	trnK(uuu)
...

The s-rRNA feature table:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.ft
>Feature k141_250555
9891	8945	gene
			gene	s-rRNA
9891	8945	rRNA
			product	12S ribosomal RNA

The l-rRNA feature table:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.ft
>Feature k141_250555
8871	7188	gene
			gene	l-rRNA
8871	7188	rRNA
			product	16S ribosomal RNA

The control region feature table:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.control_region.ft
>Feature k141_250555
9961	10745	misc_feature
			note	putative control region

The combined feature table:

$ cat  mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
>Feature k141_250555
309	>1	gene
			gene	ND3
309	>1	CDS
			product	NADH dehydrogenase subunit 3
			transl_table	2

which will be copied to mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl.

Some information about the mitogenome:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.val
ERROR: valid [SEQ_FEAT.NoStop] Missing stop codon FEATURE: CDS: cytochrome c oxidase subunit III [lcl|k141_250555:c1163-379] [lcl|k141_250555: raw, dna len= 16453] -> [lcl|k141_250555_2]
ERROR: valid [SEQ_FEAT.NoStop] Missing stop codon FEATURE: CDS: cytochrome b [lcl|k141_250555:c12027-10887] [lcl|k141_250555: raw, dna len= 16453] -> [lcl|k141_250555_9]
WARNING: valid [SEQ_INST.CompleteCircleProblem] Circular topology without complete flag set BIOSEQ: lcl|k141_250555: raw, dna len= 16453

and

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/errorsummary.val
     2 ERROR:   SEQ_FEAT.NoStop
     1 WARNING: SEQ_INST.CompleteCircleProblem

The most important files generated at the annotation step are copied to the directory:

$ ls -lhrt mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/
total 5.7M
-rw-rw-r-- 1 gmeng gmeng   29 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
-rw-rw-r-- 1 gmeng gmeng 3.2K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
-rw-rw-r-- 1 gmeng gmeng   79 Jun 29 05:54 errorsummary.val
-rw-rw-r-- 1 gmeng gmeng  36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
-rw-rw-r-- 1 gmeng gmeng  12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
-rw-rw-r-- 1 gmeng gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
-rw-rw-r-- 1 gmeng gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
-rw-rw-r-- 1 gmeng gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
-rw-rw-r-- 1 gmeng gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
-rw-rw-r-- 1 gmeng gmeng 4.9K Jun 29 05:54 summary.txt

The GenBank file generated at this step will be used for visualization.

$ head mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
LOCUS       k141_250555            16453 bp    DNA     circular     29-JUN-2022
DEFINITION  topology=circular.
ACCESSION
VERSION
KEYWORDS    .

4. The visualization step

In the sub-sub-sub directory called mt_visualization:

$ ls -lhrt mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/
total 27M
-rw-rw-r-- 1 gmeng  36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.sortedByLength
-rw-rw-r-- 1 gmeng 1.2K Jun 29 05:54 visual_config.txt
-rw-rw-r-- 1 gmeng 1.4K Jun 29 05:54 circos.map.sh
-rw-rw-r-- 1 gmeng   34 Jun 29 05:54 circos.strand.text.txt
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 circos.mito.fa
-rw-rw-r-- 1 gmeng   36 Jun 29 05:54 circos.karyotype.txt
-rw-rw-r-- 1 gmeng  828 Jun 29 05:54 circos.gene.text.txt
-rw-rw-r-- 1 gmeng 5.4K Jun 29 05:54 circos.features.txt
-rw-rw-r-- 1 gmeng  13M Jun 29 06:02 circos.bam
-rw-rw-r-- 1 gmeng 8.3M Jun 29 06:02 circos.sorted.bam
-rw-rw-r-- 1 gmeng 231K Jun 29 06:02 circos.dep
-rw-rw-r-- 1 gmeng 316K Jun 29 06:02 circos.depth.txt
-rw-rw-r-- 1 gmeng 3.7K Jun 29 06:02 circos.conf
-rw-rw-r-- 1 gmeng 342K Jun 29 06:02 circos.png
-rw-rw-r-- 1 gmeng 4.9M Jun 29 06:02 circos.svg

A fasta mitogenome file (called circos.mito.fa) was extracted from mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf from last step.

The subsampled 3G bp clean data were then mapped against this mitogenome: circos.map.sh.

Finally, MitoZ creates a circos.conf file the Circos program to run with:

$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.conf

<<include etc/colors_fonts_patterns.conf>>

#-----------------image------------------
<image>
###<<include etc/image.conf>>
dir   = /share/pool/guanliang/test/DM01/mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization
file  = circos.png
png   = yes
svg   = yes
...

The most important result files in this step are:

$ ls -lhrt mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/{circos.depth.txt,circos.png,circos.svg}
-rw-rw-r-- 1 gmeng gmeng 316K Jun 29 06:02 mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.depth.txt
-rw-rw-r-- 1 gmeng gmeng 342K Jun 29 06:02 mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png
-rw-rw-r-- 1 gmeng gmeng 4.9M Jun 29 06:02 mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg

They have been copied to the mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/ directory. Therefore, now we have annotation and visualization result files together:

$ ls -rlht mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/
total 5.7M
-rw-rw-r-- 1 gmeng   29 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
-rw-rw-r-- 1 gmeng 3.2K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
-rw-rw-r-- 1 gmeng   79 Jun 29 05:54 errorsummary.val
-rw-rw-r-- 1 gmeng  36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
-rw-rw-r-- 1 gmeng  12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
-rw-rw-r-- 1 gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
-rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
-rw-rw-r-- 1 gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
-rw-rw-r-- 1 gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
-rw-rw-r-- 1 gmeng 4.9K Jun 29 05:54 summary.txt
-rw-rw-r-- 1 gmeng 342K Jun 29 06:02 circos.png
-rw-rw-r-- 1 gmeng 4.9M Jun 29 06:02 circos.svg
-rw-rw-r-- 1 gmeng 316K Jun 29 06:02 circos.depth.txt

5. The final resulting files

Users normally look at these files only.

$ ls -lhrt DM01.result/
total 8.0K
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 06:02 DM01.megahit.result
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 06:02 DM01.DM01.megahit.mitogenome.fa.result

The DM01.result/DM01.megahit.result directory is copied directly from the 2. The mitogenome assembly step. At least you should check the DM01.result/DM01.megahit.result/DM01.megahit.overlap_information file and see if the overlapping is good enough (Please refer to 2. The mitogenome assembly step)

The DM01.result/DM01.DM01.megahit.mitogenome.fa.result directory is copied directly from the 3. The annotation step and 4. The visualization step.

The summary file:

$ cat summary.txt
#Seq_id        Length(bp)     Circularity    Closely_related_species
k141_250555    16453          yes            Homo sapiens


#Seq_id        Start  End    Length(bp) Direction  Type   Gene_name  Gene_prodcut                      Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_250555    <1     309    309        -          CDS    ND3        NADH dehydrogenase subunit 3      1
k141_250555    310    379    70         -          tRNA   trnG(ucc)  tRNA-Gly                          1
k141_250555    379    1163   785        -          CDS    COX3       cytochrome c oxidase subunit III  1
k141_250555    1163   1846   684        -          CDS    ATP6       ATP synthase F0 subunit 6         1
k141_250555    1837   2004   168        -          CDS    ATP8       ATP synthase F0 subunit 8         1
k141_250555    2006   2078   73         -          tRNA   trnK(uuu)  tRNA-Lys                          1
k141_250555    2065   2769   705        -          CDS    COX2       cytochrome c oxidase subunit II   1
k141_250555    2774   2843   70         -          tRNA   trnD(guc)  tRNA-Asp                          1
k141_250555    2853   2924   72         +          tRNA   trnS(uga)  tRNA-Ser                          2
k141_250555    2926   4482   1557       -          CDS    COX1       cytochrome c oxidase subunit I    1
k141_250555    4484   4552   69         +          tRNA   trnY(gua)  tRNA-Tyr                          1
k141_250555    4553   4619   67         +          tRNA   trnC(gca)  tRNA-Cys                          1
k141_250555    4655   4727   73         +          tRNA   trnN(guu)  tRNA-Asn                          1
k141_250555    4729   4797   69         +          tRNA   trnA(ugc)  tRNA-Ala                          1
k141_250555    4799   4870   72         -          tRNA   trnW(uca)  tRNA-Trp                          1
k141_250555    4879   5928   1050       -          CDS    ND2        NADH dehydrogenase subunit 2      1
k141_250555    5929   5998   70         -          tRNA   trnM(cau)  tRNA-Met                          1
k141_250555    5998   6068   71         +          tRNA   trnQ(uug)  tRNA-Gln                          1
k141_250555    6068   6135   68         -          tRNA   trnI(gau)  tRNA-Ile                          1
k141_250555    6139   7113   975        -          CDS    ND1        NADH dehydrogenase subunit 1      1
k141_250555    7114   7187   74         -          tRNA   trnL(uaa)  tRNA-Leu                          2
k141_250555    7188   8871   1684       -          rRNA   l-rRNA     16S ribosomal RNA                 1
k141_250555    8873   8944   72         -          tRNA   trnV(uac)  tRNA-Val                          1
k141_250555    8945   9891   947        -          rRNA   s-rRNA     12S ribosomal RNA                 1
k141_250555    9892   9960   69         -          tRNA   trnF(gaa)  tRNA-Phe                          1
k141_250555    10746  10815  70         +          tRNA   trnP(ugg)  tRNA-Pro                          1
k141_250555    10815  10886  72         -          tRNA   trnT(ugu)  tRNA-Thr                          1
k141_250555    10887  12027  1141       -          CDS    CYTB       cytochrome b                      1
k141_250555    12031  12098  68         +          tRNA   trnE(uuc)  tRNA-Glu                          1
k141_250555    12099  12620  522        +          CDS    ND6        NADH dehydrogenase subunit 6      1
k141_250555    12616  14454  1839       -          CDS    ND5        NADH dehydrogenase subunit 5      1
k141_250555    14455  14527  73         -          tRNA   trnL(uag)  tRNA-Leu                          2
k141_250555    14534  14603  70         -          tRNA   trnS(gcu)  tRNA-Ser                          2
k141_250555    14604  14673  70         -          tRNA   trnH(gug)  tRNA-His                          1
k141_250555    14669  16054  1386       -          CDS    ND4        NADH dehydrogenase subunit 4      1
k141_250555    16048  16344  297        -          CDS    ND4L       NADH dehydrogenase subunit 4L     1
k141_250555    16345  16413  69         -          tRNA   trnR(ucg)  tRNA-Arg                          1


Protein coding genes totally found: 13
tRNA genes totally found:           22
rRNA genes totally found:            2
--------------------------------------
Genes totally found:                37

The Genbank file:

$ head DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
LOCUS       k141_250555            16453 bp    DNA     circular     29-JUN-2022
DEFINITION  topology=circular.
ACCESSION
VERSION
KEYWORDS    .
SOURCE      mitochondrion Homo sapiens
  ORGANISM  Homo sapiens
            Unclassified.
REFERENCE   1  (bases 1 to 16453)
  AUTHORS   MENG,G.

The mitogenome in fasta format: DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta.

The most closely related species in MitoZ's PCG annotation database:

$ cat DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
k141_250555	Homo sapiens

The combined feature tables (including PCGs, tRNAs, rRNA genes and control region if there is):

$ head DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
>Feature k141_250555
309	>1	gene
			gene	ND3
309	>1	CDS
			product	NADH dehydrogenase subunit 3
			transl_table	2
1163	379	gene
			gene	COX3
1163	379	CDS
			product	cytochrome c oxidase subunit III

The summary file generated by tbl2asn:

$ cat errorsummary.val
     2 ERROR:   SEQ_FEAT.NoStop
     1 WARNING: SEQ_INST.CompleteCircleProblem

You should check if there are some warnings or errors in this file.

For example, internal stop codons for PCGs. If there are internal stop codons, check the DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta and DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta files.

Internal stop codons can be caused by:
(1) Ns in the PCGs;
(2) assembly error;
(3) wrong genetic code table for translation (--genetic_code 2);
(4) nuclear mitochondrial DNA segments (NUMTs). Have a look at this paper: Paul Hebert, Dan Bock, Sean Prosser. Interrogating 1000 Insect Genomes for NUMTs: A Risk Assessment for Species Scans. Authorea. July 27, 2022. https://doi.org/10.22541/au.165893766.64488370/v1.

Check the sequencing depth around the problematic regions: circos.depth.txt and circos.png/circos.svg files.

The sequence coverage distribution file:

$ head circos.depth.txt
mt1 1 1 1084
mt1 2 2 1092
mt1 3 3 1102
mt1 4 4 1113
mt1 5 5 1129
mt1 6 6 1139

The visualized mitogenome is in circos.svg and circos.png: circos

See Meanings of the different tracks on the Circos plot

Warning: You should check the coverage distribution of the mitogenome, extreme high or low (e.g 0X) regions could be unreliable, especially when using the megahit or spades assemblers! Larger kmers may fix the wrong assembly problem. See https://github.com/linzhi2013/MitoZ/issues/172#issuecomment-1600990445 for more details.

The CDS sequences in fasta format for the PCGs:

$ grep '>'  DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
>k141_250555;ND3;len=309;[<0:309](-)
>k141_250555;COX3;len=785;[378:1163](-)
>k141_250555;ATP6;len=684;[1162:1846](-)
>k141_250555;ATP8;len=168;[1836:2004](-)
>k141_250555;COX2;len=705;[2064:2769](-)
>k141_250555;COX1;len=1557;[2925:4482](-)
>k141_250555;ND2;len=1050;[4878:5928](-)
>k141_250555;ND1;len=975;[6138:7113](-)
>k141_250555;CYTB;len=1141;[10886:12027](-)
>k141_250555;ND6;len=522;[12098:12620](+)
>k141_250555;ND5;len=1839;[12615:14454](-)
>k141_250555;ND4;len=1386;[14668:16054](-)
>k141_250555;ND4L;len=297;[16047:16344](-)

and the corresponding protein sequences of CDS in fasta format:

$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
>k141_250555;ND3;len=103;[<0:309](-)
>k141_250555;COX3;len=261;[378:1163](-)
>k141_250555;ATP6;len=227;[1162:1846](-)
>k141_250555;ATP8;len=55;[1836:2004](-)
>k141_250555;COX2;len=234;[2064:2769](-)
>k141_250555;COX1;len=518;[2925:4482](-)
>k141_250555;ND2;len=349;[4878:5928](-)
>k141_250555;ND1;len=324;[6138:7113](-)
>k141_250555;CYTB;len=380;[10886:12027](-)
>k141_250555;ND6;len=173;[12098:12620](+)
>k141_250555;ND5;len=612;[12615:14454](-)
>k141_250555;ND4;len=461;[14668:16054](-)
>k141_250555;ND4L;len=98;[16047:16344](-)

All the gene (PCGs, tRNA, rRNA) sequences:

$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
>k141_250555;ND3;len=309;[<0:309](-)
>k141_250555;trnG(ucc);len=70;[309:379](-)
>k141_250555;COX3;len=785;[378:1163](-)
>k141_250555;ATP6;len=684;[1162:1846](-)
>k141_250555;ATP8;len=168;[1836:2004](-)
>k141_250555;trnK(uuu);len=73;[2005:2078](-)
>k141_250555;COX2;len=705;[2064:2769](-)
>k141_250555;trnD(guc);len=70;[2773:2843](-)
>k141_250555;trnS(uga);len=72;[2852:2924](+)
>k141_250555;COX1;len=1557;[2925:4482](-)
>k141_250555;trnY(gua);len=69;[4483:4552](+)

The rRNA sequences:

$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
>k141_250555;l-rRNA;len=1684;[7187:8871](-)
>k141_250555;s-rRNA;len=947;[8944:9891](-)

The tRNA sequences:

$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
>k141_250555;trnG(ucc);len=70;[309:379](-)
>k141_250555;trnK(uuu);len=73;[2005:2078](-)
>k141_250555;trnD(guc);len=70;[2773:2843](-)
>k141_250555;trnS(uga);len=72;[2852:2924](+)
>k141_250555;trnY(gua);len=69;[4483:4552](+)
>k141_250555;trnC(gca);len=67;[4552:4619](+)
>k141_250555;trnN(guu);len=73;[4654:4727](+)
...

6. Broken genes

You MUST check if a gene is complete in length, especially when your mitogenome is complete (circular). This is because the assembler just randomly chooses a breakpoint, and this could lead to broken genes (i.e., a gene is broken into two parts, one part is at the 5'-end of the fasta file, the second part is at the 3'-end of the fasta file.). Depending on the relative length of the two parts, MitoZ may fail to annotate either (or both?) of them!

6.1 Case 1

The quickest way is to check the summary.txt file and see if there are any > or < symbols in the Start and End columns.

Have a look at the summary.txt file:

$ cat summary.txt
#Seq_id        Length(bp)     Circularity    Closely_related_species
k141_250555    16453          yes            Homo sapiens


#Seq_id        Start  End    Length(bp) Direction  Type   Gene_name  Gene_prodcut                      Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_250555    <1     309    309        -          CDS    ND3        NADH dehydrogenase subunit 3      1
k141_250555    310    379    70         -          tRNA   trnG(ucc)  tRNA-Gly                          1
k141_250555    379    1163   785        -          CDS    COX3       cytochrome c oxidase subunit III  1

The ND3 gene is not complete, but the mitogenome is complete ('Circularity yes'), this happens since the assembler just "randomly" determines the breakpoint.

In this case, we need to change the breakpoint and re-annotate the mitogenome. See also https://github.com/linzhi2013/MitoZ/wiki/Known-issues#3-breakpoint-and-incomplete-genes.

Here I will make the breakpoint as the beginning of the ND1 gene because between 6135 and 6139, there are no genes:

k141_250555    6068   6135   68         -          tRNA   trnI(gau)  tRNA-Ile                          1
k141_250555    6139   7113   975        -          CDS    ND1        NADH dehydrogenase subunit 1      1
$ mdkir -p /share/pool/guanliang/test/DM01/moved_breakpoint
$ cd /share/pool/guanliang/test/DM01/moved_breakpoint

$ mitoz-tools gbfiletool sort -h # >=MitoZ v3.5
$ mitoz-tools gbfiletool sort -i ../DM01.result/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf \
-o DM01.moved-breakpoint.gbf \
-g ND1


# the 'genbank_file_tool.py' script can be found in
# https://github.com/linzhi2013/MitoZ/tree/master/version_2.4-alpha/useful_scripts
# I will add it into the mitoz-tools command later

$ wget https://raw.githubusercontent.com/linzhi2013/MitoZ/master/version_2.4-alpha/useful_scripts/genbank_file_tool.py
$ python3 genbank_file_tool.py sort \
-i ../DM01.result/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf \
-o DM01.moved-breakpoint.gbf \
-g ND1
  • Tips: You can also do this by hand with the vim command or any text editor by working on the DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta file directly.
$ ls -lhrt
total 36K
-rw-rw-r-- 1 gmeng gmeng 35K Jun 29 13:59 DM01.moved-breakpoint.gbf

And then extract the re-arranged mitogenome to a fasta file:

$ mitoz-tools gbseqextractor -f DM01.moved-breakpoint.gbf -prefix DM01.moved-breakpoint -types wholeseq

$ ll
total 56K
-rw-rw-r-- 1 gmeng 35K Jun 29 13:59 DM01.moved-breakpoint.gbf
-rw-rw-r-- 1 gmeng 17K Jun 29 14:03 DM01.moved-breakpoint.fasta

Because we know the mitogenome is circular, we need to add this information (topology=circular) to the header of the DM01.moved-breakpoint.fasta file (you can do this via vim command or any text editor):

>k141_250555 topology=circular

Finally, we re-annotate the new mitogenome DM01.moved-breakpoint.fasta.

mitoz annotate \
--thread_number 16  \
--outprefix DM01 \
--fastafiles DM01.moved-breakpoint.fasta \
--fq1 ../clean_data/sampling_5.0GB.clean_R1.fq.gz \
--fq2 ../clean_data/sampling_5.0GB.clean_R2.fq.gz \
--species_name "Homo sapiens" \
--genetic_code 2 \
--clade Chordata

The final result will be in DM01.result under the moved_breakpoint directory.

6.2 Case 2

The summary.txt file does not have any > or < symbols in the Start and End columns, like this:

#Seq_id        Length(bp)     Circularity    Closely_related_species
k59_737328     16329          yes            Human       


#Seq_id        Start  End    Length(bp) Direction  Type   Gene_name  Gene_prodcut                      Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k59_737328     12     216    205        -          CDS    ATP8       ATP synthase F0 subunit 8         1              
k59_737328     216    280    65         -          tRNA   trnK(uuu)  
...
...
k59_737328     14919  15704  786        -          CDS    COX3       cytochrome c oxidase subunit III  1              
k59_737328     15703  16330  628        -          CDS    ATP6       ATP synthase F0 subunit 6         1              

But if you do a multiple-sequence alignment (MSA) (see Grouping the same genes of multiple samples together), you can see that the ATP6 gene of this sample is too short (on the 5'-end or 3'-end of the MSA), for example: image

Here, the ATP6 could not be correctly annotated due to the wrong breakpoint of the mitogenome. You should use a new breakpoint and then reannotate the mitogenome with MitoZ.

7. What now?

8. Grouping the same genes of multiple samples together

Often, we have several samples, and we want to group the same gene sequences of different samples into the same file before going to subsequent analysis (e.g. multiple sequence alignments and phylogenetic tree construction).

To this end, please refer to mitoz-tools group_seq_by_gene command.

9. The mitoz-tools utility

Please check the mitoz-tools command and here for more information.

Clone this wiki locally