Skip to content

The mitoz tools group_seq_by_gene command

Guanliang MENG edited this page Jun 22, 2023 · 1 revision

To group the gene sequences of different samples into different files by genes.

$ mitoz-tools  group_seq_by_gene -h
usage: mitoz-tools group_seq_by_gene [-h] [-r <file>] [-d <str>] [-p <str>] [-clean_header]

To group the gene sequences of different samples into different files by genes.

Please cite:
Guanliang Meng, Yiyuan Li, Chentao Yang, Shanlin Liu,
MitoZ: a toolkit for animal mitochondrial genome assembly, annotation
and visualization, Nucleic Acids Research, https://doi.org/10.1093/nar/gkz173

optional arguments:
  -h, --help     show this help message and exit
  -r <file>      the gene file list. Per-line format: Abbreviation geneFilePath. The abbreviation will be added
                 to the seqid to indicate different samples.
  -d <str>       the delimiter between the abbreviation and the seqid [;]
  -p <str>       the prefix of all result files [MitoZ]
  -clean_header  Only shows the 'Abbreviation' in the sequence header [False]

Usage

Prepare a file (e.g. called gene_f_list) whose content looks like this:

DM01 DM01/DM01.result/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
DM02 DM02/DM02.result/DM02.DM02.megahit.mitogenome.fa.result/DM02_DM02.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta

For content format (per line) is:

sampleID /path/to/the/fasta_file
  • The sampleID (the first column) will be added to the beginning of the sequencing title of the resulting files.
  • The second column is the path to the fasta format files, which can be any of them:
    -rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
    -rw-rw-r-- 1 gmeng  12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
    -rw-rw-r-- 1 gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
    -rw-rw-r-- 1 gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
    -rw-rw-r-- 1 gmeng  17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
    -rw-rw-r-- 1 gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
    

Then execute:

$ mitoz-tools  group_seq_by_gene -r gene_f_list -d '_' -p MitoZ

We got:

$ ls -lh
-rw-rw-r-- 1 gmeng gmeng   42 Jul  8 17:24 gene_f_list
-rw-rw-r-- 1 gmeng gmeng 1.5K Jul  8 17:36 MitoZ.gene-ATP6.fa
-rw-rw-r-- 1 gmeng gmeng  406 Jul  8 17:36 MitoZ.gene-ATP8.fa
-rw-rw-r-- 1 gmeng gmeng 3.2K Jul  8 17:36 MitoZ.gene-COX1.fa
-rw-rw-r-- 1 gmeng gmeng 1.5K Jul  8 17:36 MitoZ.gene-COX2.fa
-rw-rw-r-- 1 gmeng gmeng 1.6K Jul  8 17:36 MitoZ.gene-COX3.fa
-rw-rw-r-- 1 gmeng gmeng 2.4K Jul  8 17:36 MitoZ.gene-CYTB.fa
-rw-rw-r-- 1 gmeng gmeng 3.4K Jul  8 17:36 MitoZ.gene-l-rRNA.fa
-rw-rw-r-- 1 gmeng gmeng 2.0K Jul  8 17:36 MitoZ.gene-ND1.fa
-rw-rw-r-- 1 gmeng gmeng 2.2K Jul  8 17:36 MitoZ.gene-ND2.fa
-rw-rw-r-- 1 gmeng gmeng  680 Jul  8 17:36 MitoZ.gene-ND3.fa
-rw-rw-r-- 1 gmeng gmeng 2.8K Jul  8 17:36 MitoZ.gene-ND4.fa
-rw-rw-r-- 1 gmeng gmeng  668 Jul  8 17:36 MitoZ.gene-ND4L.fa
-rw-rw-r-- 1 gmeng gmeng 3.7K Jul  8 17:36 MitoZ.gene-ND5.fa
-rw-rw-r-- 1 gmeng gmeng 1.1K Jul  8 17:36 MitoZ.gene-ND6.fa
-rw-rw-r-- 1 gmeng gmeng 2.0K Jul  8 17:36 MitoZ.gene-s-rRNA.fa
-rw-rw-r-- 1 gmeng gmeng  216 Jul  8 17:36 MitoZ.gene-trnA(ugc).fa
-rw-rw-r-- 1 gmeng gmeng  212 Jul  8 17:36 MitoZ.gene-trnC(gca).fa
-rw-rw-r-- 1 gmeng gmeng  218 Jul  8 17:36 MitoZ.gene-trnD(guc).fa
-rw-rw-r-- 1 gmeng gmeng  218 Jul  8 17:36 MitoZ.gene-trnE(uuc).fa
-rw-rw-r-- 1 gmeng gmeng  216 Jul  8 17:36 MitoZ.gene-trnF(gaa).fa
-rw-rw-r-- 1 gmeng gmeng  214 Jul  8 17:36 MitoZ.gene-trnG(ucc).fa
-rw-rw-r-- 1 gmeng gmeng  222 Jul  8 17:36 MitoZ.gene-trnH(gug).fa
-rw-rw-r-- 1 gmeng gmeng  214 Jul  8 17:36 MitoZ.gene-trnI(gau).fa
-rw-rw-r-- 1 gmeng gmeng  224 Jul  8 17:36 MitoZ.gene-trnK(uuu).fa
-rw-rw-r-- 1 gmeng gmeng  226 Jul  8 17:36 MitoZ.gene-trnL(uaa).fa
-rw-rw-r-- 1 gmeng gmeng  228 Jul  8 17:36 MitoZ.gene-trnL(uag).fa
-rw-rw-r-- 1 gmeng gmeng  218 Jul  8 17:36 MitoZ.gene-trnM(cau).fa
-rw-rw-r-- 1 gmeng gmeng  224 Jul  8 17:36 MitoZ.gene-trnN(guu).fa
-rw-rw-r-- 1 gmeng gmeng  222 Jul  8 17:36 MitoZ.gene-trnP(ugg).fa
-rw-rw-r-- 1 gmeng gmeng  220 Jul  8 17:36 MitoZ.gene-trnQ(uug).fa
-rw-rw-r-- 1 gmeng gmeng  220 Jul  8 17:36 MitoZ.gene-trnR(ucg).fa
-rw-rw-r-- 1 gmeng gmeng  222 Jul  8 17:36 MitoZ.gene-trnS(gcu).fa
-rw-rw-r-- 1 gmeng gmeng  222 Jul  8 17:36 MitoZ.gene-trnS(uga).fa
-rw-rw-r-- 1 gmeng gmeng  226 Jul  8 17:36 MitoZ.gene-trnT(ugu).fa
-rw-rw-r-- 1 gmeng gmeng  222 Jul  8 17:36 MitoZ.gene-trnV(uac).fa
-rw-rw-r-- 1 gmeng gmeng  222 Jul  8 17:36 MitoZ.gene-trnW(uca).fa
-rw-rw-r-- 1 gmeng gmeng  216 Jul  8 17:36 MitoZ.gene-trnY(gua).fa
$ grep '>' MitoZ.gene-COX1.fa
>DM01_COX1;len=1557;[2925:4482](-)
>DM02_COX1;len=1557;[2925:4482](-)

You can change the -p to any other string, say, your project ID.

You can also change the delimiter of the sequence title to other strings, for example, I don't want the DM01 being connected to the COX1:

$ mitoz-tools  group_seq_by_gene -r gene_f_list -d ' ' -p MitoZ

$ grep '>' MitoZ.gene-COX1.fa
>DM01 COX1;len=1557;[2925:4482](-)
>DM02 COX1;len=1557;[2925:4482](-)

If you want a clean sequence header:

$ mitoz-tools  group_seq_by_gene -r gene_f_list -p MitoZ -clean_header

$ grep '>' MitoZ.gene-COX1.fa
>DM01
>DM02

Now you can use the MitoZ.gene-*.fa files for subsequent analysis, e.g. to perform multiple sequence alignment with the MAFFT program.

Clone this wiki locally