Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add reference sequences to reduce missing genes #79

Closed
Tianyou96 opened this issue Sep 8, 2020 · 14 comments
Closed

How to add reference sequences to reduce missing genes #79

Tianyou96 opened this issue Sep 8, 2020 · 14 comments

Comments

@Tianyou96
Copy link

Tianyou96 commented Sep 8, 2020

#61 (comment)
Refer to your suggestions.
I manually expand the release_MitoZ_v2.4-alpha/bin/profiles/MT_database/Arthropoda_CDS_protein.fa
图片
As shown in the figure, except for ID, they are written in the format.

The following error occurred

can not find taxid for
can not find taxid for ['dispersus'], maybe it's a misspelling.
KeyError: 'Aleurodicus'
Error occured when running command:
/usr/lib/anaconda3/envs/mitozEnv/bin/python3 /apps/MitoZ/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/annotate/cds_ft_v2.py XJ3-1_L2_142142.cds.position.sorted.revised.filtered 5 XJ3-1_L2_142142_mitoscaf.fa.cds.ft

Where else do I need to add changes?
I would be very grateful if you could give me more tips to help me complete annotation of some species.
I added a suffix of txt to Arthropoda_CDS_protein.fa to facilitate uploading
log.txt
Arthropoda_CDS_protein.fa.txt

@linzhi2013
Copy link
Owner

Dear zskysafe,

Sorry for my late reply. there was some problem with my network.

I found you raised a new issue #78, and I think you were adding new aa sequences to release_MitoZ_v2.4-alpha/bin/profiles/MT_database/Arthropoda_CDS_protein.fa, this should work.

I will run a test with your https://github.com/linzhi2013/MitoZ/files/5188671/Arthropoda_CDS_protein.fa.txt and check the codes, then get back to you asap.

Cheers

@linzhi2013
Copy link
Owner

linzhi2013 commented Sep 9, 2020

Hi zskysafe,

That was due to inconsistency between the format of NCBI Access numbers. The mitogenomes in their RefSeq database have accession numbers like NC_001620, while the non-RefSeq mitogenomes have accession numbers like KT225300, and we used the _ to split the string.

For the consistency, the accession numbers of non-RefSeq mitogenome in the elease_MitoZ_v2.4-alpha/bin/profiles/MT_database/Arthropoda_CDS_protein.fa file MUST also start with >gi_NC_, the result look like this:

>gi_NC_KP861632_ND6_Chrysomya_pacifica_174_aa
>gi_NC_KX090381_ND6_Microthoracius_praelongiceps_157_aa
>gi_NC_EU583500_ND6_Euphausia_superba_173_aa

That's to say, your >gi_KR_063274_ATP6_Aleurodicus_dispersus_216_aa should be reformated as >gi_NC_KR063274_ATP6_Aleurodicus_dispersus_216_aa.

Best

@Tianyou96
Copy link
Author

Tianyou96 commented Sep 9, 2020

I can't wait to test if it works, but a mistake stopped me.
This error is shown below
''
can not find taxid for Nematoda
can not find taxid for ['Nematoda'], maybe it's a misspelling.
Please use other taxanomy name.
''
Arthropoda will also report a mistake.
I tried to unload and reload mitozEnv and mitoz. However, this problem still exists, taxid will terminate the program at the beginning of running.
I haven't changed that file since I reloaded mitoz.

python3 $DIR_mitoz/MitoZ.py all2 --genetic_code 5 --clade Arthropoda --outprefix $name \ --thread_number 8 \ --fq_size 5 \ --fastq1 $fq1 \ --fastq2 $fq2 \ --fastq_read_length 150 \ --insert_size 250 \ --run_mode 2 \ --filter_taxa_method 1 \ --requiring_taxa 'Arthropoda' >> mitoz.log 2>&1

@linzhi2013
Copy link
Owner

It just means that "Nematoda" is not in the NCBI taxonomy database.

What is your full species name? when I searched in NCBI taxonomy online database, I can only found https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=6231&lvl=3&lin=f&keep=1&srchmode=1&unlock, which belongs to a phylumn, and not a rank belonging to Arthropoda.

I'm not sure what have you done to the source codes. As your present command, it should only search 'Arthropoda'. I have no idea where 'Nematoda' came from.

@Tianyou96
Copy link
Author

Tianyou96 commented Sep 9, 2020

What I'm saying is a bit confusing.
Actually,I tried Nematoda and Arthropoda at the same time. Both have the same error at the beginning of the run.
can not find taxid for Arthropoda can not find taxid for ['Arthropoda'], maybe it's a misspelling. Please use other taxanomy name.
This problem existed before I installed it again.
After I re installed mitoz, I didn't make any changes to the source code.
The problem is still there, I try to change the input.
python3 $DIR_mitoz/MitoZ.py all2 --genetic_code 5 --outprefix $name \ --thread_number 8 \ --fq_size 5 \ --fastq1 $fq1 \ --fastq2 $fq2 \ --fastq_read_length 150 \ --insert_size 250 \ --run_mode 2 \ >> mitoz.log 2>&1

This error occurs whether I type Arthropoda or not.
Then, I tried to run a couple of mitoz on different paths, which had not changed since the decompression.
This problem will still arise.
I feel confused
I didn't make any changes to the mitozEnv.

图片
I reload NCBI taxonomy database

@linzhi2013
Copy link
Owner

I see...

It seems that your NCBI taxonomy database is broken. Maybe you don't have enough HOME space for it? if that's the case, please check https://github.com/linzhi2013/taxonomy_ranks/blob/master/README.md.

The NCBI taxonomy database is regularly updated so its volume is increasing, maybe now it needs more than 600M.

@Prunoideae
Copy link

Prunoideae commented Sep 9, 2020

What I'm saying is a bit confusing.
Actually,I tried Nematoda and Arthropoda at the same time. Both have the same error at the beginning of the run.
can not find taxid for Arthropoda can not find taxid for ['Arthropoda'], maybe it's a misspelling. Please use other taxanomy name.
This problem existed before I installed it again.
After I re installed mitoz, I didn't make any changes to the source code.
The problem is still there, I try to change the input.
python3 $DIR_mitoz/MitoZ.py all2 --genetic_code 5 --outprefix $name \ --thread_number 8 \ --fq_size 5 \ --fastq1 $fq1 \ --fastq2 $fq2 \ --fastq_read_length 150 \ --insert_size 250 \ --run_mode 2 \ >> mitoz.log 2>&1

This error occurs whether I type Arthropoda or not.
Then, I tried to run a couple of mitoz on different paths, which had not changed since the decompression.
This problem will still arise.
I feel confused
I didn't make any changes to the mitozEnv.

图片
I reload NCBI taxonomy database

Hello zskysafe,

This problem is caused by a recent change of NCBI's taxonomy database, which broke some assertion of ete3, causing it to fail parsing the data.

For a more official report, please refer to etetoolkit/ete#469.

The fix of this issue is declared to be released when ete4 is published, in mid-late 2020. If you need to fix this problem right now, maybe Prunoideae/MitoFlex#2 can also help you.

@linzhi2013
Copy link
Owner

Thanks @Prunoideae for pointing out the problem.

This reminds me that I have already downloaded an older NCBI taxonomy database and it works for another user, you can follow the instructions here (#72 (comment)) to re-prepare your NCBI taxonomy database.

Cheers

@Tianyou96
Copy link
Author

@linzhi2013 @Prunoideae
Thank you very much. With your help, the problem has been solved. I downloaded an older NCBI taxonomy database provided by linzhi2013.

@Tianyou96
Copy link
Author

Tianyou96 commented Sep 10, 2020

Part of the operation can not be carried out normally.
``
run the genewise shell file

running genewise

convert result to gff3 format

cat: './work71.hmmtblout.besthit.sim.fa.genewise//.genewise': No such file or directory

Sorry, the annotation finished with no result!

...

Error occured when running command:

/usr/lib/anaconda3/envs/mitozEnv/bin/python3 /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/findmitoscaf/filter_taxonomy_by_CDS_annotation.py -fa work71.hmmtblout.besthit.sim.fa -MTsoft /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/annotate/MT_annotation_BGI_V1.32/MT_annotation_BGI.pl -db /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/profiles/MT_database/Animal_CDS_protein.fa -thread 8 -genetic_code 5 -requiring_taxa 'Arthropoda' -relax 0 -WISECONFIGDIR /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/annotate/wisecfg -outf work71.hmmtblout.besthit.sim.filtered.fa
``

@linzhi2013
Copy link
Owner

linzhi2013 commented Sep 10, 2020

Part of the operation can not be carried out normally.
``
run the genewise shell file

running genewise

convert result to gff3 format

cat: './work71.hmmtblout.besthit.sim.fa.genewise//.genewise': No such file or directory

Sorry, the annotation finished with no result!

...

Error occured when running command:

/usr/lib/anaconda3/envs/mitozEnv/bin/python3 /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/findmitoscaf/filter_taxonomy_by_CDS_annotation.py -fa work71.hmmtblout.besthit.sim.fa -MTsoft /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/annotate/MT_annotation_BGI_V1.32/MT_annotation_BGI.pl -db /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/profiles/MT_database/Animal_CDS_protein.fa -thread 8 -genetic_code 5 -requiring_taxa 'Arthropoda' -relax 0 -WISECONFIGDIR /apps/Mitoz/version_2.4-alpha/release_MitoZ_v2.4-alpha/bin/annotate/wisecfg -outf work71.hmmtblout.besthit.sim.filtered.fa
``

Hi zskysafe,

Please send me the release_MitoZ_v2.4-alpha/bin/profiles/MT_database/Animal_CDS_protein.fa file and your mitogenome sequences. please send to linzhi2012<mitoz>@<mitoz>gmail<mitoz>com

@Tianyou96
Copy link
Author

Tianyou96 commented Sep 10, 2020

I have not changed this Animal_CDS_protein.fa.
My mitogenome sequences? I chose the all2 mode. I haven't got the assembly result file yet.
图片
Animal_CDS_protein.zip

I've made a quality trimming on cleandata, so my reads` length is different. This should not affect the assembly, right?
图片
python3 $DIR_mitoz/MitoZ.py all2 --genetic_code 5 --clade Arthropoda --outprefix $name \ --thread_number 8 \ --fq_size 5 \ --fastq1 $fq1 \ --fastq2 $fq2 \ --fastq_read_length 150 \ --insert_size 250 \ --run_mode 2 \ --filter_taxa_method 1 \ --requiring_taxa 'Arthropoda' >> mitoz.log 2>&1

@linzhi2013
Copy link
Owner

linzhi2013 commented Sep 10, 2020

I have not changed this Animal_CDS_protein.fa.
My mitogenome sequences? I chose the all2 mode. I haven't got the assembly result file yet.
图片
Animal_CDS_protein.zip

I've made a quality trimming on cleandata, so my readslength is different. This should not affect the assembly, right? ![图片](https://user-images.githubusercontent.com/51896128/92724687-9eed7a80-f39d-11ea-8c1f-e15c9350eb87.png)python3 $DIR_mitoz/MitoZ.py all2 --genetic_code 5 --clade Arthropoda --outprefix $name \ --thread_number 8 \ --fq_size 5 \ --fastq1 $fq1 \ --fastq2 $fq2 \ --fastq_read_length 150 \ --insert_size 250 \ --run_mode 2 \ --filter_taxa_method 1 \ --requiring_taxa 'Arthropoda' >> mitoz.log 2>&1`

I'd like to know what is the content of the work71.hmmtblout.besthit.sim.fa file. If it is empty, then no mitochondrial sequence was found from work71.ScafSeq. How much G bp data did you use for mitogenome assembly?

Then read length has not much effect in your case.

@linzhi2013
Copy link
Owner

Please raise a new issue.

I'm closing this issue since the subject on "How to add reference sequences to reduce missing genes" has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants