performance: unconsensus annotation generated #49

Hocnonsense · 2023-09-21T12:14:00Z

Recently I'm trying to annotate sequence with mantis (default databse). However, two sequences were annotated abnormally (seq1 and seq2)

When I search them in NCBI, they were annotated as "beta-propeller fold lactonase family protein" and "YncE family protein". However, with mantis, i got:

seq1	NOGG_merged;kofam_merged	397945.Aave_1437;K20932	2	5	|	cog:COG3391	description:hydrazine	synthase	subunit	eggnog:1P862	eggnog:2VKXA	eggnog:4ACF0	eggnog:COG3391	enzyme_ec:1.7.2.7	kegg_ko:K20932	pfam:PF05694
seq2	NOGG_merged;Pfam-A;kofam_merged	399795.CtesDRAFT_PD1384;Lactonase;K20932	3	5	|	cog:COG3391	description:Lactonase,	7-bladed	beta-propeller	description:hydrazine	synthase	subunit	eggnog:1P862	eggnog:2VKXA	eggnog:4ACF0	eggnog:COG3391	enzyme_ec:1.7.2.7	kegg_ko:K20932	pfam:Lactonase	pfam:PF02239	pfam:PF10282

In brief, while nogg potentially returned correct results, the annotation from kofam was totally wrong!
I also looked for detailed results in integrated_annotation.tsv:

Query	Ref_file	Ref_hit	Ref_hit_accession	evalue	bitscore	Direction	Query_length	Query_hit_start	Query_hit_end	Ref_hit_start	Ref_hit_end	Ref_length
seq1	kofam_merged	K07404	-	3.0007185000000003e-09	42.7	+	362	129	271	331	476	530
seq1	kofam_merged	K20932	-	1.0771810000000001e-05	30.8	+	362	264	359	45	127	377
seq2	kofam_merged	K07404	-	6.155319999999999e-14	58.2	+	389	115	288	281	460	530
seq2	kofam_merged	K20932	-	0.0002769894	26.1	+	389	285	387	44	127	377

It can be found that the evalue of these hits were pretty low, and the query hit length is quiet short. Is it possible to filter these low-quality annotations from cousensus results?

There sequences are list here

>seq1
VAGERAFARFGLSRLRDALVGIRATISFIKRSFVGSFLSSRFLFFFWAFVALSSSAFASQTTPLFVLNSLDANVSVINPVTWEEIRRIPTGKEPHHLYLTPDERSIIIANALSDTLTFIDPRTAEVQRTVRDIVDPYHLRFTPDMKWLVTAANRLNHVDFYKWDGKELTLAKRVATSRTPSHIWIDSKSTTAYVSMQDSDELVAMDIATQSIRWRTKTGAMPADIYGSPDDKRLFVGLTGSDSVEVFDVSGAQPSSIKRIKTGSGAHAFRAAGDKRHLYVSNRVANTISKLDMQRQEVVDTYPVPGGPDCMDVSADGRFIFVSSRWAKKMSVVDTVEKKVIKQVPVGKSPHGVWTLQHAPR*
>seq2
MDSEKVVVVGRLSIGRTGAAVALVVGVGAAAWLGASALPGFKSAKAAAPAVAATPVVAPAAAVATGAPVSPAAVKPAQAARAVQGPTPIFVLNSLDASISVIDPQTWKEQSRIPTGKEPHHLYLTPDEKSLVVANALGDSLTLVDPRTGAVQRVIRDIVDPYHLRFSPDMKWFVTAANRLNHIDFYRWDAATQTPTLVKRVSTGKTPSHLFIDAQSKTLYSTMQDSDALVAIDIATQTIKARVPTGPMPADVYGSPDGKKLFVGLTGGDGVEVFDITGPEPRSLGQIKTAAGAHAFRAAGDDRHLYVSNRVANTISKIDMVSSQVIANYPAPGGPDCMDVSADGRYIFVASRWARKLSVIDTVEKKVVNQVNVGKSPHGVWTLSHAAR*

The text was updated successfully, but these errors were encountered:

PedroMTQ · 2023-09-21T16:30:35Z

Hello @Hocnonsense

I have a few suggestions for you:
You could potentially reduce the weight of kofam in the config file.
By default each reference is given a weight of 0.7, you can reduce kofam's to a lower value with kofam_weight=0.X.

Additionally you can change the hit processing algorithm (e.g., bpo) and set a threshold for the e-value if the default value is not strict enough. In your case the e-value are already quite low which means they are not actually low quality annotations.
Please refer to this extract from HMMER's documentation which explains this quite succinctly:
The E-value is the expected number of false positives (non-homologous sequences) that scored this well or better. The E-value is a measure of statistical significance. The lower the E-value, the more significant the hit. I typically consider sequences with E-values < 10−3 or so to be significant hits.

You can further restrict the e-value, which will then likely increase precision but lower recall, so please keep that in mind.

Keep in mind I also take into account other metrics when evaluating matches:
https://github.com/PedroMTQ/mantis/wiki/Additional-information#what-is-the-e-value-threshold

You can check these options in the wiki https://github.com/PedroMTQ/mantis/wiki/Functionalities#annotate-one-sample

Finally, if you prefer to only annotate with the NCBI reference, you can disable the other references.

Regards,
Pedro

Hocnonsense · 2023-09-22T02:11:03Z

Hello Pedro:
Very thanks for your quick reply and kind advices!

However, it seems that a much better nogg annotation (take seq1 as example, annotion as 397945.Aave_1437 hit [61:361] of 362 with evalue 3.06e-125, full table is shown below) is generated, and I still have doubts because their annotation seems different:

397945.Aave_1437 points to activity of Methanethiol oxidase (H2O + methanethiol + O2 = formaldehyde + H+ + H2O2 + hydrogen sulfide, Automatic Annotation, EC:1.8.3.4)
K20932 points to hydrazine synthase subunit [EC:1.7.2.7], while another slightly better kofam hit (K07404, 6-phosphogluconolactonase [EC:3.1.1.31]) was ignored in consensus_output.tsv.

I've read https://github.com/PedroMTQ/mantis/wiki/Additional-information#inter-reference-hit-processing but cannot understand clearly how it process, what is "IDs and free-text functional descriptions" for mantis, and what's similarity score between them? Can I reproduce the results using mantis/consensus.py with results from integrated_annotation as input?

Query	Ref_file	Ref_hit	Ref_hit_accession	evalue	bitscore	Direction	Query_length	Query_hit_start	Query_hit_end	Ref_hit_start	Ref_hit_end	Ref_length
seq1	kofam_merged	K07404	-	3.00E-09	42.7	+	362	129	271	331	476	530
seq1	kofam_merged	K20932	-	1.08E-05	30.8	+	362	264	359	45	127	377
seq1	NCBIG_merged	PQQ_ABC_repeats	TIGR03866.1	1.92E-18	72.8	+	362	181	354	9	170	310
seq1	NOGG_merged	397945.Aave_1437		3.06E-125	366	+	362	63	361	8	306	306
seq1	Pfam-A	Cytochrom_D1	PF02239.19	2.31E-06	32.8	+	362	229	355	9	117	368
seq2	kofam_merged	K07404	-	6.16E-14	58.2	+	389	115	288	281	460	530
seq2	kofam_merged	K20932	-	0.000276989	26.1	+	389	285	387	44	127	377
seq2	NCBIG_merged	PQQ_ABC_repeats	TIGR03866.1	1.77E-15	63.1	+	389	212	386	8	170	310
seq2	NOGG_merged	399795.CtesDRAFT_PD1384		1.56E-143	417	+	389	1	388	1	393	393
seq2	Pfam-A	Lactonase	PF10282.12	0.000338543	26.3	+	389	160	242	249	324	344

Sincerely appreciate for your help!

PedroMTQ · 2023-09-22T08:14:46Z

Hello @Hocnonsense ,

Regarding the similarity of annotations (i.e., the IDsa and free-text descriptions) keep the following points in mind:

an ID refers to an identifier, e.g., K07404.
a free text description refers to a textual description e.g., description:Lactonase

When determining the consensus we proceed in 2 ways, depending on whether we are handling IDs, or free text:

Let's imagine we have a match to the NOG IDX which is mapped to ID1, ID2 and ID3 and another match to the Kofam IDY which is mapped to ID4, and ID1, and finally a match to the Pfam IDZ which is mapped to ID5.
During consensus, we will try to determine which is the most likely annotation, depending on the agreement between IDs, in this case since NOG IDX and Kofam IDY share the ID1, we assume that this sequence is more likely to be NOG IDX + Kofam IDY rather than Pfam IDZ since we have 2 "independent" (debatable since some sources integrate data from other sources) sources pointing to the one annotation. (this is an over simplification since there's other internal calculations at play).

For free text descriptions the idea is similar, but in this case instead of having matches between IDs (either it's a match - 1 or it's not - 0) we instead measure string similarity (from 0 to 1, 1 being very similar). This string similarity is calculated with another package that I've developed: https://github.com/PedroMTQ/UniFunc.

In general, the main idea of Mantis is to leverage "independent" annotation sources to determine a consensus, which we assume is more likely to be true than if we used a single source.

Hope this clears things up.

If it's still unclear I'd recommend you post the Mantis output files (you can trim it down to seq1 and seq2) here and I can try to dig through them to explain what's going on.

Regards,
Pedro

Hocnonsense · 2023-09-22T09:36:43Z

Thanks for your advice! There are the three file i got in mantis output folder:

integrated_annotation.tsv

Query	Ref_file	Ref_hit	Ref_hit_accession	evalue	bitscore	Direction	Query_length	Query_hit_start	Query_hit_end	Ref_hit_start	Ref_hit_end	Ref_length	|	Links
seq1	Pfam-A	Cytochrom_D1	PF02239.19	2.308245e-06	32.8	+	362	229	355	9	117	368	|	description:Cytochrome D1 heme domain	pfam:Cytochrom_D1	pfam:PF02239
seq1	kofam_merged	K07404	-	3.0007185000000003e-09	42.7	+	362	129	271	331	476	530	|	cog:COG2706	description:6-phosphogluconolactonase	enzyme_ec:3.1.1.31	go:0017057	kegg_ko:K07404
seq1	kofam_merged	K20932	-	1.0771810000000001e-05	30.8	+	362	264	359	45	127	377	|	cog:COG3391	description:hydrazine synthase subunit	enzyme_ec:1.7.2.7	kegg_ko:K20932
seq1	NCBIG_merged	PQQ_ABC_repeats	TIGR03866.1	1.9235375e-18	72.8	+	362	181	354	9	170	310	|	description:PQQ-dependent catabolism-associated beta-propeller protein	tigrfam:TIGR03866
seq1	NOGG_merged	397945.Aave_1437		3.056207886e-125	366.0	+	362	63	361	8	306	306	|	cog:COG3391	eggnog:1P862	eggnog:2VKXA	eggnog:4ACF0	eggnog:COG3391	pfam:PF05694
seq2	NCBIG_merged	PQQ_ABC_repeats	TIGR03866.1	4.3087240000000004e-15	61.8	+	358	255	347	219	309	310	|	description:PQQ-dependent catabolism-associated beta-propeller protein	tigrfam:TIGR03866
seq2	NOGG_merged	402626.Rpic_2496		3.6437579292e-229	631.0	+	358	1	357	1	357	357	|	cog:COG3391	eggnog:1P862	eggnog:2WEY0	eggnog:COG3391	pfam:PF10282
seq2	Pfam-A	Cytochrom_D1	PF02239.19	2.3851864999999997e-07	36.0	+	358	146	302	57	214	368	|	description:Cytochrome D1 heme domain	pfam:Cytochrom_D1	pfam:PF02239
seq2	kofam_merged	K07404	-	7.1555595e-17	67.9	+	358	142	344	283	458	530	|	cog:COG2706	description:6-phosphogluconolactonase	enzyme_ec:3.1.1.31	go:0017057	kegg_ko:K07404
seq2	kofam_merged	K20932	-	1.0002395e-11	50.6	+	358	19	124	51	139	377	|	cog:COG3391	description:hydrazine synthase subunit	enzyme_ec:1.7.2.7	kegg_ko:K20932

output_annotation.tsv

Query	Ref_file	Ref_hit	Ref_hit_accession	evalue	bitscore	Direction	Query_length	Query_hit_start	Query_hit_end	Ref_hit_start	Ref_hit_end	Ref_length
seq1	Pfam-A	Cytochrom_D1	PF02239.19	2.308245e-06	32.8	+	362	229	355	9	117	368
seq1	kofam_merged	K07404	-	3.0007185000000003e-09	42.7	+	362	129	271	331	476	530
seq1	kofam_merged	K20932	-	1.0771810000000001e-05	30.8	+	362	264	359	45	127	377
seq1	NCBIG_merged	PQQ_ABC_repeats	TIGR03866.1	1.9235375e-18	72.8	+	362	181	354	9	170	310
seq1	NOGG_merged	397945.Aave_1437		3.056207886e-125	366.0	+	362	63	361	8	306	306
seq2	NCBIG_merged	PQQ_ABC_repeats	TIGR03866.1	4.3087240000000004e-15	61.8	+	358	255	347	219	309	310
seq2	NOGG_merged	402626.Rpic_2496		3.6437579292e-229	631.0	+	358	1	357	1	357	357
seq2	Pfam-A	Cytochrom_D1	PF02239.19	2.3851864999999997e-07	36.0	+	358	146	302	57	214	368
seq2	kofam_merged	K07404	-	7.1555595e-17	67.9	+	358	142	344	283	458	530
seq2	kofam_merged	K20932	-	1.0002395e-11	50.6	+	358	19	124	51	139	377

consensus_annotation.tsv

Query	Ref_Files	Ref_Hits	Consensus_hits	Total_hits	|	Links
seq1	NOGG_merged;kofam_merged	397945.Aave_1437;K20932	2	5	|	cog:COG3391	description:hydrazine synthase subunit	eggnog:1P862	eggnog:2VKXA	eggnog:4ACF0	eggnog:COG3391	enzyme_ec:1.7.2.7	kegg_ko:K20932	pfam:PF05694
seq2	NOGG_merged;Pfam-A;kofam_merged	399795.CtesDRAFT_PD1384;Lactonase;K20932	3	5	|	cog:COG3391	description:Lactonase, 7-bladed beta-propeller	description:hydrazine synthase subunit	eggnog:1P862	eggnog:2VKXA	eggnog:4ACF0	eggnog:COG3391	enzyme_ec:1.7.2.7	kegg_ko:K20932	pfam:Lactonase	pfam:PF02239	pfam:PF10282

the database config file was used.

Thanks! I think I've found the connection between nogg and kegg annotation: the all annotated as COG3391! integrated_annotation.tsv is just a bridge between them!
I further searched description for 399795.CtesDRAFT_PD1384 in uniport and COG3391 in eggnog, and found inconsistance between different database. I even found an ec annotation of 3.1.1.31 (points to K07404) in linked database...

I just hope if there is any way to avoid these over-annotated results.

PedroMTQ · 2023-09-25T08:11:30Z

Hey @Hocnonsense ,

I'm glad you found the reason why. Indeed we sometimes have issues with the reference databases, but unfortunately this is not really something I can address.
If you do have concerns regarding a specific database feel free to disable them.

Regards,
Pedro

Hocnonsense closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance: unconsensus annotation generated #49

performance: unconsensus annotation generated #49

Hocnonsense commented Sep 21, 2023

PedroMTQ commented Sep 21, 2023

Hocnonsense commented Sep 22, 2023

PedroMTQ commented Sep 22, 2023

Hocnonsense commented Sep 22, 2023

PedroMTQ commented Sep 25, 2023

performance: unconsensus annotation generated #49

performance: unconsensus annotation generated #49

Comments

Hocnonsense commented Sep 21, 2023

PedroMTQ commented Sep 21, 2023

Hocnonsense commented Sep 22, 2023

PedroMTQ commented Sep 22, 2023

Hocnonsense commented Sep 22, 2023

PedroMTQ commented Sep 25, 2023