-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance: unconsensus annotation generated #49
Comments
Hello @Hocnonsense I have a few suggestions for you: Additionally you can change the hit processing algorithm (e.g., You can further restrict the e-value, which will then likely increase precision but lower recall, so please keep that in mind. Keep in mind I also take into account other metrics when evaluating matches: You can check these options in the wiki https://github.com/PedroMTQ/mantis/wiki/Functionalities#annotate-one-sample Finally, if you prefer to only annotate with the NCBI reference, you can disable the other references. Regards, |
Hello Pedro: However, it seems that a much better nogg annotation (take seq1 as example, annotion as 397945.Aave_1437 hit [61:361] of 362 with evalue 3.06e-125, full table is shown below) is generated, and I still have doubts because their annotation seems different:
I've read https://github.com/PedroMTQ/mantis/wiki/Additional-information#inter-reference-hit-processing but cannot understand clearly how it process, what is "IDs and free-text functional descriptions" for mantis, and what's similarity score between them? Can I reproduce the results using mantis/consensus.py with results from integrated_annotation as input? Query Ref_file Ref_hit Ref_hit_accession evalue bitscore Direction Query_length Query_hit_start Query_hit_end Ref_hit_start Ref_hit_end Ref_length
seq1 kofam_merged K07404 - 3.00E-09 42.7 + 362 129 271 331 476 530
seq1 kofam_merged K20932 - 1.08E-05 30.8 + 362 264 359 45 127 377
seq1 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.92E-18 72.8 + 362 181 354 9 170 310
seq1 NOGG_merged 397945.Aave_1437 3.06E-125 366 + 362 63 361 8 306 306
seq1 Pfam-A Cytochrom_D1 PF02239.19 2.31E-06 32.8 + 362 229 355 9 117 368
seq2 kofam_merged K07404 - 6.16E-14 58.2 + 389 115 288 281 460 530
seq2 kofam_merged K20932 - 0.000276989 26.1 + 389 285 387 44 127 377
seq2 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.77E-15 63.1 + 389 212 386 8 170 310
seq2 NOGG_merged 399795.CtesDRAFT_PD1384 1.56E-143 417 + 389 1 388 1 393 393
seq2 Pfam-A Lactonase PF10282.12 0.000338543 26.3 + 389 160 242 249 324 344 Sincerely appreciate for your help! |
Hello @Hocnonsense , Regarding the similarity of annotations (i.e., the IDsa and free-text descriptions) keep the following points in mind:
When determining the consensus we proceed in 2 ways, depending on whether we are handling IDs, or free text: Let's imagine we have a match to the NOG IDX which is mapped to ID1, ID2 and ID3 and another match to the Kofam IDY which is mapped to ID4, and ID1, and finally a match to the Pfam IDZ which is mapped to ID5. For free text descriptions the idea is similar, but in this case instead of having matches between IDs (either it's a match - 1 or it's not - 0) we instead measure string similarity (from 0 to 1, 1 being very similar). This string similarity is calculated with another package that I've developed: https://github.com/PedroMTQ/UniFunc. In general, the main idea of Mantis is to leverage "independent" annotation sources to determine a consensus, which we assume is more likely to be true than if we used a single source. Hope this clears things up. If it's still unclear I'd recommend you post the Mantis output files (you can trim it down to seq1 and seq2) here and I can try to dig through them to explain what's going on. Regards, |
Thanks for your advice! There are the three file i got in mantis output folder: integrated_annotation.tsv Query Ref_file Ref_hit Ref_hit_accession evalue bitscore Direction Query_length Query_hit_start Query_hit_end Ref_hit_start Ref_hit_end Ref_length | Links
seq1 Pfam-A Cytochrom_D1 PF02239.19 2.308245e-06 32.8 + 362 229 355 9 117 368 | description:Cytochrome D1 heme domain pfam:Cytochrom_D1 pfam:PF02239
seq1 kofam_merged K07404 - 3.0007185000000003e-09 42.7 + 362 129 271 331 476 530 | cog:COG2706 description:6-phosphogluconolactonase enzyme_ec:3.1.1.31 go:0017057 kegg_ko:K07404
seq1 kofam_merged K20932 - 1.0771810000000001e-05 30.8 + 362 264 359 45 127 377 | cog:COG3391 description:hydrazine synthase subunit enzyme_ec:1.7.2.7 kegg_ko:K20932
seq1 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.9235375e-18 72.8 + 362 181 354 9 170 310 | description:PQQ-dependent catabolism-associated beta-propeller protein tigrfam:TIGR03866
seq1 NOGG_merged 397945.Aave_1437 3.056207886e-125 366.0 + 362 63 361 8 306 306 | cog:COG3391 eggnog:1P862 eggnog:2VKXA eggnog:4ACF0 eggnog:COG3391 pfam:PF05694
seq2 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 4.3087240000000004e-15 61.8 + 358 255 347 219 309 310 | description:PQQ-dependent catabolism-associated beta-propeller protein tigrfam:TIGR03866
seq2 NOGG_merged 402626.Rpic_2496 3.6437579292e-229 631.0 + 358 1 357 1 357 357 | cog:COG3391 eggnog:1P862 eggnog:2WEY0 eggnog:COG3391 pfam:PF10282
seq2 Pfam-A Cytochrom_D1 PF02239.19 2.3851864999999997e-07 36.0 + 358 146 302 57 214 368 | description:Cytochrome D1 heme domain pfam:Cytochrom_D1 pfam:PF02239
seq2 kofam_merged K07404 - 7.1555595e-17 67.9 + 358 142 344 283 458 530 | cog:COG2706 description:6-phosphogluconolactonase enzyme_ec:3.1.1.31 go:0017057 kegg_ko:K07404
seq2 kofam_merged K20932 - 1.0002395e-11 50.6 + 358 19 124 51 139 377 | cog:COG3391 description:hydrazine synthase subunit enzyme_ec:1.7.2.7 kegg_ko:K20932 output_annotation.tsv Query Ref_file Ref_hit Ref_hit_accession evalue bitscore Direction Query_length Query_hit_start Query_hit_end Ref_hit_start Ref_hit_end Ref_length
seq1 Pfam-A Cytochrom_D1 PF02239.19 2.308245e-06 32.8 + 362 229 355 9 117 368
seq1 kofam_merged K07404 - 3.0007185000000003e-09 42.7 + 362 129 271 331 476 530
seq1 kofam_merged K20932 - 1.0771810000000001e-05 30.8 + 362 264 359 45 127 377
seq1 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.9235375e-18 72.8 + 362 181 354 9 170 310
seq1 NOGG_merged 397945.Aave_1437 3.056207886e-125 366.0 + 362 63 361 8 306 306
seq2 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 4.3087240000000004e-15 61.8 + 358 255 347 219 309 310
seq2 NOGG_merged 402626.Rpic_2496 3.6437579292e-229 631.0 + 358 1 357 1 357 357
seq2 Pfam-A Cytochrom_D1 PF02239.19 2.3851864999999997e-07 36.0 + 358 146 302 57 214 368
seq2 kofam_merged K07404 - 7.1555595e-17 67.9 + 358 142 344 283 458 530
seq2 kofam_merged K20932 - 1.0002395e-11 50.6 + 358 19 124 51 139 377 consensus_annotation.tsv Query Ref_Files Ref_Hits Consensus_hits Total_hits | Links
seq1 NOGG_merged;kofam_merged 397945.Aave_1437;K20932 2 5 | cog:COG3391 description:hydrazine synthase subunit eggnog:1P862 eggnog:2VKXA eggnog:4ACF0 eggnog:COG3391 enzyme_ec:1.7.2.7 kegg_ko:K20932 pfam:PF05694
seq2 NOGG_merged;Pfam-A;kofam_merged 399795.CtesDRAFT_PD1384;Lactonase;K20932 3 5 | cog:COG3391 description:Lactonase, 7-bladed beta-propeller description:hydrazine synthase subunit eggnog:1P862 eggnog:2VKXA eggnog:4ACF0 eggnog:COG3391 enzyme_ec:1.7.2.7 kegg_ko:K20932 pfam:Lactonase pfam:PF02239 pfam:PF10282 the database config file was used. Thanks! I think I've found the connection between nogg and kegg annotation: the all annotated as COG3391! I just hope if there is any way to avoid these over-annotated results. |
Hey @Hocnonsense , I'm glad you found the reason why. Indeed we sometimes have issues with the reference databases, but unfortunately this is not really something I can address. Regards, |
Recently I'm trying to annotate sequence with mantis (default databse). However, two sequences were annotated abnormally (seq1 and seq2)
When I search them in NCBI, they were annotated as "beta-propeller fold lactonase family protein" and "YncE family protein". However, with mantis, i got:
In brief, while nogg potentially returned correct results, the annotation from kofam was totally wrong!
I also looked for detailed results in integrated_annotation.tsv:
It can be found that the evalue of these hits were pretty low, and the query hit length is quiet short. Is it possible to filter these low-quality annotations from cousensus results?
There sequences are list here
The text was updated successfully, but these errors were encountered: