Adding transcript evidence to prediction after training #360

olekto · 2019-12-23T13:24:02Z

Are you using the latest release?
Yep.

Describe the bug
Not a bug, just a clarification of how the code works.

Hi,
am I correct in interpreting code (specifically

Line 329 in 93288e8

    
           if os.path.isfile(os.path.join(traindir, 'funannotate_train.transcripts.gff3')):

) correctly in that if funannotate_train.transcripts.gff3 exists, anything specified in transcript_evidence or transcript_alignments are ignored?

The reason I am asking is that I have some manually found genes (or fragments of genes) which I would like to add in some way to make sure that they are called.

If adding them as transcript evidence is not working as it is now, can I add them to other_gff even though the intron-exon borders are not exact? What does EVM do if it is just a fragment not supported by any gene prediction?

Thank you.

Ole

nextgenusfs · 2019-12-23T19:00:29Z

Yes -- the assumption that you ran funannotate train using RNA-seq data then the pipeline re-uses those data and passes the results to EVM, so the pipeline tries to save some time by not re-running those alignments. But perhaps this should be a bit more flexible, where maybe unique transcripts found in --transcript_evidence should be added if they do not exist in --transcript_alignments.

EVM only uses transcript alignments to help choose the best model from the gene predictions, so it will not generate gene models from any evidence (protein or transcript).

So if you want some particular gene models to persist you can pass them via the --other_gff option and then give it a significant weight, ie. --other_gff yourManualGenes.gff:10. That will add those gene models and give it a weight of 10, which should be enough to alter the EVM scoring to keep those models. But you want to make sure that they are real gene models, ie make a valid protein sequence (don't have internal stops etc). If it is not a valid ORF, it won't make it out of EVM/filtering.

olekto · 2019-12-23T19:36:58Z

So more flexibility could be nice, but I understand that there is a lot of choices one has to make along the path to developing this.

I think the --other_gff is a good solution so far at least for me. It is good to have as a sanity check the gene models are proper ORFs.

Thank you.

Ole

nextgenusfs · 2019-12-23T19:38:57Z

I'm looking into how long will take to parse a normal-ish alignment file and cross-reference with transcripts added via --transcript_evidence. If I can do this somewhat quickly, then will add that as default routine if both --transcript_alignments and --transcript_evidence are present.

nextgenusfs · 2019-12-24T02:17:10Z

Latest few commits in the master should address this issue, now the evidence is checked against alignments and those sequences that do not have alignments are re-aligned with minimap2. Additionally, Augustus hints were not being generated for those alignments in --transcript_alignments, so that is now added as well.

olekto · 2019-12-24T07:54:57Z

Cool, thank you.

nextgenusfs closed this as completed Dec 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding transcript evidence to prediction after training #360

Adding transcript evidence to prediction after training #360

olekto commented Dec 23, 2019

nextgenusfs commented Dec 23, 2019

olekto commented Dec 23, 2019

nextgenusfs commented Dec 23, 2019 •

edited

nextgenusfs commented Dec 24, 2019

olekto commented Dec 24, 2019

Adding transcript evidence to prediction after training #360

Adding transcript evidence to prediction after training #360

Comments

olekto commented Dec 23, 2019

nextgenusfs commented Dec 23, 2019

olekto commented Dec 23, 2019

nextgenusfs commented Dec 23, 2019 • edited

nextgenusfs commented Dec 24, 2019

olekto commented Dec 24, 2019

nextgenusfs commented Dec 23, 2019 •

edited