Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding transcript evidence to prediction after training #360

Closed
olekto opened this issue Dec 23, 2019 · 5 comments
Closed

Adding transcript evidence to prediction after training #360

olekto opened this issue Dec 23, 2019 · 5 comments

Comments

@olekto
Copy link

olekto commented Dec 23, 2019

Are you using the latest release?
Yep.

Describe the bug
Not a bug, just a clarification of how the code works.

Hi,
am I correct in interpreting code (specifically

if os.path.isfile(os.path.join(traindir, 'funannotate_train.transcripts.gff3')):
) correctly in that if funannotate_train.transcripts.gff3 exists, anything specified in transcript_evidence or transcript_alignments are ignored?

The reason I am asking is that I have some manually found genes (or fragments of genes) which I would like to add in some way to make sure that they are called.

If adding them as transcript evidence is not working as it is now, can I add them to other_gff even though the intron-exon borders are not exact? What does EVM do if it is just a fragment not supported by any gene prediction?

Thank you.

Ole

@nextgenusfs
Copy link
Owner

Yes -- the assumption that you ran funannotate train using RNA-seq data then the pipeline re-uses those data and passes the results to EVM, so the pipeline tries to save some time by not re-running those alignments. But perhaps this should be a bit more flexible, where maybe unique transcripts found in --transcript_evidence should be added if they do not exist in --transcript_alignments.

EVM only uses transcript alignments to help choose the best model from the gene predictions, so it will not generate gene models from any evidence (protein or transcript).

So if you want some particular gene models to persist you can pass them via the --other_gff option and then give it a significant weight, ie. --other_gff yourManualGenes.gff:10. That will add those gene models and give it a weight of 10, which should be enough to alter the EVM scoring to keep those models. But you want to make sure that they are real gene models, ie make a valid protein sequence (don't have internal stops etc). If it is not a valid ORF, it won't make it out of EVM/filtering.

@olekto
Copy link
Author

olekto commented Dec 23, 2019

So more flexibility could be nice, but I understand that there is a lot of choices one has to make along the path to developing this.

I think the --other_gff is a good solution so far at least for me. It is good to have as a sanity check the gene models are proper ORFs.

Thank you.

Ole

@nextgenusfs
Copy link
Owner

nextgenusfs commented Dec 23, 2019

I'm looking into how long will take to parse a normal-ish alignment file and cross-reference with transcripts added via --transcript_evidence. If I can do this somewhat quickly, then will add that as default routine if both --transcript_alignments and --transcript_evidence are present.

@nextgenusfs
Copy link
Owner

Latest few commits in the master should address this issue, now the evidence is checked against alignments and those sequences that do not have alignments are re-aligned with minimap2. Additionally, Augustus hints were not being generated for those alignments in --transcript_alignments, so that is now added as well.

@olekto
Copy link
Author

olekto commented Dec 24, 2019

Cool, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants