Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making transcript identity assignments with adaptive exon searching #22

Closed
dewyman opened this issue Apr 16, 2018 · 2 comments
Closed

Comments

@dewyman
Copy link
Member

dewyman commented Apr 16, 2018

This is a fairly difficult problem. The first version I implemented was greedy, seeking to choose the most closely matched exon for every query exon in the sam transcript. Here is an example that illustrates why this doesn't work.

Transcript c33255/f1p0/3238: The exon coordinates are as follows:
827670-827775
829003-829104
847654-847806
849484-849602
851927-852110
852671-852766
853391-853424
853474-855121
856449-857242
When we try to match the very first exon, the greedy match is 827669-827775, which is found only in annotated transcript ENST00000609139.5. Visually, we can see in the UCSC genome browser shot below that there were other options. For instance, the first four exons of ENST00000608189.4 (second transcript) match the query.
c33255-f1p0-3238_linc00115

@dewyman
Copy link
Member Author

dewyman commented Apr 17, 2018

Here is another special case that I found: c10882/f1p1/3230. This situation suggests that maybe we need to impose a similarity requirement on the exon matches. If the 3' or the 5' difference exceeds 10 basepairs, then it is not a match.
c10882-f1p1-3230_tnfrsf18

@dewyman
Copy link
Member Author

dewyman commented Apr 17, 2018

Solution: Loose exon assignments and transcript pool approach

For each exon in query transcript:
- Fetch annotated exons that overlap with query exon
- match_pool := transcripts that contain these exons
- transcript pool := intersection(match_pool, previous transcript_pool)

By the end, the only transcripts left in the transcript pool are transcripts that contain overlap for every exon in the query transcript. A final cross-check of the total exons in the query and the annotation ensures that we don't report a transcript that has more exons than the query.

@dewyman dewyman closed this as completed Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant