You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a fairly difficult problem. The first version I implemented was greedy, seeking to choose the most closely matched exon for every query exon in the sam transcript. Here is an example that illustrates why this doesn't work.
Transcript c33255/f1p0/3238: The exon coordinates are as follows:
827670-827775
829003-829104
847654-847806
849484-849602
851927-852110
852671-852766
853391-853424
853474-855121
856449-857242
When we try to match the very first exon, the greedy match is 827669-827775, which is found only in annotated transcript ENST00000609139.5. Visually, we can see in the UCSC genome browser shot below that there were other options. For instance, the first four exons of ENST00000608189.4 (second transcript) match the query.
The text was updated successfully, but these errors were encountered:
Here is another special case that I found: c10882/f1p1/3230. This situation suggests that maybe we need to impose a similarity requirement on the exon matches. If the 3' or the 5' difference exceeds 10 basepairs, then it is not a match.
Solution: Loose exon assignments and transcript pool approach
For each exon in query transcript:
- Fetch annotated exons that overlap with query exon
- match_pool := transcripts that contain these exons
- transcript pool := intersection(match_pool, previous transcript_pool)
By the end, the only transcripts left in the transcript pool are transcripts that contain overlap for every exon in the query transcript. A final cross-check of the total exons in the query and the annotation ensures that we don't report a transcript that has more exons than the query.
This is a fairly difficult problem. The first version I implemented was greedy, seeking to choose the most closely matched exon for every query exon in the sam transcript. Here is an example that illustrates why this doesn't work.
Transcript c33255/f1p0/3238: The exon coordinates are as follows:
![c33255-f1p0-3238_linc00115](https://user-images.githubusercontent.com/8904542/38840576-4074bfc0-4195-11e8-9862-759f6e27d67f.png)
827670-827775
829003-829104
847654-847806
849484-849602
851927-852110
852671-852766
853391-853424
853474-855121
856449-857242
When we try to match the very first exon, the greedy match is 827669-827775, which is found only in annotated transcript ENST00000609139.5. Visually, we can see in the UCSC genome browser shot below that there were other options. For instance, the first four exons of ENST00000608189.4 (second transcript) match the query.
The text was updated successfully, but these errors were encountered: