-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing TALON prototype results to Tofu and MatchAnnot #15
Comments
Why does transcript c1417/f31p12/2759 get a score of 5 in MatchAnnot, but is not among the transcripts with internal exon matches in TALON? [FIXED]TALON results (which don't look right to me):
[FIXED] I noticed that the problem is another GTF reading bug. For transcripts on the minus strand, the exons are listed (and numbered) in their 5' to 3' orientation rather than being listed with respect to the forward strand. This is creating problems in my exon string process. |
Does TALON recover all score 5 events from Tofu-STAR-MatchAnnot? [ANSWERED] Almost
The result was 13 lines (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e), which means that TALON did not quite cover all of the cases that MatchAnnot did. Here are the IDs: |
Comparing TALON to direct MatchAnnot (no Tofu-STAR): Do they give the same results for score 5 transcripts?
The result is 15 lines that are in MatchAnnot but not TALON, and 4 lines that are in TALON but not MatchAnnot. I should look into these cases further. Let's start by investigating the very first one, c11484/f2p30/3345. The MatchAnnot entry for it is:
When I print debugging messages for this transcript in TALON, I find out that it was assigned to a different gene, RP13-279N23.2. This strikes me as odd and probably wrong. When I dig deeper, I find that c11484/f2p30/3345 overlapped the following genes, with the amount listed in parantheses.
This suggests to me that in the rare event of a transcript having the same amount of overlap with 2 genes, it is not sufficient to simply return one of them as I am doing. I checked on the first three cases, on the list and they all come from the same tied gene situation. Clearly this is a problem. I changed the gene fetching function to return more than one gene in case of ties, and repeated the comparison. This time, I got the following 6 matches in MatchAnnot but not in TALON (a subset of the last list): MatchAnnot entry:
Notice that this is a microRNA gene. How long is the transcript? Did we perhaps filter it out with our 200 bp requirement? Checking the length, this does NOT appear to be the case.
Looking at this area in the UCSC genome browser, I noticed something really interesting that points to a problem with MatchAnnot. MIR4425 is the only gene in the neighborhood here, and it intersects with c14773/f1p0/3486. Since there is only one "exon", MatchAnnot classifies this match as a score 5 because "only the 3' and 5' ends are different", which is a bad move. The correct move would be to create a novel gene. I found a second problem with MatchAnnot when I examined example c23761/f1p0/2920. MatchAnnot assign the transcript to RP11-61J19.5, but the gene and the transcript are on opposite strands. So I'm comfortable with the fact that TALON does not find a gene match for c23761/f1p0/2920. |
Update: Comparing TALON and MatchAnnot after exon-based redesign of TALON (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931)TALON results:Total transcripts processed: 1881 Transcripts that matched in MatchAnnot but not in TALON:There are only three discrepancies. c14773/f1p0/3486MatchAnnot assigns this single-exon minus strand transcript to MIR4425-001, and TALON finds no transcript match. The reason for this is that MIR4425 is on the plus strand. This discrepancy is acceptable. c23761/f1p0/2920MatchAnnot assigns this single-exon plus strand transcript to RP11-61J19.5-001, while TALON reports no match. Again, the reason is a strand discrepancy, so this is ok. c33225/f1p0/3407This is a strand discrepancy that was documented in issue #19. This is ok. Transcripts that matched in TALON but not in MatchAnnot:There were zero cases like this. |
TALON and MatchAnnot find "score 5" matches for virtually the same set of transcripts. But if we look at the transcript assignments themselves, are they the same? (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931) [ANSWERED] Yes, except for multi-matching TALON cases (tiebreaker needs to be implemented)Note: I have seen cases so far where TALON returns more than one transcript match since I haven't implemented a tiebreaker yet. I won't worry too much about this for now.
Assignments in MatchAnnot that were not in TALON:c10364/f4p17/3199 TOR1AIP1-015: TALON assigned TOR1AIP1-015,TOR1AIP1-004 All of these cases (except for the three enumerated previously) were situations where TALON matched the query to more than one transcript, including the MatchAnnot answer within this set. So that is fine for now. Once I write a tiebreaker for TALON, these discrepancies should disappear. Assignments in TALON that were not in MatchAnnot:When I check this, I find only multi-match cases from the previous section. That's a pass |
Comparing TALON and MatchAnnot again after tiebreaker implementation for full matches (as of commit dewyman@728805b3c5e04650030b6eeae608e3c441dfea8)
Lines that are in MatchAnnot file but not TALON
c17998/f1p0/2984 comes up twice here. TALON assigns it to HIST2H3A-001. HIST2H3A-001 and HIST2H2AA4-001 both seem like reasonable choices, and the former minimizes the total 3' and 5' difference, so I'm not going to worry too much about this. Lines that are in TALON but not MatchAnnotOnly one: |
Tofu + STAR + MatchAnnot commands:
TALON
Tofu + STAR + MatchAnnot Results:
TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):
The text was updated successfully, but these errors were encountered: