Comparing TALON prototype results to Tofu and MatchAnnot #15

dewyman · 2018-04-11T22:52:46Z

Tofu + STAR + MatchAnnot commands:

# Run Tofu
source activate myenv
collapse_isoforms_by_sam.py --fq --input GM12878_chr1_canonicalOnly.fq  -s sorted.GM12878_chr1_canonicalOnly.sam --dun-merge-5-shorter -o Tofu_STAR_MatchAnnot/Tofu/tofu
source deactivate

# Remap with STAR using bash script. Here is the essential command though
STARlong --runThreadN 4 --genomeDir /bio/dwyman/pacbio_f2016/data/STAR_hg38_ENCODE \
--readFilesIn /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/Tofu/tofu.collapsed.rep.fq \
--sjdbGTFfile /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
--outFileNamePrefix /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/ \
--outSAMattributes NH HI NM MD jM jI --readNameSeparator space --outFilterMultimapScoreRange 1 --outFilterMismatchNmax 2000 --scoreGapNoncan -20 --scoreGapGCAG -4 --scoreGapATAC -8 --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --alignEndsType Local --seedSearchStartLmax 50 --seedPerReadNmax 100000 --seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000 --outSAMtype SAM

samtools sort Aligned.out.sam -o sorted.Aligned.out -O sam


# Run MatchAnnot
# Crashed when I didn't have --format alt. Must be some types in the v24 annotation
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/sorted.Aligned.out > \
/bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt

TALON

python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf

Tofu + STAR + MatchAnnot Results:

Total transcripts at MatchAnnot stage: 1526
Number of transcripts assigned a score of 5: 684
- This means that the intron boundaries matched internally, but the leading or trailing 5'/3' ends may have varied
Number of transcripts with perfect 3' and 5' match: 0
Number of transcripts with perfect 3' match but not 5': 8
Number of transcripts with perfect 5' match but not 3': 5

grep "result" match_annot_results.txt | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | wc -l
ot_results.txt | grep "sc: 5" | awk
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($11 ==0 ) print $0}' | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 ==0 ) print $0}' | wc -l

TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 954

The text was updated successfully, but these errors were encountered:

dewyman · 2018-04-12T05:19:43Z

Why does transcript c1417/f31p12/2759 get a score of 5 in MatchAnnot, but is not among the transcripts with internal exon matches in TALON? [FIXED]

TALON results (which don't look right to me):

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 496

Make a SAM file containing only this transcript

grep "c1417/f31p12/2759" test_files/sorted.GM12878_chr1_canonicalOnly.sam > test_files/c1417-f31p12-2759.sam

Run TALON

python talon.py --f test_files/c1417-f31p12-2759.sam -g test_files/gencode.v24.annotation.chr1.gtf

[FIXED] I noticed that the problem is another GTF reading bug. For transcripts on the minus strand, the exons are listed (and numbered) in their 5' to 3' orientation rather than being listed with respect to the forward strand. This is creating problems in my exon string process.

dewyman · 2018-04-12T06:16:27Z

Does TALON recover all score 5 events from Tofu-STAR-MatchAnnot? [ANSWERED] Almost

Collect all transcript IDs that scored a MatchAnnot 5 in one file:

MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
grep "result" test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file

Collect all transcripts with internal exon matches (MatchAnnot score 5 equivalent) from TALON:

TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file

Get number of lines in MatchAnnot file that are not in TALON file:

MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
TALON_file=test_files/permissive_3_5_transcripts.txt
comm -23 $MatchAnnot_file $TALON_file | wc -l

The result was 13 lines (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e), which means that TALON did not quite cover all of the cases that MatchAnnot did. Here are the IDs:
c11484/f2p30/3345
c13754/f2p1/1380
c14744/f1p2/3017
c14773/f1p0/3486
c18234/f1p0/3056
c19214/f1p1/3120
c22984/f1p5/3378
c23761/f1p0/2920
c26076/f2p16/2864
c26372/f1p0/3232
c29587/f4p2/3479
c31433/f1p3/3125
c33225/f1p0/3407

dewyman · 2018-04-12T19:25:06Z

Comparing TALON to direct MatchAnnot (no Tofu-STAR): Do they give the same results for score 5 transcripts?

Run TALON and save internal match transcripts to file (same as previous)

TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file

Run MatchAnnot directly on the same data that TALON ran on

MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MatchAnnot_file=test_files/MatchAnnot_direct/score_5_transcripts.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file

Get number of lines in MatchAnnot file that are not in TALON file

comm -23 $MatchAnnot_file $TALON_file | wc -l

The result is 15 lines that are in MatchAnnot but not TALON, and 4 lines that are in TALON but not MatchAnnot. I should look into these cases further.
c11484/f2p30/3345
c13754/f2p1/1380
c14744/f1p2/3017
c14773/f1p0/3486
c18234/f1p0/3056
c19214/f1p1/3120
c2229/f33p31/3156
c22984/f1p5/3378
c23761/f1p0/2920
c26076/f2p16/2864
c26372/f1p0/3232
c29587/f4p2/3479
c31433/f1p3/3125
c33225/f1p0/3407
c34432/f1p6/3188

Let's start by investigating the very first one, c11484/f2p30/3345. The MatchAnnot entry for it is:

result:   c11484/f2p30/3345    ALDH4A1    ALDH4A1-001    ex: 15  sc: 5  5-3:    -2    10

When I print debugging messages for this transcript in TALON, I find out that it was assigned to a different gene, RP13-279N23.2. This strikes me as odd and probably wrong. When I dig deeper, I find that c11484/f2p30/3345 overlapped the following genes, with the amount listed in parantheses.
gene_matches:

MIR1290 (78)
RP13-279N23.2 (31338)
ALDH4A1 (31338)
MIR4695 (74)
We can see that there was a tie between ALDH4A1 and RP13-279N23.2. Here are the gene annotations from GENCODE v24:

chr1	HAVANA	gene	18871430	18902781	.	-	.gene_id "ENSG00000159423.16"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ALDH4A1"; level 1; tag "ncRNA_host"; havana_gene "OTTHUMG00000002443.6";

chr1	HAVANA	gene	18849273	18921121	.	-	.gene_id "ENSG00000255275.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP13-279N23.2"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000167131.3";

This suggests to me that in the rare event of a transcript having the same amount of overlap with 2 genes, it is not sufficient to simply return one of them as I am doing. I checked on the first three cases, on the list and they all come from the same tied gene situation. Clearly this is a problem. I changed the gene fetching function to return more than one gene in case of ties, and repeated the comparison. This time, I got the following 6 matches in MatchAnnot but not in TALON (a subset of the last list):
c14773/f1p0/3486: MIR4425
c18234/f1p0/3056: NAP1L4P1
c22984/f1p5/3378: MIR34A
c23761/f1p0/2920: RP11-61J19.5
c26372/f1p0/3232: ZNF670
c33225/f1p0/3407: PHTF1
To investigate, I start again with the first transcript, c14773/f1p0/3486.

MatchAnnot entry:

result:   c14773/f1p0/3486    MIR4425    MIR4425-201    ex:  1  sc: 5 rev  5-3:     5 -3402

Notice that this is a microRNA gene. How long is the transcript? Did we perhaps filter it out with our 200 bp requirement? Checking the length, this does NOT appear to be the case.

grep "c14773/f1p0/3486" test_files/sorted.GM12878_chr1_canonicalOnly.sam | awk '{print length($10)}'
3486

Looking at this area in the UCSC genome browser, I noticed something really interesting that points to a problem with MatchAnnot. MIR4425 is the only gene in the neighborhood here, and it intersects with c14773/f1p0/3486. Since there is only one "exon", MatchAnnot classifies this match as a score 5 because "only the 3' and 5' ends are different", which is a bad move. The correct move would be to create a novel gene.

I found a second problem with MatchAnnot when I examined example c23761/f1p0/2920. MatchAnnot assign the transcript to RP11-61J19.5, but the gene and the transcript are on opposite strands. So I'm comfortable with the fact that TALON does not find a gene match for c23761/f1p0/2920.

dewyman · 2018-04-17T18:48:48Z

Update: Comparing TALON and MatchAnnot after exon-based redesign of TALON (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931)

TALON results:

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 964

Transcripts that matched in MatchAnnot but not in TALON:

There are only three discrepancies.

c14773/f1p0/3486

MatchAnnot assigns this single-exon minus strand transcript to MIR4425-001, and TALON finds no transcript match. The reason for this is that MIR4425 is on the plus strand. This discrepancy is acceptable.

c23761/f1p0/2920

MatchAnnot assigns this single-exon plus strand transcript to RP11-61J19.5-001, while TALON reports no match. Again, the reason is a strand discrepancy, so this is ok.

c33225/f1p0/3407

This is a strand discrepancy that was documented in issue #19. This is ok.

Transcripts that matched in TALON but not in MatchAnnot:

There were zero cases like this.

dewyman · 2018-04-17T19:34:10Z

TALON and MatchAnnot find "score 5" matches for virtually the same set of transcripts. But if we look at the transcript assignments themselves, are they the same? (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931) [ANSWERED] Yes, except for multi-matching TALON cases (tiebreaker needs to be implemented)

Note: I have seen cases so far where TALON returns more than one transcript match since I haven't implemented a tiebreaker yet. I won't worry too much about this for now.

Run TALON and save internal match transcripts to file (same as previous)

TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort -n -k1,1 > $TALON_file

Run MatchAnnot directly on the same data that TALON ran on

MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file

Get number of lines in MatchAnnot file that are not in TALON file

comm -23 $MA_transcript_file $TALON_file | wc -l

Assignments in MatchAnnot that were not in TALON:

c10364/f4p17/3199 TOR1AIP1-015: TALON assigned TOR1AIP1-015,TOR1AIP1-004
c12687/f3p2/3317 RBM15-004: TALON assigned RBM15-001,RBM15-201,RBM15-004
c13055/f1p27/3190 LRRC41-201: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
~~c14773/f1p0/3486 MIR4425-201~~ This was expected
c15401/f82p47/2854 LRRC41-006: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c15529/f16p41/2957 LRRC41-201: multiple assignment again
c17240/f4p6/2673 TCEB3-003: multiple assignment
c17998/f1p0/2984 HIST2H2AA4-001: multiple assignment
c17998/f1p0/2984 HIST2H3C-001: multiple assignment
c2355/f16p17/3301 TOR1AIP1-015: multiple assignment
~~c23761/f1p0/2920 RP11-61J19.5-001~~ This was expected
c24723/f2p5/4046 ATPAF1-001: multiple assignment
c26869/f1p8/3622 RSBN1-004: multiple assignment
c30828/f2p12/2747 LRRC41-006: multiple assignment
~~c33225/f1p0/3407 PHTF1-004~~ This was expected
c36587/f1p0/3228 JMJD4-006: multiple assignment
c8235/f3p8/2930 RSBN1-004: multiple assignment

All of these cases (except for the three enumerated previously) were situations where TALON matched the query to more than one transcript, including the MatchAnnot answer within this set. So that is fine for now. Once I write a tiebreaker for TALON, these discrepancies should disappear.

Assignments in TALON that were not in MatchAnnot:

When I check this, I find only multi-match cases from the previous section. That's a pass

dewyman · 2018-04-18T18:30:00Z

Comparing TALON and MatchAnnot again after tiebreaker implementation for full matches (as of commit dewyman@728805b3c5e04650030b6eeae608e3c441dfea8)

Run TALON and save internal match transcripts to file (same as previous)

TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf --d test --o test
grep -v "dataset" test | grep "known" | awk '{print $2,$10}' | sort -n -k1,1  > $TALON_file

Run MatchAnnot directly on the same data that TALON ran on

MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file

Get number of lines in MatchAnnot file that are not in TALON file

comm -23 $MA_transcript_file $TALON_file | wc -l

Lines that are in MatchAnnot file but not TALON

~~c14773/f1p0/3486 MIR4425-201~~
c17998/f1p0/2984 HIST2H2AA4-001
c17998/f1p0/2984 HIST2H3C-001
~~c23761/f1p0/2920 RP11-61J19.5-001~~
~~c33225/f1p0/3407 PHTF1-004~~

c17998/f1p0/2984 comes up twice here. TALON assigns it to HIST2H3A-001. HIST2H3A-001 and HIST2H2AA4-001 both seem like reasonable choices, and the former minimizes the total 3' and 5' difference, so I'm not going to worry too much about this.

Lines that are in TALON but not MatchAnnot

Only one:
c17998/f1p0/2984 HIST2H3A-001

dewyman self-assigned this Apr 11, 2018

dewyman closed this as completed Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing TALON prototype results to Tofu and MatchAnnot #15

Comparing TALON prototype results to Tofu and MatchAnnot #15

dewyman commented Apr 11, 2018 •

edited

dewyman commented Apr 12, 2018 •

edited

dewyman commented Apr 12, 2018 •

edited

dewyman commented Apr 12, 2018 •

edited

dewyman commented Apr 17, 2018 •

edited

dewyman commented Apr 17, 2018 •

edited

dewyman commented Apr 18, 2018 •

edited

Comparing TALON prototype results to Tofu and MatchAnnot #15

Comparing TALON prototype results to Tofu and MatchAnnot #15

Comments

dewyman commented Apr 11, 2018 • edited

dewyman commented Apr 12, 2018 • edited

Why does transcript c1417/f31p12/2759 get a score of 5 in MatchAnnot, but is not among the transcripts with internal exon matches in TALON? [FIXED]

dewyman commented Apr 12, 2018 • edited

Does TALON recover all score 5 events from Tofu-STAR-MatchAnnot? [ANSWERED] Almost

dewyman commented Apr 12, 2018 • edited

Comparing TALON to direct MatchAnnot (no Tofu-STAR): Do they give the same results for score 5 transcripts?

dewyman commented Apr 17, 2018 • edited

Update: Comparing TALON and MatchAnnot after exon-based redesign of TALON (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931)

TALON results:

Transcripts that matched in MatchAnnot but not in TALON:

c14773/f1p0/3486

c23761/f1p0/2920

c33225/f1p0/3407

Transcripts that matched in TALON but not in MatchAnnot:

dewyman commented Apr 17, 2018 • edited

Assignments in MatchAnnot that were not in TALON:

Assignments in TALON that were not in MatchAnnot:

dewyman commented Apr 18, 2018 • edited

Comparing TALON and MatchAnnot again after tiebreaker implementation for full matches (as of commit dewyman@728805b3c5e04650030b6eeae608e3c441dfea8)

Lines that are in MatchAnnot file but not TALON

Lines that are in TALON but not MatchAnnot

dewyman commented Apr 11, 2018 •

edited

dewyman commented Apr 12, 2018 •

edited

dewyman commented Apr 12, 2018 •

edited

dewyman commented Apr 12, 2018 •

edited

dewyman commented Apr 17, 2018 •

edited

dewyman commented Apr 17, 2018 •

edited

dewyman commented Apr 18, 2018 •

edited