Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing TALON prototype results to Tofu and MatchAnnot #15

Closed
dewyman opened this issue Apr 11, 2018 · 6 comments
Closed

Comparing TALON prototype results to Tofu and MatchAnnot #15

dewyman opened this issue Apr 11, 2018 · 6 comments
Assignees

Comments

@dewyman
Copy link
Member

dewyman commented Apr 11, 2018

Tofu + STAR + MatchAnnot commands:

# Run Tofu
source activate myenv
collapse_isoforms_by_sam.py --fq --input GM12878_chr1_canonicalOnly.fq  -s sorted.GM12878_chr1_canonicalOnly.sam --dun-merge-5-shorter -o Tofu_STAR_MatchAnnot/Tofu/tofu
source deactivate

# Remap with STAR using bash script. Here is the essential command though
STARlong --runThreadN 4 --genomeDir /bio/dwyman/pacbio_f2016/data/STAR_hg38_ENCODE \
--readFilesIn /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/Tofu/tofu.collapsed.rep.fq \
--sjdbGTFfile /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
--outFileNamePrefix /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/ \
--outSAMattributes NH HI NM MD jM jI --readNameSeparator space --outFilterMultimapScoreRange 1 --outFilterMismatchNmax 2000 --scoreGapNoncan -20 --scoreGapGCAG -4 --scoreGapATAC -8 --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --alignEndsType Local --seedSearchStartLmax 50 --seedPerReadNmax 100000 --seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000 --outSAMtype SAM

samtools sort Aligned.out.sam -o sorted.Aligned.out -O sam


# Run MatchAnnot
# Crashed when I didn't have --format alt. Must be some types in the v24 annotation
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/sorted.Aligned.out > \
/bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt

TALON

python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf

Tofu + STAR + MatchAnnot Results:

  • Total transcripts at MatchAnnot stage: 1526
  • Number of transcripts assigned a score of 5: 684
    • This means that the intron boundaries matched internally, but the leading or trailing 5'/3' ends may have varied
  • Number of transcripts with perfect 3' and 5' match: 0
  • Number of transcripts with perfect 3' match but not 5': 8
  • Number of transcripts with perfect 5' match but not 3': 5
grep "result" match_annot_results.txt | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | wc -l
ot_results.txt | grep "sc: 5" | awk
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($11 ==0 ) print $0}' | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 ==0 ) print $0}' | wc -l

TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):

  • Total transcripts processed: 1881
  • Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 954
@dewyman dewyman self-assigned this Apr 11, 2018
@dewyman
Copy link
Member Author

dewyman commented Apr 12, 2018

Why does transcript c1417/f31p12/2759 get a score of 5 in MatchAnnot, but is not among the transcripts with internal exon matches in TALON? [FIXED]

TALON results (which don't look right to me):

  • Total transcripts processed: 1881
  • Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 496
  1. Make a SAM file containing only this transcript
grep "c1417/f31p12/2759" test_files/sorted.GM12878_chr1_canonicalOnly.sam > test_files/c1417-f31p12-2759.sam
  1. Run TALON
python talon.py --f test_files/c1417-f31p12-2759.sam -g test_files/gencode.v24.annotation.chr1.gtf

[FIXED] I noticed that the problem is another GTF reading bug. For transcripts on the minus strand, the exons are listed (and numbered) in their 5' to 3' orientation rather than being listed with respect to the forward strand. This is creating problems in my exon string process.

@dewyman
Copy link
Member Author

dewyman commented Apr 12, 2018

Does TALON recover all score 5 events from Tofu-STAR-MatchAnnot? [ANSWERED] Almost

  1. Collect all transcript IDs that scored a MatchAnnot 5 in one file:
MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
grep "result" test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file
  1. Collect all transcripts with internal exon matches (MatchAnnot score 5 equivalent) from TALON:
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file
  1. Get number of lines in MatchAnnot file that are not in TALON file:
MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
TALON_file=test_files/permissive_3_5_transcripts.txt
comm -23 $MatchAnnot_file $TALON_file | wc -l

The result was 13 lines (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e), which means that TALON did not quite cover all of the cases that MatchAnnot did. Here are the IDs:
c11484/f2p30/3345
c13754/f2p1/1380
c14744/f1p2/3017
c14773/f1p0/3486
c18234/f1p0/3056
c19214/f1p1/3120
c22984/f1p5/3378
c23761/f1p0/2920
c26076/f2p16/2864
c26372/f1p0/3232
c29587/f4p2/3479
c31433/f1p3/3125
c33225/f1p0/3407

@dewyman
Copy link
Member Author

dewyman commented Apr 12, 2018

Comparing TALON to direct MatchAnnot (no Tofu-STAR): Do they give the same results for score 5 transcripts?

  1. Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file
  1. Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MatchAnnot_file=test_files/MatchAnnot_direct/score_5_transcripts.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file
  1. Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MatchAnnot_file $TALON_file | wc -l

The result is 15 lines that are in MatchAnnot but not TALON, and 4 lines that are in TALON but not MatchAnnot. I should look into these cases further.
c11484/f2p30/3345
c13754/f2p1/1380
c14744/f1p2/3017
c14773/f1p0/3486
c18234/f1p0/3056
c19214/f1p1/3120
c2229/f33p31/3156
c22984/f1p5/3378
c23761/f1p0/2920
c26076/f2p16/2864
c26372/f1p0/3232
c29587/f4p2/3479
c31433/f1p3/3125
c33225/f1p0/3407
c34432/f1p6/3188

Let's start by investigating the very first one, c11484/f2p30/3345. The MatchAnnot entry for it is:

result:   c11484/f2p30/3345    ALDH4A1    ALDH4A1-001    ex: 15  sc: 5  5-3:    -2    10

When I print debugging messages for this transcript in TALON, I find out that it was assigned to a different gene, RP13-279N23.2. This strikes me as odd and probably wrong. When I dig deeper, I find that c11484/f2p30/3345 overlapped the following genes, with the amount listed in parantheses.
gene_matches:

  • MIR1290 (78)
  • RP13-279N23.2 (31338)
  • ALDH4A1 (31338)
  • MIR4695 (74)
    We can see that there was a tie between ALDH4A1 and RP13-279N23.2. Here are the gene annotations from GENCODE v24:
chr1	HAVANA	gene	18871430	18902781	.	-	.gene_id "ENSG00000159423.16"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ALDH4A1"; level 1; tag "ncRNA_host"; havana_gene "OTTHUMG00000002443.6";

chr1	HAVANA	gene	18849273	18921121	.	-	.gene_id "ENSG00000255275.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP13-279N23.2"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000167131.3";

This suggests to me that in the rare event of a transcript having the same amount of overlap with 2 genes, it is not sufficient to simply return one of them as I am doing. I checked on the first three cases, on the list and they all come from the same tied gene situation. Clearly this is a problem. I changed the gene fetching function to return more than one gene in case of ties, and repeated the comparison. This time, I got the following 6 matches in MatchAnnot but not in TALON (a subset of the last list):
c14773/f1p0/3486: MIR4425
c18234/f1p0/3056: NAP1L4P1
c22984/f1p5/3378: MIR34A
c23761/f1p0/2920: RP11-61J19.5
c26372/f1p0/3232: ZNF670
c33225/f1p0/3407: PHTF1
To investigate, I start again with the first transcript, c14773/f1p0/3486.

MatchAnnot entry:

result:   c14773/f1p0/3486    MIR4425    MIR4425-201    ex:  1  sc: 5 rev  5-3:     5 -3402

Notice that this is a microRNA gene. How long is the transcript? Did we perhaps filter it out with our 200 bp requirement? Checking the length, this does NOT appear to be the case.

grep "c14773/f1p0/3486" test_files/sorted.GM12878_chr1_canonicalOnly.sam | awk '{print length($10)}'
3486

Looking at this area in the UCSC genome browser, I noticed something really interesting that points to a problem with MatchAnnot. MIR4425 is the only gene in the neighborhood here, and it intersects with c14773/f1p0/3486. Since there is only one "exon", MatchAnnot classifies this match as a score 5 because "only the 3' and 5' ends are different", which is a bad move. The correct move would be to create a novel gene.

I found a second problem with MatchAnnot when I examined example c23761/f1p0/2920. MatchAnnot assign the transcript to RP11-61J19.5, but the gene and the transcript are on opposite strands. So I'm comfortable with the fact that TALON does not find a gene match for c23761/f1p0/2920.

@dewyman
Copy link
Member Author

dewyman commented Apr 17, 2018

Update: Comparing TALON and MatchAnnot after exon-based redesign of TALON (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931)

TALON results:

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 964

Transcripts that matched in MatchAnnot but not in TALON:

There are only three discrepancies.

c14773/f1p0/3486

MatchAnnot assigns this single-exon minus strand transcript to MIR4425-001, and TALON finds no transcript match. The reason for this is that MIR4425 is on the plus strand. This discrepancy is acceptable.

c23761/f1p0/2920

MatchAnnot assigns this single-exon plus strand transcript to RP11-61J19.5-001, while TALON reports no match. Again, the reason is a strand discrepancy, so this is ok.

c33225/f1p0/3407

This is a strand discrepancy that was documented in issue #19. This is ok.

Transcripts that matched in TALON but not in MatchAnnot:

There were zero cases like this.

@dewyman
Copy link
Member Author

dewyman commented Apr 17, 2018

TALON and MatchAnnot find "score 5" matches for virtually the same set of transcripts. But if we look at the transcript assignments themselves, are they the same? (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931) [ANSWERED] Yes, except for multi-matching TALON cases (tiebreaker needs to be implemented)

Note: I have seen cases so far where TALON returns more than one transcript match since I haven't implemented a tiebreaker yet. I won't worry too much about this for now.

  1. Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort -n -k1,1 > $TALON_file
  1. Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file
  1. Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MA_transcript_file $TALON_file | wc -l

Assignments in MatchAnnot that were not in TALON:

c10364/f4p17/3199 TOR1AIP1-015: TALON assigned TOR1AIP1-015,TOR1AIP1-004
c12687/f3p2/3317 RBM15-004: TALON assigned RBM15-001,RBM15-201,RBM15-004
c13055/f1p27/3190 LRRC41-201: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c14773/f1p0/3486 MIR4425-201 This was expected
c15401/f82p47/2854 LRRC41-006: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c15529/f16p41/2957 LRRC41-201: multiple assignment again
c17240/f4p6/2673 TCEB3-003: multiple assignment
c17998/f1p0/2984 HIST2H2AA4-001: multiple assignment
c17998/f1p0/2984 HIST2H3C-001: multiple assignment
c2355/f16p17/3301 TOR1AIP1-015: multiple assignment
c23761/f1p0/2920 RP11-61J19.5-001 This was expected
c24723/f2p5/4046 ATPAF1-001: multiple assignment
c26869/f1p8/3622 RSBN1-004: multiple assignment
c30828/f2p12/2747 LRRC41-006: multiple assignment
c33225/f1p0/3407 PHTF1-004 This was expected
c36587/f1p0/3228 JMJD4-006: multiple assignment
c8235/f3p8/2930 RSBN1-004: multiple assignment

All of these cases (except for the three enumerated previously) were situations where TALON matched the query to more than one transcript, including the MatchAnnot answer within this set. So that is fine for now. Once I write a tiebreaker for TALON, these discrepancies should disappear.

Assignments in TALON that were not in MatchAnnot:

When I check this, I find only multi-match cases from the previous section. That's a pass

@dewyman
Copy link
Member Author

dewyman commented Apr 18, 2018

Comparing TALON and MatchAnnot again after tiebreaker implementation for full matches (as of commit dewyman@728805b3c5e04650030b6eeae608e3c441dfea8)

  1. Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf --d test --o test
grep -v "dataset" test | grep "known" | awk '{print $2,$10}' | sort -n -k1,1  > $TALON_file
  1. Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file
  1. Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MA_transcript_file $TALON_file | wc -l

Lines that are in MatchAnnot file but not TALON

c14773/f1p0/3486 MIR4425-201
c17998/f1p0/2984 HIST2H2AA4-001
c17998/f1p0/2984 HIST2H3C-001
c23761/f1p0/2920 RP11-61J19.5-001
c33225/f1p0/3407 PHTF1-004

c17998/f1p0/2984 comes up twice here. TALON assigns it to HIST2H3A-001. HIST2H3A-001 and HIST2H2AA4-001 both seem like reasonable choices, and the former minimizes the total 3' and 5' difference, so I'm not going to worry too much about this.

Lines that are in TALON but not MatchAnnot

Only one:
c17998/f1p0/2984 HIST2H3A-001

@dewyman dewyman closed this as completed Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant