Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIGAR operations out of range after splice junction correction #16

Open
msto opened this issue Sep 27, 2019 · 5 comments
Open

CIGAR operations out of range after splice junction correction #16

msto opened this issue Sep 27, 2019 · 5 comments

Comments

@msto
Copy link

msto commented Sep 27, 2019

Hi,

Since the update in #14 to update transcripts after correcting splice junctions, I get the following error. I think there may be a bug in the rescueNoncanonicalJunction function, which results in a CIGAR string incompatible with the sequence (i.e. more match/insert/sub operations than there are characters in the sequence), but I haven't been able to trace the exact source of the error. Could you please look into it if you have the chance?

Thanks!

  File "bin/TranscriptClean_fork/TranscriptClean.py", line 1042, in <module>
    main()
  File "bin/TranscriptClean_fork/TranscriptClean.py", line 171, in main
    writeTranscriptOutput(noncanTranscripts, sjDict, oSam, oFa, transcriptLog, genome)
  File "bin/TranscriptClean_fork/TranscriptClean.py", line 184, in writeTranscriptOutput
    outSam.write(Transcript2.printableSAM(currTranscript, genome, spliceAnnot) + "\n")
  File "/scratch/mrstone3/pacbio_iPSC/bin/TranscriptClean_fork/transcript2.py", line 294, in printableSAM
    self.NM, self.MD = self.getNMandMDFlags(genome)
  File "/scratch/mrstone3/pacbio_iPSC/bin/TranscriptClean_fork/transcript2.py", line 360, in getNMandMDFlags
    currBase = self.SEQ[seqPos]
IndexError: string index out of range```
@dewyman
Copy link
Member

dewyman commented Sep 27, 2019

Hi Matt,
Have you tried running the TranscriptClean 2.0 version on this data instead of the patch update? I am hopeful that my test cases covered this issue in the new version of the code, although if this error happens there too I would certainly want to know so that I can address it. Another benefit of the most recent version is that you can run it on multiple cores for a better runtime.
-Dana

@dewyman
Copy link
Member

dewyman commented Oct 1, 2019

Update: I was able to find an example in some data I had of a noisy read that triggered a possibly related problem during non canonical splice junction correction. I fixed the problem and added more tests. The patch will be included in v2.0.2. Hope this helps!

@msto
Copy link
Author

msto commented Nov 4, 2019

Hi Dana,

Thanks! The patch fixed that issue. Sorry to bother you again, but I found another bug that I was hoping you could take a look at. I'm not sure if it's related to the problem you found, so I thought I'd post it here.

I'm finding that after correcting noncanonical splice junctions, some transcripts have all of their canonical splice junctions shifted by 1 bp. I've attached an example of an affected transcript here. The issue appears when correcting with respect to either the gencode reference junctions or a set of junctions from Illumina sequencing of the same sample. Please let me know if you can't reproduce and I can share the gencode file I'm using.

Thanks,

Matt

transcriptclean_example.tar.gz

@dewyman
Copy link
Member

dewyman commented Nov 5, 2019

Hi Matt,
Thank you for bringing this to my attention. I will take a look at the files you sent and see if I can reproduce the issue.
Best,
Dana

@dewyman
Copy link
Member

dewyman commented Nov 14, 2019

Hi Matt,
So I tried to reproduce the issue with the latest TranscriptClean version (including reducing it down to a small test case of one junction), and I have not been able to get the same result. Here is the link to a UCSC genome browser session I am using to visualize the reads and get to the bottom of what is going on. The uncorrected and corrected files shown are the ones you sent me, and the one marked 'dana_run_corrected' is the corrected file I obtained when I ran my local version of TC on your file. Using the genome browser, I noticed in your corrected file that the sequence looks messed up right after the leftmost splice junction ends, which made me think that maybe the sequence was being handled incorrectly in some way that led to everything after the junction being shifted over. But as I say, I haven't been able to reproduce the problem.
Let me know what you think- it would be great to get to the bottom of this!
Best,
Dana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants