-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate MiCall on SARS-COV-2 Illumina data #42
Comments
It is not in the raw data: Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGAAGGCTA
Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGGCTA
GGCGGCTATGCTCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTTTCGCTAGCATTTCAGTAGTGCCACCAGCCTTTTTAGTAGGTATAACCACAGCAGTTAAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTTCGAACATATCGT
CATTAATTTGCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCCCCTCTACAGTGTAACCAAATTTAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTT Meaning that MiCall is making it up. I'm going to have to trace back through the intermediate outputs to figure out where this is coming from. |
It does appear once in the second read file:
Is MiCall stitching this one insertion into the data? |
Options:
|
I got 100% map rate with a single run of bowtie2, local alignment, so I don't think MiCall's iterative approach to mapping is necessary. Will adapt functions from |
@donkirkby can you please help me reproduce this issue with cfe-lab/MiCall? |
New pipeline is picking the deletion up:
Closing issue and tracking any further issues in new repo. |
I'm getting our references set up today, @ArtPoon, and I'll let you know what I see when I run this sample. |
Thanks @donkirkby ! |
OK, I got some results out of MiCall, and I'm starting to look through them. First item: your mystery insertion. I see it, but as a minority variant. It's also in a slightly different position from yours. The MAX consensus shows a deletion at that position. Here's a section of the consensus sequences at different cutoffs from 20290 to 20310:
The lower case codes represent a mixture of nucleotides and deletions. My guess is that MiCall-lite still reports any nucleotide reads over deletions, even if the deletions are over 75% of the reads as in this case. |
Thanks for checking this out @donkirkby - the new repo is set private for now. I'll publish it soon, I just need time to fix the |
I apologize @donkirkby, I see that we're the only contributors to that code - but why no inline comments? Throw me a bone! 😄 |
The programmer that has annoyed me the most throughout my career has always been myself from six months earlier. Would you like to go through the code together tomorrow morning over Google hangouts? I'm free at 10am PDT. |
Fair enough - I've written some real pigs as a perennial amateur. |
You might not need such drastic changes, @ArtPoon. I think the reason you're not getting a clean deletion is that you're not counting deletions when building the consensus:
Here's the equivalent code in our version:
|
Where did you get the GISAID consensus sequence? I'd like to compare my results. |
Thanks @donkirkby - I figured I had some legacy code in my branch that had been changed since the fork, but I simply didn't have the bandwidth to search for it this week. I'm pretty much finished the other pipeline anyhow, and I really don't think iterative remapping is necessary, so I'm probably going to ahead with validating my pared down version of MiCall. I registered for GISAID and then searched for a sequence with a matching descriptor. |
I registered for GISAID and found Accession EPI_ISL_408670, but it took me a while to figure out that the descriptions in the SRA abstract for SRR11140746 (SARS-CoV-2/2019-nCoV/USA-WI-1/2020) loosely match the virus name in GISAID for EPI_ISL_408670 (hCoV-19/USA/WI1/2020). Thanks for the clues! After all that, I can say that MiCall generates a matching consensus to EPI_ISL_408670, for positions from 114 to 29835. The 113 bases before that and the 44 bases after had fewer than 100 reads, and weren't reported. |
Sorry @donkirkby - I didn't mean to obscure the source. I should have provided the EPI identifier. |
Since the primary objective is to obtain a consensus sequence rather than to sample at depth (i.e., to detect minor variants), shouldn't MiCall use a lower reporting threshold? |
You did provide the EPI identifier, @ArtPoon. That's the clue I was thanking you for! I was just struggling to figure out how you knew they were the same sample. I'll discuss the reporting threshold with the team. Thanks for the suggestion! |
Published new repo at http://github.com/PoonLab/sam2conseq |
The text was updated successfully, but these errors were encountered: