Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medaka variant calling, annotation, classification: zigosity determination; supporting reads; #244

Closed
MJMCarvalho opened this issue Jan 14, 2021 · 10 comments
Labels

Comments

@MJMCarvalho
Copy link

I am calling variants on C. albicans strains genomes using the following workflow:

Read correction with Canu (v2.0)
Minimap2 for mapping reads against the reference (v2.17)
Variant calling (medaka_variant -f <REFERENCE.fasta> -i <reads.bam>), variant annotation (medaka tools annotate variants.vcf reference.fasta alignments.bam variants.annot.vcf), variant classification with Medaka (medaka tools classify_variants variants.annot.vcf) Medaka v1.2.1
Variant annotation and functional effect prediction with SnpEff (v4.3)

I came across the following situation:

Ca22chr3A_C_albicans_SC5314  3038      .               G             A             26.3085                PASS                AR=0,1;DP=189;DPS=121,68;DPSP=199;SC=14307,8962,13935,8754;SR=116,67,8,7;pos1=3038;pos2=3038;q1=26.819;q2=25.798;type=snp;ANN=A|upstream_gene_variant|MODIFIER|C3_00020W_A|C3_00020W_A|transcript|C3_00020W_A-T|protein_coding||c.-961G>A|||||961|,A|upstream_gene_variant|MODIFIER|C3_00040W_A|C3_00040W_A|transcript|C3_00040W_A-T|protein_coding||c.-2933G>A|||||2933|,A|downstream_gene_variant|MODIFIER|C3_00010C_A|C3_00010C_A|transcript|C3_00010C_A-T|protein_coding||c.*790C>T|||||790|,A|downstream_gene_variant|MODIFIER|C3_00030C_A|C3_00030C_A|transcript|C3_00030C_A-T|protein_coding||c.*1865C>T|||||1865|,A|intergenic_region|MODIFIER|CHR_START-C3_00010C_A|CHR_START-C3_00010C_A|intergenic_region|CHR_START-C3_00010C_A|||n.3038G>A||||||              GT:GQ   1|1:26
 
1. DPSP = Depth of reads spanning pos +-25; I now understand that these reads may not necessarily cover the variant position.
In this case, we have 199 reads which span position 3038 with 25bp upstream and downstream the position. I understand this is useful to phase the haplotypes; I don’t understand how we can assume these reads support the alteration.
 
 
2. We have 183 reads supporting the reference (SR fwd+rev; 116+67) and 15 reads supporting the alteration (SR fwd + rev; 8+7); in total, 198. I suppose these are only reads spanning position +-25bp at position (DPSP); one read was ambiguous (AR). However the GT of the alteration is 1|1, which means it’s homozygous; I don’t understand how the program classifies this variant as homozygous.
Additionally, looking at this position in IGV (print screen attached), it showed total count 290: 183 reads for the alteration (A) and 107 reads for the reference (G). This is not in agreement with the vcf (I understand this is the quality filtering in medaka, but I’d like to understand more about the quality filtering done throughout medaka) and does not support an homozygous alteration.
Would you, please, let me know how medaka calculates the zygosity of the alteration? Is it possible that medaka has switched the number of reads supporting the reference and the alteration?

Screenshot 2020-12-21 at 17 42 41

Thank you.
Kind regards.
Maria

@mwykes
Copy link
Contributor

mwykes commented Jan 15, 2021

Hi Maria,
Thanks for your questions. One thing before that though - you mention that you are using canu for read-correction; are you using corrected reads for medaka variant calling? If so, I would not recommend doing this; Medaka is trained on uncorrected reads and giving medaka corrected reads may result in lower accuracy variant calls unless you retrain a medaka model on corrected reads. Furthermore, our medaka variant calling models are only trained on depth up to 60X, so may not work as well at the very high depth you are using.

Regarding your question 1) about the annotation tool, there is a known bug in the annotation tool affecting the DP field, but not the DPS field. As you say, the DPS field is a count of reads spanning the variant pos -25 -> pos + 25. Hence they do overlap the variant and have sufficient anchoring sequence either side of the variant to allow reliable realignment to the reference and alternative sequence(s). Indeed one cannot assume that all of these reads support the alternative allele - this is what the SR field is for, which bring us onto question 2).

With regards to how medaka calculates zygosity, medaka factors the diploid variant calling problem into two independent sets of consensus and variant calls on each haplotype (see the docs for more information). Whatshap is used to partition reads by haplotype. Reads which can't be haplotyped (e.g. because they don't overlap with a heterozygous SNP) will be used in forming the consensus for both haplotypes. Variant calls on each haplotype are finally combined into a diploid VCF. If a variant is found on both haplotypes it is homozygous, if it is found on just one it will be heterozygous. It is possible that your variant call is homozygous because the reads supporting either A/G are spread across both hapolotypes, and medaka predicts A on both. This is consistent with the medaka quality scores for the consensus call on each haplotype (q1 and q2) are both around 26. What were the genotypes of the neighboring variants (e.g. C->T and C->A)? It is also worth mentioning that medaka's neural network model can take into account several other factors such as local sequence context and strand bias when predicting the consensus sequence. This means it can make predictions which differ to the result of a naive base-counting consensus. Whilst this does make it's predictions harder to interpret, it also the reason why medaka provides higher accuracy overall.

It's worth mentioning that medaka variant calls can be further refined by regenotyping them with DeepVariant. We have not tested this on non-human datasets, but it might be worth trying. A tutorial on how to do this can be found here.

Regarding the read counts, if you are able to share a bam containing the reads around that variant and your reference and vcf I can look into this and see if there is a bug in the code.

@mwykes
Copy link
Contributor

mwykes commented Jan 15, 2021

In case you are able to share these files and would prefer to do so privately, I've just emailed you a secure file-sharing link.

@MJMCarvalho
Copy link
Author

MJMCarvalho commented Jan 16, 2021 via email

@mwykes
Copy link
Contributor

mwykes commented Jan 19, 2021

Thanks Maria.

I ran the annotation tool on the latest pre-release of medaka in which a bug affecting the DP field has been fixed. This shows that the DP is 199 at this variant.

Filtering your bam to retain only primary alignments (which are the only alignments medaka considers) and checking the depth with bedtools confirms that the total depth of primary alignments is indeed 199 at this position.

bedtools coverage -b aln.sorted.bam -a test_variant.vcf 
Ca22chr3A_C_albicans_SC5314     3038    .       G       A       26.3085 PASS    type=snp        GT:GQ   1|1:26  313
samtools view aln.sorted.bam -F 2308 -b > aln.sorted.filt.bam
bedtools coverage -b aln.sorted.filt.bam -a test_variant.vcf                                                 
Ca22chr3A_C_albicans_SC5314     3038    .       G       A       26.3085 PASS    type=snp        GT:GQ   1|1:26  199

You can also get IGV to do this filtering via the view->preferences->alignments menu by ticking the "Filter secondary alignments" and "Filter supplementary alignments" boxes.

This shows the total depth as 189, with 129 A's, 60 G's and 10 deletions (the 10 deletions accounting for the difference between DP=199 and the depth reported by IGV).
image

Hence the variant is supported by a majority of reads, and a majority of reads on each strand.

I also took a look at the phasing around the variant. The bam you sent did not contain HP tags, so I reran medaka_variant
medaka_variant -i aln.sorted.filt.bam -f C_albicans_SC5314_A22_current_chromosomes.fasta -r Ca22chr3A_C_albicans_SC5314
and loaded the bam round_0_hap_mixed_phased.bam into IGV>

The final variant call was the same as your vcf, (despite the fact that I used the filtered bam, illustrating that medaka is doing this filtering internally).
a22chr3A_C_albicans_SC5314 3038 . G A 26.308 PASS pos1=3038;q1=26.818;pos2=3038;q2=25.798 GT:GQ 1/1:26

Colouring and sorting alignments by their HP tag (haplotype assignment) shows that the reads on both haplotypes support the variant, explaining why it was called as a homozygous variant.
image

Hopefully that all makes sense. Please let me know if not.

Mike

@mwykes
Copy link
Contributor

mwykes commented Jan 19, 2021

Hi Maria,

In my last comment, I forgot to address your question about the SR field, (Is it possible that medaka has switched the number of reads supporting the reference and the alteration?). Yes, indeed, something does seem to be wrong here, I will look into this further and update the issue.

Mike

@MJMCarvalho
Copy link
Author

MJMCarvalho commented Jan 20, 2021 via email

@mwykes
Copy link
Contributor

mwykes commented Jan 22, 2021

Hi Maria,

I have checked the SR field. It seems the problem is related to the fact that there seem to be multiple ways in which reads are aligning around that SNP - with one set of reads having multiple ~50 bp insertions within 200 bp of the SNP and another set having a single larger 462 bp insertion at ~3130. The 47 bp insertion within the 25 bp padding of the SNP seems to trip up the annotation tool and results in a different alignment of the trimmed read sequences around the SNP, and hence affects the SR counts.

It it possible that this could be an artifact of the read correction - I would recommend repeating the analysis on the uncorrected reads. It may also be preferable not to error-correct reads when dealing with a diploid genome; unless you are using ploidy-aware methods, you may run the risk of correcting "errors" in reads which are actually heterozygous variants.

The impact of the 47-base insertion can be avoided by setting --pad 3 , which yields SR values which favour the alternative allele. There are however still inconsistent with the IGV base counts.

Ca22chr3A_C_albicans_SC5314 3038 . G A 26.3085 PASS AR=3,5;DP=199;DPS=124,75;DPSP=197;SC=2081,1265,2624,1510;SR=29,18,92,50;type=snp GT:GQ 1|1:26

Having seen the limited use of these SR annotations, we have already made the read-realignment part of the annotation optional in the latest pre-release of medaka and are considering removing it from the next release, such that only depth-related annotations are added.

Let me know if you have any other questions about this.

Mike

@MJMCarvalho
Copy link
Author

MJMCarvalho commented Jan 28, 2021 via email

@MJMCarvalho
Copy link
Author

Dear Mike

I have Candida albicans isolates ONT and Illumina sequenced. I would like to call variants on these genomes, hence I used medaka. I would like to improve the previous analysis I have done and be able to interpret medaka outputs to filter out high quality called SNPs.
I am wondering if you have seen my latest question and if you could help me.

Thanks
Maria

@mwykes
Copy link
Contributor

mwykes commented Mar 23, 2021

Hi Maria,

Apologies for the slow reply.

I don't think I can offer any advice other than that I have already given, namely how we would recommend you use medaka, and that DeepVariant genotyping of medaka or PEPPER variants can improve accuracy.

It's worth noting that Medaka is supported as a "Research Release" only. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software is minimal and is only provided directly by the developers.

Kind regards,
Mike,

@mwykes mwykes closed this as completed Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants