Skip to content

joerivandervelde/vkgl-protein-folding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein folding on VKGL data

Protein structure example of the CFTR gene

Coverage of human protein structures has recently greatly increased by deep learning1 while sophisticated algorithms to predict protein stability have become accessible to mainstream users on commodity hardware2 . Protein stability may be decreased by coding DNA variation and even lead to protein misfolding, representing an important mechanism for pathogenicity3 . The potential for genome diagnostics to recognize and report such variation has been shown4 , but the ΔΔG thresholds used for interpretion in this context differ greatly between proteins4 .

Here, we perform protein folding on DNA variation from shared by Dutch genome diagnostic laboratories in the VKGL Data Sharing working group. Essentially, we calculate the difference in Gibbs free energy change (ΔΔG) between wild-type protein sequences and variant sequences. An increase in ΔΔG indicates that more energy is required for folding, making it less favourable and prone to pathogenic misfolding. The amino acid changes of the DNA variation is based on GRCh37 and introduced in the AlphaFold2 human proteome to enable automated processing.

We selected genes for which many Variants of Unknown Significance (VUS) have been reported for potential re-interpretation or otherwise having a high clinical interest. In addition, we require a substantial amount of initial benign and pathogenic variants to increase chances of success. Lastly, the selected genes had protein products consist of single-fragment monomers.

We calculate the ΔΔG for benign and pathogenic variants and use Youden's J statistic to estimate an optimal threshold between these two groups. For example, the threshold for the cystic fibrosis transmembrane conductance regulator (CFTR) protein is placed at 1.39, shown below. According to these data, the chance that a new variant is correctly labeled as 'pathogenic' above this threshold is 90% (positive predictive value, PPV) and 68% of all pathogenic variants can be found this way (sensitivity).

CFTR folding on VKGL variants

In total we have folded 5869 mutant proteins for DNA variation in 55 genes. These variants were classified as 1621 likely benign or benign (LB/B), 1678 likely pathogenic or pathogenic (LP/P), 2482 VUS and 88 conflicting, i.e. multiple classifications on the same protein change that are not identical. The mean ΔΔG threshold across these genes was 1.48 (95%CI: 0.97-2), which is comparable to the previously found threholds of 1.58 and 1.505 .

We defined genes to have a trustworthy threshold if enough samples were used for estimation (LB/B n >= 10 and LP/P n >=10) and a PPV of >= 90%. This resulted in 10 genes: CFTR, MLH1, ATP7B, LDLR, SCN1A, FGFR2, F8, SLC12A3, MSH2, and NPC1. We investigated the potential for re-classification of 456 VUS present in the 10 genes. Applying the respective gene thresholds resulted in 166 VUS that might be considered candidates for re-classification as likely pathogenic:

Gene Candidates for re-classification
ATP7B 17
CFTR 22
F8 14
FGFR2 6
LDLR 17
MLH1 26
MSH2 23
NPC1 4
SCN1A 21
SLC12A3 16

Interestingly, two genes were found to have a ΔΔG above or below the 95%CI of the mean threshold (> 2 or < 0.97). For SLC12A3 a ΔΔG threshold of 0.41 based on folded structures for 11 benign and 47 pathogenic variants, having a PPV of 93%, an NPV of 47%, a sensitivity of 81% and a specificity of 73%, and for MSH2 a ΔΔG threshold of 3.66 based on folded structures for 46 benign and 18 pathogenic variants, having a PPV 100%, an NPV of 87%, a sensitivity of 61% and a specificity of 100%.

While these results seem to confirm the potential of protein folding for genome diagnostics, the used sample size is relatively small and AlphaFold2 structures might not accurately represent actual biological protein structures nor take into account the highly complex context in which these proteins perform their function.

Acknowledgements

  • Rene Mulder
  • Jan D.H. Jongbloed
  • Kristin M. Abbott
  • Birgit Sikkema-Raddatz
  • Helga Westers

Data used

VKGL public consensus release April 2023

  • Fokkema, IFAC, van der Velde, KJ, Slofstra, MK, et al. Dutch genome diagnostic laboratories accelerated and improved variant interpretation and increased accuracy by sharing data. Human Mutation. 2019; 40: 2230–2238. https://doi.org/10.1002/humu.23896
  • Direct download link

AlphaFold2 human proteome v4

Software used

FoldX 5.0

R 4.2.3

  • R version 4.2.3 (2023-03-15) -- "Shortstop Beagle" Copyright (C) 2023 The R Foundation for Statistical Computing Platform: aarch64-apple-darwin20 (64-bit)
  • Download via R-project

Java 18.0.1

  • Oracle OpenJDK version 18.0.1
  • Download

About

Protein folding on VKGL data using FoldX and AlphaFold2

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published