In the following we show how to prepare datasets of human SAVs, with relative features and true labels, to be used for training Rhapsody classifier.

## List of human missense variants 

We start by importing lists of human SAVs, with relative pathogenicity assessments, compiled from the following publicly available datasets:
* **Integrated Dataset**, obtained by combining:
    * **HumVar, ExoVar, predictSNP, VariBench, SwissVar**, 5 datasets of labelled human missense variants already used in our  [previous publication](https://www.pnas.org/content/115/16/4164) 
    * **Humsavar**, a database of "human polymorphisms and disease mutations" available on [Uniprot](https://www.uniprot.org/docs/humsavar)
    * **ClinVar** a "public archive of reports of the relationships among human variations and phenotypes, with supporting evidence" [(FTP site)](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)
* **ClinVar** only 

We also consider ClinVar separately because it contains a *review score*, based on a ranking out of 4 stars, that allows us to test performances on different levels of "confidence in the accuracy of variation calls and assertions of clinical significance".

When combining datasets into the Integrated Datasets, SAVs with discordant interpretation are assigned with a `true_label = -1`.

In [None]:
import tarfile

if not os.path.isdir('data'):
    tar = tarfile.open('data.tar.gz', "r:gz")
    tar.extractall()
    tar.close()

In [1]:
import numpy as np

# import numpy structured arrays
ID = np.load('data/Integrated_Dataset-SAVs.npy')
CV = np.load('data/ClinVar_Dataset-SAVs.npy')

In [2]:
print('Integrated Dataset size:', len(ID))
print('ClinVar size:           ', len(CV))

Integrated Dataset size: 94505
ClinVar size:            20814


In [3]:
print(ID.dtype.names)
print(ID[0])

('SAV_coords', 'true_label', 'datasets')
('A0AV02 181 R C', 0, 'humsavar,swissvar')


In [4]:
print(CV.dtype.names)
print(CV[0])

('SAV_coords', 'true_label', 'review_star')
('A0PJY2 278 H Y', 1, 0)


In [5]:
# true labels: 0 (neutral), 1 (deleterious), -1 (unknown or discordant interpretations)
print( set(ID['true_label']) )
print( set(CV['true_label']) )

{0, 1, -1}
{0, 1, -1}


In [6]:
# ClinVar review stars (see https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status for meaning)
print( set(CV['review_star']) )

{0, 1, 2, 3, 4}


## Computing Rhapsody features
Precomputed features can be found in `data/` (computing them from scratch for the complete dataset would take days).

In the following, we show how to compute Rhapsody features for a small set of SAVs.

In [7]:
test_SAVs = ['O00294 496 A T', 'O00238 31 R H']

In [8]:
import os,sys

# Insert here path to Rhapsody folder
sys.path.insert(0, '/home/lponzoni/Scratch/028-RHAPSODY-git/rhapsody') 
import rhapsody 

In [9]:
# set folder where pickles will be stored
if not os.path.isdir('pickles'):
    os.mkdir('pickles')
rhapsody.pathRhapsodyFolder('pickles/')

@> Local Rhapsody folder is set: '/home/lponzoni/Scratch/028-RHAPSODY-git/rhapsody-tutorials/00-Training_Dataset/pickles'


In [10]:
# Insert here path to folder with EVmutation precomputed mutation effects
rhapsody.pathEVmutationFolder('/home/lponzoni/Data/025-EVmutation/mutation_effects')

@> Local EVmutation folder is set: '/home/lponzoni/Data/025-EVmutation/mutation_effects'


In [11]:
if not os.path.isdir('results'):
    os.mkdir('results')
os.chdir('results')

In [12]:
# initialize a rhapsody object
rh = rhapsody.Rhapsody()

In [13]:
# import SAVs by querying PolyPhen-2 (or by importing precomputed PolyPhen-2 output file, if found)
if os.path.isfile('pph2-full.txt'):
    rh.importPolyPhen2output('pph2-full.txt')
else:
    rh.queryPolyPhen2(test_SAVs)

@> PolyPhen-2's output parsed.


In [14]:
# we would like to compute all features
rh.setFeatSet('all')

In [15]:
# true labels must be imported prior to exporting training data
true_labels = {
    'O00294 496 A T': 1,
    'O00238 31 R H': 0
}
rh.setTrueLabels(true_labels)

In [16]:
training_dataset = rh.exportTrainingData()
training_dataset

@> Sequence-conservation features have been retrieved from PolyPhen-2's output.
@> Mapping SAVs to PDB structures...
@> [1/2] Mapping SAV 'O00238 31 R H' to PDB...
@> PDB file is found in the local folder (/home/lponzoni/.../3mdy.pdb.gz).
@> 858 atoms and 1 coordinate set(s) were parsed in 0.05s.
@> [2/2] Mapping SAV 'O00294 496 A T' to PDB...
@> Pickle 'UniprotMap-O00238.pkl' saved.
@> PDB file is found in the local folder (/home/lponzoni/.../2fim.pdb.gz).
@> 456 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> PDB file is found in the local folder (/home/lponzoni/.../3c5n.pdb.gz).
@> 454 atoms and 1 coordinate set(s) were parsed in 0.04s.
@> Chain A in 2FIM was aligned in 0.2s.
@> Pickle 'UniprotMap-O00294.pkl' saved.
@> SAVs have been mapped to PDB in 3.4s.
@> Computing structural and dynamical features from PDB structures...
@> [2/2] Analizing mutation site 2FIM:A 443...
@> PDB file is found in the local folder (/home/lponzoni/.../2fim.pdb.gz).
@> 3841 atoms and 1 coordinate 

array([('O00294 496 A T', '2FIM A 443 A', 224, 1, 0.41869268, 0.2725254, 0.30990002, 0.00208266, 0.00289542, 0.0444308, 0.00296154, 0.00206255, 0.04027145, 0., 0.754,  1., -3.1479, -1.0065, 0.0662341, 0.3341, 0.09581726, 0.08719353, 0.08953744, 0.00398543, 0.00332497, 0.01278353, 0.00573205, 0.00480731, 0.01438932, 79., 78., 2.1440704, 0.813278  , 12.638108, 14.42894, 14.559032, -2.376),
       ('O00238 31 R H', 'Unable to map SAV to PDB',   0, 0,        nan,       nan,        nan,        nan,        nan,       nan,        nan,        nan,        nan, 0., 1.634, nan, -2.4718, -2.461 , 0.0102818, 0.3508,        nan,        nan,        nan,        nan,        nan,        nan,        nan,        nan,        nan, nan, nan, 1.8702421, 0.23376623,       nan,      nan,       nan, -2.769)],
      dtype=[('SAV_coords', '<U50'), ('Uniprot2PDB', '<U100'), ('PDB_length', '<i2'), ('true_label', '<i2'), ('ANM_MSF-chain', '<f4'), ('ANM_MSF-reduced', '<f4'), ('ANM_MSF-sliced', '<f4'), ('ANM_effectiven

In [17]:
np.save('precomputed_features', training_dataset)