We show here how to train the Rhapsody random forest classifier in its final form.

Based on the analyses illustrated in the `RF_optimization` notebook, we will make the following choices:
* the training dataset will only contain SAVs with at least 1 review star in the ClinVar database (when available) and with same clinical interpretation in all 7 datasets
* only PDB structures with at least 150 residues will be considered
* the random forest classifier hyperparameters are set based on the optimization procedure (max. number of features = 2, number of trees in the forest = 1500)

In [1]:
import sys, os
import pickle
import numpy as np
from prody import LOGGER

In [3]:
# please make sure to extract the data folder beforehand
ID = np.load('../00-Training_Dataset/data/precomputed_features-ID.npy')
len(ID)

91697

In [4]:
# let's discard SAVs with unknown significance (true_label == -1), ...
ID = ID[ID['true_label'] != -1]
len(ID)

87726

In [5]:
# ... SAVs with associated PDB structure smaller than 150 residues ...
ID = ID[ ID['PDB_length' ] >= 150]
len(ID)

23085

In [6]:
# ... and SAVs with 0 review star according to ClinVar
ClinVar_SAVs  = np.load('../00-Training_Dataset/data/ClinVar_Dataset-SAVs.npy')
excluded_SAVs = set(ClinVar_SAVs[ClinVar_SAVs['review_star'] < 1]['SAV_coords'])
ID = ID[ [SAV not in excluded_SAVs for SAV in ID['SAV_coords']] ]
len(ID)

20361

In [2]:
# Insert here path to Rhapsody folder
sys.path.insert(0, '/home/lponzoni/Scratch/028-RHAPSODY-git/rhapsody')

from rhapsody import *