# Training of Random Forest classifiers

In the following, we train different versions of the Rhapsody classifier and compare their accuracy. Each version is trained on the Integrated Dataset of human missense variants and evaluated through 10-fold cross-validation. 

More specifically, we considered:
* different subsets of features,
* different subsets of the training dataset,
* different classifier's hyperparameters

In [1]:
import sys, os
import pickle
import numpy as np
from glob import glob

In [2]:
# Insert here path to Rhapsody folder
sys.path.insert(0, '/home/lponzoni/Scratch/028-RHAPSODY-git/rhapsody') 

In [3]:
from rhapsody import *

## Importing the training dataset

The Integrated Dataset used for training is made available as a NumPy structured array containing all precomputed features, as well as true labels and other info (e.g. PDB lengths).

In [4]:
ID = pickle.load(open('Integrated_Dataset.pkl', 'rb'))

In [5]:
# array structure
ID.dtype.names

('SAV_coords',
 'Uniprot2PDB',
 'PDB_length',
 'true_label',
 'wt_PSIC',
 'Delta_PSIC',
 'SASA',
 'Delta_SASA',
 'BLOSUM',
 'GNM_MSF-chain',
 'GNM_MSF-reduced',
 'GNM_effectiveness-chain',
 'GNM_effectiveness-reduced',
 'GNM_sensitivity-chain',
 'GNM_sensitivity-reduced',
 'ANM_MSF-chain',
 'ANM_MSF-reduced',
 'ANM_effectiveness-chain',
 'ANM_effectiveness-reduced',
 'ANM_sensitivity-chain',
 'ANM_sensitivity-reduced',
 'stiffness-chain',
 'stiffness-reduced',
 'MBS-chain',
 'MBS-reduced',
 'entropy',
 'ranked_MI',
 'EVmut-DeltaE_epist',
 'EVmut-DeltaE_indep',
 'EVmut-wt_aa_cons',
 'EVmut-mut_aa_freq')

In [6]:
# each entry can be accessed by indexing
ID[0]

('Q96JB6 405 D A', 'Unable to map SAV to PDB', nan, 0, -0.924, 2.234, nan, nan, -2., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 1.469, 0.9847, -6.859, -2.213, 0.5493, 0.07133)

In [7]:
# size of the training dataset
len(ID)

92108

In [8]:
# number of entries with an associated PDB structure
len(ID[~np.isnan(ID['PDB_length'])])

28010

## Cross-validation with various classification schemes

We assess, through 10-fold cross-validation, the effect of considering different combinations of features and training datasets. 
In particular, we compare performances when using:
* different subsets of the training dataset, obtained by selecting those cases where PDB structures were larger than *n* residues
* 4 different feature sets, including one reproducing version 1 of the method (RAPSODY)
* **GNM** vs **ANM** features
* ENM features computed with and without the inclusion of *enviromental* effects (**reduced** vs **chain** model), i.e. the presence of other chains in the PDB structure


In [17]:
os.mkdir('RF_training')

In [21]:
# this cell requires ~3.5 hours to complete

from prody import LOGGER
LOGGER.start('RF_training.log')

CV_summaries = {}

if os.path.isfile('RF_training/CV-summaries.pkl'):
    print('A pickle containing precomputed training results have been found.')
    print('Please delete it if you wish to run the training again.')
else:
    for min_num_res in [0, 100, 150, 200, 300, 400, 500, 600]:
        # compute subset of the training dataset
        ID_subset = ID[ ~np.isnan(ID['PDB_length']) ]
        ID_subset = ID_subset[ ID_subset['PDB_length'] >= min_num_res ]
        
        # loop over different classification schemes
        for ENM in ['GNM', 'ANM']:
            for model in ['chain', 'reduced']:
                for version in ['v2', 'v2_noPfam', 'v2_EVmut', 'v1']:

                    # select feature set (+ true label)
                    if version == 'v2':
                        # full classifier
                        featset = ['true_label', 
                                   'wt_PSIC', 'Delta_PSIC', 'SASA', 
                                   f'{ENM}_MSF-{model}',
                                   f'{ENM}_effectiveness-{model}',
                                   f'{ENM}_sensitivity-{model}',
                                   f'stiffness-{model}',
                                   'entropy', 'ranked_MI', 'BLOSUM']
                    elif version == 'v2_noPfam':
                        # reduced classifier
                        featset = ['true_label',
                                   'wt_PSIC', 'Delta_PSIC', 'SASA', 
                                   f'{ENM}_MSF-{model}',
                                   f'{ENM}_effectiveness-{model}',
                                   f'{ENM}_sensitivity-{model}',
                                   f'stiffness-{model}',
                                   'BLOSUM']
                    elif version == 'v2_EVmut':
                        # full classifier + EVmutation epistatic score
                        featset = ['true_label',
                                   'wt_PSIC', 'Delta_PSIC', 'SASA', 
                                   f'{ENM}_MSF-{model}',
                                   f'{ENM}_effectiveness-{model}',
                                   f'{ENM}_sensitivity-{model}',
                                   f'stiffness-{model}',
                                   'entropy', 'ranked_MI', 'BLOSUM',
                                   'EVmut-DeltaE_epist']
                    elif version == 'v1' and ENM == 'GNM' and model == 'chain':
                        # classifier as in version 1 of Rhapsody (RAPSODY)
                        # NB: RAPSODY used a combination of GNM/ANM features, which
                        # we reproduce here for the sake of comparison
                        featset = ['true_label', 
                                   'wt_PSIC', 'Delta_PSIC', 'SASA', 
                                   'GNM_MSF-chain', 
                                   'ANM_effectiveness-chain', 
                                   'ANM_sensitivity-chain',
                                   'stiffness-chain']
                    else:
                        continue

                    if version == 'v1':
                        scheme = f'{min_num_res}-v1'
                    else:
                        scheme = f'{min_num_res}-{ENM}-{model}-{version}'
                    print(f'CLASSIFICATION SCHEME: {scheme}')                        

                    # create folder
                    folder = f'RF_training/clsf_scheme-{scheme}'
                    os.mkdir(folder)
                    
                    # train the classifier
                    clsf = trainRFclassifier(ID_subset[featset])

                    # store summary from cross-validation into a dictionary
                    CV_summaries[scheme] = clsf['CV summary']

                    # move figures into folder
                    for file in glob('*png'):
                        os.rename(file, os.path.join(folder, file))

                    # we'll only keep classifiers trained with the 150 min_num_res requirement
                    clsf_file = 'trained_classifier.pkl'
                    if min_num_res == 150:
                        os.rename(clsf_file, os.path.join(folder, clsf_file))
                    else:
                        os.remove(clsf_file)
                    
    # store all cross-validation results into a pickle
    pickle.dump(CV_summaries, open('RF_training/CV-summaries.pkl', 'wb'))

LOGGER.close('RF_training.log')

@> Logging into file: RF_training.log
@> Logging started at 2019-02-05 13:11:30.529657


CLASSIFICATION SCHEME: 0-GNM-chain-v2


@> 5955 out of 28010 cases ignored with missing features.
@> CV iteration # 1:    ROC-AUC = 0.840   OOB score = 0.817
@> CV iteration # 2:    ROC-AUC = 0.852   OOB score = 0.817
@> CV iteration # 3:    ROC-AUC = 0.839   OOB score = 0.817
@> CV iteration # 4:    ROC-AUC = 0.827   OOB score = 0.818
@> CV iteration # 5:    ROC-AUC = 0.841   OOB score = 0.817
@> CV iteration # 6:    ROC-AUC = 0.846   OOB score = 0.820
@> CV iteration # 7:    ROC-AUC = 0.842   OOB score = 0.819
@> CV iteration # 8:    ROC-AUC = 0.845   OOB score = 0.817
@> CV iteration # 9:    ROC-AUC = 0.859   OOB score = 0.818
@> CV iteration #10:    ROC-AUC = 0.852   OOB score = 0.819
@> ------------------------------------------------------------
@> Cross-validation summary:
@> training dataset size:   22055
@> fraction of positives:   0.731
@> mean ROC-AUC:            0.844
@> mean OOB score:          0.818
@> optimal cutoff*:         0.720 +/- 0.023
@> (* argmax of Youden's index)
@> feature importances:
@>           

CLASSIFICATION SCHEME: 0-GNM-chain-v2_noPfam


@> 315 out of 28010 cases ignored with missing features.


KeyboardInterrupt: 

In [None]:
# recover precomputed cross-validation results
CV_summaries = pickle.load(open('RF_training/CV_summaries.pkl', 'rb'))

## Figures

In [None]:
os.mkdir('figures')

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(2, 3, figsize=(12,7))
fig.subplots_adjust(wspace=0.2)

ax[0,0].set_title('RHAPSODY')
ax[0,1].set_title('RHAPSODY w/o Pfam features')
ax[0,2].set_title('RHAPSODY w/ EVmutation score')
ax[0,0].set_ylabel('AUROC')
ax[1,0].set_ylabel('OOB score')
for j in range(3):
    ax[0,j].set_ylim([.80, .88])
    ax[1,j].set_ylim([.80, .88])
    ax[1,j].set_xlabel('minimum number of residues')

x = [0, 100, 150, 200, 300, 400, 500, 600]
for ENM in ['GNM', 'ANM']:
    for model in ['chain', 'reduced']:
        for i, version in enumerate(['v2', 'v2_noPfam', 'v2_EVmut']):
            scheme = f'{ENM}-{model}-{version}'
            AUC = [CV_summaries[f'{n}-{scheme}']['mean ROC-AUC'] for n in x]
            OOB = [CV_summaries[f'{n}-{scheme}']['mean OOB score'] for n in x]
            ax[0,i].plot(x, AUC, 'o-', label=scheme)
            ax[1,i].plot(x, OOB, 'o-', label=scheme)

AUC = [CV_summaries[f'{n}-v1']['mean ROC-AUC'] for n in x]
OOB = [CV_summaries[f'{n}-v1']['mean OOB score'] for n in x]
ax[0,0].plot(x, AUC, 'v--', label='v1')
ax[1,0].plot(x, OOB, 'v--', label='v1')

ax[0,0].legend(fontsize=8)

plt.tight_layout()
fig.savefig('figures/performances_comparison.png', dpi=300)

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(12,7))
fig.subplots_adjust(wspace=0.2)

featsets = {}
featsets['v2'] = ['wt_PSIC', 'Delta_PSIC', 'SASA', 'MSF', 
                  'effectiveness', 'sensitivity', 'stiffness', 
                  'entropy', 'ranked_MI', 'BLOSUM']
featsets['v2_noPfam'] = ['wt_PSIC', 'Delta_PSIC', 'SASA', 'MSF', 
                         'effectiveness', 'sensitivity', 'stiffness', 'BLOSUM']
featsets['v2_EVmut'] = featsets['v2'] + ['EVmut-DeltaE',]
SEQ_feats = ['wt_PSIC', 'Delta_PSIC', 'BLOSUM', 'entropy', 'ranked_MI', 'EVmut-DeltaE']

ax[0,0].set_title('RHAPSODY')
ax[0,1].set_title('RHAPSODY w/o Pfam features')
ax[0,2].set_title('RHAPSODY w/ EVmutation score')
ax[0,0].set_ylabel('sequence-based features')
ax[1,0].set_ylabel('structure-based features')
for j in range(3):
    ax[0,j].set_ylim([0, .3])
    ax[1,j].set_ylim([0, .13])
    ax[1,j].set_xlabel('minimum number of residues')

x = [0, 100, 150, 200, 300, 400, 500, 600]

for i, (version, featset) in enumerate(featsets.items()):
    for j,f in enumerate(featset):
        ss = [CV_summaries[f'{n}-ANM-chain-{version}']['feat. importance'][j] for n in x]
        if f in ['BLOSUM', 'SASA']:
            m = 'kv-'
        else:
            m = 'o-'
        if f in SEQ_feats:
            ax[0,i].plot(x, ss, m, label=f)
        else:
            ax[1,i].plot(x, ss, m, label=f)
            
for a in ax[0]:
    a.legend(loc='upper right', ncol=2)
for a in ax[1]:
    a.legend(loc='lower right', ncol=2)

plt.tight_layout()
fig.savefig('figures/feat_imp_comparison.png', dpi=300)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7.5,3))
fig.subplots_adjust(wspace=0.2)

x = [0, 100, 150, 200, 300, 400, 500, 600]

y = np.array([CV_summaries[f'{n}-ANM-chain-v2']['dataset size'] for n in x])
ax1.plot(x, y, 'o-')
ax1.set_xlabel('minimum number of residues')
ax1.set_ylabel('number of SAVs')

y2 = np.array([CV_summaries[f'{n}-ANM-chain-v2']['dataset bias'] for n in x])
ax2.plot(x, y2, 'o-')
ax2.set_xlabel('minimum number of residues')
ax2.set_ylabel('fraction of del. SAVs')
ax2.set_ylim((0.7, 0.8))

plt.tight_layout()
fig.savefig('figures/stats.png', dpi=300)

## Summary

In [None]:
for version in ['v2', 'v2_noPfam', 'v2_EVmut', 'v1']:
    for min_num_res in [0, 100, 150, 200, 300, 400, 500, 600]:
        for ENM in ['GNM', 'ANM']:
            for model in ['chain', 'reduced']:

                if version == 'v1' and ENM == 'GNM' and model == 'chain':
                    scheme = f'{min_num_res}-v1'
                elif version != 'v1':
                    scheme = f'{min_num_res}-{ENM}-{model}-{version}'
                else:
                    continue

                s = CV_summaries[scheme]
               
                print(f'{version:<9} {min_num_res:3} {ENM} {model:7}:  ',
                      'size = {:5}  '.format(s['dataset size']),
                      'AUROC = {:5.3f}  '.format(s['mean ROC-AUC']),
                      'OOB = {:5.3f}'.format(s['mean OOB score']) )
