# Gradient boosting trees trained on EVmutation scores

## Setup

* Data: First-principle evolutionary statistical energy function on human proteins using the 'epistatic' model returned by the EVmutation method.
* Feature vector: each variant is represented by 5 features; after one-hot encoding we have 20 + 20 + 4 + 6 + 4 + 3 + 1 = **58 features** - 
  + **WT Amino acid** (20 levels)
  + **MUT Amino Acid** (20 levels)
  + **trinucleotide context** (96 levels) (*separate one-hot encoding for positions -1 [4 features], 0 [6 features], +1 [4 features]*)
  + phyloP **conservation score** [separately for positions -1, 0, 1] [*3 features*]
  + **Q(SASA)** of the mutated position
* Objective: A classificaation task to predict whether EVmutation score < -5 (damaging) or otherwise (neutral). See the NN notebook.

In [1]:
import os
import glob
import re
import random
import pickle
import numpy as np
import pandas as pd
import sklearn
import imblearn
import time
import shap

In [2]:
# load the prepared dataset
with open('EVmutation_human_protein_predictions_collapsed_annotated.pickle', 'rb') as r:
    X, Y = pickle.load(r)

## Idea

Use -5 as the cutoff, below which variants are labelled 'damaging' (label '1') and neutral '0' otherwise. This is determined by inspection of [Figure 3](https://www.nature.com/articles/nbt.3769/figures/3) of the original EVmutation paper - the cutoff is a sensible chocie which segregates common gnomAD variants from pathogenic ClinVar variants. The actual median of the EVmutation metrics is -4.722 - not far off from -5.

For training, Similar to the DMSexp data, but put aside 1% of data as test set (many variants - not to 'waste' observations in the test.)

In [3]:
# label '1' if Y < -5 and '0' otherwise
Y = np.array(Y < -5, dtype=int)

In [4]:
def sample_fixed_set(X_set, Y_set, seed = 1, proportion = 0.2):
    """
    Give a randomly sampled (the size given by proportion) set with identical distribution 
    separately from each DMS dataset.
    Input:
      - X_set: np.array of shape (n_samples, n_features) The feature matrix.
      - Y_set: np.array of shape (n_samples, 1). The labels.
    Return:
      a list of row IDs which satisfy the condition (i.e. proportion * m variants are retained
      and m is the number of variants in each DMS dataset)
    """
    random.seed(seed) # set seed
    n_samples = X_set.shape[0]
    id_vec = range(n_samples)
    id_keep = []  # row IDs of the obs to keep
    m_s = int(np.floor( proportion * n_samples ))
    id_keep = random.sample(id_vec, m_s)
    return id_keep


In [5]:
# set aside 1% as test set.
test_set_id = sample_fixed_set(X, Y, proportion = 0.01, seed = 1234)
train_set_id = list(set(range(X.shape[0])) - set(test_set_id))

X_test = X[test_set_id, :]
Y_test = Y[test_set_id, :]
X_train = X[train_set_id, :]
Y_train = Y[train_set_id, :]

# save the whole train set
with open('human_protein_predictions/EVmutation_train_set.pickle', 'wb') as f:
    pickle.dump((X_train, Y_train), f)
    
# save the test set and forget about it for now
with open('human_protein_predictions/EVmutation_test_set.pickle', 'wb') as f:
    pickle.dump((X_test, Y_test), f)

## Architecture

using GradientBoostingClassifier in scikit-learn. Use the same set of hyperparameters in the DMSexp prediction task.

In [6]:
with open('data/EVmutation_train_set.pickle', 'rb') as r:
    X, Y = pickle.load(r)

In [7]:
from sklearn.ensemble import GradientBoostingClassifier

In [8]:
# now train the actual model
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )

model = model.fit(X, Y)

  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3498           0.0286          371.09m
         2           1.3263           0.0235          370.60m
         3           1.3058           0.0204          370.33m
         4           1.2880           0.0177          366.39m
         5           1.2738           0.0144          364.26m
         6           1.2605           0.0132          364.60m
         7           1.2501           0.0105          366.33m
         8           1.2389           0.0112          363.20m
         9           1.2294           0.0094          360.57m
        10           1.2215           0.0081          358.36m
        20           1.1673           0.0045          325.65m
        30           1.1389           0.0023          302.96m
        40           1.1193           0.0017          276.98m
        50           1.1060           0.0013          256.24m
        60           1.0955           0.0008          241.61m
       

In [9]:
# save the model
with open('setup/AA_Cons_MutSig_QSASA/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA + MutSig

In [10]:
# use only the first 54 columns (ie WT [20] + MUT [20] amino acids + MutSig position -1 [4 bases], 0[6 combinations of SBS], 1 [4 bases])
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )

model = model.fit(X[:, 0:54], Y)

  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3660           0.0126          165.49m
         2           1.3557           0.0101          165.36m
         3           1.3468           0.0090          165.03m
         4           1.3384           0.0084          164.66m
         5           1.3320           0.0064          164.15m
         6           1.3262           0.0059          163.39m
         7           1.3203           0.0060          162.94m
         8           1.3121           0.0080          162.14m
         9           1.3042           0.0079          161.39m
        10           1.2998           0.0044          161.22m
        20           1.2649           0.0025          158.17m
        30           1.2407           0.0013          155.06m
        40           1.2271           0.0008          151.77m
        50           1.2172           0.0010          148.50m
        60           1.2100           0.0006          145.24m
       

In [11]:
# save the model
with open('setup/AA_MutSig/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA + Cons + MutSig

In [12]:
# use only the first 54 columns (ie WT [20] + MUT [20] amino acids + MutSig position -1 [4 bases], 0[6 combinations of SBS], 1 [4 bases])
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )

model = model.fit(X[:, 0:57], Y)

  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3605           0.0181          196.51m
         2           1.3456           0.0149          197.05m
         3           1.3332           0.0125          196.59m
         4           1.3226           0.0106          196.15m
         5           1.3126           0.0101          195.59m
         6           1.3043           0.0082          195.20m
         7           1.2958           0.0082          194.64m
         8           1.2896           0.0064          194.09m
         9           1.2817           0.0078          193.76m
        10           1.2760           0.0059          193.43m
        20           1.2335           0.0031          189.84m
        30           1.2066           0.0016          185.30m
        40           1.1899           0.0015          181.10m
        50           1.1746           0.0016          177.00m
        60           1.1666           0.0010          172.93m
       

In [13]:
# save the model
with open('setup/AA_Cons_MutSig/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA + MutSig + QSASA

In [14]:
# use only the first 54 columns + last column (QSASA)
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )

model = model.fit(X[:, [i for i in range(54)] + [57]], Y)

  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3524           0.0261          192.16m
         2           1.3305           0.0222          191.92m
         3           1.3128           0.0175          191.90m
         4           1.2967           0.0160          191.77m
         5           1.2830           0.0138          191.63m
         6           1.2708           0.0120          191.32m
         7           1.2593           0.0115          191.07m
         8           1.2500           0.0092          190.77m
         9           1.2426           0.0075          190.35m
        10           1.2352           0.0074          190.03m
        20           1.1898           0.0037          186.38m
        30           1.1625           0.0022          181.97m
        40           1.1462           0.0019          177.93m
        50           1.1334           0.0014          173.78m
        60           1.1248           0.0010          169.74m
       

In [15]:
# save the model
with open('setup/AA_MutSig_QSASA/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA only

In [16]:
# use only the first 40 columns (ie WT [20] + MUT [20] amino acids
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )

model = model.fit(X[:, 0:40], Y)

  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3663           0.0123          116.08m
         2           1.3563           0.0099          116.40m
         3           1.3480           0.0082          116.24m
         4           1.3415           0.0067          116.01m
         5           1.3355           0.0060          115.71m
         6           1.3304           0.0051          115.47m
         7           1.3259           0.0047          115.33m
         8           1.3188           0.0069          114.66m
         9           1.3142           0.0046          114.53m
        10           1.3081           0.0059          114.01m
        20           1.2748           0.0031          112.02m
        30           1.2548           0.0011          109.76m
        40           1.2384           0.0011          107.47m
        50           1.2255           0.0009          105.08m
        60           1.2184           0.0006          102.82m
       

In [17]:
# save the model
with open('setup/AA/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### MutSig only

In [18]:
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )

model = model.fit(X[:, 41:54], Y)

  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3715           0.0070           54.49m
         2           1.3658           0.0057           54.68m
         3           1.3612           0.0046           54.49m
         4           1.3575           0.0037           54.36m
         5           1.3543           0.0030           54.22m
         6           1.3519           0.0025           54.15m
         7           1.3498           0.0021           54.07m
         8           1.3480           0.0017           54.00m
         9           1.3467           0.0014           53.95m
        10           1.3455           0.0012           53.84m
        20           1.3395           0.0002           52.83m
        30           1.3376           0.0002           51.75m
        40           1.3366           0.0000           50.65m


In [19]:
# save the model
with open('setup/MutSig/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

## train predictors without core variants

simulate effect of not having mutational dark matter in the training data.

In [20]:
surface = X[:, -1] >= 0.15
X = X[surface, :]
Y = Y[surface]

In [21]:
# now train the actual model
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X, Y)
with open('setup/AA_Cons_MutSig_QSASA/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA + MutSig
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:54], Y)
with open('setup/AA_MutSig/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA + Cons + MutSig
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:57], Y)
with open('setup/AA_Cons_MutSig/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA MutSig QSASA
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, [i for i in range(54)] + [57]], Y)
with open('setup/AA_MutSig_QSASA/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA only
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:40], Y)
with open('setup/AA/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# MutSig only
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 41:54], Y)
with open('setup/MutSig/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2223           0.0177           65.07m
         2           1.2091           0.0137           63.86m
         3           1.1970           0.0112           63.40m
         4           1.1881           0.0092           64.64m
         5           1.1796           0.0086           63.86m
         6           1.1720           0.0076           63.53m
         7           1.1656           0.0066           63.24m
         8           1.1587           0.0064           63.11m
         9           1.1535           0.0057           62.81m
        10           1.1490           0.0047           62.53m
        20           1.1120           0.0031           62.50m
        30           1.0921           0.0019           60.36m
        40           1.0765           0.0009           58.80m
        50           1.0662           0.0005           57.07m
        60           1.0602           0.0004           55.59m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2272           0.0133           44.85m
         2           1.2170           0.0103           44.66m
         3           1.2085           0.0081           44.54m
         4           1.2020           0.0069           45.23m
         5           1.1960           0.0057           44.84m
         6           1.1897           0.0063           44.47m
         7           1.1851           0.0039           44.33m
         8           1.1821           0.0035           44.19m
         9           1.1786           0.0040           44.06m
        10           1.1744           0.0040           43.82m
        20           1.1459           0.0015           42.58m
        30           1.1281           0.0012           42.21m
        40           1.1169           0.0005           42.08m
        50           1.1094           0.0009           41.35m
        60           1.1047           0.0002           40.32m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2229           0.0175           57.49m
         2           1.2091           0.0137           56.91m
         3           1.1977           0.0112           56.57m
         4           1.1885           0.0091           56.53m
         5           1.1809           0.0083           56.57m
         6           1.1727           0.0076           56.29m
         7           1.1675           0.0061           56.14m
         8           1.1617           0.0057           56.20m
         9           1.1556           0.0057           56.17m
        10           1.1509           0.0041           56.11m
        20           1.1168           0.0020           54.72m
        30           1.0948           0.0018           54.38m
        40           1.0813           0.0009           53.54m
        50           1.0726           0.0010           52.33m
        60           1.0665           0.0008           51.06m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2257           0.0137           71.64m
         2           1.2158           0.0105           74.36m
         3           1.2075           0.0083           71.40m
         4           1.2001           0.0069           72.36m
         5           1.1946           0.0058           73.32m
         6           1.1889           0.0059           74.66m
         7           1.1838           0.0053           75.95m
         8           1.1785           0.0047           76.86m
         9           1.1741           0.0049           78.52m
        10           1.1704           0.0036           78.89m
        20           1.1417           0.0015           79.44m
        30           1.1228           0.0020           78.68m
        40           1.1113           0.0006           77.29m
        50           1.1021           0.0008           75.52m
        60           1.0962           0.0006           73.75m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2278           0.0128           43.01m
         2           1.2174           0.0099           42.90m
         3           1.2100           0.0079           42.40m
         4           1.2034           0.0063           42.39m
         5           1.1986           0.0051           42.26m
         6           1.1943           0.0043           42.28m
         7           1.1902           0.0039           42.11m
         8           1.1865           0.0035           42.01m
         9           1.1824           0.0040           42.21m
        10           1.1786           0.0042           42.96m
        20           1.1520           0.0017           41.41m
        30           1.1339           0.0010           40.16m
        40           1.1228           0.0012           39.34m
        50           1.1156           0.0003           38.87m
        60           1.1078           0.0004           38.19m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2343           0.0064           27.97m
         2           1.2291           0.0051           27.90m
         3           1.2242           0.0044           27.93m
         4           1.2212           0.0034           27.78m
         5           1.2182           0.0029           27.56m
         6           1.2160           0.0026           27.48m
         7           1.2140           0.0019           27.31m
         8           1.2118           0.0016           27.07m
         9           1.2102           0.0018           26.83m
        10           1.2085           0.0014           26.84m
        20           1.2021           0.0004           25.93m
        30           1.2004           0.0001           25.54m
        40           1.1994           0.0000           24.45m


# Evaluation

## Holdout set from same proteins/experiments encounted in training

These were the 1% variants removed as `(X_test, Y_test)`, i.e. coming from the same proteins/experiments used in training but withheld from model construction and training.

In [22]:
def getROCcurve(model, X, Y, dataset_name, test_set, model_name, model_type):
    """
    given the feature matrix X, run it through model and generate a pd.DataFrame of statistics needed for
    plotting roc curve.
    args:
      model: sklearn model
      X: feature matrix
      Y: true labels
      dataset_name: name of training dataset (EVmutation / DMSexp)
      test_set: name of testing dataset ('same protein'/'BRCA1'/etc)
      model_name: name of model (e.g. 'AA_Cons_MutSig_QSASA')
      model_type: type of model (all variants / no-core)
    """
    roc = pd.DataFrame(list(sklearn.metrics.roc_curve( Y, \
                                                       model.predict_proba( X )[:, 1] ))).T
    roc.columns = ['fpr', 'tpr', 'threshold']
    roc['dataset_name'] = dataset_name
    roc['test_set'] = test_set
    roc['model_name'] = model_name
    roc['model_type'] = model_type
    return roc    

In [23]:
with open('setup/AA_Cons_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('data/EVmutation_test_set.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation', 'same proteins', 'AA + Cons + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation', 'same proteins', 'AA + Cons + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation', 'same proteins', \
                  'AA + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation', 'same proteins', 'AA + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation', 'same proteins', 'AA only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation', 'same proteins', 'MutSig only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')


AA_Cons_MutSig_QSASA :: Accuracy =  0.7342945239694761 , F1-score =  0.6962805442337605 , ROC-AUC =  0.8077505351027433
AA_Cons_MutSig       :: Accuracy =  0.7085766454791311 , F1-score =  0.664525084275233 , ROC-AUC =  0.773645187748739
AA_MutSig_QSASA      :: Accuracy =  0.7198077618727714 , F1-score =  0.6790069858309982 , ROC-AUC =  0.7890640049417041
AA_MutSig            :: Accuracy =  0.6861316383304911 , F1-score =  0.6329149626286843 , ROC-AUC =  0.740971876582381
AA only              :: Accuracy =  0.6858215768349611 , F1-score =  0.6333353436664455 , ROC-AUC =  0.7406807873737661
MutSig only          :: Accuracy =  0.5926997743441338 , F1-score =  0.47339702901939823 , ROC-AUC =  0.6139261713442673


73% accuracy!

'no-core' predictors:

In [24]:
with open('setup/AA_Cons_MutSig_QSASA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('data/EVmutation_test_set.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation', 'same proteins', 'AA + Cons + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation', 'same proteins', 'AA + Cons + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation', 'same proteins', \
                  'AA + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation', 'same proteins', 'AA + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation', 'same proteins', 'AA only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation', 'same proteins', 'MutSig only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')


AA_Cons_MutSig_QSASA :: Accuracy =  0.6966565035398687 , F1-score =  0.6012770004075533 , ROC-AUC =  0.777753619768631
AA_Cons_MutSig       :: Accuracy =  0.6701634713106989 , F1-score =  0.5210365701135624 , ROC-AUC =  0.7629302109341842
AA_MutSig_QSASA      :: Accuracy =  0.6714037172928187 , F1-score =  0.5528154156313002 , ROC-AUC =  0.7499191377998582
AA_MutSig            :: Accuracy =  0.6305100511601468 , F1-score =  0.4197056595606536 , ROC-AUC =  0.7291261380853359
AA only              :: Accuracy =  0.6306478562692712 , F1-score =  0.4203924960804455 , ROC-AUC =  0.7292447464645916
MutSig only          :: Accuracy =  0.5517027543796186 , F1-score =  0.06394993346041794 , ROC-AUC =  0.6041657160748721


## Train same model but without BRCA1 and PTEN variants

For validating the predictions.

In [25]:
with open('human_protein_predictions/EVmutation_human_protein_predictions_collapsed_annotated_excludedBRCA1-PTEN.pickle', 'rb') as r:
    X, Y = pickle.load(r)

Y = np.array(Y < -5, dtype=int)

In [26]:
# AA_Cons_MutSig_QSASA
print('AA_Cons_MutSig_QSASA')
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X, Y)
# save the model
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'wb' ) as o:
    pickle.dump( model, o )

AA_Cons_MutSig_QSASA


  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3498           0.0286          223.71m
         2           1.3258           0.0241          222.96m
         3           1.3052           0.0203          221.81m
         4           1.2881           0.0169          221.08m
         5           1.2737           0.0145          220.77m
         6           1.2612           0.0127          220.16m
         7           1.2502           0.0109          219.71m
         8           1.2392           0.0111          219.06m
         9           1.2300           0.0092          218.63m
        10           1.2223           0.0077          218.20m
        20           1.1698           0.0033          213.56m
        30           1.1386           0.0029          208.61m
        40           1.1195           0.0011          203.84m
        50           1.1065           0.0009          199.27m
        60           1.0967           0.0017          194.57m
       

In [27]:
#_______________________________________    
# AA + Cons + MutSig
print('AA_Cons_MutSig')
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:57], Y)
with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'wb' ) as o:
    pickle.dump( model, o )

AA_Cons_MutSig


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3604           0.0180          284.88m
         2           1.3457           0.0149          285.71m
         3           1.3330           0.0126          283.22m
         4           1.3227           0.0103          281.35m
         5           1.3134           0.0093          280.92m
         6           1.3044           0.0089          283.77m
         7           1.2976           0.0068          284.67m
         8           1.2888           0.0088          286.65m
         9           1.2827           0.0061          284.38m
        10           1.2767           0.0060          283.94m
        20           1.2333           0.0033          271.05m
        30           1.2077           0.0017          255.86m
        40           1.1899           0.0015          237.20m
        50           1.1766           0.0011          221.37m
        60           1.1675           0.0004          212.38m
       

In [28]:
#_______________________________________    
# AA MutSig QSASA
print('AA_MutSig_QSASA')
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, [i for i in range(54)] + [57]], Y)
with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'wb' ) as o:
    pickle.dump( model, o )

AA_MutSig_QSASA


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3524           0.0261          288.82m
         2           1.3308           0.0214          280.75m
         3           1.3134           0.0176          276.54m
         4           1.2965           0.0170          274.92m
         5           1.2824           0.0142          273.63m
         6           1.2702           0.0119          272.50m
         7           1.2592           0.0111          272.07m
         8           1.2499           0.0091          273.01m
         9           1.2411           0.0089          278.00m
        10           1.2343           0.0070          282.38m
        20           1.1876           0.0030          309.06m
        30           1.1616           0.0021          292.14m
        40           1.1435           0.0033          266.07m
        50           1.1321           0.0013          245.58m
        60           1.1226           0.0009          222.50m
       

In [29]:
# AA_MutSig
print('AA_MutSig')
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:54], Y)
# save the model
with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'wb' ) as o:
    pickle.dump( model, o )

AA_MutSig


  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3659           0.0126          168.34m
         2           1.3557           0.0102          168.50m
         3           1.3467           0.0090          168.09m
         4           1.3382           0.0084          167.76m
         5           1.3320           0.0064          167.75m
         6           1.3261           0.0058          167.11m
         7           1.3178           0.0082          166.25m
         8           1.3123           0.0056          165.86m
         9           1.3055           0.0068          165.22m
        10           1.2985           0.0070          164.79m
        20           1.2620           0.0021          161.92m
        30           1.2402           0.0015          158.81m
        40           1.2253           0.0013          155.38m
        50           1.2140           0.0008          151.96m
        60           1.2082           0.0010          148.64m
       

In [30]:
# AA only
print('AA only')
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:40], Y)
# save the model
with open('setup/AA/GBM-excludeBRCA1-PTEN.pickle', 'wb' ) as o:
    pickle.dump( model, o )

AA only


  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3662           0.0122          122.85m
         2           1.3564           0.0099          123.12m
         3           1.3482           0.0082          122.35m
         4           1.3414           0.0067          122.12m
         5           1.3355           0.0060          122.56m
         6           1.3297           0.0058          122.39m
         7           1.3247           0.0051          123.24m
         8           1.3178           0.0068          122.23m
         9           1.3136           0.0044          121.86m
        10           1.3076           0.0059          121.02m
        20           1.2734           0.0024          117.67m
        30           1.2512           0.0016          115.12m
        40           1.2343           0.0011          112.63m
        50           1.2253           0.0004          110.33m
        60           1.2173           0.0007          107.77m
       

In [31]:
# MutSig only
print('MutSig only')
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 41:54], Y)
# save the model
with open('setup/MutSig/GBM-excludeBRCA1-PTEN.pickle', 'wb' ) as o:
    pickle.dump( model, o )

MutSig only


  return f(*args, **kwargs)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3715           0.0070           62.96m
         2           1.3659           0.0057           61.97m
         3           1.3614           0.0045           60.68m
         4           1.3576           0.0037           59.53m
         5           1.3543           0.0031           59.00m
         6           1.3520           0.0026           58.67m
         7           1.3498           0.0021           58.47m
         8           1.3478           0.0017           58.18m
         9           1.3466           0.0014           57.93m
        10           1.3455           0.0011           57.72m
        20           1.3395           0.0004           56.00m
        30           1.3371           0.0001           54.91m
        40           1.3366           0.0000           53.79m


## 'no-core' predictors

In [32]:
with open('human_protein_predictions/EVmutation_human_protein_predictions_collapsed_annotated_excludedBRCA1-PTEN.pickle', 'rb') as r:
    X, Y = pickle.load(r)
Y = np.array(Y < -5, dtype=int)

# remove anything with Q(SASA) < 0.15
surface = X[:, -1] >= 0.15
X = X[surface, :]
Y = Y[surface]

# now train the actual model
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X, Y)
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA + MutSig
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:54], Y)
with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA + Cons + MutSig
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:57], Y)
with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA MutSig QSASA
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, [i for i in range(54)] + [57]], Y)
with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# AA only
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 0:40], Y)
with open('setup/AA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

#_______________________________________    
# MutSig only
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_iter_no_change = 3,\
                                    n_estimators = 500, subsample = 0.7, verbose = 1 )
model = model.fit(X[:, 41:54], Y)
with open('setup/MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2227           0.0177           90.87m
         2           1.2088           0.0138           88.78m
         3           1.1973           0.0112           87.88m
         4           1.1885           0.0092           87.98m
         5           1.1795           0.0080           87.89m
         6           1.1724           0.0078           87.77m
         7           1.1652           0.0068           87.65m
         8           1.1591           0.0064           87.19m
         9           1.1534           0.0054           86.96m
        10           1.1490           0.0051           86.60m
        20           1.1129           0.0037           84.54m
        30           1.0936           0.0020           82.66m
        40           1.0793           0.0010           81.06m
        50           1.0694           0.0010           79.65m
        60           1.0612           0.0007           79.21m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2269           0.0134           78.76m
         2           1.2164           0.0104           78.85m
         3           1.2087           0.0081           77.83m
         4           1.2016           0.0069           77.60m
         5           1.1961           0.0056           77.41m
         6           1.1915           0.0046           77.40m
         7           1.1877           0.0037           76.91m
         8           1.1826           0.0053           76.67m
         9           1.1768           0.0054           76.58m
        10           1.1728           0.0040           76.53m
        20           1.1458           0.0016           74.80m
        30           1.1273           0.0014           72.73m
        40           1.1178           0.0010           71.67m
        50           1.1107           0.0008           70.20m
        60           1.1053           0.0005           68.38m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2227           0.0175           76.87m
         2           1.2091           0.0137           77.55m
         3           1.1984           0.0110           77.16m
         4           1.1897           0.0092           77.40m
         5           1.1816           0.0077           77.28m
         6           1.1732           0.0077           77.68m
         7           1.1671           0.0058           77.99m
         8           1.1614           0.0061           78.20m
         9           1.1560           0.0047           78.16m
        10           1.1514           0.0044           77.86m
        20           1.1157           0.0029           76.96m
        30           1.0961           0.0023           74.88m
        40           1.0827           0.0011           73.27m
        50           1.0730           0.0009           71.45m
        60           1.0678           0.0005           69.65m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2268           0.0137           72.19m
         2           1.2159           0.0106           73.82m
         3           1.2088           0.0083           73.98m
         4           1.2008           0.0068           72.88m
         5           1.1949           0.0059           72.41m
         6           1.1896           0.0062           73.05m
         7           1.1834           0.0051           73.43m
         8           1.1780           0.0057           73.43m
         9           1.1734           0.0048           73.20m
        10           1.1688           0.0038           73.31m
        20           1.1393           0.0015           74.49m
        30           1.1217           0.0015           75.88m
        40           1.1097           0.0009           75.49m
        50           1.1033           0.0008           74.80m
        60           1.0963           0.0008           73.49m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2268           0.0127           44.32m
         2           1.2184           0.0098           44.20m
         3           1.2101           0.0079           43.78m
         4           1.2037           0.0063           43.74m
         5           1.1988           0.0051           43.40m
         6           1.1935           0.0049           43.24m
         7           1.1895           0.0041           43.29m
         8           1.1858           0.0036           43.29m
         9           1.1826           0.0030           43.32m
        10           1.1796           0.0030           43.44m
        20           1.1510           0.0022           41.79m
        30           1.1346           0.0017           40.79m
        40           1.1225           0.0005           39.78m
        50           1.1165           0.0005           39.25m
        60           1.1100           0.0005           38.81m
       

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.2337           0.0064           28.49m
         2           1.2292           0.0051           28.78m
         3           1.2246           0.0045           28.22m
         4           1.2213           0.0034           28.00m
         5           1.2170           0.0031           27.69m
         6           1.2153           0.0023           27.78m
         7           1.2129           0.0022           27.58m
         8           1.2118           0.0019           27.74m
         9           1.2097           0.0015           27.71m
        10           1.2083           0.0013           27.87m
        20           1.2026           0.0003           27.26m
        30           1.2001           0.0001           26.42m
        40           1.1988           0.0000           26.09m


## Independent benchmark

1. Loss-of-function and Functional variants as annotated by BRCA1 DMS experiment ([Findlay et al Nature 2018](https://www.nature.com/articles/s41586-018-0461-z))

In [33]:
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('human_protein_predictions/BRCA_DMS_annotated.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + Cons + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + Cons + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', \
                  'BRCA1 DMS', 'AA + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'MutSig only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

#________________________________
# no-core predictors
print("\n'no-core' predictors:\n")
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + Cons + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + Cons + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'AA only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'BRCA1 DMS', 'MutSig only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')



AA_Cons_MutSig_QSASA :: Accuracy =  0.8045197740112995 , F1-score =  0.6418219461697723 , ROC-AUC =  0.8495370962379478
AA_Cons_MutSig       :: Accuracy =  0.7627118644067796 , F1-score =  0.5248868778280542 , ROC-AUC =  0.7758049666248945
AA_MutSig_QSASA      :: Accuracy =  0.7593220338983051 , F1-score =  0.6216696269982239 , ROC-AUC =  0.8477340732973582
AA_MutSig            :: Accuracy =  0.7412429378531074 , F1-score =  0.5638095238095238 , ROC-AUC =  0.7605783739546303
AA only              :: Accuracy =  0.7423728813559322 , F1-score =  0.569811320754717 , ROC-AUC =  0.768004654612414
MutSig only          :: Accuracy =  0.63954802259887 , F1-score =  0.2989010989010989 , ROC-AUC =  0.5552479476228228

'no-core' predictors:

AA_Cons_MutSig_QSASA :: Accuracy =  0.7446327683615819 , F1-score =  0.3891891891891892 , ROC-AUC =  0.7721541648551189
AA_Cons_MutSig       :: Accuracy =  0.7310734463276836 , F1-score =  0.22727272727272724 , ROC-AUC =  0.7076481419912535
AA_MutSig_QSASA    

2. PTEN variant abundance DMS experiment 

In [34]:
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('human_protein_predictions/PTEN_stability_DMS_annotated.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + Cons + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + Cons + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'MutSig only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

#________________________________
# no-core predictors
print("\n\n'no-core' predictors:\n")
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + Cons + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + Cons + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation', \
                  'PTEN abundance DMS', 'AA + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'AA only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN abundance DMS', 'MutSig only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')



AA_Cons_MutSig_QSASA :: Accuracy =  0.6832579185520362 , F1-score =  0.732824427480916 , ROC-AUC =  0.7511494252873564
AA_Cons_MutSig       :: Accuracy =  0.6153846153846154 , F1-score =  0.6718146718146719 , ROC-AUC =  0.7202791461412151
AA_MutSig_QSASA      :: Accuracy =  0.665158371040724 , F1-score =  0.6782608695652174 , ROC-AUC =  0.7644909688013136
AA_MutSig            :: Accuracy =  0.6832579185520362 , F1-score =  0.669811320754717 , ROC-AUC =  0.744376026272578
AA only              :: Accuracy =  0.665158371040724 , F1-score =  0.6574074074074073 , ROC-AUC =  0.7422413793103448
MutSig only          :: Accuracy =  0.502262443438914 , F1-score =  0.375 , ROC-AUC =  0.5406403940886699


'no-core' predictors:

AA_Cons_MutSig_QSASA :: Accuracy =  0.6380090497737556 , F1-score =  0.6638655462184874 , ROC-AUC =  0.723152709359606
AA_Cons_MutSig       :: Accuracy =  0.6108597285067874 , F1-score =  0.5981308411214952 , ROC-AUC =  0.7063218390804598
AA_MutSig_QSASA      :: Accuracy = 

3. PTEN activity DMS experiment using a phosphatase essay

In [35]:
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('human_protein_predictions/PTEN_activity_DMS_annotated.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + Cons + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + Cons + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-excludeBRCA1-PTEN.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'MutSig only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

#________________________________
# no-core predictors
print("\n\n'no-core' predictors:\n")
with open('setup/AA_Cons_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + Cons + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + Cons + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'EVmutation', 'PTEN activity DMS', 'AA + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'AA only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-excludeBRCA1-PTEN_no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'EVmutation (exclude BRCA1 & PTEN)', 'PTEN activity DMS', 'MutSig only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')



AA_Cons_MutSig_QSASA :: Accuracy =  0.6162790697674418 , F1-score =  0.5864661654135338 , ROC-AUC =  0.8200530212084833
AA_Cons_MutSig       :: Accuracy =  0.6162790697674418 , F1-score =  0.5965770171149144 , ROC-AUC =  0.798031712685074
AA_MutSig_QSASA      :: Accuracy =  0.7162790697674418 , F1-score =  0.6411764705882352 , ROC-AUC =  0.8150510204081634
AA_MutSig            :: Accuracy =  0.7255813953488373 , F1-score =  0.6168831168831169 , ROC-AUC =  0.7867146858743497
AA only              :: Accuracy =  0.7116279069767442 , F1-score =  0.6050955414012739 , ROC-AUC =  0.7881902761104441
MutSig only          :: Accuracy =  0.6255813953488372 , F1-score =  0.3534136546184739 , ROC-AUC =  0.5903861544617849


'no-core' predictors:

AA_Cons_MutSig_QSASA :: Accuracy =  0.672093023255814 , F1-score =  0.6178861788617886 , ROC-AUC =  0.7706957783113246
AA_Cons_MutSig       :: Accuracy =  0.7209302325581395 , F1-score =  0.6226415094339622 , ROC-AUC =  0.7820003001200481
AA_MutSig_QSASA  

# Feature importance

Use the [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap) library.

In [36]:
# function to perform SHAP analysis and output the explanatory matrix

def doSHAP(model_file, data_file, out_file, column_selection, feature_names):
    """
    automatic routine to put through data through model and output SHAP explanatory matrix
    arg:
      model_file: filepath to sklearn model pickle
      data_file: filepath to dataset (X, Y) pickle
      out_file: filepath to output CSV where the SHAP matrix will be written out
      column_selection: list of column indices to be included in the feature matrix X for interpretation
      feature_names: list of feature names. Use as column names of the output matrix
    """
    with open(model_file, 'rb' ) as r:
        model = pickle.load( r )
    with open(data_file, 'rb') as r:
        X, Y = pickle.load(r)
    X = X[:, column_selection]    
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    out = pd.DataFrame(shap_values.values, columns = [feature_names[i] for i in column_selection])
    out.to_csv(out_file, index=False)
    return None

In [37]:
# feature names
feature_cols = []
AAs = list('ACDEFGHIKLMNPQRSTVWY')
feature_cols +=  ['WT_AA_' + aa for aa in AAs]
feature_cols +=  ['MUT_AA_' + aa for aa in AAs]
bs = list('AGCT')
feature_cols +=  ['MutSig_-1_' + b for b in bs]
feature_cols +=  ['MutSig_+1_' + b for b in bs]
subs = ['C>A', 'C>G', 'C>T', 'T>A', 'T>C', 'T>G']
feature_cols +=  ['MutSig_' + m for m in subs] 
feature_cols += ['PhyloP_0', 'PhyloP_-1', 'PhyloP_1', 'Q(SASA)']


In [38]:
# explain the model's predictions using SHAP
doSHAP(model_file = 'setup/AA_Cons_MutSig_QSASA/GBM.pickle', data_file = 'data/EVmutation_test_set.pickle', \
       out_file = "GBM-AA_Cons_MutSig_QSASA_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(58)])
doSHAP(model_file = 'setup/AA_MutSig_QSASA/GBM.pickle', data_file = 'data/EVmutation_test_set.pickle', \
       out_file = "GBM-AA_MutSig_QSASA_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(54)] + [57])
doSHAP(model_file = 'setup/AA_Cons_MutSig/GBM.pickle', data_file = 'data/EVmutation_test_set.pickle', \
       out_file = "GBM-AA_Cons_MutSig_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(57)])
doSHAP(model_file = 'setup/AA_MutSig/GBM.pickle', data_file = 'data/EVmutation_test_set.pickle', \
       out_file = "GBM-AA_MutSig_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(54)])
doSHAP(model_file = 'setup/AA/GBM.pickle', data_file = 'data/EVmutation_test_set.pickle', \
       out_file = "GBM-AA_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(40)])
doSHAP(model_file = 'setup/MutSig/GBM.pickle', data_file = 'data/EVmutation_test_set.pickle', \
       out_file = "GBM-MutSig_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(40, 54)])