# Gradient boosting trees trained on experimental DMS assays

## Setup

* Data: DMS assay data from 9 datasets for all possible AA substitution generated by single nucleotide change (see `DMSexp_prepare_data.ipynb` for details on what has been done).
* Feature vector: each variant is represented by 5 features; after one-hot encoding we have 20 + 20 + 4 + 6 + 4 + 3 + 1 = **58 features** - 
  + **WT Amino acid** (20 levels)
  + **MUT Amino Acid** (20 levels)
  + **trinucleotide context** (96 levels) (*separate one-hot encoding for positions -1 [4 features], 0 [6 features], +1 [4 features]*)
  + phyloP **conservation score** [separately for positions -1, 0, 1] [*3 features*]
  + **Q(SASA)** of the mutated position
* Objective: Predict whether given variant is damaging (label '1') or neutral (label '0'). These were labelled in `DMSexp_prepare_data.ipynb`.

In [1]:
import os
import glob
import re
import random
import pickle
import numpy as np
import pandas as pd
import sklearn
import imblearn
import time
import shap

In [2]:
with open('data/MAVEdb/human_proteins_retained_ready.pickle', 'rb') as r:
    X, Y, scoreset_id = pickle.load(r)

## Idea

### Dealing with Imbalanced labels

An issue is the unequal number of variants labelled '1' and '0'. Therefore needs to take steps to ensure training/testing considers equal number of 1's and 0's - will need to oversample the smaller subgroup.

Because each dataset has a variable number of 1's and 0's we need specific ways to address this other than 
simply by assigning different weights to 1 and 0 in the training step (ideally this weight needs to be adjusted for each separate DMS dataset.

Therefore use the following scheme to establish training set - for each DMS dataset:

1. Let $m$ be the total number of variants included. First set asside $0.2 * m$ variants as test set. Let $m' = 0.8 * m$, i.e. the remaining variants.


2. For the remaining variants, let $(m'_0, m'_1)$ be the number of variants in this group which are labelled 0 and 1 respectively. Then sample, without replacement, separately from the 0 and 1 subsets $m'_s$ variants where $m'_s = min(m'_0, m'_1)$. 

Note for the smaller subgroup this means no sampling were effectively being done. The sampled subsets from the 0's and the 1's together form the training set.


3. Repeat step (2) for 100 times to derive 100 training sets to train 100 models. The final prediction will be derived by majority voting (i.e. averaging) the 100 models.


In [3]:
# just to see how imbalanced the labels are:
# proportion of labels == 1 (damaging)
np.mean(Y)

0.33009040880503143

In [4]:
def sample_fixed_set(X_set, Y_set, exp_id_set, seed = 1, proportion = 0.2):
    """
    Give a randomly sampled (the size given by proportion) set with identical distribution 
    separately from each DMS dataset.
    Input:
      - X_set: np.array of shape (n_samples, n_features) The feature matrix.
      - Y_set: np.array of shape (n_samples, 1). The labels.
      - exp_id_set: np.array of shape (n_samples, 1). The experiment ID for each observation.
    Return:
      a list of row IDs which satisfy the condition (i.e. proportion * m variants are retained
      and m is the number of variants in each DMS dataset)
    """
    exp_ids = list(set(exp_id_set.flatten()))
    random.seed(seed) # set seed
    n_samples = X_set.shape[0]
    id_vec = range(n_samples)
    id_keep = []  # row IDs of the obs to keep
    for exp_id in exp_ids:
        subset = [i for i in id_vec if exp_id_set[i] == exp_id]
        m_s = int(np.floor(proportion * len(subset))) # number to be sampled from each subgroup
        id_keep += random.sample(subset, m_s)
    return id_keep

In [5]:
# set aside 20% as test set.
test_set_id = sample_fixed_set(X, Y, scoreset_id, proportion = 0.2, seed = 1234)
train_set_id = [i for i in range(X.shape[0]) if i not in test_set_id]

X_test = X[test_set_id, :]
Y_test = Y[test_set_id, :]
scoreset_id_test = scoreset_id[test_set_id, :]
X_train = X[train_set_id, :]
Y_train = Y[train_set_id, :]
scoreset_id_train = scoreset_id[train_set_id, :]

# save the whole train set
with open('data/train_set.pickle', 'wb') as f:
    pickle.dump((X_train, Y_train, scoreset_id_train), f)
    
# save the test set and forget about it for now
with open('data/test_set.pickle', 'wb') as f:
    pickle.dump((X_test, Y_test, scoreset_id_test), f)

## Architecture

using GradientBoostingClassifier in scikit-learn. A standard hyperparameter grid search was carried out.

In [6]:
# force balanced dataset using SMOTE
from imblearn.over_sampling import SMOTE
X_train_resampled, Y_train_resampled = SMOTE().fit_resample(X_train, Y_train)

with open('data/train_set_resampled.pickle', 'wb') as w:
    pickle.dump((X_train_resampled, Y_train_resampled), w)

In [7]:
# ref https://machinelearningmastery.com/gradient-boosting-machine-ensemble-in-python/
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

# define the model with default hyperparameters
model = GradientBoostingClassifier()
# define the grid of values to search
grid = dict()
grid['n_estimators'] = [50, 100, 500]
grid['learning_rate'] = [0.01, 0.1, 1.0]
grid['subsample'] = [0.5, 0.7, 1.0]
grid['max_depth'] = [3, 4, 5, 6]

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(X_train_resampled, Y_train_resampled)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.877941 using {'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.7}


In [8]:
with open('Hyperparams-GBM.pickle', 'wb') as w:
    pickle.dump( (grid_result.best_score_, grid_result.best_params_), w )

In [9]:
with open('data/train_set_resampled.pickle', 'rb') as r:
    X_train_resampled, Y_train_resampled = pickle.load( r )

In [10]:
# now train the actual model
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )

model = model.fit(X_train_resampled, Y_train_resampled)

In [11]:
%%bash

mkdir setup
mkdir setup/AA_Cons_MutSig_QSASA
mkdir setup/AA_Cons_MutSig
mkdir setup/AA_MutSig_QSASA
mkdir setup/AA_MutSig
mkdir setup/AA
mkdir setup/MutSig

In [12]:
# save the model
with open('setup/AA_Cons_MutSig_QSASA/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA + MutSig

In [13]:
# use only the first 54 columns (ie WT [20] + MUT [20] amino acids + MutSig position -1 [4 bases], 0[6 combinations of SBS], 1 [4 bases])
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )

model = model.fit(X_train_resampled[:, 0:54], Y_train_resampled)

In [14]:
# save the model
with open('setup/AA_MutSig/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA + Cons + MutSig

In [15]:
# use only the first 54 columns (ie WT [20] + MUT [20] amino acids + MutSig position -1 [4 bases], 0[6 combinations of SBS], 1 [4 bases])
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )

model = model.fit(X_train_resampled[:, 0:57], Y_train_resampled)

In [16]:
# save the model
with open('setup/AA_Cons_MutSig/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA + MutSig + QSASA

In [17]:
# use only the first 54 columns + last column (QSASA)
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )

model = model.fit(X_train_resampled[:, [i for i in range(54)] + [57]], Y_train_resampled)

In [18]:
# save the model
with open('setup/AA_MutSig_QSASA/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### AA only

In [19]:
# use only the first 40 columns (ie WT [20] + MUT [20] amino acids
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )

model = model.fit(X_train_resampled[:, 0:40], Y_train_resampled)

In [20]:
# save the model
with open('setup/AA/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

### MutSig only

In [21]:
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )

model = model.fit(X_train_resampled[:, 41:54], Y_train_resampled)

In [22]:
# save the model
with open('setup/MutSig/GBM.pickle', 'wb' ) as o:
    pickle.dump( model, o )

## Predictors without using core variants

Just to see effect of removing the variants most likley to be in the mutational dark matter.

In [23]:
surface = X_train_resampled[:, -1] >= 0.15

In [24]:
# remove obs where Q(SASA) < 0.15 and retrain the model
X_train_resampled = X_train_resampled[surface, :]
Y_train_resampled = Y_train_resampled[surface]

In [25]:
from sklearn.ensemble import GradientBoostingClassifier

# AA Cons MutSig QSASA
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )
model = model.fit(X_train_resampled, Y_train_resampled)
with open('setup/AA_Cons_MutSig_QSASA/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )
    
# AA_MutSig
# use only the first 54 columns (ie WT [20] + MUT [20] amino acids + MutSig position -1 [4 bases], 0[6 combinations of SBS], 1 [4 bases])
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )
model = model.fit(X_train_resampled[:, 0:54], Y_train_resampled)
with open('setup/AA_MutSig/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )
    
# AA Cons MutSig
# use only the first 54 columns (ie WT [20] + MUT [20] amino acids + MutSig position -1 [4 bases], 0[6 combinations of SBS], 1 [4 bases])
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )
model = model.fit(X_train_resampled[:, 0:57], Y_train_resampled)
with open('setup/AA_Cons_MutSig/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )
    
# AA MutSig QSASA    
# use only the first 54 columns + last column (QSASA)
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )
model = model.fit(X_train_resampled[:, [i for i in range(54)] + [57]], Y_train_resampled)
with open('setup/AA_MutSig_QSASA/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )
    
# AA only
# use only the first 40 columns (ie WT [20] + MUT [20] amino acids
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )
model = model.fit(X_train_resampled[:, 0:40], Y_train_resampled)
with open('setup/AA/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )
    
# MutSig Only
model = GradientBoostingClassifier( learning_rate = 0.1, max_depth = 6, n_estimators = 500, subsample = 0.7 )
model = model.fit(X_train_resampled[:, 41:54], Y_train_resampled)
with open('setup/MutSig/GBM-no-core.pickle', 'wb' ) as o:
    pickle.dump( model, o )

# Evaluation

## Holdout set from same proteins/experiments encounted in training

These were the 20% variants removed as `(X_test, Y_test)`, i.e. coming from the same proteins/experiments used in training but withheld from model construction and training.

In [26]:
def getROCcurve(model, X, Y, dataset_name, test_set, model_name, model_type):
    """
    given the feature matrix X, run it through model and generate a pd.DataFrame of statistics needed for
    plotting roc curve.
    args:
      model: sklearn model
      X: feature matrix
      Y: true labels
      dataset_name: name of training dataset (EVmutation / DMSexp)
      test_set: name of testing dataset ('same protein'/'BRCA1'/etc)
      model_name: name of model (e.g. 'AA_Cons_MutSig_QSASA')
      model_type: type of model (classical / no-core)
    """
    roc = pd.DataFrame(list(sklearn.metrics.roc_curve( Y, \
                                                       model.predict_proba( X )[:, 1] ))).T
    roc.columns = ['fpr', 'tpr', 'threshold']
    roc['dataset_name'] = dataset_name
    roc['test_set'] = test_set
    roc['model_name'] = model_name
    roc['model_type'] = model_type
    return roc    

In [27]:
with open('setup/AA_Cons_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('data/test_set.pickle', 'rb') as r:
    X_test, Y_test, scoreset_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'DMSexp', 'same proteins', 'AA + Cons + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'DMSexp', 'same proteins', 'AA + Cons + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'DMSexp', 'same proteins', 'AA + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'DMSexp', 'same proteins', 'AA + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'DMSexp', 'same proteins', 'AA only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'DMSexp', 'same proteins', 'MutSig only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')


AA_Cons_MutSig_QSASA :: Accuracy =  0.825178264076715 , F1-score =  0.7138832997987928 , ROC-AUC =  0.8683141553091445
AA_Cons_MutSig       :: Accuracy =  0.7917383820998278 , F1-score =  0.6492753623188406 , ROC-AUC =  0.8265829261562114
AA_MutSig_QSASA      :: Accuracy =  0.7831325301204819 , F1-score =  0.6903089887640449 , ROC-AUC =  0.8370406801153747
AA_MutSig            :: Accuracy =  0.6006884681583476 , F1-score =  0.5045759609517999 , ROC-AUC =  0.6452008491039183
AA only              :: Accuracy =  0.5960167199409885 , F1-score =  0.500456065673457 , ROC-AUC =  0.6366123502241573
MutSig only          :: Accuracy =  0.578067371526924 , F1-score =  0.4417696811971373 , ROC-AUC =  0.5875954612252765


82.5% accuracy!

Using the 'no-core' predictors:

In [28]:
with open('setup/AA_Cons_MutSig_QSASA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'DMSexp', 'same proteins', 'AA + Cons + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'DMSexp', 'same proteins', 'AA + Cons + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'DMSexp', 'same proteins', 'AA + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'DMSexp', 'same proteins', 'AA + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'DMSexp', 'same proteins', 'AA only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'DMSexp', 'same proteins', 'MutSig only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')


AA_Cons_MutSig_QSASA :: Accuracy =  0.7115810179493484 , F1-score =  0.5666789804211304 , ROC-AUC =  0.7487479440384232
AA_Cons_MutSig       :: Accuracy =  0.7241209736906811 , F1-score =  0.4595375722543353 , ROC-AUC =  0.6810337133896484
AA_MutSig_QSASA      :: Accuracy =  0.6528153430046717 , F1-score =  0.5866510538641687 , ROC-AUC =  0.7168099086945
AA_MutSig            :: Accuracy =  0.6188836980575363 , F1-score =  0.3541666666666667 , ROC-AUC =  0.5762793227755645
AA only              :: Accuracy =  0.6284730759773789 , F1-score =  0.36592530423835506 , ROC-AUC =  0.5825257775422197
MutSig only          :: Accuracy =  0.6437177280550774 , F1-score =  0.28160634605850277 , ROC-AUC =  0.5636135317860969


## Independent benchmark

1. Loss-of-function and Functional variants as annotated by BRCA1 DMS experiment ([Finlay et al Nature 2018](https://www.nature.com/articles/s41586-018-0461-z))

In [29]:
with open('setup/AA_Cons_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('../EVmutation/human_protein_predictions/BRCA_DMS_annotated.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + Cons + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + Cons + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'DMSexp', 'BRCA1 DMS', \
                  'AA + MutSig + Q(SASA)', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + MutSig', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'DMSexp', 'BRCA1 DMS', 'MutSig only', 'all variants')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

# no-core
print('\nno-core predictors:\n')
with open('setup/AA_Cons_MutSig_QSASA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig_QSASA :: Accuracy = ', model.score( X_test, Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test )[:, 1]))
roc = getROCcurve( model, X_test, Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + Cons + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_Cons_MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_Cons_MutSig       :: Accuracy = ', model.score( X_test[:, 0:57], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:57], ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:57] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:57], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + Cons + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig_QSASA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig_QSASA      :: Accuracy = ', model.score( X_test[:, [i for i in range(54)] + [57]], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, [i for i in range(54)] + [57]] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]))
roc = getROCcurve( model, X_test[:, [i for i in range(54)] + [57]], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + MutSig + Q(SASA)', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA_MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA_MutSig            :: Accuracy = ', model.score( X_test[:, 0:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:54], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA + MutSig', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/AA/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('AA only              :: Accuracy = ', model.score( X_test[:, 0:40], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 0:40] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 0:40] )[:, 1]))
roc = getROCcurve( model, X_test[:, 0:40], Y_test, 'DMSexp', 'BRCA1 DMS', 'AA only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')

with open('setup/MutSig/GBM-no-core.pickle', 'rb' ) as r:
    model = pickle.load( r )
print('MutSig only          :: Accuracy = ', model.score( X_test[:, 41:54], Y_test ), \
      ", F1-score = ", sklearn.metrics.f1_score(Y_test, model.predict( X_test[:, 41:54] ), average='binary'), \
      ", ROC-AUC = ", sklearn.metrics.roc_auc_score(Y_test, model.predict_proba( X_test[:, 41:54] )[:, 1]))
roc = getROCcurve( model, X_test[:, 41:54], Y_test, 'DMSexp', 'BRCA1 DMS', 'MutSig only', 'no-core')
roc.to_csv('roc_curves_all.csv', header=True, index=False, mode = 'a')


AA_Cons_MutSig_QSASA :: Accuracy =  0.7186440677966102 , F1-score =  0.3821339950372209 , ROC-AUC =  0.6689726605457661
AA_Cons_MutSig       :: Accuracy =  0.6915254237288135 , F1-score =  0.20869565217391303
AA_Cons_MutSig       :: Accuracy =  0.6915254237288135 , F1-score =  0.20869565217391303 , ROC-AUC =  0.6069729674432879
AA_MutSig_QSASA      :: Accuracy =  0.6259887005649718 , F1-score =  0.4582651391162029 , ROC-AUC =  0.651754430832971
AA_MutSig            :: Accuracy =  0.6971751412429379 , F1-score =  0.5281690140845071 , ROC-AUC =  0.7274654100918135
AA only              :: Accuracy =  0.7310734463276836 , F1-score =  0.6020066889632106 , ROC-AUC =  0.7678575995498835
MutSig only          :: Accuracy =  0.6474576271186441 , F1-score =  0.42857142857142855 , ROC-AUC =  0.6238523311424261

no-core predictors:

AA_Cons_MutSig_QSASA :: Accuracy =  0.6937853107344633 , F1-score =  0.31043256997455476 , ROC-AUC =  0.6902700698191863
AA_Cons_MutSig       :: Accuracy =  0.672316384

2. Envision mutations

Here only use the predict function (as Envision does not provide labels but rather a score. Will investigate correlation between Envision scores and our predicted damaging-probabiltiies.

In [30]:
with open('setup/AA_Cons_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
with open('../Envision/Envision_ready_unique.pickle', 'rb') as r:
    X_test, Y_test = pickle.load(r)



In [31]:
envision = pd.read_csv('../Envision/Envision_ready.tsv', sep = "\t")
envision['AA_Cons_MutSig_QSASA_p'] = model.predict_proba( X_test )[:, 1]

with open('setup/AA_Cons_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
envision['AA_Cons_MutSig_p'] = model.predict_proba( X_test[:, 0:57] )[:, 1]

with open('setup/AA_MutSig_QSASA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
envision['AA_MutSig_QSASA_p'] = model.predict_proba( X_test[:, [i for i in range(54)] + [57]] )[:, 1]

with open('setup/AA_MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
envision['AA_MutSig_p'] = model.predict_proba( X_test[:, 0:54] )[:, 1]

with open('setup/AA/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
envision['AA_p'] = model.predict_proba( X_test[:, 0:40] )[:, 1]

with open('setup/MutSig/GBM.pickle', 'rb' ) as r:
    model = pickle.load( r )
envision['MutSig_p'] = model.predict_proba( X_test[:, 41:54] )[:, 1]

envision



Unnamed: 0.1,Unnamed: 0,X1,id2,AA1,AA2,position,Uniprot,WT_Mut,Variant,AA1_polarity,...,PhyloP_0,PhyloP_1,protein,Q(SASA),AA_Cons_MutSig_QSASA_p,AA_Cons_MutSig_p,AA_MutSig_QSASA_p,AA_MutSig_p,AA_p,MutSig_p
0,8323,19082162,P04035_P439A,P,A,439,P04035,PA,P439A,Special,...,4.0,4.0,P04035,1.0323,0.618426,0.195662,0.606245,0.541139,0.606757,0.408901
1,8329,19082168,P04035_P439H,P,H,439,P04035,PH,P439H,Special,...,4.0,-0.5,P04035,1.0323,0.098062,0.110471,0.592379,0.700575,0.650552,0.678710
2,8332,19082171,P04035_P439L,P,L,439,P04035,PL,P439L,Special,...,4.0,-0.5,P04035,1.0323,0.226643,0.150819,0.728743,0.789619,0.671871,0.712546
3,8336,19082175,P04035_P439R,P,R,439,P04035,PR,P439R,Special,...,4.0,-0.5,P04035,1.0323,0.145746,0.185445,0.771501,0.782489,0.686362,0.685450
4,8337,19082176,P04035_P439S,P,S,439,P04035,PS,P439S,Special,...,4.0,4.0,P04035,1.0323,0.387892,0.295273,0.441203,0.385010,0.555850,0.438892
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8291,4610,3660589,Q9H3S4_S243N,S,N,243,Q9H3S4,SN,S243N,Polar,...,0.5,4.0,Q9H3S4,1.2412,0.052334,0.085340,0.080482,0.420180,0.452736,0.297807
8292,4613,3660592,Q9H3S4_S243R,S,R,243,Q9H3S4,SR,S243R,Polar,...,4.0,5.0,Q9H3S4,1.2412,0.349461,0.246390,0.278748,0.827931,0.586168,0.474273
8293,4613,3660592,Q9H3S4_S243R,S,R,243,Q9H3S4,SR,S243R,Polar,...,2.0,0.5,Q9H3S4,1.2412,0.248722,0.188052,0.158618,0.114509,0.586168,0.427768
8294,4613,3660592,Q9H3S4_S243R,S,R,243,Q9H3S4,SR,S243R,Polar,...,4.0,5.0,Q9H3S4,1.2412,0.428038,0.196191,0.307977,0.767870,0.586168,0.434599


In [32]:
envision.to_csv('../Envision/Envision_DMSexppredicted.tsv', sep = "\t")

# Feature importance

Use the [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap) library.

In [33]:
# function to perform SHAP analysis and output the explanatory matrix

def doSHAP(model_file, data_file, out_file, column_selection, feature_names):
    """
    automatic routine to put through data through model and output SHAP explanatory matrix
    arg:
      model_file: filepath to sklearn model pickle
      data_file: filepath to dataset (X, Y) pickle
      out_file: filepath to output CSV where the SHAP matrix will be written out
      column_selection: list of column indices to be included in the feature matrix X for interpretation
      feature_names: list of feature names. Use as column names of the output matrix
    """
    with open(model_file, 'rb' ) as r:
        model = pickle.load( r )
    with open(data_file, 'rb') as r:
        X, Y, scoreset = pickle.load(r)
    X = X[:, column_selection]    
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    out = pd.DataFrame(shap_values.values, columns = [feature_names[i] for i in column_selection])
    out.to_csv(out_file, index=False)
    return None

In [34]:
# feature names
feature_cols = []
AAs = list('ACDEFGHIKLMNPQRSTVWY')
feature_cols +=  ['WT_AA_' + aa for aa in AAs]
feature_cols +=  ['MUT_AA_' + aa for aa in AAs]
bs = list('AGCT')
feature_cols +=  ['MutSig_-1_' + b for b in bs]
feature_cols +=  ['MutSig_+1_' + b for b in bs]
subs = ['C>A', 'C>G', 'C>T', 'T>A', 'T>C', 'T>G']
feature_cols +=  ['MutSig_' + m for m in subs] 
feature_cols += ['PhyloP_0', 'PhyloP_-1', 'PhyloP_1', 'Q(SASA)']


In [35]:
# explain the model's predictions using SHAP
doSHAP(model_file = 'setup/AA_Cons_MutSig_QSASA/GBM.pickle', data_file = 'data/test_set.pickle', \
       out_file = "GBM-AA_Cons_MutSig_QSASA_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(58)])
doSHAP(model_file = 'setup/AA_MutSig_QSASA/GBM.pickle', data_file = 'data/test_set.pickle', \
       out_file = "GBM-AA_MutSig_QSASA_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(54)] + [57])
doSHAP(model_file = 'setup/AA_Cons_MutSig/GBM.pickle', data_file = 'data/test_set.pickle', \
       out_file = "GBM-AA_Cons_MutSig_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(57)])
doSHAP(model_file = 'setup/AA_MutSig/GBM.pickle', data_file = 'data/test_set.pickle', \
       out_file = "GBM-AA_MutSig_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(54)])
doSHAP(model_file = 'setup/AA/GBM.pickle', data_file = 'data/test_set.pickle', \
       out_file = "GBM-AA_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(40)])
doSHAP(model_file = 'setup/MutSig/GBM.pickle', data_file = 'data/test_set.pickle', \
       out_file = "GBM-MutSig_test-set_shap-values.csv", feature_names = feature_cols, \
       column_selection = [i for i in range(40, 54)])