# W207 Summer 2017 Final Project

## Personalized Medicine: Redefining Cancer Treatment



#### Matt Shaffer https://github.com/planetceres 
#### Kaggle Competition: https://www.kaggle.com/c/msk-redefining-cancer-treatment

According to [discussion boards](https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35810#202604) on Kaggle, the classes we are trying to predict appear to be as follows:

1. Likely Loss-of-function
2. Likely Gain-of-function
3. Neutral
4. Loss-of-function
5. Likely Neutral
6. Inconclusive
7. Gain-of-function
8. Likely Switch-of-function
9. Switch-of-function


#### Dependencies

In [762]:
import os
import time
import glob
import re
import pandas as pd
import numpy as np
import scipy.sparse as sps
import Bio

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import explained_variance_score
from sklearn.pipeline import make_pipeline

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from keras.layers import Dropout
from keras.utils import np_utils
from keras.models import model_from_json
from keras.callbacks import ModelCheckpoint, EarlyStopping

%matplotlib inline
import matplotlib.pyplot as plt

import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
from itertools import islice

In [2]:
model_version = '001'

In [3]:
model_name = 'model_' + model_version

In [4]:
data_directory = '/Users/Reynard/dropbox/Data/kaggle/Personalized Medicine'
model_directory = data_directory + '/saved_models'

In [5]:
model_path = os.path.join(model_directory, model_name)

In [6]:
# Create model directory if it does not exist
if not os.path.isdir(model_directory):
    print("creating directory for saved models")
    os.mkdir(model_directory)

In [7]:
# Load model to resume training or perform inference
def load_model_from_json(model_path):
    model = model_from_json(open(model_path + '.json').read())
    model.load_weights(model_path + '.h5')
    #model.compile(optimizer=rmsprop, loss='mse')
    return model

In [8]:
from keras.models import load_model
# Load model to resume training or perform inference
def load_recent_model(model_path):
    # Locate the most recent model the folder to resume training from
    model_recent = max(glob.iglob(model_path + '*.hdf5'), key=os.path.getctime)
    print("Using model at checkpoint: {}".format(model_recent))
    #model = model_from_json(open(model_path + '.json').read())
    model = load_model(model_recent)
    #model.compile(optimizer=rmsprop, loss='mse')
    return model

In [9]:
# Save model
def save_model_to_json(m, model_path):    
    json_string = m.model.to_json()
    open(model_path + '.json', 'w').write(json_string)
    m.model.save_weights(model_path + '.h5', overwrite=True)

In [10]:
def print_op_str(data_type):
    p = "Done processing " + data_type + " data in {:.2f} seconds"
    return p

In [11]:
def print_blank(n):
    print(" "*n, end="\r")

### Data Overview

In [12]:
train_variants = pd.read_csv(data_directory + "/input/training_variants")
test_variants = pd.read_csv(data_directory + "/input/test_variants")
train_text = pd.read_csv(data_directory + "/input/training_text", sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
test_text = pd.read_csv(data_directory + "/input/test_text", sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

In [31]:
# create dataset will all variants
all_variants = pd.concat([train_variants, test_variants], ignore_index=True)

The test set has no labels and is used only for score submission. This will be a challenge since the sample size is small, and it will be hard to learn the properties of the population needed to perform inference. 

In [13]:
# Test set has no labels and is used 
print(list(train_variants.columns))
print(list(test_variants.columns))

['ID', 'Gene', 'Variation', 'Class']
['ID', 'Gene', 'Variation']


In addition to the gene variant data, we also have a text corpus for each example that provides the clinical evidence that human experts used to classify the genetic mutations. This is essentially an unstructured feature set, and our first task will be to map this noisy data to a set of features that can more easily be used for prediction. 

In [14]:
print(list(train_text.columns))
print(list(test_text.columns))

['ID', 'Text']
['ID', 'Text']


In [15]:
# Merge the text with the variant data, and separate the target values (`Class`) from the features
train = pd.merge(train_variants, train_text, how='left', on='ID')
y_train = train['Class'].values
X_train = train.drop('Class', axis=1)

In [16]:
# Do the same thing with the test data, but note that there are no classes to separate as targets
X_test = pd.merge(test_variants, test_text, how='left', on='ID')
test_index = X_test['ID'].values

In [17]:
# Create mini data sets for model building
train_mini = train.sample(frac=0.05)
y_train_mini = train_mini['Class'].values
X_train_mini = train_mini.drop('Class', axis=1)
X_test_mini = X_test.sample(frac=0.05)
test_index_mini = X_test_mini['ID'].values

# Create mini dev set for model building
dev_mini = train.sample(frac=0.05)
y_dev_mini = dev_mini['Class'].values
X_dev_mini = dev_mini.drop('Class', axis=1)

In [18]:
X_train_mini.shape

(166, 4)

### Transform Variation Sequences

In [102]:
# Get variants with `null` in the text
# these are likely to be generated samples since they only appear in the test set
def matches_null(data):
    idx = []
    for index, v in data.iterrows():
        if 'null' in v['Variation']:
            idx.append(index)
    return idx, data.drop(idx)

In [88]:
# Get variants where gene text matches variant text
def matches_gene_variant(data):
    idx = []
    for index, v in data.iterrows():
        if (all_variants['Gene'][index] in v['Variation']) and ('Fusion' not in v['Variation']):
            idx.append(index)
    return idx, data.drop(idx)

In [73]:
# Get variants with `Fusion` in the text
def matches_fusion(data):
    idx = []
    for index, v in data.iterrows():
        if ('Fusion' in v['Variation']) and ('Fusions' not in v['Variation']):
            idx.append(index)
    return idx, data.drop(idx)

In [93]:
# Get variants with `Exon` in the text
def matches_exon(data):
    idx = []
    for index, v in data.iterrows():
        if 'Exon' in v['Variation']:
            idx.append(index)
    return idx, data.drop(idx)

In [80]:
# Get variants that match a unique variation type
def matches_type(data):
    # Get all unique variations with no digits in 'Variation'
    text_type = []
    for v in data['Variation'].unique():  
        text_type.append(v) if (any(str.isdigit(c) for c in v) == False) else None
    type_tokens = list(set(text_type))
    
    idx = []
    for index, v in data.iterrows():
        if any(x in v['Variation'] for x in type_tokens):
            idx.append(index)
    return idx, data.drop(idx)

In [81]:
# Get variants with `_` in the text
def matches_underscore(data):
    idx = []
    for index, v in data.iterrows():
        if '_' in v['Variation']:
            idx.append(index)
    return idx, data.drop(idx)

In [85]:
# Get variants with `*` in the text
def matches_asterisk(data):
    idx = []
    for index, v in data.iterrows():
        if '*' in v['Variation']:
            idx.append(index)
    return idx, data.drop(idx)

In [90]:
# Get variants that match a unique variation type
def matches_actions(data):
    # Get all variations with an action
    action_match = ['del', 'delins', 'dup', 'ins', 'splice', 'trunc', 'fs']
    
    idx = []
    for index, v in data.iterrows():
        if any(x in v['Variation'] for x in action_match):
            idx.append(index)
    return idx, data.drop(idx)

In [136]:
# Get variants that match regex that checks for ending in a digit, 
# instead of amino acid (i.e. letter)
def matches_end_on_position(data):
    idx = []
    for index, v in data.iterrows():
        m = re.search(r'\d+$', v['Variation'])
        if m is not None:
            idx.append(index)
    return idx, data.drop(idx)

In [174]:
# Get variants that match regex that checks for starting in a series of digits, 
# instead of amino acid (i.e. letter)
def matches_start_on_position(data):
    idx = []
    for index, v in data.iterrows():
        m = re.search(r'^[0-9]+', v['Variation'])
        if m is not None:
            idx.append(index)
    return idx, data.drop(idx)

In [137]:
# Get variants that match regex that checks for ending in a capital letter, indicating amino acid
def matches_end_on_amino(data):
    idx = []
    for index, v in data.iterrows():
        m = re.search(r'[A-Z]+$', v['Variation'])
        if m is not None:
            idx.append(index)
    return idx, data.drop(idx)

In [199]:
# Get variants that match regex that checks for starting in a capital letter, indicating amino acid
def matches_start_on_amino(data):
    idx = []
    for index, v in data.iterrows():
        m = re.search(r'^[A-Z]+', v['Variation'])
        if m is not None:
            idx.append(index)
    return idx, data.drop(idx)

In [188]:
# Get variants that are left over after initial grouping
def matches_none(data):
    return list(data.index.values)

In [391]:
def variant_groups(data):
    # First get indices of variants with `null` in the text
    # these are likely to be generated samples since they only appear in the test set
    idx0, datax = matches_null(data)
    # Get indices of variants where gene text matches variant text
    idx1, datax = matches_gene_variant(datax)
    # Get indices of variants where gene text matches variant text
    idx2, datax = matches_fusion(datax)
    # Get indices of variants with `Exon` in the text
    idx3, datax = matches_exon(datax)
    # Get indices of variants that match a unique variation type
    idx4, datax = matches_type(datax)
    # Get indices of variants with `_` in the text
    idx5, datax = matches_underscore(datax)
    # Get indices of variants that match a unique variation type
    idx6, datax = matches_actions(datax)
    # Get indices of variants with `*` in the text
    idx7, datax = matches_asterisk(datax)
    # Get variants that match regex that checks for ending in a capital letter, indicating amino acid
    idx8, datax = matches_end_on_position(datax)
    # Get variants that match regex that checks for ending in a capital letter, indicating amino acid
    idx9, datax = matches_start_on_position(datax)
    # Get variants that match regex that checks for ending in a capital letter, indicating amino acid
    idx10, datax = matches_end_on_amino(datax)
    # Get variants that match regex that checks for starting in a capital letter, indicating amino acid
    idx11, datax = matches_start_on_amino(datax)
    # Get indices of all other variants
    idx12 = matches_none(datax)
    
    
    groups = {
        'has_null': idx0,
        'gv': idx1,
        'fusion': idx2,
        'exon': idx3,
        'type': idx4,
        'underscore': idx5,
        'actions': idx6,
        'asterisk': idx7,
        'end_digit': idx8,
        'start_digit': idx9,
        'end_amino': idx10,
        'start_amino': idx11,
        'outliers': idx12
    }
    
    return groups

In [311]:
# Disassemble group1
def deconstruct_null(data):
    idx = []
    elements = []
    for index, v in data.iterrows():
        if v['Variation'].startswith('null'):
            #m = re.split(r'(\d+|\D+)', v['Variation']) # also extract integers?
            m = re.split(r'\d+', v['Variation'])
            if len(m) == 2:
                elements.append(m)
                idx.append(index)
    colnames = ['amino_state0', 'amino_state1']
            
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [312]:
# Disassemble group2
def deconstruct_gv_match(data):
    idx = []
    elements = []
    for index, v in data.iterrows():
        elements.append([v['Variation'], 1])
        idx.append(index)
    colnames = ['protein_token', 'protein_token_bool']
        
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [313]:
# Disassemble group3
def deconstruct_fusion(data):
    idx = []
    elements = []
    for index, v in data.iterrows():
        if v['Variation'].endswith('Fusion'):
            m = re.split(r'\W+', v['Variation'])
            elements.append([m[0], m[1], 1])
            idx.append(index)
    colnames = ['fusion0', 'fusion1', 'fusion_bool']
            
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [314]:
# Disassemble group4
def deconstruct_exon(data):
    idx = []
    elements = []
    for index, v in data.iterrows():
        if v['Variation'].startswith('Exon'):
            m = re.split(r'\W+', v['Variation'])
            if len(m) == 4:
                m[2] = m[2] + m[3]
                m = m[0:3]
            elements.append([m[1], m[2], 1])
            idx.append(index)
    colnames = ['exon_n', 'exon_action', 'exon_bool']
            
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [315]:
# Disassemble group5
def deconstruct_type(data):
    idx = []
    elements = []
    for index, v in data.iterrows():
        elements.append([v['Variation'], 1])
        idx.append(index)
    colnames = ['type_token', 'type_token_bool']
        
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [389]:
# Disassemble group6
def deconstruct_underscore(data):
    idx = []
    elements = []
    action_match = ['del', 'delins', 'dup', 'ins', 'splice', 'trunc', 'fs']

    for index, v in data.iterrows():
        m = re.split(r'([a-z]+)', v['Variation'])
        for action in m:
            if any(a == action for a in action_match):
                action_idx = m.index(action)
        if (len(m) == 3) and (action_idx == 1):
            genes = m[0].split('_')
            if len(genes) == 2:
                elements.append([genes[0], genes[1], m[1], m[2], 1])
                idx.append(index)
    colnames = ['genes_state0', 'genes_state1', 'genes_action', 'genes_action_key', 'genes_action_bool']
            
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [408]:
# Disassemble group7
def deconstruct_actions(data):
    idx = []
    elements = []
    action_match = ['del', 'delins', 'dup', 'ins', 'splice', 'trunc', 'fs']

    for index, v in data.iterrows():
        m = re.split(r'([a-z]+)', v['Variation'])
        for action in m:
            if any(a == action for a in action_match):
                action_idx = m.index(action)  
        if (len(m) == 3) and (action_idx == 1):
            amino = re.split(r'\d+', m[0])
            if len(amino) == 2:
                elements.append([amino[0], amino[1], m[1], m[2], 1])
                idx.append(index)
    colnames = ['amino_state0', 'amino_state1', 'amino_action', 'amino_action_key', 'amino_action_bool']
            
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [461]:
# Disassemble group8
def deconstruct_standard(data):
    idx = []
    elements = []

    for index, v in data.iterrows():
        # For sequences ending in '*'
        if v['Variation'].endswith('*'):
            amino = re.split(r'\d+', v['Variation'])
    
            if len(amino) == 2:
                elements.append([amino[0], amino[1], 1])
                idx.append(index)
        
        # For normal Amino-Position-Amino sequences
        if re.search(r'[A-Z]+$', v['Variation']) is not None:
            amino = re.split(r'\d+', v['Variation'])
            
            if len(amino) == 2:
                elements.append([amino[0], amino[1], 1])
                idx.append(index)
        
        # For sequences with no ending amino
        if re.search(r'[0-9]+$', v['Variation']) is not None:
            amino = re.split(r'\d+', v['Variation'])
            if len(amino) == 2:
                elements.append([amino[0], amino[1], 1])
                idx.append(index)
                
    colnames = ['amino_state0', 'amino_state1', 'amino_standard_bool']
            
    return idx, data.drop(idx), pd.DataFrame(elements, columns=colnames, index=idx)

In [601]:
def variant_subgroups(data, groups):
    # Transform each group into subgroups of elements
    idx0, datax, df0 = deconstruct_null(data.loc[groups['has_null']])
    idx1, datax, df1 = deconstruct_gv_match(data.loc[groups['gv']])
    idx2, datax, df2 = deconstruct_fusion(data.loc[groups['fusion']])
    idx3, datax, df3 = deconstruct_exon(data.loc[groups['exon']])
    idx4, datax, df4 = deconstruct_type(data.loc[groups['type']])
    idx5, datax, df5 = deconstruct_underscore(data.loc[groups['underscore']])
    idx6, datax, df6 = deconstruct_actions(data.loc[groups['actions']])
    idx7, datax, df7 = deconstruct_standard(data.loc[groups['asterisk']])
    idx8, datax, df8 = deconstruct_standard(data.loc[groups['end_digit']])
    idx9, datax, df9 = deconstruct_standard(data.loc[groups['start_digit']])
    idx10, datax, df10 = deconstruct_standard(data.loc[groups['end_amino']])
    idx11, datax, df11 = deconstruct_standard(data.loc[groups['start_amino']])
    idx12, _, df12 = deconstruct_type(data.loc[groups['outliers']])
    
    subgroups = {
        'has_null': idx0,
        'gv': idx1,
        'fusion': idx2,
        'exon': idx3,
        'type': idx4,
        'underscore': idx5,
        'actions': idx6,
        'asterisk': idx7,
        'end_digit': idx8,
        'start_digit': idx9,
        'end_amino': idx10,
        'start_amino': idx11,
        'outliers': idx12
    }
    
    data = data.join(df1, how='outer', lsuffix='_1L', rsuffix='_1R')
    data = data.join(df2, how='outer', lsuffix='_2L', rsuffix='_2R')
    data = data.join(df3, how='outer', lsuffix='_3L', rsuffix='_3R')
    data = data.join(df4, how='outer', lsuffix='_4L', rsuffix='_4R')
    data = data.join(df5, how='outer', lsuffix='_5L', rsuffix='_5R')
    data = data.join(df6, how='outer', lsuffix='_6L', rsuffix='_6R')
    data = data.join(df7, how='outer', lsuffix='_7L', rsuffix='_7R')
    data = data.join(df8, how='outer', lsuffix='_8L', rsuffix='_8R')
    data = data.join(df9, how='outer', lsuffix='_9L', rsuffix='_9R')
    data = data.join(df10, how='outer', lsuffix='_10L', rsuffix='_10R')
    data = data.join(df11, how='outer', lsuffix='_11L', rsuffix='_11R')
    data = data.join(df12, how='outer', lsuffix='_12L', rsuffix='_12R')
    
    amino_col = ['amino_state0_7L',  
                 'amino_state0_7R', 
                 'amino_state0_9L',  
                 'amino_state0_9R',  
                 'amino_state0_11L',  
                 'amino_state0_11R']
    
    data['amino_state0'] = np.nan
    for col in amino_col:
        data['amino_state0'].fillna(data[col], inplace=True)
        
    amino1_col = ['amino_state1_7L', 
             'amino_state1_7R',
             'amino_state1_9L', 
             'amino_state1_9R', 
             'amino_state1_11L', 
             'amino_state1_11R']
    
    data['amino_state1'] = np.nan
    for col in amino1_col:
        data['amino_state1'].fillna(data[col], inplace=True)
        
    type_col = ['type_token_12L', 'type_token_12R']
    data['type_token'] = np.nan
    for col in type_col:
        data['type_token'].fillna(data[col], inplace=True)
        
    type_bool_col = ['type_token_bool_12L', 'type_token_bool_12R']
    data['type_token_bool'] = np.nan
    for col in type_bool_col:
        data['type_token_bool'].fillna(data[col], inplace=True)
        
    amino_standard_bool_col = ['amino_standard_bool_8L', 'amino_standard_bool_8R',
                'amino_standard_bool_10L', 'amino_standard_bool_10R']
    data['amino_standard_bool_token'] = np.nan
    for col in amino_standard_bool_col:
        data['amino_standard_bool_token'].fillna(data[col], inplace=True)
    
    drop_cols = amino_col + amino1_col + type_col + type_bool_col + amino_standard_bool_col
    for col in drop_cols:
        data.drop([col], axis=1, inplace=True)
    
    return subgroups, data


In [602]:
%%time
groups = variant_groups(all_variants)
subgroups, all_transformed = variant_subgroups(all_variants, groups)

CPU times: user 9.36 s, sys: 108 ms, total: 9.47 s
Wall time: 9.59 s


In [609]:
#all_transformed[all_transformed.columns[19:30]][0:15]
all_transformed[['Gene', 'Variation', 'amino_state0', 'amino_state1', 'amino_standard_bool_token', 'type_token', 'type_token_bool']][0:20]

Unnamed: 0,Gene,Variation,amino_state0,amino_state1,amino_standard_bool_token,type_token,type_token_bool
0,FAM58A,Truncating Mutations,,,,Truncating Mutations,1.0
1,CBL,W802*,W,*,1.0,,
2,CBL,Q249E,Q,E,1.0,,
3,CBL,N454D,N,D,1.0,,
4,CBL,L399V,L,V,1.0,,
5,CBL,V391I,V,I,1.0,,
6,CBL,V430M,V,M,1.0,,
7,CBL,Deletion,,,,Deletion,1.0
8,CBL,Y371H,Y,H,1.0,,
9,CBL,C384R,C,R,1.0,,


### Amino Table of Properties

In [610]:
amino_table = pd.read_csv(data_directory + '/supplementary/amino_properties.csv')

Unnamed: 0,Amino acid,Abbreviations,Letter,Hydropathy (3 classes),Volume (5 classes),Chemical  (7 classes),Physicochemical (11 classes),Charge (3 classes),Polarity (2 classes),Hydrogen donor or acceptor atom (4 classes)
0,Alanine,Ala,A,hydrophobic (1),very small (1),aliphatic (1),aliphatic (1),uncharged(3),nonpolar (2),none (4)
1,Arginine,Arg,R,hydrophilic (3),large (4),basic (5),basic (5),positive charged (1),polar (1),donor (1)
2,Asparagine,Asn,N,hydrophilic (3),small (2),amide (7),amide (2),uncharged (3),polar (1),donor and acceptor (3)
3,Asparagine or aspartic acid,Asx,B,,,,,,,
4,Aspartic acid,Asp,D,hydrophilic (3),small (2),acidic (6),acidic (6),negative charged (2),polar (1),acceptor (2)
5,Cysteine,Cys,C,hydrophobic (1),small (2),sulfur (3),sulfur (3),uncharged (3),nonpolar (2),none (4)
6,Glutamine,Gln,Q,hydrophilic (3),medium (3),amide (7),amide (2),uncharged (3),polar (1),donor and acceptor (3)
7,Glutamic acid,Glu,E,hydrophilic (3),medium (3),acidic (6),acidic (6),negative charged (2),polar (1),acceptor (2)
8,Glycine,Gly,G,neutral (2),very small (1),aliphatic (1),G (11),uncharged (3),nonpolar (2),none (4)
9,Histidine,His,H,neutral (2),medium (3),basic (5),basic (5),positive charged (1),polar (1),donor and acceptor (3)


In [611]:
# Create Lookup Table
amino_name_lookup = amino_table[amino_table.columns[:3]]
amino_name_lookup

Unnamed: 0,Amino acid,Abbreviations,Letter
0,Alanine,Ala,A
1,Arginine,Arg,R
2,Asparagine,Asn,N
3,Asparagine or aspartic acid,Asx,B
4,Aspartic acid,Asp,D
5,Cysteine,Cys,C
6,Glutamine,Gln,Q
7,Glutamic acid,Glu,E
8,Glycine,Gly,G
9,Histidine,His,H


In [639]:
# Create table of transformed variants along with their properties
variant_properties = all_transformed.merge(amino_table[amino_table.columns[2:]], how='left',
                 left_on='amino_state0', right_on='Letter', suffixes=('', '_'))
variant_properties = variant_properties.merge(amino_table[amino_table.columns[2:]], how='left',
                 left_on='amino_state1', right_on='Letter', suffixes=('_state0', '_state1'))

In [724]:
variant_properties_data = pd.concat([variant_properties[variant_properties.columns[:1]],
                                pd.get_dummies(variant_properties[variant_properties.columns[1:]])], axis=1)

In [779]:
# separate table into training-test
train_variant_properties = variant_properties_data[:train_variants.shape[0]].drop('Class', axis=1).fillna(0)
test_variant_properties = variant_properties_data[train_variants.shape[0]:].reset_index(drop=True).drop('Class', axis=1).fillna(0)

In [780]:
train_variant_properties = np.array(train_variant_properties)
test_variant_properties = np.array(test_variant_properties)

In [782]:
test_variant_properties.shape

(5668, 10817)

In [648]:
pd.DataFrame(list(variant_properties.columns))

Unnamed: 0,0
0,Class
1,Gene
2,ID
3,Variation
4,protein_token
5,protein_token_bool
6,fusion0
7,fusion1
8,fusion_bool
9,exon_n


### PCA

In [783]:
def pca_metrics(s, n):
    explained_variance = s.explained_variance_ratio_.sum()
    #     print_op = "Explained variance of SVD with {} features: {}%"
    #     print(print_op.format(n, int(explained_variance * 100)))
    return explained_variance

In [792]:
def pca_fn(data_train, data_test=None, target=None, n_features=20, test_transform=False, plotting=False):
    t = time.time()
    print("Beginning svd on training data", end="\r")
    #svd = TruncatedSVD(n_features, algorithm='arpack')
    pca = PCA(n_features)
    pca_train = pca.fit_transform(data_train)
    print(print_op_str("train").format(time.time()-t), end="\r")
    
    if test_transform == True:
        print("Beginning pca on test data", end="\r")
        pca_test = pca.transform(data_test)
        print(print_op_str("test").format(time.time()-t), end="\r")
        print_blank(len(print_op_str("test")))
    else:
        pca_test = sps.csr_matrix((1,1))
        print_blank(len(print_op_str("train")))
        
    total_explained_variance = pca_metrics(pca, n_features)
        
    if plotting == True:
        data2D_pca = pca_train
        plt.scatter(data2D_pca[:,0], data2D_pca[:,1], c=target)
        plt.show()  
        
    return pca, pca_train, pca_test, total_explained_variance

In [793]:
features_n_pca = 20

In [1]:
pca_mini, pca_train_mini, pca_test_mini, tev_pca_mini = svd_fn(train_variant_properties, 
                                                 test_variant_properties, 
                                                 y_train, 
                                                 features_n_pca, 
                                                 test_transform=False, 
                                                 plotting=True)

In [796]:
input_shape = train_variant_properties.shape[1]
output_shape = len(train['Class'].unique())
batch_n = 32
EPOCHS_N = 10 #100
model_save_interval = 100

In [797]:
def model_hypothesis():
    model = Sequential()
    model.add(Dense(512, input_dim=input_shape, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(256, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(128, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(128, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(256, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(512, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(output_shape, kernel_initializer='normal', activation="softmax"))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [730]:
def model_resume():
    model = load_recent_model(model_path)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [731]:
# Callback for saving model and weights at n intervals in training
# https://keras.io/callbacks/#modelcheckpoint
weight_save_callback = ModelCheckpoint(model_path + '.{epoch:02d}-{loss:.2f}.hdf5', 
                                       monitor='loss', 
                                       verbose=0, 
                                       save_best_only=True, 
                                       mode='auto',
                                       period=model_save_interval # Interval (number of epochs) between checkpoints
                                      )
early_stopping = EarlyStopping(monitor='loss', min_delta=0, patience=50, verbose=1, mode='auto')

In [732]:
onehot = LabelEncoder()
onehot.fit(y_train)
y_enc = onehot.transform(y_train)

In [733]:
y_ind = np_utils.to_categorical(y_enc)

In [None]:
# Try to restore previous checkpoints to continue training
if os.path.isfile(model_path + '.h5') and os.path.isfile(model_path + '.json'):
    estimator = KerasClassifier(build_fn=model_resume, epochs=EPOCHS_N, batch_size=batch_n)
    model = model_resume()
else:
    estimator = KerasClassifier(build_fn=model_hypothesis, epochs=EPOCHS_N, batch_size=batch_n)
    model = model_hypothesis()

In [798]:
# Or create a new one
estimator = KerasClassifier(build_fn=model_hypothesis, epochs=EPOCHS_N, batch_size=batch_n)

In [799]:
%%time
start_time = time.time()
#estimator.fit(svd_train, y_ind, validation_split=0.05)
estimator.fit(np.array(train_variant_properties), y_ind, batch_size=batch_n, epochs=EPOCHS_N*10, callbacks=[weight_save_callback, early_stopping])
end_time = time.time()
print("Elapsed time: {:.2f} sec".format(end_time-start_time))
try: 
    save_model_to_json(estimator, model_path)
    print("Saved model and weights to disk")
except Exception as e:
    print(e)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100

KeyboardInterrupt: 

In [None]:
y_pred = estimator.predict_proba(svd_test)

In [None]:
submission = pd.DataFrame(y_pred)
submission['id'] = test_index
submission.columns = ['class1', 'class2', 'class3', 'class4', 'class5', 'class6', 'class7', 'class8', 'class9', 'id']
submission.to_csv(data_directory + "/output/submission_" + str(int(time.time())) + ".csv",index=False)

### TFIDF

In [800]:
corpus_train = X_train['Text']
corpus_test = X_test['Text']

In [801]:
def sparse_metrics(transformer_type, sparse_mtx):
    print("Vocabulary length: {}".format(len(transformer_type.vocabulary_)))
    print('sparse matrix shape: {}'.format(sparse_mtx.shape))
    print('nonzero count: {}'.format(sparse_mtx.nnz))
    print('sparsity: {:.2f}'.format((100.0 * sparse_mtx.nnz / (sparse_mtx.shape[0] * sparse_mtx.shape[1]))))

In [802]:
def tfidf_func(X, Y, ngrams=1, plotting=False):
    t = time.time()
    tfidf = TfidfVectorizer(stop_words = 'english', ngram_range=(1, ngrams), sublinear_tf=True, use_idf=True)
    tfidf_train = tfidf.fit_transform(X)
    print(print_op_str("train").format(time.time()-t), end="\r")
    tfidf_test = tfidf.transform(Y)
    print(print_op_str("test").format(time.time()-t), end="\r")
    print_blank(len(print_op_str("test")))
    
    print_op = "Shape: {}\nNon-zero mean: {}\nNon-zero median: {}"
    mean_nnz = int(round(np.mean(tfidf_train.getnnz(1))))
    median_nnz = int(round(np.median(tfidf_train.getnnz(1))))
    print(print_op.format(tfidf_train.shape, mean_nnz,median_nnz))
    print("\nDone in {:.2f} seconds".format(time.time()-t))
    
    if plotting == True:     
        fig, ax = plt.subplots(figsize=(12, 6))
        plt.plot(np.sort(tfidf_train.getnnz(1))[::-1])

        # Log transformed
        plt.figure(figsize=(12,6))
        plt.plot(np.sort(np.log(tfidf_train.getnnz(1)))[::-1])
        
    return tfidf, tfidf_train, tfidf_test

In [None]:
%%time
tfidf = TfidfVectorizer(stop_words = 'english', ngram_range=(1, 2), sublinear_tf=True, use_idf=True)
tfidf_train = tfidf.fit_transform(corpus_train)
tfidf_test = tfidf.transform(corpus_test)

In [803]:
sw = stopwords
stopwrds = sw.words('english')

In [804]:
def text_preprocessor(s, action=1):
    
    # lowercase (this is already defaulted in Count Vectorizer)
    s = s.lower()
    
    # Regex functions
    def regexer(s, keep=None):
        
        # remove numbers at end of words that might be citations or other misleading modifiers
        if keep == "no_citations":
            if s.lower() not in tt_corpus:
                regex = r"^[^a-zA-z]*|[^a-zA-Z]*$"
                
        # remove all numbers in words that might be citations or other misleading modifiers
        if keep == "no_numbers":
            if s.lower() not in tt_corpus:
                regex = r"[^a-zA-Z ]"
            
        # Apply Regex transformation
        s = re.sub(regex, "", s, 0)
        return s
    
    # Remove stop words
    def stopwords(s):
        
        # Use nltk function for tokenization
        tokens = word_tokenize(s) 
        
        # Get stopwords list from nltk.corpus and remove words in the list
        s_bin = [w for w in tokens if not w in stopwrds]
        s = ' '.join(s_bin)
        return s
    
    # Lemmatize or stem function
    def lemmatize(s, func="stem"):
        # Split string into individual tokens
        tokens = word_tokenize(s) 
        s_bin = []
        
        # Stem tokens
        if func == "stem":
            fn = PorterStemmer().stem
        # Lemmatization
        if func == "lemma":
            fn = WordNetLemmatizer().lemmatize
        
        for t in tokens:
            s_bin.append(fn(t))
        s = ' '.join(s_bin)
        return s
    
    # Select transformation based on `action` parameter
    if 0 in action:
        s = regexer(s, keep="no_citations")
    if 1 in action:
        s = regexer(s, keep="no_numbers")
    if 2 in action:
        s = stopwords(s)
    if 3 in action:
        s = lemmatize(s, "lemma")
    if 4 in action:
        s = lemmatize(s, "stem")

    return s

In [None]:
%%time
cv = CountVectorizer(preprocessor=lambda x: text_preprocessor(x, action=[0, 2, 3]), 
                     lowercase=True, 
                     stop_words='english', 
                     min_df=1, 
                     max_df=.1, 
                     ngram_range=(1,2))
cv

In [None]:
%%time
train_cv = cv.fit_transform(corpus_train)
test_cv = cv.transform(corpus_test)

In [805]:
def tfidf_fn(X, Y, ngrams=1, mn=1, mx=0.1, test_transform=False, plotting=False):
    t = time.time()
    tfidf = TfidfVectorizer(stop_words = 'english', 
                            ngram_range=(1, ngrams), 
                            sublinear_tf=True, 
                            use_idf=True, 
                            preprocessor=lambda x: text_preprocessor(x, action=[0, 2, 3]), 
                            lowercase=True, 
                            min_df=mn, 
                            max_df=mx)
    print("Beginning tf-idf on training data", end="\r")
    tfidf_train = tfidf.fit_transform(X)
    print(print_op_str("train").format(time.time()-t), end="\r")
    
    if test_transform == True:
        print("Beginning tf-idf on test data", end="\r")
        tfidf_test = tfidf.transform(Y)
        print(print_op_str("test").format(time.time()-t), end="\r")
        print_blank(len(print_op_str("test")))
    else:
        tfidf_test = sps.csr_matrix((1,1))
        print_blank(len(print_op_str("train")))
    
    print_op = "Shape: {}\nNon-zero mean: {}\nNon-zero median: {}"
    mean_nnz = int(round(np.mean(tfidf_train.getnnz(1))))
    median_nnz = int(round(np.median(tfidf_train.getnnz(1))))
    print(print_op.format(tfidf_train.shape, mean_nnz,median_nnz))
    print("\nDone in {:.2f} seconds".format(time.time()-t))
    
    if plotting == True:     
        fig, ax = plt.subplots(figsize=(12, 6))
        plt.plot(np.sort(tfidf_train.getnnz(1))[::-1])

        # Log transformed
        plt.figure(figsize=(12,6))
        plt.plot(np.sort(np.log(tfidf_train.getnnz(1)))[::-1])
        
    return tfidf, tfidf_train, tfidf_test

In [806]:
def svd_metrics(s, n):
    explained_variance = s.explained_variance_ratio_.sum()
    #     print_op = "Explained variance of SVD with {} features: {}%"
    #     print(print_op.format(n, int(explained_variance * 100)))
    return explained_variance

In [807]:
def svd_fn(data_train, data_test=None, target=None, n_features=20, test_transform=False, plotting=False):
    t = time.time()
    print("Beginning svd on training data", end="\r")
    svd = TruncatedSVD(n_features, algorithm='arpack')
    svd_train = svd.fit_transform(data_train)
    print(print_op_str("train").format(time.time()-t), end="\r")
    
    if test_transform == True:
        print("Beginning svd on test data", end="\r")
        svd_test = svd.transform(data_test)
        print(print_op_str("test").format(time.time()-t), end="\r")
        print_blank(len(print_op_str("test")))
    else:
        svd_test = sps.csr_matrix((1,1))
        print_blank(len(print_op_str("train")))
        
    total_explained_variance = svd_metrics(svd, n_features)
        
    if plotting == True:
        data2D_svd = svd_train
        plt.scatter(data2D_svd[:,0], data2D_svd[:,1], c=target)
        plt.show()  
        
    return svd, svd_train, svd_test, total_explained_variance

In [808]:
features_n_svd = 200

In [810]:
corpus_obj = [
    X_train['Gene'],
    X_train['Variation'],
    X_test['Gene'],
    X_test['Variation']
]

tt_corpus = []
[tt_corpus.extend(df.values.tolist()) for df in corpus_obj]
tt_corpus = pd.Series(tt_corpus)
tt_corpus = [s.lower() for s in list(tt_corpus.unique())]
len(tt_corpus)

10116

In [3]:
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

In [815]:
def load_sparse_csr(filename):
    loader = np.load(filename + '.npz')
    return sps.csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

### Restore TF-IDF or run TF-IDF

Tokenize data (this will take a long time)

In [812]:
# %%time
# tfidf, tfidf_train, tfidf_test = tfidf_fn(corpus_train,
#                                 corpus_test,
#                                 ngrams=2, 
#                                 mn=1, 
#                                 mx=0.1, 
#                                 test_transform=True, 
#                                 plotting=False)

Make sure to save afterward

In [4]:
# save_sparse_csr(data_directory + '/data/train_bigram_vocabulary', tfidf_train)

In [None]:
# save_sparse_csr(data_directory + '/data/test_bigram_vocabulary', tfidf_test)

Or, load previoiusly tokenized vocabulary

In [816]:
%%time 
tfidf = TfidfVectorizer()
tfidf_train = load_sparse_csr(data_directory + '/data/train_bigram_vocabulary')
tfidf_test = load_sparse_csr(data_directory + '/data/test_bigram_vocabulary')

CPU times: user 927 ms, sys: 414 ms, total: 1.34 s
Wall time: 1.61 s


### Restore SVD features or run SVD

Run SVD on data (also takes a long time)

In [None]:
# %%time
# svd = TruncatedSVD(features_n_svd, algorithm='arpack')
# svd_train = svd.fit_transform(tfidf_train)
# svd_test = svd.transform(tfidf_test)

And save:

In [None]:
# np.save(data_directory + '/data/train_svd_200', svd_train)
# np.save(data_directory + '/data/test_svd_200', svd_test)

Or, load previously saved data:

In [818]:
svd_train = np.load(data_directory + '/data/train_svd_200.npy')
svd_test = np.load(data_directory + '/data/test_svd_200.npy')

Concatenate SVD with features created with transformation along with amino properties info.

In [825]:
hybrid_train = np.concatenate((svd_train, train_variant_properties), axis=1)
hybrid_test = np.concatenate((svd_test, test_variant_properties), axis=1)

**  Hyperparameters **

In [828]:
input_shape = hybrid_train.shape[1]
output_shape = len(train['Class'].unique())
batch_n = 32
EPOCHS_N = 100
model_save_interval = 100

### Model Architecture

Using the same model that worked well with only text features (hidden units => `512` => `256` => `128` => `64`), then (`64` => `128` => `256` => `512`).

In [829]:
def model_hypothesis():
    model = Sequential()
    model.add(Dense(512, input_dim=input_shape, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(256, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(128, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(128, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(256, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(512, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(output_shape, kernel_initializer='normal', activation="softmax"))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [830]:
def model_resume():
    model = load_recent_model(model_path)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

One-hot encoding of labels

In [831]:
# Callback for saving model and weights at n intervals in training
# https://keras.io/callbacks/#modelcheckpoint
weight_save_callback = ModelCheckpoint(model_path + '.{epoch:02d}-{loss:.2f}.hdf5', 
                                       monitor='loss', 
                                       verbose=0, 
                                       save_best_only=True, 
                                       mode='auto',
                                       period=model_save_interval # Interval (number of epochs) between checkpoints
                                      )

In [832]:
early_stopping = EarlyStopping(monitor='loss', min_delta=0, patience=50, verbose=1, mode='auto')

In [833]:
onehot = LabelEncoder()
onehot.fit(y_train)
y_enc = onehot.transform(y_train)

In [834]:
y_ind = np_utils.to_categorical(y_enc)

### Restore Model or Begin Training New Model

To restore a previously saved model:

In [837]:
# Try to restore previous checkpoints to continue training
if os.path.isfile(model_path + '.h5') and os.path.isfile(model_path + '.json'):
    estimator = KerasClassifier(build_fn=model_resume, epochs=EPOCHS_N, batch_size=batch_n)
    model = model_resume()
else:
    estimator = KerasClassifier(build_fn=model_hypothesis, epochs=EPOCHS_N, batch_size=batch_n)
    model = model_hypothesis()

Using model at checkpoint: /Users/Reynard/dropbox/Data/kaggle/Personalized Medicine/saved_models/model_001.199-1.51.hdf5


Or create a new one:

In [835]:
estimator = KerasClassifier(build_fn=model_hypothesis, epochs=EPOCHS_N, batch_size=batch_n)

### Training

In [838]:
%%time
start_time = time.time()
#estimator.fit(svd_train, y_ind, validation_split=0.05)
estimator.fit(hybrid_train, y_ind, batch_size=batch_n, epochs=EPOCHS_N*10, callbacks=[weight_save_callback, early_stopping])
end_time = time.time()
print("Elapsed time: {:.2f} sec".format(end_time-start_time))
try: 
    save_model_to_json(estimator, model_path)
    print("Saved model and weights to disk")
except Exception as e:
    print(e)

Using model at checkpoint: /Users/Reynard/dropbox/Data/kaggle/Personalized Medicine/saved_models/model_001.199-1.51.hdf5
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 

In [70]:
# Display model architecture
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model_hypothesis()).create(prog='dot', format='svg'))

# Save to .png
from keras.utils import plot_model
plot_model(model_hypothesis(), to_file='model_hypothesis1.png')

In [71]:
y_pred = estimator.predict_proba(hybrid_test)



In [73]:
submission = pd.DataFrame(y_pred)
submission['id'] = test_index
submission.columns = ['class1', 'class2', 'class3', 'class4', 'class5', 'class6', 'class7', 'class8', 'class9', 'id']
submission.to_csv(data_directory + "/output/submission_" + str(int(time.time())) + ".csv",index=False)