# Problem Statement

**Problem Statement:** Classify the given genetic variations/mutations based on evidence from text-based clinical literature. 

**Objective:** Predict the probability of each data-point belonging to each of the nine classes.

Details:

* Class probabilities are needed.
* Penalize the errors in class probabilites => Metric is Log-loss.
* No Latency constraints.

https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

#Mount Google Drive

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Import Dependencies

In [0]:
import pandas as pd
import numpy as np

import nltk
nltk.download('stopwords')
nltk.download('punkt')


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer

from bs4 import BeautifulSoup
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
#from sklearn.ensemble import StackingClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn import model_selection

# import ploty graph objects as "go"
import plotly.graph_objs as go
import plotly.express as px

#Ignore warning
import warnings
warnings.filterwarnings("ignore")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
from tqdm.notebook import trange, tqdm
tqdm.pandas(desc="Progress")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Loading Data

**TrainingData** : a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations), Class (1-9 the class this genetic mutation has been classified on)

**TrainingText** : a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)

Ref: https://github.com/gauravtheP/Personalized-Cancer-Diagnosis/blob/master/Personalized-Cancer-Diagnosis.ipynb

In [0]:
#Data Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

training_text = pd.read_csv('/content/drive/My Drive/Case Studies/Personalized Cancer Diagnosis/training_text', sep='\|\|', skiprows =1, names=['ID','Text'])
training_variants = pd.read_csv('/content/drive/My Drive/Case Studies/Personalized Cancer Diagnosis/training_variants')

print (training_text.shape, training_variants.shape)


(3321, 2) (3321, 4)


In [0]:
training_text.head()

Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


In [0]:
# Dataframe info
training_text.info()

#https://www.w3schools.com/python/ref_string_format.asp
print ('training_text dataframe has {number} null value in {col} column'.format(number = 3321-3316, col = 'Text'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3321 entries, 0 to 3320
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      3321 non-null   int64 
 1   Text    3316 non-null   object
dtypes: int64(1), object(1)
memory usage: 52.0+ KB
training_text dataframe has 5 null value in Text column


In [0]:
training_variants.head()

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


In [0]:
# Dataframe info
training_variants.info()
print ('training_variants dataframe has {number} null value'.format(number = 0))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3321 entries, 0 to 3320
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         3321 non-null   int64 
 1   Gene       3321 non-null   object
 2   Variation  3321 non-null   object
 3   Class      3321 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 103.9+ KB
training_variants dataframe has 0 null value


In [0]:
# merging both the dataframe on ID
#https://stackoverflow.com/questions/44064299/how-can-i-concatenate-pandas-dataframes-by-column-and-index

df = pd.merge(training_text, training_variants, on=['ID'])
df.head()

Unnamed: 0,ID,Text,Gene,Variation,Class
0,0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1
1,1,Abstract Background Non-small cell lung canc...,CBL,W802*,2
2,2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2
3,3,Recent evidence has demonstrated that acquired...,CBL,N454D,3
4,4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3321 entries, 0 to 3320
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         3321 non-null   int64 
 1   Text       3316 non-null   object
 2   Gene       3321 non-null   object
 3   Variation  3321 non-null   object
 4   Class      3321 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 155.7+ KB


In [0]:
#Droping blank 
df.dropna(inplace=True)

In [0]:
df.head()

Unnamed: 0,ID,Text,Gene,Variation,Class
0,0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1
1,1,Abstract Background Non-small cell lung canc...,CBL,W802*,2
2,2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2
3,3,Recent evidence has demonstrated that acquired...,CBL,N454D,3
4,4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4


# Exploratory Analysis

In [0]:
#Ref: https://plotly.com/python/

# Number of Gene frequency
df['Gene'].value_counts().values

# import graph objects as "go"
import plotly.graph_objs as go

x = df['Gene'].value_counts().index
y = df['Gene'].value_counts().values

data = {
  'x': x,
  'y': y,
  'name': 'Gene Frequency',
  'type': 'bar'
};

layout = {
  'xaxis': {'title': 'Gene'},
  'barmode': 'relative',
  'title': 'Gene Frequency'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

In [0]:
#Data has a long tail, so we are visulaizing top 100 points

df.Variation.value_counts()
# import graph objects as "go"
import plotly.graph_objs as go

x = df.Variation.value_counts().index [:100]
y = df.Variation.value_counts().values [:100]

data = {
  'x': x,
  'y': y,
  'name': 'Variation Frequency',
  'type': 'bar'
};

layout = {
  'xaxis': {'title': 'Class'},
  'barmode': 'relative',
  'title': 'Variation Frequency'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

In [0]:
# import graph objects as "go"
import plotly.graph_objs as go

x = df.Class.value_counts().index
y = df.Class.value_counts().values

data = {
  'x': x,
  'y': y,
  'name': 'Class Frequency',
  'type': 'bar'
};

layout = {
  'xaxis': {'title': 'Class'},
  'barmode': 'relative',
  'title': 'Class Frequency'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

In [0]:
#Density headmap between Gene and Variation

import plotly.express as px

fig = px.density_heatmap(df, x="Class", y="Gene")
fig.show()

# Pre-processing Text Data

In [0]:
#Preprocessing
def pre_processing(x):
    
    #characters converting into lower
    x = str(x).lower()
    #replacing values

    #Expanding contraction
    x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
                           .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
                           .replace("€", " euro ").replace("'ll", " will")
    x = re.sub(r"([0-9]+)000000", r"\1m", x)
    x = re.sub(r"([0-9]+)000", r"\1k", x)
    
    #doing stemming
    porter = PorterStemmer()
    pattern = re.compile('\W')
    #defining type
    if type(x) == type(''):
        x = re.sub(pattern, ' ', x)
    
    #getting text from html
    if type(x) == type(''):
        x = porter.stem(x)
        example1 = BeautifulSoup(x)
        x = example1.get_text()
               
    
    return x


#list of stopword
stop_words = set(stopwords.words('english'))

#removing stopword
def cleaning_stopword(string):
  #tokenizing the sting
  word_tokens = word_tokenize(string)

  #blank list to append
  clean_string = [] 
  
  for x in word_tokens: 
      if x not in stop_words: 
          clean_string.append(x)
  return clean_string



#Removing Punctuation
def remove_punctuation(string): 
    # punctuation marks 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
  
    # traverse the given string and if any punctuation 
    # marks occur replace it with null 
    for x in string.lower(): 
        if x in punctuations: 
            string = string.replace(x, "") 
  
    # Print string without punctuation 
    return string

#Removing html tag
def remove_html_tags(string):   
    clean = re.compile('<.*?>')
    return re.sub(clean, '', string)

In [0]:
df['clean_text'] = df.Text.progress_apply(lambda x: pre_processing(x))

HBox(children=(FloatProgress(value=0.0, description='Progress', max=3316.0, style=ProgressStyle(description_wi…




In [0]:
df['clean_text'] = df.clean_text.progress_apply(lambda x: remove_punctuation(x))

HBox(children=(FloatProgress(value=0.0, description='Progress', max=3316.0, style=ProgressStyle(description_wi…




In [0]:
df['clean_text'] = df.clean_text.progress_apply(lambda x: remove_html_tags(x))

HBox(children=(FloatProgress(value=0.0, description='Progress', max=3316.0, style=ProgressStyle(description_wi…




In [0]:
df['clean_text'] = df.clean_text.progress_apply(lambda x: cleaning_stopword(x))

HBox(children=(FloatProgress(value=0.0, description='Progress', max=3316.0, style=ProgressStyle(description_wi…




In [0]:
df.head()

Unnamed: 0,ID,Text,Gene,Variation,Class,clean_text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1,"[cyclin, dependent, kinases, cdks, regulate, v..."
1,1,Abstract Background Non-small cell lung canc...,CBL,W802*,2,"[abstract, background, non, small, cell, lung,..."
2,2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2,"[abstract, background, non, small, cell, lung,..."
3,3,Recent evidence has demonstrated that acquired...,CBL,N454D,3,"[recent, evidence, demonstrated, acquired, uni..."
4,4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4,"[oncogenic, mutations, monomeric, casitas, b, ..."


In [0]:
df.drop(['ID','Class'], axis=1)

Unnamed: 0,Text,Gene,Variation,clean_text
0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,"[cyclin, dependent, kinases, cdks, regulate, v..."
1,Abstract Background Non-small cell lung canc...,CBL,W802*,"[abstract, background, non, small, cell, lung,..."
2,Abstract Background Non-small cell lung canc...,CBL,Q249E,"[abstract, background, non, small, cell, lung,..."
3,Recent evidence has demonstrated that acquired...,CBL,N454D,"[recent, evidence, demonstrated, acquired, uni..."
4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,"[oncogenic, mutations, monomeric, casitas, b, ..."
...,...,...,...,...
3316,Introduction Myelodysplastic syndromes (MDS) ...,RUNX1,D171N,"[introduction, myelodysplastic, syndromes, mds..."
3317,Introduction Myelodysplastic syndromes (MDS) ...,RUNX1,A122*,"[introduction, myelodysplastic, syndromes, mds..."
3318,The Runt-related transcription factor 1 gene (...,RUNX1,Fusions,"[runt, related, transcription, factor, 1, gene..."
3319,The RUNX1/AML1 gene is the most frequent targe...,RUNX1,R80C,"[runx1, aml1, gene, frequent, target, chromoso..."


# Splitting Data Into Text and Train

In [0]:
X = df.drop(['ID','Class'], axis=1)
y =df['Class']

from sklearn.model_selection import train_test_split

#Splitting the data into train vs test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

#Further splitting train data into train vs cross validate
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)

print (X_train.shape, y_train.shape,X_cv.shape, y_cv.shape, X_test.shape, y_test.shape)

(2121, 4) (2121,) (531, 4) (531,) (664, 4) (664,)


In [0]:
#Distribution of y_i's in Train, Test and Cross Validation datasets
x = y_train.value_counts().index
y = y_train.value_counts().values

data = {
  'x': x,
  'y': y,
  'name': 'Class Frequency',
  'type': 'bar'
};

layout = {
  'xaxis': {'title': 'Class'},
  'barmode': 'relative',
  'title': 'Y Train Class Frequency'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

x = y_cv.value_counts().index
y = y_cv.value_counts().values

data = {
  'x': x,
  'y': y,
  'name': 'Class Frequency',
  'type': 'bar'
};

layout = {
  'xaxis': {'title': 'Class'},
  'barmode': 'relative',
  'title': 'Y CV Class Frequency'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

x = y_test.value_counts().index
y = y_test.value_counts().values

data = {
  'x': x,
  'y': y,
  'name': 'Class Frequency',
  'type': 'bar'
};

layout = {
  'xaxis': {'title': 'Class'},
  'barmode': 'relative',
  'title': 'Y Test Class Frequency'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

# Creating Random Model

In [0]:
import plotly.figure_factory as ff

def print_confusion_matrix(Y_Test,Predicted):
  confusionMatx = np.round_(confusion_matrix(Y_Test, Predicted),decimals=3 )
  precision = np.round_(confusionMatx/confusionMatx.sum(axis = 0),decimals=3 )
  recall = np.round_((confusionMatx.T/confusionMatx.sum(axis = 1)).T,decimals=3)

  #Confusion Matrix -------------------------------------
  z = confusionMatx
  # change each element of z to type string for annotations
  z_text = [[str(y) for y in x] for x in z]

  #https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Heatmap.html
  
  x_labels = [i for i in range(1, 10)]
  y_labels = [i for i in range(1, 10)]
  fig = ff.create_annotated_heatmap(z, x=x_labels, y=y_labels, annotation_text=z_text)


  # add title
  fig.update_layout(height=600, width=800, 
                    title_text='<b>Confusion matrix</b>',
                    xaxis = dict(title='Predicted value'),
                    yaxis = dict(title='Real value')
                  )
  fig.show()

  #Precision Matrix -------------------------------------
  z = precision
  # change each element of z to type string for annotations
  z_text = [[str(y) for y in x] for x in z]

  #https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Heatmap.html
  
  x_labels = [i for i in range(1, 10)]
  y_labels = [i for i in range(1, 10)]
  fig = ff.create_annotated_heatmap(z, x=x_labels, y=y_labels, annotation_text=z_text)


  # add title
  fig.update_layout(height=600, width=800, 
                    title_text='<b>Precision matrix: Columm Sum=1</b>',
                    xaxis = dict(title='Predicted value'),
                    yaxis = dict(title='Real value')
                  )
  fig.show()

  #Recall Matrix -------------------------------------
  z = recall
  # change each element of z to type string for annotations
  z_text = [[str(y) for y in x] for x in z]

  #https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Heatmap.html
  
  x_labels = [i for i in range(1, 10)]
  y_labels = [i for i in range(1, 10)]
  fig = ff.create_annotated_heatmap(z, x=x_labels, y=y_labels, annotation_text=z_text)


  # add title
  fig.update_layout(height=600, width=800, 
                    title_text='<b>Recall matrix: Row sum=1</b>',
                    xaxis = dict(title='Predicted value'),
                    yaxis = dict(title='Real value')
                  )
  
  fig.show()

In [0]:
#Random Model using Test Data

test_data_length = X_test.shape[0]
test_predicted_probs = np.zeros((test_data_length,9))

for i in range(test_data_length):
  #it will return an array of random numbers between 1 and 0 of size 1*9
  rand_probs_test = np.random.rand(1,9)
  #it will generate random probabilities of each point in test data such that theor sum = 1
  test_predicted_probs[i] = (rand_probs_test/sum(sum(rand_probs_test)))[0]

print("Log loss on Test Data using Random Model "+str(log_loss(y_test, test_predicted_probs)))

#Random Model Y Predticted Value
text_predicted_y =np.argmax(test_predicted_probs, axis=1)

Log loss on Test Data using Random Model 2.4833425662069772


In [0]:
print_confusion_matrix(y_test,text_predicted_y+1)

In [0]:
#Random Model using CV Data

cv_data_length = X_cv.shape[0]
cv_predicted_probs = np.zeros((cv_data_length,9))

for i in range(cv_data_length):
  #it will return an array of random numbers between 1 and 0 of size 1*9
  rand_probs_cv = np.random.rand(1,9)
  #it will generate random probabilities of each point in test data such that theor sum = 1
  cv_predicted_probs[i] = (rand_probs_cv/sum(sum(rand_probs_cv)))[0]

print("Log loss on Test Data using Random Model "+str(log_loss(y_cv, cv_predicted_probs)))

#Random Model Y Predticted Value
cv_predicted_probs =np.argmax(cv_predicted_probs, axis=1)

Log loss on Test Data using Random Model 2.4471390229897763


In [0]:
print_confusion_matrix(y_cv,cv_predicted_probs+1)

# Feature Extraction

## One Hot Encoding

In [0]:
# creating one hot encoder object with categorical feature 0 

#Using Scikit Learn Encoder
encoder = OneHotEncoder(sparse=False,handle_unknown ='ignore')

# Reshape your data either using
# array.reshape(-1, 1) if your data has a single feature
# array.reshape(1, -1) if it contains a single sample.

encoder.fit(X_train['Gene'].values.reshape(-1,1))

X_train_Gene_one = encoder.transform(X_train['Gene'].values.reshape(-1,1))
X_cv_Gene_one = encoder.transform(X_cv['Gene'].values.reshape(-1,1))
X_test_Gene_one = encoder.transform(X_test['Gene'].values.reshape(-1,1))

print('After One Hot Encoding of Gene')
print(X_train_Gene_one.shape)
print(X_cv_Gene_one.shape)
print(X_test_Gene_one.shape)

After One Hot Encoding of Gene
(2121, 233)
(531, 233)
(664, 233)


In [0]:
# creating one hot encoder object with categorical feature 0 

#Using Scikit Learn Encoder
encoder = OneHotEncoder(sparse=False,handle_unknown ='ignore')

# Reshape your data either using
# array.reshape(-1, 1) if your data has a single feature
# array.reshape(1, -1) if it contains a single sample.

encoder.fit(X_train['Variation'].values.reshape(-1,1))

X_train_Variation_one = encoder.transform(X_train['Variation'].values.reshape(-1,1))
X_cv_Variation_one = encoder.transform(X_cv['Variation'].values.reshape(-1,1))
X_test_Variation_one = encoder.transform(X_test['Variation'].values.reshape(-1,1))

print('After One Hot Encoding of Variation')
print(X_train_Variation_one.shape)
print(X_cv_Variation_one.shape)
print(X_test_Variation_one.shape)

After One Hot Encoding of Variation
(2121, 1930)
(531, 1930)
(664, 1930)


## Response coding

In [0]:
#Ref: https://medium.com/@thewingedwolf.winterfell/response-coding-for-categorical-data-7bb8916c6dc1

In [0]:
print(X_train.shape, y_train.shape,X_cv.shape, y_cv.shape, X_test.shape, y_test.shape)

(2121, 4) (2121,) (531, 4) (531,) (664, 4) (664,)


In [0]:
def response_coding(feature,training_df,training_y,alpha):
  all_outcome = sorted(training_y.iloc[:].unique())
  #feature frquncy calculation
  feature_value_counts = dict(training_df[feature].value_counts())

  # empty dictionary to store
  response_coded_dict = dict()
  for feature_key in feature_value_counts.keys():
    #print(feature_key)
    filt = training_df[feature] == feature_key
    filter_y = training_y[filt.values]
    filter_y_value_counts = dict(filter_y.value_counts())
    #data wihtout laplace smoothing
    #filter_y_value_counts = dict(filter_y.value_counts(normalize=True))
    #storing probablity value after laplace smoothing
    prob_value = []
    #Calculating probablity for each key 
    for k in all_outcome:
      #print(k)

      try:
      #Sum of all dictionary value
      #Ref: https://stackoverflow.com/questions/4880960/how-to-sum-all-the-values-in-a-dictionary
        prob_value.append((filter_y_value_counts[k] + alpha*10)/ (sum(filter_y_value_counts.values()) + 90*alpha))
      except:
      #If Feature does not occurs in a particular outcomes
        prob_value.append((0+ alpha*10)/ (sum(filter_y_value_counts.values()) + 90*alpha))
      #print(prob_value)
    
    response_coded_dict[feature_key]=prob_value

  return response_coded_dict

In [0]:
def get_response_value (feature,training_df,training_y,test_df,alpha):
  response_coded = dict(response_coding(feature,training_df,training_y,alpha))
  feature_key = test_df[feature]
  retrive_value=[]
  for key in feature_key:
    try:
      retrive_value.append(response_coded[str(key)])
    except:
      retrive_value.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])
  return retrive_value

In [0]:
X_train_Gene_response_code = pd.DataFrame(get_response_value('Gene',X_train,y_train,X_train,1),columns=['Gene_X1','Gene_X2','Gene_X3','Gene_X4','Gene_X5','Gene_X6','Gene_X7','Gene_X8','Gene_X9']) 
X_cv_Gene_response_code = pd.DataFrame(get_response_value('Gene',X_train,y_train,X_cv,1),columns=['Gene_X1','Gene_X2','Gene_X3','Gene_X4','Gene_X5','Gene_X6','Gene_X7','Gene_X8','Gene_X9'])
X_test_Gene_response_code = pd.DataFrame(get_response_value('Gene',X_train,y_train,X_test,1),columns=['Gene_X1','Gene_X2','Gene_X3','Gene_X4','Gene_X5','Gene_X6','Gene_X7','Gene_X8','Gene_X9'])

X_train_Variation_response_code = pd.DataFrame(get_response_value('Variation',X_train,y_train,X_train,1),columns=['Variation_X1','Variation_X2','Variation_X3','Variation_X4','Variation_X5','Variation_X6','Variation_X7','Variation_X8','Variation_X9'])
X_cv_Variation_response_code = pd.DataFrame(get_response_value('Variation',X_train,y_train,X_cv,1),columns=['Variation_X1','Variation_X2','Variation_X3','Variation_X4','Variation_X5','Variation_X6','Variation_X7','Variation_X8','Variation_X9'])
X_test_Variation_response_code = pd.DataFrame(get_response_value('Variation',X_train,y_train,X_test,1),columns=['Variation_X1','Variation_X2','Variation_X3','Variation_X4','Variation_X5','Variation_X6','Variation_X7','Variation_X8','Variation_X9'])

In [0]:
print('After get_response_value of Gene')
print(X_train_Gene_response_code.shape)
print(X_cv_Gene_response_code.shape)
print(X_test_Gene_response_code.shape)

print('After get_response_value of Variation')
print(X_train_Variation_response_code.shape)
print(X_cv_Variation_response_code.shape)
print(X_test_Variation_response_code.shape)

After get_response_value of Gene
(2121, 9)
(531, 9)
(664, 9)
After get_response_value of Variation
(2121, 9)
(531, 9)
(664, 9)


## TF-IDF for Text

In [0]:
#TFIDF vectorizer of Train Data
from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF with min 5 occurance and maximum 1000 features
vectorizer = TfidfVectorizer(lowercase = False, min_df=5, max_features=1000)
corpus = X_train.Text

X_train_tf_idf = vectorizer.fit_transform(corpus).todense()
X_cv_tf_idf = vectorizer.transform(X_cv['Text'].values).todense()
X_test_tf_idf = vectorizer.transform(X_test['Text'].values).todense()



## Hstack Variables 

### tf-idf + one_hot Gene + one_hot Variant

In [0]:
# tf-idf + one_hot Gene + one_hot Variant
print(X_train_tf_idf.shape, X_train_Gene_one.shape, X_train_Variation_one.shape)
print(X_cv_tf_idf.shape, X_cv_Gene_one.shape, X_cv_Variation_one.shape )
print(X_test_tf_idf.shape, X_test_Gene_one.shape, X_test_Gene_response_code.shape)

(2121, 1000) (2121, 233) (2121, 1930)
(531, 1000) (531, 233) (531, 1930)
(664, 1000) (664, 233) (664, 9)


In [0]:
X_train_one_hot = np.hstack((X_train_tf_idf,X_train_Gene_one,X_train_Variation_one))
X_cv_one_hot = np.hstack((X_cv_tf_idf,X_cv_Gene_one,X_cv_Variation_one))
X_test_one_hot = np.hstack((X_test_tf_idf,X_test_Gene_one,X_test_Gene_response_code))

print(X_train_one_hot.shape,X_cv_one_hot.shape, X_test_one_hot.shape)

(2121, 3163) (531, 3163) (664, 1242)


### tf-idf + Response Coding Gene + Response Coding Variant

In [0]:
# tf-idf + Response Coding Gene + Response Coding Variant
print(X_train_tf_idf.shape, X_train_Gene_response_code.shape, X_train_Variation_response_code.shape)
print(X_cv_tf_idf.shape, X_cv_Gene_response_code.shape, X_cv_Variation_response_code.shape )
print(X_test_tf_idf.shape, X_test_Gene_response_code.shape, X_test_Variation_response_code.shape)

(2121, 1000) (2121, 9) (2121, 9)
(531, 1000) (531, 9) (531, 9)
(664, 1000) (664, 9) (664, 9)


In [0]:
X_train_resp = np.hstack((X_train_tf_idf, X_train_Gene_response_code, X_train_Variation_response_code))
X_cv_resp = np.hstack((X_cv_tf_idf, X_cv_Gene_response_code, X_cv_Variation_response_code))
X_test_resp = np.hstack((X_test_tf_idf, X_test_Gene_response_code, X_test_Variation_response_code))

print(X_train_resp.shape,X_cv_resp.shape, X_test_resp.shape)

(2121, 1018) (531, 1018) (664, 1018)


In [0]:
#Value from nested dictionary
#Ref: https://www.programiz.com/python-programming/nested-dictionary

In [0]:
#Ref: https://github.com/tulasiram58827/Cancer-Diagnosis
#Ref: https://github.com/saicharanarishanapally/Personalized-Cancer-Diagnosis/blob/master/PersonalizedCancerDiagnosis.ipynb



# Machine Learning Algorithm

Base Line Model

* Naive Bayes
* K Nearest Nabour
* Logistic Regression
* Support Vector Machine
* Random Forest

Stacking Classifier
* Maximum Voting classifier 

## Naive Bayes

### Multinomial Naive Bayes
* Suited for classification of data with discrete features ( count data )
* Very useful in text processing
* Each text will be converted to vector of word count
* Cannot deal with negative numbers



In [0]:
print(X_train_resp.shape, y_train.shape, X_cv_resp.shape, y_cv.shape, X_test_resp.shape, y_test.shape)

(2121, 1018) (2121,) (531, 1018) (531,) (664, 1018) (664,)


In [0]:
#Ref: https://medium.com/@awantikdas/a-comprehensive-naive-bayes-tutorial-using-scikit-learn-f6b71ae84431

'''
#There are two ways to use CalibratedClassifierCV

#Method 1, train classifier within CCCV
model = CalibratedClassifierCV(my_clf)
model.fit(X_train_val, y_train_val)

#Method 2, train classifier and then use CCCV on DISJOINT set (if we have a seprate data set for validation)
my_clf.fit(X_train, y_train)
model = CalibratedClassifierCV(my_clf, cv='prefit')
model.fit(X_val, y_val)
'''

alpha=[10 ** x for x in range(-5, 1)]
#ref: https://www.kaggle.com/marcospinaci/0-335-log-loss-in-a-dozen-lines

cv_log_error_array = []

for i in alpha:
  #creating classifier for Multinomial NB
  my_classifier = MultinomialNB(alpha=i)
  my_classifier.fit(X_train_resp, y_train)

  model = CalibratedClassifierCV(my_classifier, cv='prefit')
  model.fit(X_cv_resp, y_cv)

  nb_prob = model.predict_proba(X_test_resp)
  cv_log_error_array.append(log_loss(y_test, prob))
  print("Logloss for alpha: ",i," :",log_loss(y_test, nb_prob), "Accuracy: ", model.score(X_test_resp, y_test))

Logloss for alpha:  1e-05  : 1.2218987880910597 Accuracy:  0.5331325301204819
Logloss for alpha:  0.0001  : 1.2220217058218397 Accuracy:  0.5331325301204819
Logloss for alpha:  0.001  : 1.2221797931763458 Accuracy:  0.5331325301204819
Logloss for alpha:  0.01  : 1.2224691626952278 Accuracy:  0.5331325301204819
Logloss for alpha:  0.1  : 1.2259418873929027 Accuracy:  0.5346385542168675
Logloss for alpha:  1  : 1.2460385442892938 Accuracy:  0.5496987951807228


In [0]:
fig = go.Figure(data=go.Scatter(x=alpha, y=cv_log_error_array))

# add title and labels
fig.update_layout(title_text='<b>Cross validation log loss</b>',
                    xaxis = dict(title='Alpha'),
                    yaxis = dict(title='Log_loss')
                  )
fig.show()

In [0]:
#numpy.argmin(array, axis = None, out = None) : Returns indices of the min element of the array in a particular axis.
best_alpha = alpha[np.argmin(cv_log_error_array)]

#Refitting the model with best parameter
my_classifier = MultinomialNB(alpha=best_alpha)
my_classifier.fit(X_train_resp, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_cv_resp, y_cv)
prob_nb = model.predict_proba(X_test_resp)
cv_log_error_array.append(log_loss(y_test, prob_nb, labels=my_classifier.classes_, eps=1e-15))
print("Logloss for alpha",round(best_alpha,2),":",round(log_loss(y_test, prob_nb, labels=my_classifier.classes_, eps=1e-15),2),"Accuracy:",round(model.score(X_test_resp, y_test),2))

Logloss for alpha 0.0 : 1.22 Accuracy: 0.53


In [0]:
print_confusion_matrix(y_test,model.predict(X_test_resp))

##K Nearest Neighbors

In [0]:
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

# class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)[source]
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, 
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)

#create new a knn model
from sklearn.model_selection import GridSearchCV

neigh = KNeighborsClassifier()

param_grid = {'n_neighbors': np.arange(1, 10)}#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(neigh, param_grid,scoring='neg_log_loss')#fit model to data
knn_gscv.fit(X_train_resp,y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'n_neighbors': array([1, 2, 3, 4, 5, 6, 7, 8, 9])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_log_loss', verbose=0)

In [0]:
log_error_array = []
'''
#There are two ways to use CalibratedClassifierCV

#Method 1, train classifier within CCCV
model = CalibratedClassifierCV(my_clf)
model.fit(X_train_val, y_train_val)

#Method 2, train classifier and then use CCCV on DISJOINT set (if we have a seprate data set for validation)
my_clf.fit(X_train, y_train)
model = CalibratedClassifierCV(my_clf, cv='prefit')
model.fit(X_val, y_val)
'''
alpha = np.arange(1, 25)
for i in alpha:
  #hyper parameter tunning
  neigh = KNeighborsClassifier(n_neighbors= i)
  neigh.fit(X_train_resp,y_train)
  
  #fitting into claibrated classifier
  knn_model = CalibratedClassifierCV(neigh, cv='prefit')
  knn_model.fit(X_cv_resp, y_cv)
  
  #Predict probablity
  prob_cv = knn_model.predict_proba(X_cv_resp)
  
  #Calculating Logloss
  log_error_array.append(log_loss(y_cv, prob_cv))

In [0]:
fig = go.Figure(data=go.Scatter(x=np.arange(1, 25), y=log_error_array))

# add title and labels
fig.update_layout(title_text='<b>Cross validation log loss</b>',
                    xaxis = dict(title='Alpha'),
                    yaxis = dict(title='Log_loss')
                  )
fig.show()

In [0]:
#fitting the model using best parameter

# numpy.argmin(array, axis = None, out = None) : Returns indices of the min element of the array in a particular axis.
#Best parameter as per log loss
best_alpha = alpha[np.argmin(log_error_array)]

#Creating KNN classifier
neigh = KNeighborsClassifier(n_neighbors= best_alpha)
neigh.fit(X_train_resp,y_train)
  
#fitting into claibrated classifier
knn_model = CalibratedClassifierCV(neigh, cv='prefit')
knn_model.fit(X_cv_resp, y_cv)

CalibratedClassifierCV(base_estimator=KNeighborsClassifier(algorithm='auto',
                                                           leaf_size=30,
                                                           metric='minkowski',
                                                           metric_params=None,
                                                           n_jobs=None,
                                                           n_neighbors=7, p=2,
                                                           weights='uniform'),
                       cv='prefit', method='sigmoid')

In [0]:
prob_knn = knn_model.predict_proba(X_test_resp)
print("Logloss for n_neighbors:",round(log_loss(y_test, prob_knn),2), "Accuracy:", round(knn_model.score(X_test_resp,y_test),3))

Logloss for n_neighbors: 1.14 Accuracy: 0.61


In [0]:
print_confusion_matrix(y_test,knn_model.predict(X_test_resp))

##Logistic Regression

In [0]:
# find more about LR: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#Default Parameter  class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, 
#C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', 
#max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)


'''In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, 
and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ 
option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)'''

'''
#There are two ways to use CalibratedClassifierCV

#Method 1, train classifier within CCCV
model = CalibratedClassifierCV(my_clf)
model.fit(X_train_val, y_train_val)

#Method 2, train classifier and then use CCCV on DISJOINT set (if we have a seprate data set for validation)
my_clf.fit(X_train, y_train)
model = CalibratedClassifierCV(my_clf, cv='prefit')
model.fit(X_val, y_val)
'''

log_error_array=[]

alpha=[10 ** x for x in range(-5, 5)]

for i in alpha:

    lr = LogisticRegression(random_state=0, C=i,class_weight='balanced')
    lr.fit(X_train_resp,y_train)

    lr_model=CalibratedClassifierCV(base_estimator=lr,method='sigmoid', cv='prefit')
    lr_model.fit(X_cv_resp, y_cv)
    
    #Predict probablity
    prob_cv = lr_model.predict_proba(X_cv_resp)
  
    #Calculating Logloss
    log_error_array.append(log_loss(y_cv, prob_cv))

In [0]:
fig = go.Figure(data=go.Scatter(x=alpha, y=log_error_array))

# add title and labels
fig.update_layout(title_text='<b>Cross validation log loss</b>',
                    xaxis = dict(title='Alpha'),
                    yaxis = dict(title='Log_loss')
                  )
fig.show()

In [0]:
#Best parameter as per log loss
best_alpha = alpha[np.argmin(log_error_array)]

lr = LogisticRegression(random_state=0, C=i,class_weight='balanced')
lr.fit(X_train_resp,y_train)

lr_model=CalibratedClassifierCV(base_estimator=lr,method='sigmoid', cv='prefit')
lr_model.fit(X_cv_resp, y_cv)

prob_lr = lr_model.predict_proba(X_test_resp)
print("Logloss for n_neighbors:",round(log_loss(y_test, prob_lr),2), "Accuracy:", round(lr_model.score(X_test_resp,y_test),3))

Logloss for n_neighbors: 1.27 Accuracy: 0.577


In [0]:
print_confusion_matrix(y_test,lr_model.predict(X_test_resp))

##Support Vector Machine

In [0]:
#details: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

'''
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', 
coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, 
class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', 
break_ties=False, random_state=None)[source]¶
'''
alpha = ['linear','poly','rbf','sigmoid']
log_error_array = []

for i in alpha:
  svm = SVC(kernel=i, decision_function_shape='ovo', class_weight='balanced')
  svm.fit(X_train_resp,y_train)
  
  svm_model=CalibratedClassifierCV(base_estimator=svm,method='sigmoid', cv='prefit')
  svm_model.fit(X_cv_resp, y_cv)
    
  #Predict probablity
  prob_cv = svm_model.predict_proba(X_cv_resp)
  
  #Calculating Logloss
  log_error_array.append(log_loss(y_cv, prob_cv))

In [0]:
fig = go.Figure(data=go.Scatter(x=alpha, y=log_error_array))

# add title and labels
fig.update_layout(title_text='<b>Cross validation log loss</b>',
                    xaxis = dict(title='Alpha'),
                    yaxis = dict(title='Log_loss')
                  )
fig.show()

In [0]:
#Best parameter as per log loss
best_alpha = alpha[np.argmin(log_error_array)]

svm = SVC(kernel=best_alpha, decision_function_shape='ovo', class_weight='balanced')
svm.fit(X_train_resp,y_train)
  
svm_model=CalibratedClassifierCV(base_estimator=svm,method='sigmoid', cv='prefit')
svm_model.fit(X_cv_resp, y_cv)

prob_svm = svm_model.predict_proba(X_test_resp)
print("Logloss for n_neighbors:",round(log_loss(y_test, prob_svm),2), "Accuracy:", round(svm_model.score(X_test_resp,y_test),3))

Logloss for n_neighbors: 1.58 Accuracy: 0.399


In [0]:
#details: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

'''
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', 
coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, 
class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', 
break_ties=False, random_state=None)[source]¶
'''
alpha = [10 ** x for x in range(-5, 5)]
log_error_array = []

#C is Regularization parameter

for i in alpha:
  svm = SVC(C=i, kernel='sigmoid', decision_function_shape='ovo', class_weight='balanced')
  svm.fit(X_train_resp,y_train)
  
  svm_model=CalibratedClassifierCV(base_estimator=svm,method='sigmoid', cv='prefit')
  svm_model.fit(X_cv_resp, y_cv)
    
  #Predict probablity
  prob_cv = svm_model.predict_proba(X_cv_resp)
  
  #Calculating Logloss
  log_error_array.append(log_loss(y_cv, prob_cv))

In [0]:
fig = go.Figure(data=go.Scatter(x=alpha, y=log_error_array))

# add title and labels
fig.update_layout(title_text='<b>Cross validation log loss</b>',
                    xaxis = dict(title='Alpha'),
                    yaxis = dict(title='Log_loss')
                  )
fig.show()

In [0]:
#Best parameter as per log loss
best_alpha = alpha[np.argmin(log_error_array)]

svm = SVC(C=best_alpha, kernel='sigmoid', decision_function_shape='ovo', class_weight='balanced')
svm.fit(X_train_resp,y_train)
  
svm_model=CalibratedClassifierCV(base_estimator=svm,method='sigmoid', cv='prefit')
svm_model.fit(X_cv_resp, y_cv)

prob_svm = svm_model.predict_proba(X_test_resp)
print("Logloss for n_neighbors:",round(log_loss(y_test, prob_svm),2), "Accuracy:", round(svm_model.score(X_test_resp,y_test),3))

Logloss for n_neighbors: 1.58 Accuracy: 0.399


In [0]:
print_confusion_matrix(y_test,svm_model.predict(X_test_resp))

##Random Forest

In [0]:
#For more Details: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
'''
Default:
 class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', 
 max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
 max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, 
 bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, 
 class_weight=None, ccp_alpha=0.0, max_samples=None)
'''
alpha = [100,200,500,1000,2000]
max_depth = [x for x in range(3, 10)]
log_error_array = []
alphda_count=[]
max_depth_count=[]

for i in tqdm(alpha):
  for j in tqdm(max_depth):
    rf = RandomForestClassifier(n_estimators=i, max_depth=j, random_state=42)
    rf.fit(X_train_resp, y_train)
    
    rf_model=CalibratedClassifierCV(base_estimator=rf,method='sigmoid', cv='prefit')
    rf_model.fit(X_cv_resp, y_cv)

    #Predict probablity
    prob_cv = rf_model.predict_proba(X_cv_resp)
  
    #Calculating Logloss
    log_error_array.append(log_loss(y_cv, prob_cv))
    alphda_count.append(i)
    max_depth_count.append(j)
    print('alpha:',i,'max_depth:',j,log_loss(y_cv, prob_cv))

error_log = pd.DataFrame()
error_log['n_estimators'] = alphda_count
error_log['max_depth'] = max_depth_count
error_log['log_loss'] = log_error_array

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

alpha: 100 max_depth: 3 1.0622941450517331
alpha: 100 max_depth: 4 1.0153034748384677
alpha: 100 max_depth: 5 0.9627517263437535
alpha: 100 max_depth: 6 0.9293857002021614
alpha: 100 max_depth: 7 0.9305813829921726
alpha: 100 max_depth: 8 0.9014457221699463
alpha: 100 max_depth: 9 0.9123228160740493



HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

alpha: 200 max_depth: 3 1.0653759383349184
alpha: 200 max_depth: 4 0.9974675418763453
alpha: 200 max_depth: 5 0.9525121937486393
alpha: 200 max_depth: 6 0.9225695848674615
alpha: 200 max_depth: 7 0.9162161277467956
alpha: 200 max_depth: 8 0.9004693814712527
alpha: 200 max_depth: 9 0.8932670658830076



HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

alpha: 500 max_depth: 3 1.0576701737978078
alpha: 500 max_depth: 4 0.983055429088527
alpha: 500 max_depth: 5 0.9474973333374749
alpha: 500 max_depth: 6 0.921318641193289
alpha: 500 max_depth: 7 0.9074040992469785
alpha: 500 max_depth: 8 0.893007821863386
alpha: 500 max_depth: 9 0.8905671034781547



HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

alpha: 1000 max_depth: 3 1.0567499821191109
alpha: 1000 max_depth: 4 0.9812982790524966
alpha: 1000 max_depth: 5 0.9492731415320041
alpha: 1000 max_depth: 6 0.9255190208336755
alpha: 1000 max_depth: 7 0.9070416406485892
alpha: 1000 max_depth: 8 0.8945881211790524
alpha: 1000 max_depth: 9 0.8943545473711503



HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

alpha: 2000 max_depth: 3 1.0617665339890883
alpha: 2000 max_depth: 4 0.9897208327720557
alpha: 2000 max_depth: 5 0.954218713934868
alpha: 2000 max_depth: 6 0.92356496178057
alpha: 2000 max_depth: 7 0.9080316828485635
alpha: 2000 max_depth: 8 0.8991783825465772
alpha: 2000 max_depth: 9 0.8938754629778859




In [0]:
print('Best Param')
print(error_log.iloc[np.argmin(log_error_array)])

Best Param
n_estimators    500.000000
max_depth         9.000000
log_loss          0.890567
Name: 20, dtype: float64


In [0]:
i = error_log.n_estimators.iloc[np.argmin(log_error_array)]
j = error_log.max_depth.iloc[np.argmin(log_error_array)]

rf = RandomForestClassifier(n_estimators=i, max_depth=j, random_state=42)
rf.fit(X_train_resp, y_train)
    
rf_model=CalibratedClassifierCV(base_estimator=rf,method='sigmoid', cv='prefit')
rf_model.fit(X_cv_resp, y_cv)

#Predict probablity
prob_rf = svm_model.predict_proba(X_test_resp)
print("Logloss for n_neighbors:",round(log_loss(y_test, prob_rf),2), "Accuracy:", round(rf_model.score(X_test_resp,y_test),3))

Logloss for n_neighbors: 1.58 Accuracy: 0.697


##Stacking Model

In [0]:
#http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier

from sklearn import model_selection

NB_clf = MultinomialNB(alpha=0.01)
KN_clf = KNeighborsClassifier(n_neighbors=20)
lr_clf = LogisticRegression(C=10,class_weight='balanced')
SVC_clf = SVC(C=0.1, kernel='sigmoid', decision_function_shape='ovo', class_weight='balanced')
rf_clf = RandomForestClassifier(n_estimators=1000, max_depth=9, random_state=42)

sc_clf = StackingCVClassifier(classifiers=[NB_clf, KN_clf, lr_clf, SVC_clf, rf_clf], meta_classifier=lr_clf, use_probas=True,)

for clf, label in zip([NB_clf, KN_clf, lr_clf, SVC_clf,rf_clf,sc_clf], 
                      ['Naive Bayes', 
                       'K Nearest Nabours', 
                       'Logistic Regrression',
                       'Support Vector Machine',
                       'Random Forrest',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X_train_resp, y_train, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

Accuracy: 0.56 (+/- 0.01) [Naive Bayes]
Accuracy: 0.55 (+/- 0.01) [K Nearest Nabours]
Accuracy: 0.63 (+/- 0.02) [Logistic Regrression]
Accuracy: 0.43 (+/- 0.01) [Support Vector Machine]
Accuracy: 0.87 (+/- 0.01) [Random Forrest]
Accuracy: nan (+/- nan) [StackingClassifier]


In [0]:
#http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier

from sklearn import model_selection

NB_clf = MultinomialNB(alpha=0.01)
KN_clf = KNeighborsClassifier(n_neighbors=20)
lr_clf = LogisticRegression(C=10,class_weight='balanced')
SVC_clf = SVC(C=0.1, kernel='sigmoid', decision_function_shape='ovo', class_weight='balanced')
rf_clf = RandomForestClassifier(n_estimators=1000, max_depth=9, random_state=42)

sc_clf = StackingCVClassifier(classifiers=[NB_clf, KN_clf,SVC_clf, rf_clf], meta_classifier=lr_clf, use_probas=True,)

for clf, label in zip([NB_clf, KN_clf, SVC_clf,rf_clf,sc_clf], 
                      ['Naive Bayes', 
                       'K Nearest Nabours', 
                       'Support Vector Machine',
                       'Random Forrest',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X_train_resp, y_train, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

Accuracy: 0.56 (+/- 0.01) [Naive Bayes]
Accuracy: 0.55 (+/- 0.01) [K Nearest Nabours]
Accuracy: 0.43 (+/- 0.01) [Support Vector Machine]
Accuracy: 0.87 (+/- 0.01) [Random Forrest]
Accuracy: nan (+/- nan) [StackingClassifier]


In [0]:
#http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier

from sklearn import model_selection

NB_clf = MultinomialNB(alpha=0.01)
KN_clf = KNeighborsClassifier(n_neighbors=20)
lr_clf = LogisticRegression(C=10,class_weight='balanced')
SVC_clf = SVC(C=0.1, kernel='sigmoid', decision_function_shape='ovo', class_weight='balanced',probability=True)
rf_clf = RandomForestClassifier(n_estimators=1000, max_depth=9, random_state=42)

sc_clf = StackingCVClassifier(classifiers=[NB_clf, KN_clf, lr_clf, SVC_clf, rf_clf], meta_classifier=lr_clf, use_probas=True,)

for clf, label in zip([NB_clf, KN_clf, lr_clf, SVC_clf,rf_clf,sc_clf], 
                      ['Naive Bayes', 
                       'K Nearest Nabours', 
                       'Logistic Regrression',
                       'Support Vector Machine',
                       'Random Forrest',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X_train_one_hot, y_train, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

Accuracy: 0.54 (+/- 0.01) [Naive Bayes]
Accuracy: 0.54 (+/- 0.01) [K Nearest Nabours]
Accuracy: 0.62 (+/- 0.02) [Logistic Regrression]
Accuracy: 0.43 (+/- 0.05) [Support Vector Machine]
Accuracy: 0.63 (+/- 0.02) [Random Forrest]
Accuracy: nan (+/- nan) [StackingClassifier]


###Stacking Model (manually)

In [0]:
NB_clf = MultinomialNB(alpha=0.01)
KN_clf = KNeighborsClassifier(n_neighbors=20)
lr_clf = LogisticRegression(C=10,class_weight='balanced')
SVC_clf = SVC(C=0.1, kernel='sigmoid', decision_function_shape='ovo', class_weight='balanced',probability=True)
rf_clf = RandomForestClassifier(n_estimators=1000, max_depth=9, random_state=42,)

NB_clf.fit(X_train_resp, y_train)
KN_clf.fit(X_train_resp, y_train)
lr_clf.fit(X_train_resp, y_train)
SVC_clf.fit(X_train_resp, y_train)
rf_clf.fit(X_train_resp, y_train)

prob_nb = NB_clf.predict_proba(X_train_resp)
prob_knn = KN_clf.predict_proba(X_train_resp)
prob_lr = lr_clf.predict_proba(X_train_resp)
prob_svc = SVC_clf.predict_proba(X_train_resp)
prob_rf = rf_clf.predict_proba(X_train_resp)

In [0]:
input = np.hstack((prob_nb,prob_knn,prob_lr,prob_svc,prob_rf))
st_clf = LogisticRegression(C=10,class_weight='balanced')
st_clf.fit(input, y_train)

LogisticRegression(C=10, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
prob_nb = NB_clf.predict_proba(X_test_resp)
prob_knn = KN_clf.predict_proba(X_test_resp)
prob_lr = lr_clf.predict_proba(X_test_resp)
prob_svc = SVC_clf.predict_proba(X_test_resp)
prob_rf = rf_clf.predict_proba(X_test_resp)

test = np.hstack((prob_nb,prob_knn,prob_lr,prob_svc,prob_rf))

#Predict probablity
prob = st_clf.predict_proba(test)

print("Logloss for Stacking:",round(log_loss(y_test, prob),2), "Accuracy:", round(st_clf.score(test,y_test),3))

Logloss for Stacking: 1.04 Accuracy: 0.661
