# Personalized Cancer Diagnosis

## 1. Business Problem

### 1.1 Problem Description

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

Data: Memorial Sloan Kettering Cancer Center (MSKCC)

Download training_variants.zip and training_text.zip from Kaggle.

**Context**:-
Source:  https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

**Problem Statement:-**
Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

### 1.2 Source/Useful Links

Some articles and refrence blogs about the problem statement
1. https://www.forbes.com/sites/matthewherper/2017/06/03/a-new-cancer-drug-helped-almost-everyone-who-took-it-almost-heres-what-it-teaches-us/#2a44ee2f6b25
2. https://www.youtube.com/watch?v=UwbuW7oK8rk
3. https://www.youtube.com/watch?v=qxXRKVompI8

### 1.3 Real-World/Business objectives and constraints

- No latency requirement.
- Interpretability is important.
- Errors can be very costly.
- Probability of a data-point belonging to each class is needed.

## 2. Machine Learning Problem Formulation

### 2.1 Data Overview

- Source:  https://www.kaggle.com/c/msk-redefining-cancer-treatment/data
- We have two data files:one contain infromation aboutthe genetic mutations and the other contains the clinical evidence(text) that human experts/pathologists use to classify the genetic mutations.
- Both these data files are have a common column called ID
- Data file's information:
    - Training_variants (ID, Gene, Variations, Class)
    - Training_text(ID, Text)
        

### 2.2 Mapping the real-world problem to an ML Problem

### 2.2.1 Type of Machine Learning Problem

There are 9 diffrent classes a genetic mutation can be classified into=> Multiclass Classification Problem.

### 2.2.2 Performance Metric
Metrics:
- Multiclass log-loss
- Confusion Matrix

### 2.2.3 Machine Learning Objectives and Constraints
**Objective"** Predict the probability of each data-point belonging to each of the nine classes.
**Constraints:** 
- Interpretability
- Class Probabilites are needed.
- Penalize the errors in class probabilities => Metric is log-loss.
- No latency constraints.

### 2.3 Train, CV and Test Datasets
Split the dataset randomly into three parts- train, cross validation and test with 64%, 16% and 20% of data respectively.

## 3. Exploratory Data Analysis

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from imblearn.over_sampling import SMOTE
from collections import Counter
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
 
import math
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")

from mlxtend.classifier import StackingClassifier
from sklearn.linear_model import LogisticRegression

from mlxtend.plotting import plot_decision_regions

### 3.1 Reading the Data

In [2]:
data= pd.read_csv("training_variants")
print('Number of data points:',data.shape[0])
print('Number of features:',data.shape[1])
print('Features:',data.columns.values)
data.head()

Number of data points: 3321
Number of features: 4
Features: ['ID' 'Gene' 'Variation' 'Class']


Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


#### 3.1.1 Reading text data:

In [3]:
data_text= pd.read_csv("training_text",sep= "\|\|",engine="python",names= ["ID","Text"],skiprows= 1 )
print("Number of data points:",data_text.shape[0])
print('Number of features:',data_text.shape[1])
print('Features:',data_text.columns.values)
data_text.head()

Number of data points: 3321
Number of features: 2
Features: ['ID' 'Text']


Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


#### 3.1.2 Preprocessing the text:


In [4]:
stop_words= set(stopwords.words('english'))
def nlp_preprocessing(total_text,index,column):
    if type(total_text) is not int:
        string= ""
    total_text= re.sub('[^a-zA-Z0-9\n]',' ', total_text)
    
    total_text= re.sub('\s+',' ',total_text)
    total_text= total_text.lower()
    
    for word in total_text.split():
        if word not in stop_words:
            string+=word+" "
                
        data_text[column][index]= string
        

In [None]:
start_time= time.clock()
for index,row in data_text.iterrows():
    if type(row["Text"]) is str:
        nlp_preprocessing(row['Text'],index, 'Text') 
    else:
        print('There is no description for',index)
print('time taken for processing the text:',time.clock()-start_time,' seconds')

In [None]:
result= pd.merge(data,data_text,on="ID",how='left')
result.head()

In [None]:
result[result.isnull().any(axis= 1)]

In [None]:
result.loc[result['Text'].isnull(),'Text']= result['Gene']+ ' '+ result['Variation']

In [None]:
result[result['ID']== 1109]

### Train, Test and Cross Validation Split

#### Splitting the dataset into Train, Cross Validation and Test( 60:20:20)

In [None]:
y+true= result['Class'].values
result.Gene= result.Gene.str.replace('\s+','_')
result.variation= result.variation.str.replace('\s+','_')
                                               
#splitting the dataset into train and test by maintaining same dist as output
x_train,x_test, y_train, y_test= train_test_split(result, y_true, stratify= y_true, test_size= 0.2)
train_data, cv_data, y_train, y_cv= train_test_split(x_train, y_train, stratify= y_train, te3st_size= 0.2)


In [None]:
print("Number of data points in train data:", train_data.shape[0])
print("Number of data points in test data:", x_test.shape[0])
print("Number of data points in cross validation data:", cv_data.shape[0])

#### Distribution of y_i in train, test and cross validation dataset:

In [None]:
train_dist= train_data['Class'].value_counts().sortlevel()
test_dist= x_test['Class'].value_counts().sortlevel()
cv_dist= cv_dqata['Class'].value_counts().sortlevel()

my_colors= 'rgbkymc'
train_dist.plot(kind= 'bar')
plt.xlabel('Class')
plt.ylabel('Data points per class:')
plt.title("Distribution of yi in train data")
plt.grid()
plt.show()

sorted_yi= np.argsort(-train_dist.values)
for i in sorted_yi:
    print('Number of data points in class',i+1,':',train_dist.values[i],'(',np.round(train_dist.values[i]/train_data.shape[0]*100),3,'%)')
print('-'*80)

my_colors= 'rgbkymc'
test_dist.plot(kind= 'bar')
plt.xlabel('Class')
plt.ylabel('Data points per class:')
plt.title("Distribution of yi in test data")
plt.grid()
plt.show()

sorted_yi= np.argsort(-test_dist.values)
for i in sorted_yi:
    print('Number of data points in class',i+1,':',test_dist.values[i],'(',np.round(test_dist.values[i]/x_test.shape[0]*100),3,'%)')
print('-'*80)
my_colors= 'rgbkymc'
cv_dist.plot(kind= 'bar')
plt.xlabel('Class')
plt.ylabel('Data points per class:')
plt.title("Distribution of yi in cross validation data")
plt.grid()
plt.show()

sorted_yi= np.argsort(-train_dist.values)
for i in sorted_yi:
    print('Number of data points in class',i+1,':',cv_dist.values[i],'(',np.round(cv_dist.values[i]/cv_data.shape[0]*100),3,'%)')
    

## Prediction using Random Model

In random Model we generate the NINE class Probabilities randomly such that they sum to 1.