**Problem statement of the project**

A lot has been said during the past several years about how precision medicine and, more concretely, how genetic testing is going to disrupt the way diseases are treated.

But this is only partially due to the huge amount of manula work still required. Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from neutral mutations (passengers)

Currently this interpretation of genetic mutations is being done manulaly. This is a very-time consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.

We need to develop a machine learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.

This problem was a competition on Kaggle. This was lauched by Memorial Sloan Kettering Cancer Center (MSKCC)

The dataset is in training folder and it includes training_variants.zip and training_text.zip



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [3]:
#!pip install <package_name>

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from imblearn.over_sampling import SMOTE
from collections import Counter
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import StratifiedKFold 
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import math
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")

from mlxtend.classifier import StackingClassifier
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression



In [5]:
# There are two data files and they are inside the training folder.
#Loading trainging_variants. Its a comma seperated file
data_variants = pd.read_csv('training/training_variants')
# Loading training_text dataset. This is seperated by ||
data_text = pd.read_csv("training/training_text", sep = "\|\|", engine="python", names = ["ID","TEXT"], skiprows=1)

In [6]:
data_variants.head(3)

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2


**There are 4 fields above:**


1.   ID: row id used to link the mutation to the clinical evidence
2.   Gene: the gene where this genetic mutation is located
1.   Variation: the aminoacid change for this mutations
2.   Clas: class value 1-9, this genetic mutation has been classified on





In [7]:
data_variants.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3321 entries, 0 to 3320
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         3321 non-null   int64 
 1   Gene       3321 non-null   object
 2   Variation  3321 non-null   object
 3   Class      3321 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 103.9+ KB


In [8]:
data_variants.describe()

Unnamed: 0,ID,Class
count,3321.0,3321.0
mean,1660.0,4.365854
std,958.834449,2.309781
min,0.0,1.0
25%,830.0,2.0
50%,1660.0,4.0
75%,2490.0,7.0
max,3320.0,9.0


In [10]:
#checking the dimension of data
data_variants.shape

(3321, 4)

In [11]:
#check the column in above dataset
data_variants.columns

Index(['ID', 'Gene', 'Variation', 'Class'], dtype='object')

In [12]:
#Now lets explore data_text
data_text.head(3)

Unnamed: 0,ID,TEXT
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...


In [13]:
# the above dataset have 2 columns. ID and Text column. We can also observe column Id which is common in both dataset. Lets exploring it
data_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3321 entries, 0 to 3320
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      3321 non-null   int64 
 1   TEXT    3316 non-null   object
dtypes: int64(1), object(1)
memory usage: 52.0+ KB


In [14]:
data_text.describe()

Unnamed: 0,ID
count,3321.0
mean,1660.0
std,958.834449
min,0.0
25%,830.0
50%,1660.0
75%,2490.0
max,3320.0


In [15]:
data_text.columns

Index(['ID', 'TEXT'], dtype='object')

In [16]:
data_text.shape

(3321, 2)

Summary:

In short our datasets look like this


*  data_variants (ID, Gene, Variations, Class)
*  data_text (ID, text)

Now we understood the dataset. lets try to understand the same problem from Machine Learning point of view

We want to predict about class of cancer. Now question is what kind of data is present in class column



In [17]:
data_variants.Class.unique()

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

This is descrete data so it is classification problem and since there are multiple descrete output possible so we can call it Multi-class classification problem


**Important note**: this is medical related problem so CORRECT RESULTS are very important. Error can be really costly here so we will have to have result for each class in terms of probability. We might not be much bothered about time taken by ML algorithm as far as it is reasonable.

We also want our model to be highly interpritable because a medical practionar want to also give proper reasoning on why ML algorithm is predicting any class

We will evaluate our model using confusion matrix and multiclass log-loss

So now we understand the problem statement. 