# jovian-ml-presentation

Use the "Run" button to execute the code.

**Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers). Currently this
interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical
pathologist has to manually review and classify every single genetic mutation based on evidence from text-based
clinical literature. For this competition MSKCC is making available an expert-annotated knowledge base where 
world-class researchers and oncologists have manually annotated thousands of mutations. MSKCC intends the Data
scientist to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically
classifies genetic variations.**

In [2]:
# https://github.com/osabnis/PredictingGeneticvariations/blob/master/Genetic_Variations.py
# Dataset information:
#     Four Files:
#         Training Variants - Contains a description of the mutations.
#                             Fields - ID, Gene, Variation and Class
#                             Delimiter - ,
#         Training Text - Contains clinical evidence(text) used to classify the gene.
#                         Fields - ID,Text
#                         Delimiter - ||
#         Test Variants - Contains the description of mutations used for training.
#                         Fields - ID, Gene and Variation
#                         Delimiter - ,
#         Test Text - Contains the clinical evidence used to classify the gene.
#                     Fields - ID, Text
#                     Delimiter - ||

In [3]:
ls

 Volume in drive C is Windows
 Volume Serial Number is A67F-E478

 Directory of C:\Users\anubr\jupyter_notebooks\jovian-ml-project

12/07/2021  10:28 PM    <DIR>          .
12/03/2021  10:04 PM    <DIR>          ..
12/04/2021  11:34 AM             2,048 .gitignore
12/07/2021  10:24 PM    <DIR>          .ipynb_checkpoints
12/03/2021  11:30 PM                23 .jovianrc
11/28/2021  09:26 AM             6,456 code_02_XX Reading Data.ipynb
11/28/2021  09:37 AM             8,006 code_03_XX Text Cleansing and Extraction.ipynb
11/28/2021  09:50 AM             7,356 code_04_XX Advanced Text Processing.ipynb
12/07/2021  10:28 PM            27,371 jovian-ml-presentation.ipynb
12/05/2021  11:35 AM         4,422,201 jovian-ml-project.ipynb
12/04/2021  11:32 AM           265,891 pattern_matching_algorithms.pdf
12/03/2021  10:04 PM                84 README.md
12/01/2021  10:03 PM       103,787,620 test_text.zip
12/01/2021  10:02 PM            48,614 test_variants.zip
12/01/2021  10:03 PM        63,

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import xml.etree.ElementTree as ET
import time
import re
from tqdm import tqdm
%matplotlib inline

In [5]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
warnings.simplefilter(action='ignore')

In [6]:
train_text_df = pd.read_csv('training_text.zip', sep="\|\|", encoding="utf-8",engine="python", skiprows=1, names=["ID", "Text"],compression='zip')

In [7]:
train_vari_df = pd.read_csv('training_variants.zip',compression='zip')

In [8]:
train_text_df.head(n=3)

Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...


In [9]:
train_vari_df.head(n=3)

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2


In [10]:
merged_df=train_vari_df.merge(train_text_df,on= 'ID')

In [11]:
merged_df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...


In [12]:
!pip install nltk



distutils: C:\Users\anubr\anaconda3\Include\UNKNOWN
sysconfig: C:\Users\anubr\anaconda3\Include
user = False
home = None
root = None
prefix = None
distutils: C:\Users\anubr\anaconda3\Include\UNKNOWN
sysconfig: C:\Users\anubr\anaconda3\Include
user = False
home = None




root = None
prefix = None


In [13]:
import nltk
nltk.download('words')
from nltk.corpus import words
words=nltk.corpus.words.words()

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\anubr\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [14]:
nltk_words=set(words)

In [15]:
def reduce_words(data):        
    data = data.split()    
    words = [word for word in data if word not in nltk_words]
    text = " ".join(words)
    return text    
tqdm.pandas()

In [16]:
merged_df['cleaned_text'] = merged_df['Text'].progress_map(reduce_words,na_action = 'ignore')

100%|█████████▉| 3316/3321 [00:13<00:00, 250.49it/s]


In [17]:
merged_df['Text_len'] = merged_df['Text'].map(lambda x:len(str(x)))
merged_df['cleaned_text_len'] = merged_df['cleaned_text'].map(lambda x:len(str(x)))

In [18]:
merged_df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text,cleaned_text,Text_len,cleaned_text_len
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...,Cyclin-dependent kinases (CDKs) processes. CDK...,39672,20462
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...,Abstract Background Non-small (NSCLC) disorder...,36691,18897
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...,Abstract Background Non-small (NSCLC) disorder...,36691,18897
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...,Recent has demonstrated disomy (aUPD) mutation...,36238,17618
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...,Oncogenic mutations Casitas B-lineage (Cbl) tu...,41308,19425


In [19]:
merged_df[merged_df['cleaned_text'].map(lambda x:not isinstance(x,str))]

Unnamed: 0,ID,Gene,Variation,Class,Text,cleaned_text,Text_len,cleaned_text_len
1109,1109,FANCA,S1088F,1,,,3,3
1277,1277,ARID5B,Truncating Mutations,1,,,3,3
1407,1407,FGFR3,K508M,6,,,3,3
1639,1639,FLT1,Amplification,6,,,3,3
2755,2755,BRAF,G596C,7,,,3,3


In [20]:
idx=merged_df[merged_df['cleaned_text'].map(lambda x:not isinstance(x,str))].index

In [21]:
merged_df.loc[idx,'cleaned_text']='NA'

In [22]:
merged_df[merged_df['cleaned_text'].map(lambda x:not isinstance(x,str))]

Unnamed: 0,ID,Gene,Variation,Class,Text,cleaned_text,Text_len,cleaned_text_len


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
tfidf = TfidfVectorizer(
    min_df=1, max_features=1600, strip_accents='unicode',lowercase =True,
    analyzer='word', token_pattern=r'\w+', ngram_range=(1, 3), use_idf=True, 
    smooth_idf=True, sublinear_tf=True, stop_words = 'english')
X_train_tfidf = tfidf.fit_transform(merged_df['cleaned_text']).toarray()

In [25]:
X_train_tfidf.shape

(3321, 1600)

In [27]:
X_train_tfidf[1,:]

array([0.0554793 , 0.        , 0.        , ..., 0.07052438, 0.04119452,
       0.04490162])

In [28]:
from numpy import asarray
from numpy import savetxt

In [29]:
X_train_array=asarray(X_train_tfidf)

In [30]:
savetxt('X_train_tfidf.csv',X_train_array,delimiter=",")