## Using TFIDF values to predict the class of genes

### Table of Contents
1. [Importing and formatting the data](#1)
2. [Creating predictions](#1.1) 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
plt.style.use('ggplot')

### Importing and formatting the data <a id=1></a>

We need to do some data preprocessing at first.  Let us load the data where we will use 5x cross validation to evaluate the models.   

In [2]:
df_text = pd.read_csv('../data/training_text.csv',delimiter='\|\|',index_col=0,encoding='utf-8')
df_text.columns=['Text']
df_variants = pd.read_csv('../data/training_variants.csv',index_col=0)
print(df_text.shape)
df_text.head(5)

  if __name__ == '__main__':


(3321, 1)


Unnamed: 0,Text
0,Cyclin-dependent kinases (CDKs) regulate a var...
1,Abstract Background Non-small cell lung canc...
2,Abstract Background Non-small cell lung canc...
3,Recent evidence has demonstrated that acquired...
4,Oncogenic mutations in the monomeric Casitas B...


In [3]:
df_variants.head()

Unnamed: 0_level_0,Gene,Variation,Class
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,FAM58A,Truncating Mutations,1
1,CBL,W802*,2
2,CBL,Q249E,2
3,CBL,N454D,3
4,CBL,L399V,4


We now need to clean up the data, we see that there are several features we will want to remove from the dataset.  This task will be put in the process_text function  

In [None]:
df_text.Text[0]

In [11]:
def process_text(text):
    text = re.sub(r'\s+([0-9,.%]+)\s+','',text) #remove numbers
    text = re.sub(r'\(ref[^)]+\)','',text) #remove references ie "(ref. 1)"
    text = re.sub('(:?[fF]ig)(:?s)?(:?\.|(ure))(:?\s*[a-zA-Z0-9]+)','',text) 
    text = re.sub(r'[([][0-9, ]*[)\]]','',text) #remove references ie "[1]"
    text = re.sub(r'\([a-zA-Z]\)','',text) #remove references ie "[A]"
    return text
process_text(df_text.Text[0])

"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells. The precise mechanisms by which CDK10 modulates ETS2 activity, and more generally the functions of CDK10, remain elusive. Here we demonstrate that CDK10 is a cyclin-dependent kinase by identifying cyclin M as an activating cyclin. Cyclin M, an orphan cyclin, is the product of FAM58A, whose mutations cause STAR syndrome, a human developmental anomaly whose features include toe syndactyly, telecanthus, and anogenital and renal malformations. We show that STAR syndrome-associated cyclin M mutants are unable to interact with CDK10. Cyclin M silencing phenocopies CDK1

In [12]:
df_text.Text[1]

" Abstract Background  Non-small cell lung cancer (NSCLC) is a heterogeneous group of disorders with a number of genetic and proteomic alterations. c-CBL is an E3 ubiquitin ligase and adaptor molecule important in normal homeostasis and cancer. We determined the genetic variations of c-CBL, relationship to receptor tyrosine kinases (EGFR and MET), and functionality in NSCLC.  Methods and Findings  Using archival formalin-fixed paraffin embedded (FFPE) extracted genomic DNA, we show that c-CBL mutations occur in somatic fashion for lung cancers. c-CBL mutations were not mutually exclusive of MET or EGFR mutations; however they were independent of p53 and KRAS mutations. In normal/tumor pairwise analysis, there was significant loss of heterozygosity (LOH) for the c-CBL locus (22%, n\u200a=\u200a8/37) and none of these samples revealed any mutation in the remaining copy of c-CBL. The c-CBL LOH also positively correlated with EGFR and MET mutations observed in the same samples. Using selec

In [13]:
process_text(df_text.Text[1])

" Abstract Background  Non-small cell lung cancer (NSCLC) is a heterogeneous group of disorders with a number of genetic and proteomic alterations. c-CBL is an E3 ubiquitin ligase and adaptor molecule important in normal homeostasis and cancer. We determined the genetic variations of c-CBL, relationship to receptor tyrosine kinases (EGFR and MET), and functionality in NSCLC.  Methods and Findings  Using archival formalin-fixed paraffin embedded (FFPE) extracted genomic DNA, we show that c-CBL mutations occur in somatic fashion for lung cancers. c-CBL mutations were not mutually exclusive of MET or EGFR mutations; however they were independent of p53 and KRAS mutations. In normal/tumor pairwise analysis, there was significant loss of heterozygosity (LOH) for the c-CBL locus (22%, n\u200a=\u200a8/37) and none of these samples revealed any mutation in the remaining copy of c-CBL. The c-CBL LOH also positively correlated with EGFR and MET mutations observed in the same samples. Using selec

Now that we have a processor that can clean up the text, let us clean it up all at once

In [14]:
df_processed = df_text.Text.map(process_text)

In [15]:
df_processed.head()

0    Cyclin-dependent kinases (CDKs) regulate a var...
1     Abstract Background  Non-small cell lung canc...
2     Abstract Background  Non-small cell lung canc...
3    Recent evidence has demonstrated that acquired...
4    Oncogenic mutations in the monomeric Casitas B...
Name: Text, dtype: object

### TFIDF

Now that we have cleaned up the text data, we should try to run a tfidf on it.  However, I will first use a Porter tokenizer which will map the words running, runs, ran, etc to the same stem 'ru'.  This allows us to ignore the tenses of words, and just concentrate on the syntactic meaning of them.  After we vectorize paragraphs using this, we can run a simple logistic regression on the vectors.   

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import nltk
from nltk.stem.porter import PorterStemmer
def port_tokenizer(text):
    words = re.split(r'[^a-z0-9-]+',text.lower())
    ps = PorterStemmer()
    return [ps.stem(word) for word in words if not word in nltk.corpus.stopwords.words('english') and word]

To make classification easier, we will first only try to classify texts as '7' and 'not-7'.    

In [17]:
y = df_variants.Class.map(lambda x: x==7)

In [18]:
port_tokenizer(df_processed[0])

['cyclin-depend',
 'kinas',
 'cdk',
 'regul',
 'varieti',
 'fundament',
 'cellular',
 'process',
 'cdk10',
 'stand',
 'one',
 'last',
 'orphan',
 'cdk',
 'activ',
 'cyclin',
 'identifi',
 'kinas',
 'activ',
 'reveal',
 'previou',
 'work',
 'shown',
 'cdk10',
 'silenc',
 'increas',
 'ets2',
 'v-et',
 'erythroblastosi',
 'viru',
 'e26',
 'oncogen',
 'homolog',
 '2',
 '-driven',
 'activ',
 'mapk',
 'pathway',
 'confer',
 'tamoxifen',
 'resist',
 'breast',
 'cancer',
 'cell',
 'precis',
 'mechan',
 'cdk10',
 'modul',
 'ets2',
 'activ',
 'gener',
 'function',
 'cdk10',
 'remain',
 'elus',
 'demonstr',
 'cdk10',
 'cyclin-depend',
 'kinas',
 'identifi',
 'cyclin',
 'activ',
 'cyclin',
 'cyclin',
 'orphan',
 'cyclin',
 'product',
 'fam58a',
 'whose',
 'mutat',
 'caus',
 'star',
 'syndrom',
 'human',
 'development',
 'anomali',
 'whose',
 'featur',
 'includ',
 'toe',
 'syndactyli',
 'telecanthu',
 'anogenit',
 'renal',
 'malform',
 'show',
 'star',
 'syndrome-associ',
 'cyclin',
 'mutant',
 'un

This is taking forever, Lets try possibly using the hashing vectorizer and feeding that into logsitic regression.   

In [53]:
from sklearn.feature_extraction.text import HashingVectorizer
tfidf_vect = HashingVectorizer(tokenizer=port_tokenizer)
# pipe = Pipeline([('tfidf',tfidf_vect),('log_reg',LogisticRegression())])
# cross_val_score(pipe,df_processed.values[:10],y[:10],n_jobs=-1,cv=5)
a = tfidf_vect.fit_transform(df_processed.values[:300])
lr = LogisticRegression()
lr.fit_transform(a[:200],y[:200])
lr.score(a[200:300],y[200:300])



0.68999999999999995

In [57]:
from sklearn.linear_model import SGDClassifier

In [77]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
n_batches=100
batch_size=100

s = SGDClassifier()
X_train,X_test,y_train,y_test = train_test_split(df_processed,df_variants.Class)
max_batches = X_train.shape[0]//batch_size
tfidf_vect = HashingVectorizer(tokenizer=port_tokenizer)
x_ts,y_ts = tfidf_vect.transform(X_test[:500]),y_test[:500]
for i in range(n_batches):
    if i%max_batches==0:
        X_train,y_train = shuffle(X_train,y_train)
    x = X_train[batch_size*(i%max_batches):batch_size*((i%max_batches)+1)]
    x = tfidf_vect.transform(x)
    y = y_train[batch_size*(i%max_batches):batch_size*((i%max_batches)+1)]
    s.partial_fit(x,y,classes = np.unique(df_variants.Class))
    if i%5==0:
        print('iteration: {:d}\n\ttraining accuracy: {:.3f}\n\ttest accuracy: {:.3f}'.\
             format(i,s.score(x,y),s.score(x_ts,y_ts)))
s.score(tfidf_vect.transform(X_ts),y_ts)

TypeError: unsupported format string passed to csr_matrix.__format__

In [112]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
n_batches=50
batch_size=100
from sklearn.metrics import log_loss
s = SGDClassifier(loss='log')
for i in range(0,n_batches):
    if i%max_batches==0:
        X_train,y_train = shuffle(X_train,y_train)
    x = X_train[batch_size*(i%max_batches):batch_size*((i%max_batches)+1)]
    x = tfidf_vect.transform(x)
    y = y_train[batch_size*(i%max_batches):batch_size*((i%max_batches)+1)]
    s.partial_fit(x,y,classes = np.unique(df_variants.Class))
    if i%5==0:
        y_pred = s.predict_proba(x)
        y_ts_pred = s.predict_proba(x_ts)
        print('iteration: {:d}\n\ttraining loss: {:.3f}\n\ttest loss: {:.3f}'.\
             format(i,log_loss(y,s.predict_proba(x),labels=range(1,10)),
                    log_loss(y_ts,s.predict_proba(x_ts),labels=range(1,10))))

iteration: 0
	training loss: 0.876
	test loss: 2.629
iteration: 5
	training loss: 0.437
	test loss: 1.822
iteration: 10
	training loss: 0.563
	test loss: 1.544
iteration: 15
	training loss: 0.515
	test loss: 1.314
iteration: 20
	training loss: 0.655
	test loss: 1.222
iteration: 25
	training loss: 0.591
	test loss: 1.128
iteration: 30
	training loss: 0.572
	test loss: 1.175
iteration: 35
	training loss: 0.616
	test loss: 1.139
iteration: 40
	training loss: 0.634
	test loss: 1.082
iteration: 45
	training loss: 0.743
	test loss: 1.080


In [113]:
for i in range(0,n_batches):
    if i%max_batches==0:
        X_train,y_train = shuffle(X_train,y_train)
    x = X_train[batch_size*(i%max_batches):batch_size*((i%max_batches)+1)]
    x = tfidf_vect.transform(x)
    y = y_train[batch_size*(i%max_batches):batch_size*((i%max_batches)+1)]
    s.partial_fit(x,y,classes = np.unique(df_variants.Class))
    if i%5==0:
        y_pred = s.predict_proba(x)
        y_ts_pred = s.predict_proba(x_ts)
        print('iteration: {:d}\n\ttraining loss: {:.3f}\n\ttest loss: {:.3f}'.\
             format(i,log_loss(y,s.predict_proba(x),labels=range(1,10)),
                    log_loss(y_ts,s.predict_proba(x_ts),labels=range(1,10))))

iteration: 0
	training loss: 0.575
	test loss: 1.082
iteration: 5
	training loss: 0.680
	test loss: 1.051
iteration: 10
	training loss: 0.684
	test loss: 1.056
iteration: 15
	training loss: 0.628
	test loss: 1.071
iteration: 20
	training loss: 0.715
	test loss: 1.038
iteration: 25
	training loss: 0.637
	test loss: 1.059
iteration: 30
	training loss: 0.761
	test loss: 1.068



KeyboardInterrupt



In [1]:
1+1

2

In [3]:
import pandas as pd
df = pd.read_csv('../data/training_variants.csv')

In [None]:
df.to_pickle()