# TFIDF classification for the Kaggle cancer dataset

Using the preprocessing functions found in src/text_procesing to process the data, I run several simple logistic regression and tfidf classifications for the data.  These are simple and rely on no feature engineering so they should be taken as a benchmark for future algorithms.  

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
xs = pd.read_csv('../data/training_variants.csv')
x = pd.read_pickle('../data/local_pk.pk')
x_new = pd.read_pickle('../data/new_tokens.pk')
y = pd.read_csv('../data/training_variants.csv',index_col=0).Class

In [3]:
x_new.head()

Unnamed: 0,text,processed
0,Cyclin-dependent kinases (CDKs) regulate a var...,"[cyclin-depend, kinas, cdk, regul, varieti, fu..."
1,Abstract Background Non-small cell lung canc...,"[abstract, background, non-smal, cell, lung, c..."
2,Abstract Background Non-small cell lung canc...,"[abstract, background, non-smal, cell, lung, c..."
3,Recent evidence has demonstrated that acquired...,"[recent, evid, demonstr, acquir, uniparent, di..."
4,Oncogenic mutations in the monomeric Casitas B...,"[oncogen, mutat, monomer, casita, b-lineag, ly..."


### Hashing Vectorizer (Similar to bag of words)

Froom this model, we get a slightly higher score then the tfidf vectorizer.  I do not understand exactly why the bag of words odes better thena  more sophisticated tfidf, but the results speak for themselves.  I will analyze the results for more than several strengths of regularization in logistic regression (Note that C is the inverse regularization strength, so smaller values of C are more regulated models).

In [6]:
def return_same(x):
    return x
from sklearn.feature_extraction.text import HashingVectorizer
for i in [2**x for x in range(-2,3)]:
    pipe = Pipeline([('encoder',HashingVectorizer(analyzer=return_same)),('lr',LogisticRegression(C=i))])
    cvs = -cross_val_score(pipe,x.text,y,scoring='neg_log_loss',n_jobs=-1,cv=5).mean()
    print('regularization 2^{:d}: {:.3f}'.format(int(np.log2(i)),cvs))
print('\n\n')
for i in [2**x for x in range(-2,3)]:
    pipe = Pipeline([('encoder',HashingVectorizer(analyzer=return_same)),('lr',LogisticRegression(C=i))])
    cvs = -cross_val_score(pipe,x_new.processed,y,scoring='neg_log_loss',n_jobs=-1,cv=5).mean()
    print('regularization 2^{:d}: {:.3f}'.format(int(np.log2(i)),cvs))


regularization 2^-2: 1.531
regularization 2^-1: 1.502
regularization 2^0: 1.501
regularization 2^1: 1.535
regularization 2^2: 1.606



regularization 2^-2: 1.527
regularization 2^-1: 1.498
regularization 2^0: 1.498
regularization 2^1: 1.531
regularization 2^2: 1.603


### Using TFIDF and logistic regression

We can see that TFIDF does not preform as well as just the bag of words model found above.

In [8]:
for i in [2**x for x in range(-2,3)]:
    pipe = Pipeline([('encoder',TfidfVectorizer(analyzer=return_same)),('lr',LogisticRegression(C=i))])
    cvs = -cross_val_score(pipe,x.text,y,scoring='neg_log_loss',n_jobs=-1,cv=5).mean()
    print('regularization 2^{:d}: {:.3f}'.format(int(np.log2(i)),cvs))
    
    
for i in [2**x for x in range(-2,3)]:
    pipe = Pipeline([('encoder',TfidfVectorizer(analyzer=return_same)),('lr',LogisticRegression(C=i))])
    cvs = -cross_val_score(pipe,x_new.processed,y,scoring='neg_log_loss',n_jobs=-1,cv=5).mean()
    print('regularization 2^{:d}: {:.3f}'.format(int(np.log2(i)),cvs))

regularization 2^-2: 1.618
regularization 2^-1: 1.597
regularization 2^0: 1.605
regularization 2^1: 1.650
regularization 2^2: 1.734
regularization 2^-2: 1.615
regularization 2^-1: 1.594
regularization 2^0: 1.603
regularization 2^1: 1.648
regularization 2^2: 1.733


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.svm import SVC

In [21]:
x_train,x_test,y_train,y_test = train_test_split(x_new.processed,y)
pipe = Pipeline([('encoder',HashingVectorizer(analyzer=return_same)),
                 ('svc',SVC(kernel='linear',probability=True))])
cvs = -cross_val_score(pipe,x_new.processed,y,scoring='neg_log_loss',n_jobs=-1,cv=3).mean()
print('regularization 2^{:d}: {:.3f}'.format(int(np.log2(i)),cvs))

regularization 2^2: 1.754


In [None]:
pipe = Pipeline([('encoder',HashingVectorizer(analyzer=return_same)),
                 ('svc',SVC(kernel='linear',probability=True,class_weight='balanced'))])
cvs = -cross_val_score(pipe,x_new.processed,y,scoring='neg_log_loss',n_jobs=-1,cv=5).mean()
print('regularization 2^{:d}: {:.3f}'.format(int(np.log2(i)),cvs))