In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn import svm

# Justification for Model

I decided to use a tfidf vectorizer to transform the text into a vector.  This is generally better than a simple count vectorizer because it takes into account the fact that words that don't show up often in general but do show up in a particular document or more predictive than common words that just show up all the time.  

I decided to use a linear support vector classifer because the number of features (the length of the tfidf vectorizer) is relatively large compared to our number of samples so it will be useful to have a maximum margin classifier that can still converge even in the case of linearly separable data as overfitting is definitely a problem that we will run into in this regime.

In order to deal with overfitting I made both the support vector classifier's regularization and the length of the tfidf vectors hyperparameters of the model and used a scikit-learn pipeline to put these together and allow us to use cross-validation to tune both of these hyperparameters together instead of separately since they both control overfitting and will both have some influence on each other.

In order to validate the model I used the f1 score instead of simply using the accuracy because the dataset is imbalanced with 10 times as many positives as negatives.  As a result a model that predicts a positive every time is already 91% accurate making it not a great metric to use.  The f1 scores takes into account both the precision and the recall which would prevent a model that always predicts a positive from having a good score.

# Model Results

Surprisingly it turns out that the best result is a region of the hyperparameter space where there is clear overfitting.  The model performs almost perfectly on the training data (misclassifying only 3 out of 33000) with both f1 and accuracy scores above .999. It still performs well on the test set of data that it has not been trained on with f1 and accuracy scores of .789 and .965.  This is a clear sign of overfitting but nevertheless this parameter configuration leads to our best f1 score on the test set despite the overfitting.  This suggests that the model does not have enough data to train on and given more data it could probably perform significantly better so given the opportunity to collect more data that would probably be one of the most important things that could be done to improve this model.

# Other things to try

There are a lot of other things that I would have liked to try but I only had a few hours to devote to this today so I will instead just write a little bit about the things I would have tried if I had more time and why.

Looking at the confusion matrix we can see that it is doing a much better job predicting the positives than the negatives.  One interesting way to try to deal with this could be using a model that outputs a probability such as logistic regression and making the cutoff probability for a positive/negative in the model a hyperparameter to try to deal with this issue and see if there is some optimal cutoff other than 50%.  Though what is an optimal cutoff could depend on business considerations.  For example in some applications a false positive is worse than a false negative or vice versa.

Another thing that I would have really loved to try if I had more time is some neural network models and in particular I would have liked to try to make use of word2vec.  There is a lot of information that my tfidf/svm method is not making use of.  First of all tfidf does not take into account the order of the word or the context of the words in the reviews.  It would be possible to use n-grams instead of only using single words to create our tfidf vectors but we are already overfitting so that probably wouldn't be that useful.  Second of all the relationship of different words to each other is also not taken into account by tfidf whereas a model making use of something like word2vec would be able to make use of the fact that words such as fantastic and excellent are synonyms and treat them as such when making predictions.  The only issue with neural networks is that 33000 samples may not be enough to train it but I think it is certainly worth a try.

In [2]:
df = pd.read_csv('data/data.csv')

In [3]:
df.head()

Unnamed: 0,content,sentiment
0,"Great fun!, Got these last Christmas as a gag ...",1
1,"Inspiring, I hope a lot of people hear this cd...",1
2,"Great CD, My lovely Pat has one of the GREAT v...",1
3,"First album I've bought since Napster, We've c...",1
4,"Amazing!, I used to find myself starting Chron...",1


In [4]:
# data set is unbalanced
df['sentiment'].value_counts()

0    30000
1     3000
Name: sentiment, dtype: int64

In [5]:
# use random_state to make results reproduceable
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['sentiment'], test_size=0.2, random_state=0)

In [6]:
parameters = {
    'vect__max_features': (500, 1000, 2000, 5000, 10000, 20000, 50000),
    'clf__C': (0.1, 0.5, 1.0, 5.0, 10.0)
}

# use pipeline to be able to tune both svm regularization and TfidfVectorizer max_features at the same time
# both are parameters that will be useful to control overfitting
pipeline = Pipeline([
    ('vect', TfidfVectorizer(max_df=0.5,
                             min_df=2, stop_words='english',
                             use_idf=True)),
    ('clf', svm.LinearSVC())
])

In [7]:
# do cross validation grid search automatically using scikit learn
grid_search = GridSearchCV(pipeline, parameters, scoring='f1', return_train_score=True)

In [8]:
%%time
grid_search.fit(X_train, y_train)

CPU times: user 3min 47s, sys: 2.45 s, total: 3min 50s
Wall time: 3min 50s


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__max_features': (500, 1000, 2000, 5000, 10000, 20000, 50000), 'clf__C': (0.1, 0.5, 1.0, 5.0, 10.0)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)

In [9]:
grid_search.best_params_

{'clf__C': 5.0, 'vect__max_features': 20000}

In [10]:
grid_search.cv_results_

{'mean_fit_time': array([ 0.96250828,  0.92858839,  0.91574804,  0.92127554,  0.93918562,
         0.92526627,  0.92912046,  0.91089026,  0.90428019,  0.91202235,
         0.91720581,  0.93000563,  0.93461585,  0.93688019,  0.91953365,
         0.92298277,  0.91576266,  0.93210371,  0.95045733,  0.9573427 ,
         0.93409133,  1.00227102,  0.97752778,  0.97100242,  0.96816897,
         0.97880975,  0.99074705,  0.99329201,  1.08759022,  1.04261335,
         1.03319224,  1.01983897,  1.01538698,  1.02401408,  1.02455012]),
 'mean_score_time': array([ 0.41898068,  0.39545043,  0.39674807,  0.40232205,  0.41273069,
         0.4092234 ,  0.41359234,  0.39749368,  0.39924002,  0.39909093,
         0.40417886,  0.41439033,  0.41269207,  0.40839044,  0.39023868,
         0.39494626,  0.39901622,  0.41023517,  0.40772899,  0.4165717 ,
         0.41165082,  0.39084832,  0.39181153,  0.404459  ,  0.40322971,
         0.40716259,  0.42241565,  0.41252438,  0.39203715,  0.3953704 ,
         0.40

In [11]:
y_pred_train = grid_search.predict(X_train)
y_pred_test = grid_search.predict(X_test)

In [12]:
f1_score(y_pred_train, y_train)

0.99937823834196893

In [13]:
f1_score(y_pred_test, y_test)

0.78899082568807344

In [14]:
accuracy_score(y_pred_train, y_train)

0.99988636363636363

In [15]:
accuracy_score(y_pred_test, y_test)

0.9651515151515152

In [16]:
confusion_matrix(y_pred_train, y_train)

array([[23986,     1],
       [    2,  2411]])

In [17]:
# can see that the positives are being predicted much better than the negatives
confusion_matrix(y_pred_test, y_test)

array([[5940,  158],
       [  72,  430]])