# IMDB sentiment analysis

(Based on Coursera MIPT & Yandex Machine Learning course)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

There are 25000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1.

Kaggle competition: https://www.kaggle.com/c/word2vec-nlp-tutorial/data

Let's load the sample:

In [3]:
imdb = pd.read_csv('labeledTrainData.tsv', delimiter='\t')
imdb.shape

(25000, 3)

In [4]:
imdb.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


Classes are balanced:

In [5]:
imdb.sentiment.value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

Splitting the training and test sets:

In [6]:
from sklearn.cross_validation import train_test_split
texts_train, texts_test, y_train, y_test = train_test_split(imdb.review.values, imdb.sentiment.values)

Vectorizing reviews:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(sublinear_tf=True, use_idf=True)
X_train = vect.fit_transform(texts_train)
X_test = vect.transform(texts_test)

## Logistic regression

Let's fit Logistic Regression and evaluate AUC score:

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

clf = LogisticRegression()
clf.fit(X_train, y_train)

print metrics.accuracy_score(y_test, clf.predict(X_test))
print metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

0.88768
0.956957106423


But we've got too many features:

In [9]:
X_train.shape

(18750, 66597)

Let's perform selection using Lasso regularization:

In [10]:
clf = LogisticRegression(C=0.15, penalty='l1')
clf.fit(X_train, y_train)

print np.sum(np.abs(clf.coef_) > 1e-4)
print metrics.accuracy_score(y_test, clf.predict(X_test))
print metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

104
0.80944
0.891544332825


Another approach for a feature selection - randomized logistic regression (subsampling + l1):

In [11]:
from sklearn.linear_model import RandomizedLogisticRegression

rlg = RandomizedLogisticRegression(C=0.15)
rlg.fit(X_train, y_train)
#careful! throws warnings

  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in range(n_resampling)):
  for _ in

RandomizedLogisticRegression(C=0.15, fit_intercept=True,
               memory=Memory(cachedir=None), n_jobs=1, n_resampling=200,
               normalize=True, pre_dispatch='3*n_jobs', random_state=None,
               sample_fraction=0.75, scaling=0.5, selection_threshold=0.25,
               tol=0.001, verbose=False)

Let's see how many features selected:

In [12]:
np.sum(rlg.scores_ > 0)

142

Fitting Logistic Regression on preselected 140 features:

In [13]:
X_train_lasso = X_train[:, rlg.scores_ > 0]
X_test_lasso = X_test[:, rlg.scores_ > 0]

In [14]:
clf = LogisticRegression(C=1)
clf.fit(X_train_lasso, y_train)
print metrics.accuracy_score(y_test, clf.predict(X_test_lasso))
print metrics.roc_auc_score(y_test, clf.predict_proba(X_test_lasso)[:, 1])

0.83456
0.912297535073


## Principal component analysis approach

Let's create 100 synthetic features using PCA method:

In [15]:
from sklearn.decomposition import TruncatedSVD

tsvd = TruncatedSVD(n_components=100)
X_train_pca = tsvd.fit_transform(X_train)
X_test_pca = tsvd.transform(X_test)

Fitting logistic regression on them:

In [16]:
clf = LogisticRegression()
clf.fit(X_train_pca, y_train)
print metrics.accuracy_score(y_test, clf.predict(X_test_pca))
print metrics.roc_auc_score(y_test, clf.predict_proba(X_test_pca)[:, 1])

0.86016
0.934814478903


On 100 features we get almost the same score as on whole 66k set. 

What about Random Forest?

In [17]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train_pca, y_train)
print metrics.accuracy_score(y_test, clf.predict(X_test_pca))
print metrics.roc_auc_score(y_test, clf.predict_proba(X_test_pca)[:, 1])

0.8264
0.905591691673


PCA features are optimal for linear methods, that's why logistic regression scores better then non-linear complex classifiers.

Let's create a submission

In [33]:
vc = TfidfVectorizer(sublinear_tf=True, use_idf=True)
train_all = vc.fit_transform(imdb.review.values)
print train_all.shape

clf = LogisticRegression(C=0.15, penalty='l1')
clf.fit(train_all, imdb.sentiment.values)

(25000, 74849)


LogisticRegression(C=0.15, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [37]:
test_data = pd.read_csv("testData.tsv", header=0, delimiter="\t")
test_all = vc.transform(test_data.review)

In [38]:
result = clf.predict(test_all)

# Write the test results 
output = pd.DataFrame(data = {"id" : test_data["id"], "sentiment" : result})
output.to_csv("Tf-idf_LR.csv", index=False, quoting=3)

I get 0.82804 with this submission

Let's feed use the tf-idf transformer on full data:

In [41]:
unlabeled_data = pd.read_csv("unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

In [49]:
vc = TfidfVectorizer(sublinear_tf=True, use_idf=True)

vc.fit(imdb.review.values)
vc.fit(unlabeled_data.review.str.strip('"'))
vc.fit(test_data.review)

train_all = vc.transform(imdb.review.values)

clf = LogisticRegression(C=0.15, penalty='l1')
clf.fit(train_all, imdb.sentiment.values)

LogisticRegression(C=0.15, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [50]:
test_all = vc.transform(test_data.review)
result = clf.predict(test_all)

# Write the test results 
output = pd.DataFrame(data = {"id" : test_data["id"], "sentiment" : result})
output.to_csv("Tf-idf_LR2.csv", index=False, quoting=3)

0.82796. For some reason we don't get any additional profit from feeding tranformer with additional data