# Sentiment analysis of positive and negative IMDB reviews

Predict positive and negative sentiment on the IMDB database [1] using `Calf`.  This example shows that Calf can efficiently fit and predict using a sparse feature matrix produced by TfidfVectorizer.  Notably, calfcv successfully handled IMDB sentiment classification using a 50,000 x 101,895 feature matrix with multiprocessing and stable memory use. Auc of 0.93 was achieved for predicting sentiment of the positive and negative classes, demonstrating its performance. We start with a small example.

In [40]:
# Author: Rolf Carlson, Carlson Research LLC, <hrolfrc@gmail.com>
# License: BSD 3 clause

### Get the data

In [41]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from collections import Counter
import numpy as np
from calfcv import Calf

Read the sentiment

We downloaded the sentiment dataset from https://ai.stanford.edu/~amaas/data/sentiment/ and installed it in the /srv/imdb/ directory.  You will need to download the dataset and make it available to run this notebook.  Change the path in the following cell to point to your copy of the data.


In [42]:
im_train = load_files('/srv/imdb/train/', shuffle=False)
im_test = load_files('/srv/imdb/test/', shuffle=False)

Create the train and test datasets

In [43]:
corpus = im_train.data + im_test.data
y_all = list(im_train.target) + list(im_test.target)

In [44]:
# count the classes
Counter(y_all)

Counter({0: 25000, 1: 25000, 2: 50000})

Class 2 is neutral sentiment.  We want only positive and negative sentiment.

In [45]:
index = [i for i in range(len(y_all)) if y_all[i] in [0, 1]]
y = np.array(y_all)[index]
X_uv = np.array(corpus)[index]

There are 25000 records in each positive and negative class

In [46]:
Counter(y)

Counter({0: 25000, 1: 25000})

In [47]:
X = TfidfVectorizer().fit_transform(X_uv)

## Predict sentiment on a small dataset
For this small example, we select 400 movie reviews out of 25000.  Even with the limited number of samples, the bag of word model expands the number of feature columns to 9576.  Calf learns the training sets.  As expected, Calf needs more examples to show skill predicting unseen reviews.  

In [48]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    train_size=200,
    test_size=200,
    stratify=y, 
    random_state=42
)

In [49]:
Counter(y_train)

Counter({0: 100, 1: 100})

Predict the corpus subset.  The prediction of this small corpus should have an AUC near 1.

With only 200 samples, Calf demonstrates poor skill at predicting sentiment in the training data after having learned the limited corpus.   There needs to be more training data.

In [50]:
clf = Calf(order_col=True).fit(X_train, y_train)
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)

0.87

Calf demonstrates skill at learning the testing data.

In [51]:
clf = Calf(order_col=True).fit(X_test, y_test)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)

0.9

### Predict unseen sentiment

Calf does not have enough training data to show skill predicting the unseen testing set

In [52]:
clf = Calf(order_col=True).fit(X_train, y_train)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)

0.6799999999999999

Predict the probability of the sentiment class requires additional training

In [39]:
y_pred = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred)

0.7424

## Predict positive and negative sentiment

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    train_size=.8,
    test_size=.2,
    stratify=y, 
    random_state=42
)

In [None]:
Counter(y_train)

In [None]:
Counter(y_test)

Fit Calf

In [None]:
clf_calf = Calf(order_col=True, verbose=True).fit(X_train, y_train)

In [None]:
len(clf_calf.feature_index_)

Predict on unseen data

In [None]:
y_pred = clf_calf.predict(X_test)
print('Calf auc on unseen, test data, class', roc_auc_score(y_test, y_pred))

y_pred = clf_calf.predict_proba(X_test)[:, 1]
print('Calf auc on unseen, test data, prob ', roc_auc_score(y_test, y_pred))

# fit LogisticRegression
clf_lr = LogisticRegression(C=4.95, n_jobs=5, max_iter=100000).fit(X_train, y_train)
# test on unseen data
y_pred = clf_lr.predict(X_test)
print('LogisticRegression auc on unseen, test data, class ', roc_auc_score(y_test, y_pred))

y_pred = clf_lr.predict_proba(X_test)[:, 1]
print('LogisticRegression auc on unseen, test data, proba ', roc_auc_score(y_test, y_pred))

### References


[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.  [Learning Word Vectors for Sentiment Analysis](https://aclanthology.org/P11-1015) (Maas et al., ACL 2011)