# Sentiment analysis with sparse features

Predict sentiment on the IMDB database using `Calf`.  This example shows that Calf can efficiently fit and predict using a sparse feature matrix produced by TfidfVectorizer.   For this small example, we select 400 movie reviews out of 25000.  Even with the limited number of samples, the bag of word model expands the number of feature columns to 9576.  Calf learns the training sets.  As expected, Calf needs more examples to show skill predicting unseen reviews.  

Notably, calfcv successfully handled IMDB sentiment classification using a 50,000 x 101,895 feature matrix with multiprocessing and stable memory use. Auc of 0.93 was achieved for predicting sentiment of the positive and negative classes, demonstrating its performance. 


Author: Rolf Carlson, Carlson Research LLC, <hrolfrc@gmail.com>

License: 3-clause BSD

### Get the data

In [1]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from calfcv import Calf

Read the sentiment

In [2]:
im_train = load_files('../../data/imdb/train/', shuffle=False)
im_test = load_files('../../data/imdb/test/', shuffle=False)

Create the train and test datasets

In [3]:
corpus = im_train.data + im_test.data
y = list(im_train.target) + list(im_test.target)
y_train, y_test = list(im_train.target), list(im_test.target)
X = TfidfVectorizer().fit_transform(corpus)

X_train = X[0:len(im_train.data), :]
X_test = X[len(im_train.data)::, :]

print(len(y_train), X_train.shape)
assert len(y_train) == X_train.shape[0], "y_train wrong shape"
print(len(y_test), X_test.shape)
assert len(y_test) == X_test.shape[0], "y_test wrong shape"

200 (200, 9576)
200 (200, 9576)


### Predict seen sentiment

Predict the corpus

This prediction of the corpus should have an AUC near 1

In [4]:
clf = Calf(order_col=True).fit(X, y)

In [5]:

y_pred = clf.predict(X)

In [6]:

roc_auc_score(y, y_pred)

0.945

With only 200 samples, Calf demonstrates poor skill at predicting sentiment in the training data after having learned the corpus.   There needs to be more training data.

In [7]:
clf = Calf(order_col=True).fit(X_train, y_train)
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)

0.765

Calf demonstrates skill at learning the testing data.

In [8]:
clf = Calf(order_col=True).fit(X_test, y_test)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)

0.9299999999999999

Training and predicting the training data results in a high AUC, even for a small number of samples.

In [9]:
y_pred = clf.predict(X_train)
roc_auc_score(y_train, y_pred)

0.705

Similarly, Calf learns the testing data

In [10]:
clf = Calf(order_col=True).fit(X_test, y_test)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)

0.9299999999999999

### Predict unseen sentiment

Calf does not have enough training data to show skill predicting the unseen testing set

In [11]:
clf = Calf(order_col=True).fit(X_train, y_train)
y_pred = clf.predict(X_test)
roc_auc_score(y_test, y_pred)

0.6900000000000001

Predict the probability of the sentiment class requires additional training

In [12]:
y_pred = clf.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred)

0.7846