## Working with text-data

In this notebook, we see how we can perform machine learning on text-data. Specifically, we will look at large number of review from IMDB, and try to predict whether each review is good or bad - this is a so-called "sentiment analysis". 

In order to run this notebook, you need to download the data (the files "reviews.txt" and "labels.txt", which you can find on ItsLearning) and put it in the same folder as this notebook (or alter the path in `read_csv` below).

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

We load the reviews as a pandas dataframe, and print the beginning of the first few reviews:

In [3]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


We can access the 5th review like this:

In [4]:
reviews[0][5]

'this film lacked something i couldn  t put my finger on at first charisma on the part of the leading actress . this inevitably translated to lack of chemistry when she shared the screen with her leading man . even the romantic scenes came across as being merely the actors at play . it could very well have been the director who miscalculated what he needed from the actors . i just don  t know .  br    br   but could it have been the screenplay  just exactly who was the chef in love with  he seemed more enamored of his culinary skills and restaurant  and ultimately of himself and his youthful exploits  than of anybody or anything else . he never convinced me he was in love with the princess .  br    br   i was disappointed in this movie . but  don  t forget it was nominated for an oscar  so judge for yourself .  '

We use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. We only use the 10,000 most frequent words.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
vect = CountVectorizer(max_features=10000).fit(reviews[0])

We split the data into a train and a test-set

In [6]:
X = vect.transform(reviews[0])
Y = (labels=='positive').astype(np.int_)

X_train, X_test, y_train, y_test = train_test_split(X, Y)

We train a decision tree on the data

In [7]:
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [8]:
print("Accuracy on training data = {}".format(clf.score(X_train,y_train)))
print("Accuracy on test data = {}".format(clf.score(X_test,y_test)))

Accuracy on training data = 1.0
Accuracy on test data = 0.71104


We can also write and classify our own reviews. Note that the two reviews below have the same bag-of-words-representation:

In [9]:
bad_review = vect.transform(["This is the worst movie of all time!"])
reviewA = vect.transform(["This is not a good movie, it is actually really bad!"])
reviewB = vect.transform(["This is not a bad movie, it is actually really good!"])
print(reviewA.nonzero())
print(reviewB.nonzero())

(array([0, 0, 0, 0, 0, 0, 0, 0, 0]), array([  90,  665, 3847, 4715, 4728, 5863, 6088, 7162, 8990]))
(array([0, 0, 0, 0, 0, 0, 0, 0, 0]), array([  90,  665, 3847, 4715, 4728, 5863, 6088, 7162, 8990]))


Let's see which class the tree predicts for each review:

In [10]:
print("Predicted class for bad review: {}".format(clf.predict(bad_review)[0]))
print("Predicted class for review A: {}".format(clf.predict(reviewA)[0]))
print("Predicted class for review B: {}".format(clf.predict(reviewB)[0]))

Predicted class for bad review: 0
Predicted class for review A: 0
Predicted class for review B: 0
