This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie.

First we will install *sklearn* which we will be using to do the machine learning.

In [3]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\rushej2\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [4]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\rushej2\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


Now let's load the IMDB training set. We will print out the last instance.

In [5]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']
print(imdb_dataset[-1])

  from .autonotebook import tqdm as notebook_tqdm


{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.', 'label': 1}


Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

In [6]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 500 dimension vector of word counts. Only the 500 most frequent words are used in this version.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=500,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-d array where each row is one of the 25,000 instances and each column is one of 200 words. Print out the words that will be used for classification.

In [8]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 500)
['10' 'able' 'about' 'absolutely' 'act' 'acting' 'action' 'actor' 'actors'
 'actually' 'after' 'again' 'against' 'all' 'almost' 'along' 'already'
 'also' 'although' 'always' 'am' 'amazing' 'american' 'an' 'and' 'another'
 'any' 'anyone' 'anything' 'are' 'around' 'art' 'as' 'at' 'audience'
 'away' 'awful' 'back' 'bad' 'based' 'be' 'beautiful' 'because' 'become'
 'becomes' 'been' 'before' 'beginning' 'behind' 'being' 'believe' 'best'
 'better' 'between' 'big' 'bit' 'black' 'book' 'boring' 'both' 'boy' 'br'
 'budget' 'but' 'by' 'called' 'came' 'camera' 'can' 'car' 'care' 'case'
 'cast' 'certainly' 'character' 'characters' 'child' 'children' 'cinema'
 'classic' 'close' 'come' 'comedy' 'comes' 'completely' 'could' 'couldn'
 'couple' 'course' 'dark' 'day' 'days' 'dead' 'death' 'definitely'
 'despite' 'dialogue' 'did' 'didn' 'different' 'direction' 'director' 'do'
 'does' 'doesn' 'doing' 'don' 'done' 'down' 'drama' 'during' 'dvd' 'each'
 'early' 'effects' 'either' 'else' 'end' 'e

Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 90% of the data for training and 10% for testing.

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.9,random_state=123)

We will use Decision Tree Classifier to do the classification. Create the model.

In [10]:
from sklearn import tree
model = tree.DecisionTreeClassifier()

Train the model.

In [11]:
model = model.fit(X=X_train,y=y_train)

Test the model on the validation set.

In [12]:
y_pred = model.predict(X_val)

Now let's calculate the accuracy of the model's predictions on the validation set.

In [13]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.6756


Calculate and print the confusion matrix for the model.

In [17]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_val, y_pred)
print(cm)

[[851 390]
 [421 838]]


Extract and assign variable names to the true positive, false positive, true negative and false negative values.

Use the relevant formulae to compute the true positive and true negative rates.

These metrics provide a more detailed breakdown of the models accuracy in prediciting positive and negative cases individually.

In [18]:
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
print(f'True Positive Rate: {tpr:3f}')
tnr = tn / (tn + fp)
print(f'True Negative Rate: {tnr:3f}')

True Positive Rate: 0.665608
True Negative Rate: 0.685737


We can see that the decision tree classifier is more reliable at predicting negative cases but the accuracy is low overall. 

Other models such as linear regression and naive bayes which have higher accuracy rates are clearly suitable for this type of data.