#Prepare for the dataset

Upload files to Google Colaboratory

Please download imdb_labeled.txt file locally and then use the button below to upload to colab


In [49]:
from google.colab import files
uploaded = files.upload()

Saving imdb_labeled.txt to imdb_labeled (2).txt


The dataset is a TSV file which has a format of instance**\t**label.

Instance: free-text movie comments
Review: 0 or 1. 0 stands for negative and 1 stands for positive

In [70]:
# we could also use pandas to load the file. Please complete the following code blocks

import pandas as pd
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_columns', None)
# now use pd.read_csv() to read the dataset and save to a variable called df
# you could look at the documentation to figure out what parameters are needed
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

#finish your codes here
df = pd.read_csv('imdb_labeled.txt', sep='\t', header=None, names=['Instance', 'Label'])


In [71]:
#let's see how df looks like
df.head(10)



Unnamed: 0,Instance,Label
0,Instance,Label
1,"A very, very, very slow-moving, aimless movie about a distressed, drifting young man.",0
2,"Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.",0
3,"Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.",0
4,Very little music or anything to speak of.,0
5,The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.,1
6,"The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.",0
7,Wasted two hours.,0
8,"Saw the movie today and thought it was a good effort, good messages for kids.",1
9,A bit predictable.,0


#Pipeline step 0: Train, valid, and test set split

In [72]:
#first, let's calculate the size of each set
train_size = int(len(df) * 0.7)
valid_size = int(len(df) * 0.1)
test_size = len(df) - train_size - valid_size

print('train size:', train_size)
print('valid size:', valid_size)
print('test size:', test_size)

train size: 2100
valid size: 300
test size: 601


In [73]:
from sklearn.model_selection import train_test_split

#let's first split the dataset into train+valid and test
#parse instances and labels from pandas to the function
#make sure to set random state

instance_trainvalid, instance_test, label_trainvalid, label_test = \
train_test_split(df['Instance'], df['Label'], test_size=test_size, random_state=0) # did 62 due to it being my favorite number



In [74]:
# now split train+valid into train and valid separately

instance_train, instance_valid, label_train, label_valid = \
train_test_split(instance_trainvalid, label_trainvalid, test_size=valid_size, random_state=0)

In [75]:
print('train set shape', instance_train.shape)
print('valid set shape', instance_valid.shape)
print('test set shape',  instance_test.shape)


train set shape (2100,)
valid set shape (300,)
test set shape (601,)


Please note that we need to validate each step to avoid errors to propagate, e.g., print a few samples and check whether the instances have correct labels in the original file

In [76]:
train_set = pd.concat([instance_train, label_train], axis=1)

train_set.head()

Unnamed: 0,Instance,Label
1339,"It was an inexpensive piece, but I would still have expected better quality.",0
1057,VERY DISAPPOINTED.,0
1568,"Great it was new packaged nice works good, no problems and it came in less time then I expected!!!!",1
2435,So absolutley fantastic.,1
2352,He was terrible!,0


#Pipeline step 1: Text processing and representation (and examine the training set)

In [77]:
from nltk import word_tokenize

#this is the function that we wrote before to track word frequency
def create_word_freq(words):
    #step 1 create an empty dictionary
    #write your codes here
    word_freq = {}

    #step 2: iterate each word in words, keep track of its frequency
    #the code should start with something like: for word in words:
    #write your codes here
    for word in words:
        if word not in word_freq:
          word_freq[word] = 1 #the first time to add the word to the dictionary
        else:
          word_freq[word] += 1 ##not the first time; increase its frequency
           # word_freq[word] += 1 ##not the first time; increase its frequency

    #step 3: return the dictionary
    #write your codes here
    return word_freq

In [82]:
### Johnny testing
word_list = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
result = create_word_freq(word_list)
print(result)

{'apple': 3, 'banana': 2, 'orange': 1}


In [78]:
#same function we did before: print top k words
from operator import itemgetter
def print_top_freq_words(word_freq_dict, k):
    freqs = sorted(word_freq_dict.items(), key=itemgetter(1), reverse=True)
    print('top', k, 'terms')
    for freq in freqs[:k]:
        print(freq)
    print('total number of unique tokens', len(word_freq_dict))

In [86]:
print_top_freq_words(result, 50)

top 50 terms
('apple', 3)
('banana', 2)
('orange', 1)
total number of unique tokens 3


In [91]:
#same function we did before: let's start with a basic text processing function
#which removes punctuations and stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab') # new update 11/30/2025 -JU
from nltk.corpus import stopwords
stopword_list = set(stopwords.words('english'))
from string import punctuation

def preprocess_text(document):

    #step 1: tokenize the document
    #write your codes here
    words_nltk = word_tokenize(document)
    # words_nltk = nltk.word_tokenize(document)

    #let's create a list to track the final tokens
    words_final = []
    #step 2: for each token, check whether it is a punctuation

    for word in words_nltk:
      word = word.lower()
      if word not in punctuation and word not in stopword_list:
        words_final.append(word)

    #step 3: return the final list of tokens
    #write your codes here
    return words_final

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [109]:
### johnny testing
blah = preprocess_text('It was an inexpensive piece, but I would still have expected better quality')
thing = create_word_freq(blah)
print_top_freq_words(thing, 50)

top 50 terms
('inexpensive', 1)
('piece', 1)
('would', 1)
('still', 1)
('expected', 1)
('better', 1)
('quality', 1)
total number of unique tokens 7


In [101]:
train_set

Unnamed: 0,Instance,Label
1339,"It was an inexpensive piece, but I would still have expected better quality.",0
1057,VERY DISAPPOINTED.,0
1568,"Great it was new packaged nice works good, no problems and it came in less time then I expected!!!!",1
2435,So absolutley fantastic.,1
2352,He was terrible!,0
...,...,...
312,"One more thing: I can tolerate political incorrectness very well, I'm all for artistic freedom and suspension of disbelief, but the Slavic female character was just too much.",0
1755,The camera on the phone may be used as a dustpan when indoors... I'd rather be using a disposable then this.,0
1379,"Clear Skype Calls, Long Battery Life, Long Range.",1
328,Which has more depth and character than the man underneath it.,0


In [135]:
#with all the functions above, here is the function to wrap up
#it goes through each document, pre-process, and track the pre-processed words
#finally, it prints top k word
def examine_collection(documents, top_freq_k):
  words_in_collection = []
  for document in documents:
    # print(document) ## johnny testing
    words_in_documents = preprocess_text(document)
    words_in_collection.extend(words_in_documents)
  word_freq = create_word_freq(words_in_collection)
  print_top_freq_words(word_freq, top_freq_k)


In [137]:
#let's see top 50 words for positive records in the training set

#train_set[train_set['Label']==1: get all the positive records in the training set
#train_set[train_set['Label']==1]['Instance']: get the Instance column of the positive records in the training set
# examine_collection(train_set[train_set['Label']==1]['Instance'], 50)


# examine_collection(train_set[train_set['Label']==1]['Instance'], 50)
# turns out, the 1 is an object text instead of an integer
## so we have to slightly modify it down below
examine_collection(train_set[train_set['Label']=='1']['Instance'], 50)

top 50 terms
('great', 129)
('good', 115)
("'s", 106)
('film', 71)
('movie', 61)
('phone', 60)
('one', 56)
("n't", 47)
('best', 47)
('well', 44)
('like', 43)
('place', 42)
('service', 39)
('excellent', 39)
('food', 38)
('nice', 37)
('love', 37)
('really', 35)
('works', 34)
('time', 34)
('also', 31)
('even', 24)
('product', 24)
('price', 23)
('quality', 23)
('recommend', 23)
('ever', 23)
('amazing', 22)
('...', 22)
('back', 22)
('better', 22)
('pretty', 21)
('headset', 21)
('could', 21)
('made', 20)
('sound', 20)
('would', 20)
('everything', 19)
('go', 19)
("'m", 18)
('think', 18)
('say', 17)
('every', 17)
("'ve", 17)
('wonderful', 17)
('use', 17)
('friendly', 17)
('definitely', 16)
('battery', 16)
('case', 15)
total number of unique tokens 2565


In [138]:
#similarly, let's see top 50 words for negative records in the training set

# examine_collection(train_set[train_set['Label']==0]['Instance'], 50)
## same deal as before, the 0 is a text not an integer
examine_collection(train_set[train_set['Label']=='0']['Instance'], 50)

top 50 terms
("n't", 153)
('bad', 70)
("'s", 68)
('movie', 58)
('phone', 54)
('one', 53)
('would', 50)
('food', 47)
('like', 45)
('service', 44)
('place', 41)
('time', 40)
('back', 37)
('film', 35)
('good', 35)
('could', 33)
('even', 31)
('...', 30)
('ever', 29)
('really', 29)
('worst', 27)
('go', 26)
("'m", 25)
('never', 25)
('work', 25)
('also', 24)
('way', 23)
('much', 23)
('money', 23)
('get', 22)
('plot', 22)
('waste', 22)
('quality', 21)
('disappointed', 21)
('terrible', 21)
('minutes', 21)
("'ve", 20)
('poor', 20)
('thing', 19)
('going', 18)
('make', 18)
('got', 17)
('first', 17)
('think', 17)
('battery', 17)
('acting', 17)
('product', 17)
('use', 17)
('better', 16)
('ear', 16)
total number of unique tokens 2520


Observations

Based on the top k words in positive vs negative records. We could find words like great, excellent, nice, amazing (and so on) commonly appear in the positive instances.

In contrast, words like bad, worst, waste (and so on) commonly appear in the negative instances.

We also find the word 'good' commonly appears in both positive and negative cases. Why 'good' appears in negative cases? Let's see a few samples:


1.  Crust is **not good**.
2.  REAL sushi lovers, let's be honest - Yama is **not that good**.
3. Unfortunately, it was **not good**.	0



In summary, these observations give some directions for the later steps:



*   If you want to build a rule-based model, you could build the rules using the top sentiment words above and consider the negation words. For example: if 'great' in the sentence AND 'not' not in the sentence: label=1
*   If you want to build a machine learning model, you could also think about including phrases (or consecutive words) in the text representation. Re bag-of-word representations only use single words by defult. We could also add phrases. This will help the cases where 'not good' and 'not that good' will be captured if we keep track of every 2 words and every 3 words respectively




#Pipeline step 2: establish the pipeline and build a basic model to get the baseline performance

In [140]:
#this step aims to get a sense of how basic text mining models behave in the dataset
#often we start with default settings

from sklearn.feature_extraction.text import TfidfVectorizer

#preprocess all the documents by calling the preprocess_function that we did before
def preprocess_documents(instances):
  documents_final = []
  for instance in instances:
    instance_tokens = preprocess_text(instance)
    documents_final.append(' '.join(instance_tokens))
  return documents_final

#preprocess training set
instance_train_preprocessed = preprocess_documents(instance_train)

#represent training set
#write your codes here
#create a new TfidfVectorizer
vectorizer = TfidfVectorizer()
#fit and transform the preprocessed training set
instance_train_representation = vectorizer.fit_transform(instance_train_preprocessed)

#train a few baseline models and evaluate the model performance on the validation set
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

baselines = [KNeighborsClassifier(), MultinomialNB(), DecisionTreeClassifier(random_state=0)]

instances_valid_representation = vectorizer.transform(preprocess_documents(instance_valid))

for baseline in baselines:
  print('model', baseline)
  baseline.fit(instance_train_representation, label_train)
  #write your codes here
  #use the trained baseline model predict the validation set (instances_valid_representation)
  predictions_valid = baseline.predict(instances_valid_representation)

  from sklearn.metrics import classification_report
  print(classification_report(label_valid, predictions_valid, digits=5))


model KNeighborsClassifier()
              precision    recall  f1-score   support

           0    0.71429   0.85714   0.77922       140
           1    0.84848   0.70000   0.76712       160

    accuracy                        0.77333       300
   macro avg    0.78139   0.77857   0.77317       300
weighted avg    0.78586   0.77333   0.77277       300

model MultinomialNB()
              precision    recall  f1-score   support

           0    0.86567   0.82857   0.84672       140
           1    0.85542   0.88750   0.87117       160

    accuracy                        0.86000       300
   macro avg    0.86055   0.85804   0.85894       300
weighted avg    0.86020   0.86000   0.85976       300

model DecisionTreeClassifier(random_state=0)
              precision    recall  f1-score   support

           0    0.69799   0.74286   0.71972       140
           1    0.76159   0.71875   0.73955       160

    accuracy                        0.73000       300
   macro avg    0.72979   0.7308

Naive Bayes gives the highest performance on the validation set. This gives a potential direction that keeps optimizing the pipeline using Naive Bayes

# Pipeline step 3: examine the instances, error analysis, model selection, tuning, and repeat step 1-3

In [143]:
#now let's see when we capture 2 consecutive words and how performance changes

vectorizer = TfidfVectorizer(ngram_range=(1,2))
instance_train_representation = vectorizer.fit_transform(instance_train_preprocessed)

baselines = [MultinomialNB()]

instances_valid_representation = vectorizer.transform(preprocess_documents(instance_valid))

for baseline in baselines:
  print('model', baseline)
  baseline.fit(instance_train_representation, label_train)
  predictions_valid = baseline.predict(instances_valid_representation)
  #write your codes here
  #now print the classification report on the performance of the validation set; use the digits of 5
  print(classification_report(label_valid, predictions_valid, digits=5))

model MultinomialNB()
              precision    recall  f1-score   support

           0    0.86567   0.82857   0.84672       140
           1    0.85542   0.88750   0.87117       160

    accuracy                        0.86000       300
   macro avg    0.86055   0.85804   0.85894       300
weighted avg    0.86020   0.86000   0.85976       300



In [144]:
#how many consecutive words should we choose?
#we can try different values and evaluate on the validation set for selection
#note that we often try up to 5. (very few phrases have a length over 5)
for ngram in range(1, 6):
  print('ngram value', ngram)
  vectorizer = TfidfVectorizer(ngram_range=(1,ngram))
  instance_train_representation = vectorizer.fit_transform(instance_train_preprocessed)

  baselines = [MultinomialNB()]

  instances_valid_representation = vectorizer.transform(preprocess_documents(instance_valid))

  for baseline in baselines:
    print('model', baseline)
    baseline.fit(instance_train_representation, label_train)
    predictions_valid = baseline.predict(instances_valid_representation)

    print(classification_report(label_valid, predictions_valid, digits=5))

ngram value 1
model MultinomialNB()
              precision    recall  f1-score   support

           0    0.86567   0.82857   0.84672       140
           1    0.85542   0.88750   0.87117       160

    accuracy                        0.86000       300
   macro avg    0.86055   0.85804   0.85894       300
weighted avg    0.86020   0.86000   0.85976       300

ngram value 2
model MultinomialNB()
              precision    recall  f1-score   support

           0    0.86567   0.82857   0.84672       140
           1    0.85542   0.88750   0.87117       160

    accuracy                        0.86000       300
   macro avg    0.86055   0.85804   0.85894       300
weighted avg    0.86020   0.86000   0.85976       300

ngram value 3
model MultinomialNB()
              precision    recall  f1-score   support

           0    0.85294   0.82857   0.84058       140
           1    0.85366   0.87500   0.86420       160

    accuracy                        0.85333       300
   macro avg    0.85

ngram = 2 gives the highest performance on the validation set. In general, you could try k-fold which computes the average performance for different values of ngrams. This will provide more reliable assessments given it repeats multiple times

In [145]:
#we can also print error cases

vectorizer = TfidfVectorizer(ngram_range=(1,2))
instance_train_representation = vectorizer.fit_transform(instance_train_preprocessed)

model = MultinomialNB()

instances_valid_representation = vectorizer.transform(preprocess_documents(instance_valid))

model.fit(instance_train_representation, label_train)
predictions_valid = model.predict(instances_valid_representation)

predictions_valid = pd.Series(predictions_valid, name='Pred', index=instance_valid.index)

valid_set_pred = pd.concat([instance_valid, label_valid, predictions_valid], axis=1)

for _, row in valid_set_pred.iterrows():
  if row.Label != row.Pred:
    print('instance', row.Instance, 'True label:', row.Label, 'Predicted', row.Pred)



instance REALLY UGLY. True label: 0 Predicted 1
instance Still, it was the SETS that got a big 10 on my oy-vey scale.   True label: 1 Predicted 0
instance Because both ears are occupied, background is not distracting at all. True label: 1 Predicted 0
instance It always cuts out and makes a beep beep beep sound then says signal failed. True label: 0 Predicted 1
instance The delivery was on time. True label: 1 Predicted 0
instance The food is about on par with Denny's, which is to say, not good at all. True label: 0 Predicted 1
instance Eew... This location needs a complete overhaul. True label: 0 Predicted 1
instance But I thought his acting was skilled.   True label: 1 Predicted 0
instance There is a totally unnecessary train/roller coaster scene.   True label: 0 Predicted 1
instance My phone doesn't slide around my car now and the grip prevents my phone from slipping out of my hand. True label: 1 Predicted 0
instance I will be back many times soon. True label: 1 Predicted 0
instance T

Other approaches to consider are: looking for top errors and brainstorm according, adding word-level features specifcially for negation, changing to deep-learning based model which capture semantics, etc.

Also think about doing k-fold, adding regularization etc.

#Pipeline step 4: evaluate on the testing set

In [146]:
instances_test_representation = vectorizer.transform(preprocess_documents(instance_test))

predictions_test = model.predict(instances_test_representation)

print(classification_report(label_test, predictions_test, digits=5))

              precision    recall  f1-score   support

           0    0.79470   0.78947   0.79208       304
           1    0.78595   0.79125   0.78859       297

    accuracy                        0.79035       601
   macro avg    0.79033   0.79036   0.79033       601
weighted avg    0.79038   0.79035   0.79036       601



For a baseline model like NaiveBayes, an F1-score of 0.78 is reasonable. Also the performance between validation and test is only ~5%, which also suggests the model is robust.  

#Additional references

A comprehensive list of supervised learning models in sklearn: http://scikit-learn.org/stable/supervised_learning.html

Sklearn pipeline http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html: an automatic way to perform 1-3

Sklearn parameter tuning http://scikit-learn.org/stable/modules/grid_search.html: find best set of parameters automatically