# Assignment 4 - Active Learning on the IMDB Dataset

In this assignment, you'll perform active learning on the IMDB dataset, where you'll play the role of the expert/teacher and a Multinomial naive Bayes model will play the role of the learner/student.

In [1]:
import codecs
from IPython.core import display
import numpy as np
import pickle
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
# You probably do not need to modify this.
def doc2html(doc_raw, w, vocabulary, tokenizer, size_ranges):
    html_rep = ""
    tokens = doc_raw.split(" ") 
    seen_tokens = set()
    for token in tokens:
        vocab_tokens = tokenizer.findall(token.lower())
        if len(vocab_tokens) > 0:
            vocab_token = vocab_tokens[0]
            if vocab_token in vocabulary:
                vocab_index = vocabulary[vocab_token]
                
                ws = "(%0.2f)" %w[vocab_index]
                #print(ws)

                if vocab_index not in seen_tokens:

                    if w[vocab_index] > 0: # positive word
                        s = np.sum(w[vocab_index] > size_ranges)
                        html_rep = html_rep + "<font size = " + str(s) + ", color=blue> " + token + ws + " </font>"

                    elif w[vocab_index] < 0: # negative word
                        s = np.sum(np.abs(w[vocab_index]) > size_ranges)
                        html_rep = html_rep + "<font size = " + str(s) + ", color=red> " + token + ws + " </font>"

                    else: # neutral word
                        html_rep = html_rep + "<font size = 1, color=black> " + token + " </font>"

                    seen_tokens.add(vocab_index)

                else: # if this is a token we have seen before
                    html_rep = html_rep + "<font size = 1, color=black> " + token + " </font>"
            else: # this token does not exist in the vocabulary
                html_rep = html_rep + "<font size = 1, color=gray> " + token + " </font>"
        else:
            html_rep = html_rep + "<font size = 1, color=gray> " + token + " </font>"
    return html_rep

In [3]:
# TODO 1: Modify this variable.
# I assume you have the class repository already cloned in your computer; point to the file in that repository.
PATH_TO_IMDB_FOLDER = "../../cs578/notebooks/data"
# Alternatively, you can try using urllib2 with the following url: https://github.com/CS578-S19/CS578/blob/master/notebooks/data/imdb.pickle.z


with open(PATH_TO_IMDB_FOLDER+"/imdb.pickle.z", 'rb') as f:
    compressed_data = f.read()

uncompressed_data = codecs.decode(compressed_data, 'zlib_codec')
imdb_data = pickle.loads(uncompressed_data)

In [4]:
tp = r"(?u)\b[\w\'/]+\b"
vect = CountVectorizer(token_pattern=tp, lowercase=True, ngram_range=(1, 1), min_df=100, max_df=0.7, binary=True)

In [5]:
X_pool = vect.fit_transform(imdb_data['train_corpus'])
X_test = vect.transform(imdb_data['test_corpus'])
pool_corpus = imdb_data['train_corpus']

In [6]:
# For evaluation purposes
y_test = imdb_data['y_test']

In [7]:
tokenizer = re.compile(tp)
size_ranges = np.linspace(0, 1.5, 8)

In [8]:
# TODO 2: Modify this; set it to the last two digits of your A#
last_two_digits = 72

In [9]:
rand = np.random.RandomState(last_two_digits)
candidates = list(rand.permutation(X_pool.shape[0]))

In [10]:
# TODO 3: Bootstrap your learning algorithm by labeling at least 5 documents. Go through the objects in the order of candidates.
# Modify the train_indices, y_train, and candidates variables accordingly.
# Print each document. Here are two examples. Please run the code and your examples will be different.

In [11]:
# Train is purposefully empty; you'll create it
train_indices = []
y_train = []

In [12]:
c = candidates[0]
print(c)
print(pool_corpus[c])

644
I saw this last week after picking up the DVD cheap. I had wanted to see it for ages, finding the plot outline very intriguing. So my disappointment was great, to say the least. I thought the lead actor was very flat. This kind of part required a performance like Johny Depp's in The Ninth Gate (of which this is almost a complete rip-off), but I guess TV budgets don't always stretch to this kind of acting ability.<br /><br />I also the thought the direction was confused and dull, serving only to remind me that Carpenter hasn't done a decent movie since In the Mouth of Madness. As for the story - well, I was disappointed there as well! There was no way it could meet my expectation I guess, but I thought the payoff and explanation was poor, and the way he finally got the film anti-climactic to say the least.<br /><br />This was written by one of the main contributors to AICN, and you can tell he does love his cinema, but I would have liked a better result from such a good initial prem

In [13]:
# Here, I decided  that it is a negative document. 
y_train.append(0)
train_indices.append(c)
candidates = candidates[1:]

In [14]:
# Here is another example.
c = candidates[0]
print(c)
print(pool_corpus[c])

5297
I saw this movie with my rock climbing instructor, and we found the entire thing so ridiculous as to be beyond pity. (For one, if Stallone is out free-climbing by himself, there's no need to carry any gear, but I guess those dangling carabiners look sorta "mountain climby," so let's throw them in). For those lobotomized folks who think that Colorado looks anything like the Dolomites in Italy (where the movie was filmed), well the Hollywood moguls have got a lot more ridiculous & foul-smelling stuff for you to swallow.


In [15]:
# Here, I decided  that it is a negative document.
y_train.append(0)
train_indices.append(c)
candidates = candidates[1:]

In [16]:
#Continue labeling (at least three more).

In [17]:
# Here is another example.
c = candidates[0]
print(c)
print(pool_corpus[c])

10793
I can't believe how awful this movie turned out to be. I feel magnanimous even referring to it as a "movie". The acting was flat, the editing was terrible and the plot leaves many major questions unanswered. The premise was OK, if unoriginal: a small group of aliens is living in the US and trying to slowly take over humanity. But it goes rapidly downhill from there. How could they convince a "human" to accept an alien as his wife in order to make they alien-human hybrid they require? They show a larval alien but never show what it does. They have a plastic surgeon that can produce perfect looking skin on an industrial scale. They throw in the obligatory huge alien monster with teeth. The ending was almost too painful to watch. I suppose that I'm mostly disappointed that Bruce Boxlietner would have anything to do with this. How could he say to the huge alien monster with teeth, "Get away from him you son of a b*tch" with a straight face? It's a long fall from his Babylon 5 days. A

In [18]:
# Here, I decided  that it is a negative document.
y_train.append(0)
train_indices.append(c)
candidates = candidates[1:]

In [19]:
# Here is another example.
c = candidates[0]
print(c)
print(pool_corpus[c])

18736
It's a little disconcerting to have a character named Gig Young in a movie...played by Gig Young. But this film is where Gig got his name and also a nice career boost after playing small parts under another name.<br /><br />I'm going to go against the majority of the other comments and state that I really enjoyed this film, mainly because of the vibrant performance of Barbara Stanwyck as Fiona. She was funny, angry, vulnerable, caring, and feisty as the oldest of three daughters whose mother died on the Lusitania, and whose father was later killed during Woar War I. <br /><br />As the "man" of the house, Fiona has stood steadfast for years against settling her father's will which would therefore allow a Donald Trump type named Charles Barclay to get the family home. But Fiona's keeping a secret as to why she hates Barclay so much. Geraldine Fitzgerald is the middle, flirty sister, who is married to an Englishman but craves her youngest sister's boyfriend (Gig Young).<br /><br />I

In [20]:
# Here, I decided  that it is a positive document.
y_train.append(1)
train_indices.append(c)
candidates = candidates[1:]

In [21]:
# Here is another example.
c = candidates[0]
print(c)
print(pool_corpus[c])

12411
Strummer's hippie past was a revelation, but overall this felt like crashing a wake. Campfire stories work best around the intimacy of a campfire. There were just too many semi-boring old friends anecdotes and too much filler stock footage. I love The Clash and Joe for not reuniting and selling their songs until now (FU Mick Jones), but this doc left me wanting..to relate more. Using campfire storytellers without proper explanation of who is telling the anecdote alienates the viewer to some extent. They should have been interviewed on their own. Even using Strummer's 'radio DJ voice' did little to glue the film together. And can someone explain all the flags flying behind the campfire scenes? After the awesome "Filth And The Fury" I hoped Temple could deliver. A Joe Strummer doc deserves better.


In [22]:
# Here, I decided  that it is a negative document.
y_train.append(0)
train_indices.append(c)
candidates = candidates[1:]

In [23]:
# TODO 4: Run this.
mnb = MultinomialNB()
mnb.fit(X_pool[train_indices], y_train)
w = mnb.feature_log_prob_[1] - mnb.feature_log_prob_[0]
mnb.score(X_test, y_test)

0.50012

In [24]:
# TODO 5: Perform uncertainty sampling for 25 iterations, one document at a time, using uncertainty sampling.
# Each time, display the document using display.HTML(doc2html(train_corp[c], w, vect.vocabulary_, tokenizer, size_ranges))
# and then label it. Modify the train_indices, y_train, and candidates variables accordingly.
# After each labeling, copy and paste the code of TODO 4 and run it.

In [25]:
budget = 25
for b in range(budget):
    decision = '?'
    val = '?'
    probs = mnb.predict_proba(X_pool[candidates])
    c = candidates[np.argmin(np.max(probs, axis=1))]
    print("Document",str(b+1)+':')
    display.display(display.HTML(doc2html(pool_corpus[c], w, vect.vocabulary_, tokenizer, size_ranges)))
    while True:
        decision = input("Is this document positive (1) or negative (0)?")
        if decision == '1':
            val = 'positive'
            break
        elif decision == '0':
            val = 'negative'
            break
        else:
            print("Invalid Input. Please enter 1 for positive or 0 for negative")
    print("\nThis document is classified as", val)
    train_indices.append(c)
    candidates.remove(c)
    y_train.append(int(decision))
    mnb = MultinomialNB()
    mnb.fit(X_pool[train_indices], y_train)
    w = mnb.feature_log_prob_[1] - mnb.feature_log_prob_[0]
    print('Acccuracy: ', mnb.score(X_test, y_test))
    print('\n\n')

Document 1:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.50128



Document 2:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.5058



Document 3:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.50176



Document 4:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.5008



Document 5:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.50368



Document 6:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.51428



Document 7:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.53364



Document 8:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.56324



Document 9:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.54556



Document 10:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.53068



Document 11:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.54524



Document 12:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.51656



Document 13:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.52344



Document 14:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.51656



Document 15:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.54084



Document 16:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.5916



Document 17:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.59824



Document 18:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.59132



Document 19:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.62064



Document 20:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.60912



Document 21:


Is this document positive (1) or negative (0)?0

This document is classified as negative
Acccuracy:  0.6078



Document 22:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.61456



Document 23:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.61484



Document 24:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.61148



Document 25:


Is this document positive (1) or negative (0)?1

This document is classified as positive
Acccuracy:  0.58844





In [26]:
# TODO 6: Discuss your findings and your experience.

#### It was interesting to see initially the postive vs negative algorithm was picking up actual positive and negative words but still predicting the values wrong due to sarcasm and "not" words disregaring the overall classification of the document.
#### As the process continued the accuracy increased and the most negative and positive words became helper words such as better, least and instead. 
#### The most interesting part was how random words such as could, her, she, played, ... were classified as either positive or negative as more documents were added classified (the accuracy also jumped up and down when this occured). It was interesting to see these words have high values in there classification.
#### It was sometimes difficult for me to classify the documents with some documents giving both positives and negative of the movie. It would be helpful to see the rating (numeric value) of the reviews. But even then, classifying a document postive and negative is tricky since ratings can be in the middle. A (yes/no) question regarding if the movie would be recommended to others would be helpful in this case.