# Week 4, Lesson 3, Activity 5: End-to-end sentiment analysis

&copy;2021, Ekaterina Kochmar \
(edited: Nadejda Roubtsova, February 2022)

Your task in this activity is to:

- Implement a sentiment analysis algorithm and train it on the set of reviews provided with the notebook.

## Step 1: Data loading

We will be using popular `polarity dataset 2.0` collected by [Bo Pang and colleagues from Cornell Univeristy](http://www.cs.cornell.edu/people/pabo/movie-review-data/). Let's first upload the data.

In [1]:
import os, codecs

def read_in(folder):
    files = os.listdir(folder)
    a_dict = {}
    for a_file in sorted(files):
        if not a_file.startswith("."):
            with codecs.open(folder + a_file, encoding='ISO-8859-1', errors ='ignore') as f:
                file_id = a_file.split(".")[0].strip()
                a_dict[file_id] = f.read()
            f.close()
    return a_dict

When you download the dataset, it comes as two subfolders named `pos/` for all positive reviews and `neg/` for all negative ones, put within a folder called `review_polarity/txt_sentoken/`. If you don't change the folder names, you can simply read in the contents of all positive and negative reviews and put them in separate Python dictionaries of review titles mapped to the reviews content, using the method `read_in` from above.

Let's also print out the number of reviews in positive and negative dictionaries, as well as the very first positive and very first negative reviews in the dictionaries.

In [2]:
folder = "review_polarity/txt_sentoken"
pos_dict = read_in(f"{folder}/pos/")
print(f"Number of positive sentiment reviews: {len(pos_dict)}") # check that this is 1000
print(pos_dict.get(next(iter(pos_dict))))

neg_dict = read_in(f"{folder}/neg/")
print(f"Number of positive sentiment reviews: {len(neg_dict)}") # check that this is 1000
print(neg_dict.get(next(iter(neg_dict))))

Number of positive sentiment reviews: 1000
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hugh

## Step 2: Preprocess texts with spaCy

Import `spacy`; since processing with `spacy` might take time, let's run it once and store the results in dedicated data structures:

In [3]:
import spacy
nlp = spacy.load("en_core_web_md")

def spacy_preprocess_reviews(source):
    source_docs = {}
    index = 0
    for review_id in source.keys():
        #to speed processing up, you can disable "ner" – Named Entity Recognition module of spaCy
        source_docs[review_id] = nlp(source.get(review_id).replace("\n", ""), disable=["ner"])
        if index>0 and (index%200)==0:
            print(str(index) + " reviews processed")
        index += 1
    print("Dataset processed")
    return source_docs

pos_docs = spacy_preprocess_reviews(pos_dict)
neg_docs = spacy_preprocess_reviews(neg_dict)

2023-07-14 16:50:44.701225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


200 reviews processed
400 reviews processed
600 reviews processed
800 reviews processed
Dataset processed
200 reviews processed
400 reviews processed
600 reviews processed
800 reviews processed
Dataset processed


## Step 3: Apply a machine learning classifier to the data

First, let's filter out punctuation marks (you can experiment by adding any other filters if you'd like e.g. stopwords) and prepare the data for the machine learning pipeline:

In [4]:
import random
import string
from spacy.lang.en.stop_words import STOP_WORDS as stopwords_list # stopwords list
punctuation_list = [punct for punct in string.punctuation]

def text_filter(a_dict, label, exclude_lists):
    data = []
    for rev_id in a_dict.keys():
        tokens = []
        for token in a_dict.get(rev_id):
            if not token.text in exclude_lists:
                tokens.append(token.lemma_)
        data.append((' '.join(tokens), label))
    return data

def prepare_data(pos_docs, neg_docs, exclude_lists):
    data = text_filter(pos_docs, 1, exclude_lists)
    data += text_filter(neg_docs, -1, exclude_lists)
    random.seed(42)
    random.shuffle(data)
    texts = []
    labels = []
    for item in data:
        texts.append(item[0])
        labels.append(item[1])
    return texts, labels

# for the use of both lists in filtering:
# texts, labels = prepare_data(pos_docs, neg_docs, list(stopwords_list) + punctuation_list)

texts, labels = prepare_data(pos_docs, neg_docs, list(stopwords_list) + punctuation_list)

print(f"Total number of reviews = {len(texts)} and labels = {len(labels)}") # there should be 2000 texts and 2000 labels
print(texts[0])

Total number of reviews = 2000 and labels = 2000
central focus michael winterbottom welcome sarajevo sarajevo city siege different effect character unfortunate stick prove backdrop stunningly realize story refreshingly stray mythic portent platoon racial tumultuosness risible walk dead tinge schmaltziness schindler list lead stephen dillane reporter emira nusevic orphan plight identify extremely believable moment involve ring false question go right question go wrong film fail provide political overview war progress dillane character report american plane depart sarajevo depart assortment high profile support actor range woody harrelson yankee reporter liquor cigarrette marisa tomei huggable child aid somesuch incapable rise sketchiness character albeit strive interrupted use authentic war footage somewhat hamper rest film make fictional character powerless comparison winterbottom eschews mawkishness flashy frantic editing imaginative use music plus toy emotion sentimental blandness wa

Let's prepare $80\%$ of the data for training and rest for testing in this randomly shuffled set:

In [5]:
def split(texts, labels, proportion):
    train_data = []
    train_targets = []
    test_data = []
    test_targets = []
    for i in range(0, len(texts)):
        if i < proportion*len(texts):
            train_data.append(texts[i])
            train_targets.append(labels[i])
        else:
            test_data.append(texts[i])
            test_targets.append(labels[i])
    return train_data, train_targets, test_data, test_targets

train_data, train_targets, test_data, test_targets = split(texts, labels, 0.8)
        
print(len(train_data)) # is this 1600?
print(len(train_targets)) # is this 1600?      
print(len(test_data)) # is this 400?       
print(len(test_targets)) # is this 400? 
print(train_targets[:10]) # print out the targets for the first 10 training reviews 
print(test_targets[:10]) # print out the targets for the first 10 test reviews 

1600
1600
400
400
[1, -1, 1, 1, -1, -1, -1, -1, 1, -1]
[-1, 1, 1, -1, -1, 1, -1, 1, 1, 1]


Now, let's estimate the distribution of words across texts using `sklearn`'s `CountVectorizer`:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_data)
# Check the dimensionality 
print(train_counts.shape)

(1600, 28621)


This shows that our training set contains over $28,000$ distinct words (the exact number may change depending on your split). This is our training set vocabulary, and it will be applied to all test reviews only. Note that this vocabulary is learned on the training data only. Let's look 'under the hood' and print out the counts for some words in the first $10$ reviews from the training set:

In [7]:
print(train_counts[:11])

  (0, 4177)	1
  (0, 9571)	1
  (0, 16162)	1
  (0, 28100)	2
  (0, 27830)	1
  (0, 21925)	3
  (0, 4620)	1
  (0, 22919)	1
  (0, 6849)	1
  (0, 7866)	1
  (0, 4270)	5
  (0, 26625)	1
  (0, 24134)	1
  (0, 19959)	1
  (0, 2021)	2
  (0, 24384)	1
  (0, 20510)	1
  (0, 24226)	1
  (0, 20710)	1
  (0, 24280)	1
  (0, 16972)	1
  (0, 19414)	1
  (0, 19165)	1
  (0, 20262)	1
  (0, 26130)	1
  :	:
  (10, 25603)	1
  (10, 16800)	1
  (10, 8197)	1
  (10, 27703)	1
  (10, 8243)	1
  (10, 6623)	1
  (10, 14560)	1
  (10, 21431)	1
  (10, 11929)	1
  (10, 18603)	1
  (10, 9470)	1
  (10, 17787)	1
  (10, 23853)	1
  (10, 23963)	1
  (10, 24334)	1
  (10, 8294)	1
  (10, 11178)	1
  (10, 8066)	1
  (10, 9289)	1
  (10, 14387)	1
  (10, 17478)	1
  (10, 26713)	1
  (10, 20813)	1
  (10, 1865)	1
  (10, 386)	1


What do the results like (0, 5285)	5 and (0, 28620)	1 mean? \
The first review (index 0, a positive review since it has label `1` in `train_targets`) contains $5$ occurrences of some word with an index $5285$ and $1$ occurrences of a word with an index $28620$ from the vocabulary. Let's see what those indexes correspond to:

In [8]:
count_vect.get_feature_names_out()[5285]

'conrad'

In [9]:
count_vect.get_feature_names_out()[28620]

'zzzzzzz'

E.g., you might find out that index $5285$ corresponds to the word *characters* and index $30800$ to the word *stuck*.  \
(Please note that you will get different words if you experimented with alternative preprocessing.) \
Here is how you can check the whole list of words (features) mapped to indices:

In [10]:
count_vect.vocabulary_

{'central': 4177,
 'focus': 9571,
 'michael': 16162,
 'winterbottom': 28100,
 'welcome': 27830,
 'sarajevo': 21925,
 'city': 4620,
 'siege': 22919,
 'different': 6849,
 'effect': 7866,
 'character': 4270,
 'unfortunate': 26625,
 'stick': 24134,
 'prove': 19959,
 'backdrop': 2021,
 'stunningly': 24384,
 'realize': 20510,
 'story': 24226,
 'refreshingly': 20710,
 'stray': 24280,
 'mythic': 16972,
 'portent': 19414,
 'platoon': 19165,
 'racial': 20262,
 'tumultuosness': 26130,
 'risible': 21349,
 'walk': 27599,
 'dead': 6259,
 'tinge': 25585,
 'schmaltziness': 22092,
 'schindler': 22075,
 'list': 14839,
 'lead': 14469,
 'stephen': 24102,
 'dillane': 6887,
 'reporter': 20963,
 'emira': 8099,
 'nusevic': 17566,
 'orphan': 17973,
 'plight': 19216,
 'identify': 12294,
 'extremely': 8819,
 'believable': 2507,
 'moment': 16535,
 'involve': 13185,
 'ring': 21323,
 'false': 8934,
 'question': 20192,
 'go': 10531,
 'right': 21301,
 'wrong': 28301,
 'film': 9276,
 'fail': 8896,
 'provide': 19962,
 

Alternatively, to print the vocabulary of features in the alphabetical order run:

In [11]:
count_vect.get_feature_names_out()

array(['00', '000', '0009f', ..., 'zwigoff', 'zycie', 'zzzzzzz'],
      dtype=object)

Now let's convert word occurrences into binary values: use $1$ if the word occurs in a reivew, and $0$ otherwise:

In [12]:
from sklearn.preprocessing import Binarizer

transformer = Binarizer()
train_bin = transformer.fit_transform(train_counts)
print(train_bin.shape)
print(train_bin[0])

(1600, 28621)
  (0, 4177)	1
  (0, 9571)	1
  (0, 16162)	1
  (0, 28100)	1
  (0, 27830)	1
  (0, 21925)	1
  (0, 4620)	1
  (0, 22919)	1
  (0, 6849)	1
  (0, 7866)	1
  (0, 4270)	1
  (0, 26625)	1
  (0, 24134)	1
  (0, 19959)	1
  (0, 2021)	1
  (0, 24384)	1
  (0, 20510)	1
  (0, 24226)	1
  (0, 20710)	1
  (0, 24280)	1
  (0, 16972)	1
  (0, 19414)	1
  (0, 19165)	1
  (0, 20262)	1
  (0, 26130)	1
  :	:
  (0, 25115)	1
  (0, 1377)	1
  (0, 8810)	1
  (0, 12588)	1
  (0, 26072)	1
  (0, 9106)	1
  (0, 28214)	1
  (0, 17802)	1
  (0, 16767)	1
  (0, 8089)	1
  (0, 6408)	1
  (0, 8867)	1
  (0, 12670)	1
  (0, 4585)	1
  (0, 24689)	1
  (0, 28239)	1
  (0, 20586)	1
  (0, 188)	1
  (0, 8705)	1
  (0, 18724)	1
  (0, 12722)	1
  (0, 1643)	1
  (0, 13098)	1
  (0, 6461)	1
  (0, 5048)	1


Finally, let's train the classifier and run it on the designated test set: 

In [13]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(train_counts, train_targets)
test_counts = count_vect.transform(test_data)
predicted = clf.predict(test_counts)

for text, label in list(zip(test_data, predicted))[:10]:
    if label==1:
        print('%r => %s' % (text[:100], "pos"))
    else:
        print('%r => %s' % (text[:100], "neg"))

'lengthy lousy word describe boring drama english patient great acting music cinematography nice dull' => pos
'capsule short punchy action sequel dinosaur film steven spielberg joe johnston direct straightforwar' => neg
'lets look history shark film unforgettable jaw exciting jaw 2 flaky jaw 3d late 90 film genre recall' => neg
'girl spend day closed building inventory strange box gets deliver start sound familiar mistake girl ' => neg
'sided doom gloom documentary possible annihilation human race foretold bible orson welles narrate ap' => neg
'robert redford river run film watch masterpiece -- well film recent year 1994 second favorite film t' => pos
'susan granger review america sweetheart columbia sony waste talented cast billy crystal co writer pe' => neg
'  fugitive probably great thriller take realistic believable character tell exciting story totally b' => pos
'look year ago coen brother comedic gem big lebowski change actor away bowling alley add record store' => pos
'plot 10 w

Alternatively, this is how you can do the same using `sklearn`'s pipeline: 

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer

text_clf = Pipeline([('vect', CountVectorizer(min_df=10, max_df=0.5)), 
                     ('binarizer', Binarizer()), # include this for detecting presence-absence of features
                     ('clf', MultinomialNB())
                    ])

text_clf.fit(train_data, train_targets) 
print(text_clf)
predicted = text_clf.predict(test_data)

Pipeline(steps=[('vect', CountVectorizer(max_df=0.5, min_df=10)),
                ('binarizer', Binarizer()), ('clf', MultinomialNB())])


Evaluate the results:

In [15]:
from sklearn import metrics

print("\nConfusion matrix:")
print(metrics.confusion_matrix(test_targets, predicted))
print(metrics.classification_report(test_targets, predicted))


Confusion matrix:
[[165  37]
 [ 44 154]]
              precision    recall  f1-score   support

          -1       0.79      0.82      0.80       202
           1       0.81      0.78      0.79       198

    accuracy                           0.80       400
   macro avg       0.80      0.80      0.80       400
weighted avg       0.80      0.80      0.80       400

