# Notebook 3 - Supervised Learning

CSI4106 Artificial Intelligence  
Fall 2019  
Prepared by Caroline Barrière and Julian Templeton

***INTRODUCTION***:

The supervised classification task tackled in this notebook is **polarity detection**, which is one possible activity within the quite popular trend of *Opinion Mining* in AI.  Many companies want to know whether there are positive or negative reviews about them.  Reviews can be on hotels, restaurants, movies, customer service of any kind, etc.

This notebook will allow you to better understand an ***experimental set-up*** for supervised machine learning.  The notion of training set, test set, evaluation, bias, etc.  The notebook also introduces the notion of comparative evaluation.  To say if a method is good or not, we often compare it to a *baseline* approach.  

This notebook makes use of a really nice and popular machine learning package, called **scikit-learn** (http://scikit-learn.org/stable/).  It contains many pre-coded machine learning algorithms which you can call.  To use this package, you must download it. You will also need to download **Pandas** which is a great tool for manipulating data to use in Machine Learning algorithms.  At the command prompt, type ***pip install sklearn*** and ***pip install pandas*** to download the packages.  

In this notebook we will use the Naive Bayes implementation and the Support Vector Machine (SVM) implementation for polarity detection of a large movie review dataset, but we will explore other ML algorithms included in scikit-learn in future notebooks.  

You will need to download the movie review dataset from the following shared Google Drive:
https://drive.google.com/file/d/1w1TsJB-gmIkZ28d1j7sf1sqcPmHXw352/view

This is a dataset of reviews from Rotten Tomatoes along with the Freshness of the review (Fresh or Rotten). We will be using this dataset throughout the notebook so be sure to place it in the same directory as this notebook. It contains 480000 reviews with half of them being rotten and the other half being fresh. We will only use a subset of these due to the large computation time of the Baseline and SVM learners.


***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.  

*The notebook will be marked on 20.  
Each **(TO DO)** has a number of points associated with it.*
***

**1. Polarity detection**  

In polarity detection, we use two classes: positive and negative.  This is different from sentiment analysis for example, in which the classes might be (sad, happy, anxious, angry, etc).  It's also more restricted than *rating* in which we would like assign a value (0..5) to evaluate a particular service.  So, the polarity detection task aims to assign either *positive* or *negative* to a statement.

**2. Application domain:  Movie reviews**  

Polarity detection could be used on reviews of anything.  In this notebook, we wish to apply polarity detection within the domain of movies.  Movie reviewers give a review accompanied by a score for movies that they review. The website Rotten Tomatoes is a website that collects movie reviews and the accompanied ratings, where the ratings are can be classified as "Rotten" for a low review score or "Fresh" for a higher review score. We will be using the dataset *rt_reviews.csv* that you downloaded earlier to perform polarity detection on.

The first thing to do is to setup the training and testing sets for our models. We will build these sets by importing the data from the dataset using pandas, then use that dataframe along with sckikit learn's train_test_split function that will separate the data into a training set and a test set. These will be used later on by the models that will be created/used.

We **SHOULD NOT** use this test set to build our model later on. The test set (unseen data) is to test the model after we train it with the training set.

In [1]:
# Import the libraries that we will use to help create the train and test sets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Import the dataset, need to use the ISO-8859-1 encoding due to some invalid UTF-8 characters
df = pd.read_csv("rt_reviews.csv", encoding="ISO-8859-1")

The first step after loading the data is to take a quick look at it. Pandas offers the two useful functions df.head() and df.tail() which allow you to visualize the top and the bottom of your data frame.

In [2]:
df.head(5) # Show the first five reviews of the dataset to understand the dataframe's structure

Unnamed: 0,Freshness,Review
0,fresh,"Manakamana doesn't answer any questions, yet ..."
1,fresh,Wilfully offensive and powered by a chest-thu...
2,rotten,It would be difficult to imagine material mor...
3,rotten,Despite the gusto its star brings to the role...
4,rotten,If there was a good idea at the core of this ...


In [3]:
# Randomly select 10000 fresh examples from the dataframe
dfFresh = df[df["Freshness"] == "fresh"].sample(n=10000, random_state=5)
# Randomly select 10000 rotten examples from the dataframe
dfRotten = df[df["Freshness"] == "rotten"].sample(n=10000, random_state=3)
# Combine the results to make a small random subset of reviews to use
dfPartial = dfFresh.append(dfRotten)

In [4]:
# Split the data such that 90% is used for training and 10% is used for testing (separating the review
# from the freshness scores that we will use as the labels)
# Recall that we do not use this test set when building the model, only the training set
# We use the parameter stratify to split the training and testing data equally to create
# a balanced dataset
train_reviews, test_reviews, train_tags, test_tags = train_test_split(dfPartial["Review"],
                                                                      dfPartial["Freshness"],
                                                                      test_size=0.1, 
                                                                      random_state=10,
                                                                      stratify=dfPartial["Freshness"])
train_tags = train_tags.to_numpy()
train_reviews = train_reviews.to_numpy()
# Testing set (what we will use to test the trained model)
test_tags = test_tags.to_numpy()
test_reviews = test_reviews.to_numpy()

**3. Bias:  Available resources**  

For polarity detection, some researchers have established lists of positive and negative words.  The ones used in this notebook have been downloaded from [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (a website on Opinion Mining by renowned research Bing Lu) and stored locally.  The files *positive-words.txt* and *negative-words.txt* are in the Jupyter Notebook module in Brightspace.  Make sure you place these files in the same repertory as your notebook.

As discussed in class, using any external resource is somewhat of a *bias* that we introduce in the study of a problem. Although in this particular case, the lists themselves have been compiled from data by other researchers.

In [5]:
# Read the positive words
# to fix encoding problems, you might need to replace the line below
# with open("positive-words.txt", encoding = "ISO-8859-1") as f: 

with open("positive-words.txt", encoding = "ISO-8859-1") as f:
    posWords = f.readlines()
posWords = [p[0:len(p)-1] for p in posWords if p[0].isalpha()] 

# print the first 50 words
print(posWords[:50])

['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous']


In [6]:
# Read the negative words
# to fix encoding problems, you might need to replace the line below
# with open("negative-words.txt", encoding = "ISO-8859-1") as f: 

with open("negative-words.txt", encoding = "ISO-8859-1") as f:
    negWords = f.readlines()
negWords = [p[0:len(p)-1] for p in negWords if p[0].isalpha()] 

print(negWords[:50])

['abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted', 'aborts', 'abrade', 'abrasive', 'abrupt', 'abruptly', 'abscond', 'absence', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'absurdly', 'absurdness', 'abuse', 'abused', 'abuses', 'abusive', 'abysmal', 'abysmally', 'abyss', 'accidental', 'accost', 'accursed', 'accusation', 'accusations', 'accuse', 'accuses', 'accusing', 'accusingly', 'acerbate', 'acerbic', 'acerbically', 'ache', 'ached', 'aches', 'achey', 'aching', 'acrid', 'acridly', 'acridness', 'acrimonious', 'acrimoniously']


**4. Baseline approach**  

Before we evaluate the performances of a supervised learning approach, we can start by establishing a very simple baseline approach.  It's always good to start simple.  A baseline allows us to measure whether the additional complexity of the various models we develop is worth it or not.

The *baseline algorithm* we will use simply counts the number of positive and negative words in the review and outputs the category corresponding to the maximum.  This approach DOES NOT LEARN anything.  It just uses a particular *reasoning* (strategy at test time).  You might be surprised to find out how many *AI start-ups* within the area of Opinion Mining, do use this kind of simple approach.  

In [7]:
# First let's define methods to count positive and negative words

def countPos(text):
    count = 0
    for t in text.split():
        if t in posWords:
            count += 1
    return count

def countNeg(text):
    count = 0
    for t in text.split():
        if t in negWords:
            count += 1
    return count

In [8]:
# Simple counting algorithm as baseline approach to polarity detection
def baselinePolarity(review):
    numPos = countPos(review)
    numNeg = countNeg(review)
    if numPos > numNeg:
        return "fresh"   
    else:
        return "rotten"   

In [9]:
# Test the baseline method
print("Testing baselinePolarity with the review:", train_reviews[0])
print("baselinePriority result:", baselinePolarity(train_reviews[0]))
print("Actual result:", train_tags[0])
print(" ")
print("Testing baselinePolarity with the review:", train_reviews[1])
print("baselinePriority result:", baselinePolarity(train_reviews[1]))
print("Actual result:", train_tags[1])

Testing baselinePolarity with the review:  You Again poses an interesting question -- what if our long-ago bullies were just as psychically scarred by the tormenting as their tormented victims were? -- but that curveball is buried under a lot of gunk.
baselinePriority result: rotten
Actual result: rotten
 
Testing baselinePolarity with the review:  Joe Swanberg's starriest picture is a lovely slice of everything and nothing disguised as a murder mystery.
baselinePriority result: rotten
Actual result: fresh


**5. Evaluation of the Baseline Approach**  
We saw in class that there could be multiple ways of evaluating an algorithm.  In the case of classification, a common evaluation method is simply to calculate *number of wrong choices*.

To test our *baseline algorithm* we use the test set, defined earlier and calculate the number of wrong assignments.

In [10]:
# Function takes a one dimensional array of reviews and a one dimensional array of
# tags as input and prints the number of incorrect assignments when running the baseline approach
# on the reviews.
# Let's establish the polarity for each review
def incorrectReviews(reviews, tags):
    nbWrong = 0
    count = 0
    for i in range(len(reviews)):
        polarity = baselinePolarity(reviews[i])
        if (count < 10):
            print(reviews[i] + " -- Prediction: " + polarity + ". Actually: " + tags[i] + " \n")
            count += 1
        if (polarity != tags[i]):
            nbWrong += 1

    print('There are %s wrong predictions out of %s total predictions' %(nbWrong, len(tags)))    

In [11]:
# This may take a minute to run
incorrectReviews(test_reviews, test_tags)

 A loving homage to spy movies that stands as a film in its own right, Kingsman draws its influences from all the right places and, most importantly, is entertaining. -- Prediction: fresh. Actually: fresh 

 ...has little new going for it beyond a boatload of energy and enthusiasm. And pure formula. -- Prediction: fresh. Actually: rotten 

 Len Wiseman's remake improves on it [the original] many ways, though in other ways, it's slightly inferior. If only we could have combined the best parts of both! -- Prediction: fresh. Actually: fresh 

 The film asks down-and-dirty questions about what really resides beneath thousands of years of human progress, a savage and haunting antidote to the high-minded idealism of movies like Christopher Nolan's Interstellar and Ridley Scott's The Martian. -- Prediction: rotten. Actually: fresh 

 At least it's not "The Emoji Movie."  -- Prediction: rotten. Actually: rotten 

 Really, it's the same circumstance captured in Bye Bye Birdie, but Lee and Scham

**(TO DO) Q1 - 1 marks**  
Look at the ten outputs above which provide predictions from the Baseline approach for specified reviews along with their actual review class.
From the output, give the prior probabilities (no code needed) for each class based on the output given by the Baseline approach and based on the actual review class.

***Answer here***  
For the Baseline predictions:
P(fresh) = 4/10
P(rotten) = 6/10

For the actual outputs:
P(fresh) = 4/10
P(rotten) = 6/10

#### 6. Supervised learning method

We will now train a supervised learning model for polarity detection.

***6.1 Training data***  

In supervised learning, we need training data.  This training data must be *different* but *representative* of the eventual test data. At the beginning of the notebook we defined the training data and the test data to be a subset of the entire dataset (20000 total rows from the 480000). We did this due to the large computation time of the Baseline Approach and the SVM approach that we will use later in this notebook. In reality we would want to use the entire dataset and ensure that we have trained our models with a large enough training set. This would ensure that when predicting unseen data that we have learned most of the examples that we expect to ever predict.

Usually a training set should be as large and varied as possible.  Training sets are very valuable, but they are costly to obtain, as they require tagging (human annotation) to generate them. Once again, the training set is used to train the model and the testing set is used to test how well the trained model performs on unseen examples.

In [12]:
# Looking at the shapes of the train and test datasets that we will be using
print(train_reviews.shape)
print(test_reviews.shape)

(18000,)
(2000,)


***6.2 Pre-processing of input data*** 

This Machine Learning package, *scikit-learn*, is somewhat particular in the way the data must be formatted to be used by the training algorithms.  So, we must perform some preprocessing on the sentences above.  Luckily *scikit-learn* provides some pre-defined functions for doing text pre-processing.  

We easily transform each sentence into a list of indexes into a dictionary.  The dictionary is built from the words in the sentences.  The keys of the dictionary are the words, and the value is an index.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# The CountVectorizer builds a dictionary of all words (count_vect.vocabulary_), 
# and generates a matrix (train_counts), to represent each sentence
# as a set of indices into the dictionary. The words in the dictionary are the words found in train_reviews.

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_reviews)

To understand what the code above does, first let's print the vocabulary gathered from the sentences in train_reviews.  

In [14]:
# print the vocabulary (dictionary of words)
print(count_vect.vocabulary_)



For example, you can interpret the output above as: 

'again':690  to mean that the word 'again' has been assigned index 690  
'poses':18445 to mean that the word 'poses' has been assigned index 18445

Then, let's print the *train_counts*.  

In [15]:
# print the content of the training examples in terms of frequency of words (each word represented by its index)
print(train_counts)

  (0, 27267)	1
  (0, 690)	1
  (0, 18445)	1
  (0, 1048)	1
  (0, 12692)	1
  (0, 19260)	1
  (0, 26723)	1
  (0, 11991)	1
  (0, 17077)	1
  (0, 14372)	1
  (0, 720)	1
  (0, 3278)	1
  (0, 26696)	2
  (0, 13315)	1
  (0, 1518)	2
  (0, 19048)	1
  (0, 21103)	1
  (0, 3398)	1
  (0, 24372)	1
  (0, 24783)	1
  (0, 24382)	1
  (0, 24782)	1
  (0, 26198)	1
  (0, 3373)	1
  (0, 24368)	1
  :	:
  (17998, 21268)	1
  (17998, 5471)	1
  (17998, 25073)	1
  (17998, 20199)	1
  (17998, 1943)	1
  (17998, 23998)	1
  (17998, 11619)	1
  (17998, 17779)	1
  (17998, 21747)	1
  (17999, 24372)	2
  (17999, 24368)	1
  (17999, 12919)	1
  (17999, 1077)	1
  (17999, 9133)	1
  (17999, 9837)	1
  (17999, 20327)	1
  (17999, 12205)	1
  (17999, 22687)	1
  (17999, 12952)	1
  (17999, 13425)	1
  (17999, 8731)	1
  (17999, 5135)	1
  (17999, 11785)	1
  (17999, 24169)	1
  (17999, 21091)	1


You can interpret each line above as:  

(0, 10829) 1  -- sentence 0 (in train_reviews) has 1 instance(s) of word 10829 (index of the word in count_vect.vocabulary, that is the word 'gunk')  
(17999, 24372) 2  -- sentence 17999 (in train_reviews) has 2 instance(s) of word 24372 (index of the word in count_vect.vocabulary, that is the word 'the')  

So the train_counts contain for each sentence, the BOW associated with that sentence, but in the form of a list of indexes (each index corresponding to a word).

***6.3 Naive Bayes learning***

With the data preprocessed, we are ready to test the Naive Bayes algorithm provided by scikit-learn.  That algorithm required the training data to be represented in terms of *train counts* which is why we did the pre-processing above.

It's as easy as performing *fit*, as you see below, to train the model.  But you know what's underneath!!!  It creates prior probabilities for classes (fresh, rotten) and posterior probabilities of words (features) per class (e.g. P(awful|fresh) or P(awful|rotten).  All these probabilities are used in Bayes Theorem.  

**(TO DO) Q2 - 2 marks**  
Before training the model, what are the prior probabilities of the fresh and rotten classes using the training set above?

In [16]:
# Find the prior probabilities for the fresh and rotten classes in the train set (train_tags) and the test set (test_tags)
# that we will be using.
# You must calculate it from the train and test sets, then print the calculated result
# Print the prior probabilities as: <TRAIN_OR_TEST>: P(class) = value
negTrainingCount=0;
posTrainingCount=0;
negTestingCount=0;
posTestingCount=0;
trainingSetSize = 18000;
testingSetSize = 2000;
for i in range(trainingSetSize):
    if (train_tags[i] == "fresh"):
        posTrainingCount+=1
    if (train_tags[i] == "rotten"):
        negTrainingCount+=1;
        
for i in range(testingSetSize):
    if (test_tags[i] == "fresh"):
        posTestingCount+=1
    if (test_tags[i] == "rotten"):
        negTestingCount+=1;
        
print("TRAIN: P(fresh) = " + str(posTrainingCount/trainingSetSize));
print("TRAIN: P(rotten) = " + str(negTrainingCount/trainingSetSize));
print("TEST: P(fresh) = " + str(posTestingCount/testingSetSize));
print("TEST: P(rotten) = " +str(negTestingCount/testingSetSize));

TRAIN: P(fresh) = 0.5
TRAIN: P(rotten) = 0.5
TEST: P(fresh) = 0.5
TEST: P(rotten) = 0.5


In [17]:
# Test of a naive bayes algorithm, the "fit" is the training
from sklearn.naive_bayes import MultinomialNB

# Training the model
clf = MultinomialNB().fit(train_counts, train_tags)   

***6.4 Evaluation of Naive Bayes***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [18]:
# Testing on training set
predicted = clf.predict(train_counts)
# Print the first ten predictions
for doc, category in zip(train_reviews[:10], predicted[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(train_tags, predicted):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, train_tags.size))

' You Again poses an interesting question -- what if our long-ago bullies were just as psychically scarred by the tormenting as their tormented victims were? -- but that curveball is buried under a lot of gunk.' => rotten

" Joe Swanberg's starriest picture is a lovely slice of everything and nothing disguised as a murder mystery." => fresh

' There is very little here to disabuse the growing belief that what the young Steven Patrick Morrissey most needs is a slap.' => rotten

' A sensitive portrait, but often a wretched one, of young people at crossroads, set on a Canadian First Nations reservation but with resonance far beyond.' => fresh

" Nothing really happens besides self-introspection and escape, which can be interesting, but here isn't." => rotten

' The supernatural elements brush up against some heavy topics, some actual real-life horrors, but like any encounter with a ghost, Angelica is likely to simply leave you cold.' => rotten

' An exciting film that will appeal to all f

Unsurprisingly, on the training set we get most of the examples correct....  But we should test on a real **test set**, namely test_reviews and test_tags.

**(TO DO) Q3 - 2 marks**  
Test the trained model on the test set.  Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps, so their format is compatible with the learner.

In [19]:
# Pre-process test set test_reviews
# Note, we use transform and NOT fit_transform since this we do not want to re-fit the vecotrizer
# that we used to train the model
test_reviews_counts = count_vect.transform(test_reviews)
# Predict the results
prediction = clf.predict(test_reviews_counts)
# Print the first ten predictions
for docTest, categoryTest in zip(test_reviews[:10], prediction[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (docTest, categoryTest))
# Print the total correctly classified instances out of the total instances
correct = 0
for tag, pred in zip(test_tags, prediction):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, test_tags.size))

' A loving homage to spy movies that stands as a film in its own right, Kingsman draws its influences from all the right places and, most importantly, is entertaining.' => fresh

' ...has little new going for it beyond a boatload of energy and enthusiasm. And pure formula.' => fresh

" Len Wiseman's remake improves on it [the original] many ways, though in other ways, it's slightly inferior. If only we could have combined the best parts of both!" => fresh

" The film asks down-and-dirty questions about what really resides beneath thousands of years of human progress, a savage and haunting antidote to the high-minded idealism of movies like Christopher Nolan's Interstellar and Ridley Scott's The Martian." => fresh

' At least it\'s not "The Emoji Movie." ' => rotten

" Really, it's the same circumstance captured in Bye Bye Birdie, but Lee and Schamus lack a sense of humor." => fresh

" The film is fascinating to watch, but I can hardly say what it's about, other than people killing each

***6.5 Support Vector Machine (SVM) learning***

Now that we have tested the Naive Bayes Classifier, we are ready to test the SVM algorithm provided by scikit-learn.  This notebook will not be explaining all of the parameters and behind the scenes of an SVM classifier. However, below is a link to the official documentation of scikit learn's implementations along with a good article to explain SVMs.

https://scikit-learn.org/stable/modules/svm.html
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72

In [20]:
# Test of a SVM, the "fit" is the training
from sklearn import svm

# Training the model with an SVM using a linear kernel (less computationally intensive)
# For the purpose of this Notebook we will stick to this simple model and stop after 3500
# iterations to save time (would take much longer otherwise, feel free to duplicate the notebook
# and test different parameters for yourself to see how much better it does!)
# This will take several minutes to run on less powerful machines, so be patient!
clf_svm = svm.SVC(kernel="linear", random_state=0, max_iter=3500).fit(train_counts, train_tags)   



***6.6 Evaluation of SVM***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [21]:
# May take a few minutes to run on weaker machines
# Testing on training set
predicted_svm = clf_svm.predict(train_counts)
# Print the first ten predictions
for doc, category in zip(train_reviews[:10], predicted_svm[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(train_tags, predicted_svm):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, train_tags.size))

' You Again poses an interesting question -- what if our long-ago bullies were just as psychically scarred by the tormenting as their tormented victims were? -- but that curveball is buried under a lot of gunk.' => rotten

" Joe Swanberg's starriest picture is a lovely slice of everything and nothing disguised as a murder mystery." => fresh

' There is very little here to disabuse the growing belief that what the young Steven Patrick Morrissey most needs is a slap.' => rotten

' A sensitive portrait, but often a wretched one, of young people at crossroads, set on a Canadian First Nations reservation but with resonance far beyond.' => rotten

" Nothing really happens besides self-introspection and escape, which can be interesting, but here isn't." => rotten

' The supernatural elements brush up against some heavy topics, some actual real-life horrors, but like any encounter with a ghost, Angelica is likely to simply leave you cold.' => rotten

' An exciting film that will appeal to all 

We can see that this model does not perform amazing on the training data. This is due to the parameter choice.  But we should test on a real **test set**, namely test_reviews and test_tags.

**(TO DO) Q4 - 2 mark**  
Test the trained SVM model on the test set.  Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps, so their format is compatible with the learner (can just repeat what you did above for this).

In [22]:
# Pre-process test set test_reviews
# Note, we use transform and NOT fit_transform since this we do not want to re-fit the vecotrizer
# that we used to train the model
test_reviews_counts = count_vect.transform(test_reviews)
# Predict the results with the SVM
test_predicted_svm = clf_svm.predict(test_reviews_counts)
# Print the first ten predictions
for doc, category in zip(test_reviews[:10], test_predicted_svm[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
# Print the total correctly classified instances out of the total instances
correct = 0
for tag, pred in zip(test_tags, test_predicted_svm):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, test_tags.size))

' A loving homage to spy movies that stands as a film in its own right, Kingsman draws its influences from all the right places and, most importantly, is entertaining.' => fresh

' ...has little new going for it beyond a boatload of energy and enthusiasm. And pure formula.' => fresh

" Len Wiseman's remake improves on it [the original] many ways, though in other ways, it's slightly inferior. If only we could have combined the best parts of both!" => fresh

" The film asks down-and-dirty questions about what really resides beneath thousands of years of human progress, a savage and haunting antidote to the high-minded idealism of movies like Christopher Nolan's Interstellar and Ridley Scott's The Martian." => fresh

' At least it\'s not "The Emoji Movie." ' => rotten

" Really, it's the same circumstance captured in Bye Bye Birdie, but Lee and Schamus lack a sense of humor." => rotten

" The film is fascinating to watch, but I can hardly say what it's about, other than people killing eac

***6.7 More Evaluation!***


**(TO DO) Q5 - 2 marks**   
A common **Evaluation Measure** in Machine Learning is **Recall**. Recall is the number of correct predictions for a class of interest (called the True Positives) divided by the total number of instances that are actually labelled as that class of interest (True Positives + False Negatives).   For example, if the test set contains 5 fresh examples and the algorithm only found 2, then the recall for the class fresh is 2/5.  Write a small method below that will calculate a class' recall.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the recall of that class (e.g. 50%).

In [114]:
# Number wrong
# CANNOT USE ANY FUNCTIONS FROM LIBRARIES TO DIRECTLY GET THE RECALL
def recall(actualTags, predictions, classOfInterest):
    #compare each index of correct tags with prediction, keeping track of how many correct prediction for fresh / rotten.
    #calculate recall for the class.
    correctPredictions = 0;
    totalClassTags = 0;
    classOfInterest = str(classOfInterest);
    actualTagSize=len(actualTags);
    predictionSize = len(predictions);
    
    for i in range(actualTagSize):
        if actualTags[i] == classOfInterest:   
            totalClassTags+=1;
            
    for i in range(predictionSize):
        if ((actualTags[i] == predictions[i]) and (actualTags[i]==classOfInterest)):
            correctPredictions+=1;
            
    return("The recall for the class "+classOfInterest+" is: "+str(correctPredictions)+"/"+str(totalClassTags));

**(TO DO) Q6 - 2 marks**   
Use the recall method to calculate the recall on the test set (both classes) for the Naive Bayes and SVM learners.  Print those recalls.   
Hint: You can test if recall() works correctly by testing with the provided exampel above

In [119]:
# Recall
#recallTest = recall(("fresh","fresh", "fresh", "rotten", "rotten", "fresh"), ("fresh", "rotten", "fresh", "fresh"), "fresh")
#print(recallTest);

naiveFreshBayesRecall = recall(test_tags, prediction,"fresh");
print("Naive Bayes Recall: "+naiveFreshBayesRecall);

naiveRottenBayesRecall = recall(test_tags, prediction, "rotten");
print("Naive Bayes Recall: "+naiveRottenBayesRecall);

svmFreshLearnersRecall = recall(test_tags, test_predicted_svm, "fresh");
print("SVM Learner Recall: "+svmFreshLearnersRecall);

svmRottenLearnersRecall = recall(test_tags,test_predicted_svm, "rotten");
print("SVM Learner Recall: "+svmRottenLearnersRecall);

Naive Bayes Recall: The recall for the class fresh is: 747/1000
Naive Bayes Recall: The recall for the class rotten is: 778/1000
SVM Learner Recall: The recall for the class fresh is: 609/1000
SVM Learner Recall: The recall for the class rotten is: 676/1000


**(TO DO) Q7 - 2 marks**   
Another common **Evaluation Measure** in Machine Learning is called **Precision**. Precision is the number of correct predictions for a class of interest (True Positives) divided by the total number of times that class of interest was predicted (True Positives + False Positives). For example is the test set (ground truth) contains 3 fresh examples and 1 rotten example and the algorithm correctly labelled two of these as fresh, incorrectly labelled one of these as rotten, and incorrectly labelled one of these as fresh, then the Precision for the class fresh is 2/3.  Write a small method below that will calculate a class' precision.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the precision of that class (e.g. 50%).

In [124]:
# CANNOT USE ANY FUNCTIONS FROM LIBRARIES TO DIRECTLY GET THE PRECISION
def precision(actualTags, predictions, classOfInterest):
    correctPredictions = 0;
    totalClassPredictions = 0;
    classOfInterest = str(classOfInterest);
    actualTagSize=len(actualTags);
    predictionSize = len(predictions);
    
    for i in range(predictionSize):
        if predictions[i] == classOfInterest:   
            totalClassPredictions+=1;
            
    for i in range(predictionSize):
        if ((actualTags[i] == predictions[i]) and (actualTags[i]==classOfInterest)):
            correctPredictions+=1;
            
    return("The precision for the class "+classOfInterest+" is: "+str(correctPredictions)+"/"+str(totalClassPredictions));

**(TO DO) Q8 - 2 marks**   
Use the precision method to calculate the precision on the test set (both classes) for the Naive Bayes and SVM learners.  Print those precision values.

In [127]:
# Precision
#precisionTest = precision(("fresh","fresh", "fresh", "rotten"), ("fresh", "fresh", "rotton", "fresh"), "fresh")
#print(precisionTest);
naiveFreshBayesPrecision = precision(test_tags, prediction,"fresh");
print("Naive Bayes Precision: "+naiveFreshBayesPrecision);

naiveRottenBayesPrecision = precision(test_tags, prediction, "rotten");
print("Naive Bayes Precision: "+naiveRottenBayesPrecision);

svmFreshLearnersPrecision = precision(test_tags, test_predicted_svm, "fresh");
print("SVM Learner Precision: "+svmFreshLearnersPrecision);

svmRottenLearnersPrecision = precision(test_tags,test_predicted_svm, "rotten");
print("SVM Learner Precision: "+svmRottenLearnersPrecision);

Naive Bayes Precision: The precision for the class fresh is: 747/969
Naive Bayes Precision: The precision for the class rotten is: 778/1031
SVM Learner Precision: The precision for the class fresh is: 609/933
SVM Learner Precision: The precision for the class rotten is: 676/1067


#### 7. Discussion

**(TO DO) Q9 - 5 marks**  
1.Are the Naive Bayes and SVM approaches performing better than the baseline approach, if so by how much?  

2.How does the precision and recall values for the "rotten" and "fresh" classes from the Naive Bayes approach compare to those from the SVM approach?  

3.If we used the training data on the Baseline approach, how would you theorize those results would compare to those from the test data (better, worse, maybe both)? Explain why the comparison of the train and test data predictions from the Baseline model may or may not (depending on your previous answer) resemble the comparison of the train and test predictions from the Naive Bayes and SVM learners.  

4.Present and discuss the overall results below (including the precision and recall comparisons).  

5.Give two suggestions (each) to help the Naive Bayes approach and the SVM approach within the context of our experiment of polarity detection for movie reviews.

**Answer Q9 here**
1.Naive Bayes and SVM approaches both perform better than the baseline approach. When comparing the evaluation on the test sets, the difference between the three approaches can be quantified. The baseline approach got 1240 correct guesses out of 2000, which is 62%. The Naive Bayes approach got 1525 correct guesses out of 2000, which is 76.25%. Lastly, the SVM approach got 1285 correct guesses out of 2000, which is 64.25%. By comparing the percentile value of correct guesses out of 2000 trials, it can be shown that the Naive Bayes approach has the highest percentage of correct guesses, followed by the baseline approach. In last place, al though by only a small margain when compared to the baseline approach, is the baseline approach.

-----------------------------------------------------------------------------------
2.To compare the precision and recall values for both classes of interest from the Naives Bayes approach and the SVM approach, the following percentages have been calculated:

Precision:
    Naive Bayes:
        Fresh: 747/1000 = 74.70%
        Rotton:778/1000 = 77.80%
    SVM:
        Fresh: 609/1000 = 60.90%
        Rotton: 676/1000 = 67.60%

Recall:
     Naive Bayes:
        Fresh: 747/969 = 77.09%
        Rotton: 778/1031 = 75.46%
    SVM:
        Fresh: 609/933 = 61.33%
        Rotton: 676/1067 = 63.36%
        
As seen from above, the Naive Bayes approach performed better than the SVM approach in both the precision and recall, for both classes of interest.

-----------------------------------------------------------------------------------
#3.By using the training data on the baseline approach, it is theorized that those results would fare better than the results from the test data. The results of using the training data on the Naive Bayes and SVM approaches' - 90.28% and 72.29%, respectively - it can be seen that both approaches fared better than they did when using the testing data. The Naive Bayes approach was 14.05% more accurate when using the training data. The SVM approach was 8.04% more accurate when using the training data. The baseline approach's accuracy is expected to increase, but by a smaller increment than the Naive Bayes and SVM approaches.

-----------------------------------------------------------------------------------
4.The following is a presentation of the data obtained from this notebook:

Training Set:
    Basline: N/A
    Naive Bayes: 16250/18000 = 90.28%
    SVM: 13012/18000 = 72.29%
    
Testing Set: 
    Baseline: 1240/2000 = 62%
    Naive Bayes: 1525/2000 = 76.25%
    SVM: 1285/2000 = 64.25%

Precision:
    Naive Bayes:
        Fresh: 747/1000 = 74.70%
        Rotton:778/1000 = 77.80%
    SVM:
        Fresh: 609/1000 = 60.90%
        Rotton: 676/1000 = 67.60%

Recall:
     Naive Bayes:
        Fresh: 747/969 = 77.09%
        Rotton: 778/1031 = 75.46%
    SVM:
        Fresh: 609/933 = 61.33%
        Rotton: 676/1067 = 63.36%
        
As seen from above, the Naive Bayes' approach performed better than SVM for all trials. The Naive Bayes' approach also performed better than the baseline results, which is a good indication that the extra complexity is working as intended. Unfortunately, the SVM approach fared worse than the baseline approach for the fresh precision results and the fresh recall results, and when the SVM approach did do better than the baseline approach, it is only by a small increment. This demonstrates that the SVM approach should be improved on if it is to be implemented, to make the added complexity worth it. Alongside this, even though the Naive Bayes approach fared the best, improvements should always be a priority, to ensure the added complexity is worth it.


-----------------------------------------------------------------------------------
5.

Suggestions to improve Naive Bayes Approach:
1.Increase size of data set if possible as Naive Bayes approach has high accuracy and speed on large datasets.
2.Multinomial Naive Bayes classification could be used to identify the different genres of movies, and whether or not certain words are positive or negative. (i.e. "Gory" is a negative word for romance movies but a positive word for a horror movie).

Suggestions to improve SVM Approach:
1.Increase the max iteration.
2.SVM performs poorly with large dataset due to the training time, it is recommended to do k-fold cross validation and have the SVM perform the computation on test sets.

In [None]:
**Optional - No marks** 
For your own interest, create a local copy of the notebook and redo the questions using the entire dataset with the same train, test split. Also try some different kernels for the SVM. How do these tests compare with the ones that you have done on the partial dataset in the

#### Signature

I, Kenny Nguyen, declare that the answers provided in this notebook are my own.