# Sentiment Analysis with Naïve Bayes
Concepts:
-  Run a simple implementation of a Naïve Bayes spam filter by hand
-  Run NLTK's implementation of a Naïve Bayes classifier to sort movie reviews into positive and negative
-  Carry out feature engineering to improve the performance of a binary classifier

## 1. Naïve Bayes spam filtering by hand

### 1.2. Set up the spreadsheet
This section will be done with the following [spreadsheet](https://docs.google.com/spreadsheets/d/1Ipd_qC_2R7FZEmkuPFZ_u3Jxn_MmFMGMzuz1zLQTKBQ/edit?usp=sharing).

The Naïve Bayes approach to spam filtering is based on how often certain words occur in spam emails and in non-spam emails. For example, the word "Viagra" often occurs in spam emails and rarely occurs in non-spam emails. For these two reasons, the occurrence of this word is a reliable indicator of spam. We can estimate the reliability of any word-based feature in the following way:
- We count the number of spam emails in our training set that contain the word, and divide it by the number of spam emails in our training set overall. The higher this number, the stronger this word provides evidence into the "spam" direction. 
- We also count the number of non-spam emails in our training set that contain this word, and divide this by the number of non-spam emails in our training set overall. The higher this number, the stronger this word provides evidence into the "non-spam" direction. 
- We then set up a tug-of-war between the two numbers by dividing the first number by the second. The result is called the odds ratio. It will be close to 1 if the word occurs about equally often in spam and non-spam emails. The way we set it up in this exercise, the odds ratio will be above 1 (tending towards infinity) if the word occurs predominantly in spam emails, and below 1 (tending towards zero) if the word occurs predominantly in non-spam emails. We could also have set it up the other way around, so that odds ratios above 1 indicate ham and odds rations below 1 indicate spam, and it would have worked out just fine -- the choice is arbitrary here. As long as the choice is made in a consistent way throughout an application, it will not matter.

One problem with this and similar approaches arises from the fact that certain words will not occur in any non-spam emails in the test set. For example, "Viagra" occurs so rarely in non-spam emails that in many cases, a training set will not contain it at all. This is the case in our example. As a result, computing the odds ratio as described above results in division by zero.

After studying this problem carefully for a few thousands of years 😊, a task force of machine learning experts has come up with a clever solution: simply add one to every count. This is called "add-one smoothing". This way, every number is guaranteed to be positive and there is no risk of running into division by zero. Mathematically, this corresponds to taking the entire list of known words and pretending that it occurs twice in the training set, once as a spam email and once as a non-spam email. Since we add it on both sides of the divide, we don't change anything about how the individual words are categorized: every feature that used to pull us into a certain direction will still do so. 

In the spreadsheet, columns B and C give the counts of particular words found in spam and ham emails for a hypothetical user named Sandy, who chats about horses. The next two columns, D and E, are intended to hold smoothed counts, i.e. ones with no zeros. However, the counts initially are the same as those in columns B and C (i.e., they are initially unsmoothed). Consequently, there are divide-by-zero errors in columns H and J.

### 1.2. Complete the spreadsheet
The second step is to smooth the counts in columns D and E. We’re going to use the simple add-one smoothing method. Select cell D5, which should have the formula `=B5`, and replace it with the formula `=B5+1`, thereby adding one to the count. 

Next, add one to the count of all the words. To do so, select cell D5 again, then press command-C on Macs or Ctrl-C on Windows (or select Edit|Copy), then select the cells for all the words in columns D and E (i.e., cells `D5:E15`) and press command-V (or Ctrl-V, or select Edit|Paste).

By smoothing the counts, the smoothed probabilities in columns F and G should all become non-zero. These probabilities are simply the relative frequencies of the counts in columns D and E, i.e. the count of each word divided by the total. 

Column H has the ratio of the probabilities in columns D and E, which gives the odds ratio of a message being spam rather than ham just based on each word. 

Column I has the counts of each word for Message 1; for example, the word *cash* appears once in Message 1 (note that we don’t have data for all the words in all the messages — we’ll assume that the other words are inconsequential). 

Column J carries over the relevant ratios: for words with non-zero counts, it has the product of the counts in column I times the ratios in column H. For words with zero counts (words that don't occur in the message), it has 1.

Finally, cell J16 has the product of the ratios in Column J, which gives the overall odds ratio for the message being spam, while cell J17 has the inverse of J16, which is the odds ratio for the message being ham. This is how a simple spam filter works.

### 1.3. Interpret the spreadsheet
Message 1 is classified as spam since the overall odds ratio (J16 = 225.368) is above 1 and approaches infinity. Its inverse overall odds ratio (J17 = 0.004) also approaches 0, which means its words occur predominantly in spam emails.

Message 2 is classified as non-spam since the overall odds ratio (L16 = 0.000) is below 1 and approaches 0. Its inverse overall odds ratio (L17 = 2871.649) is also really big and approaches infinity, which means its words occur predominantly in non-spam emails.

Message 3 is classified as non-spam since the overall odds ratio (N16 = 0.04) is below 1 and approaches 0. Its inverse overall odds ratio (N17 = 24.896) is also really big and approaches infinity, which means its words occur predominantly in non-spam emails. This goes as expected because the words that appear in Message 3 have low ratios which makes the overall odds ratio approach 0.


## 2. Analyze movie reviews with a Naïve Bayes classifier
In this section, a Naïve Bayes classifier will be used to analyze movie reviews. This will require a couple of assumptions: 

- First, we're assuming that every review can be classified as either positive or negative (no neutral reviews).
- Second, documents can be represented as a bag of words, with sequential order being unimportant.
- Third, the probabilities of two words $x$ and $y$ appearing in a document are independent (this is the *naïve* part of Naïve Bayes). This is saying that the probability of encountering the word "butter" is unchanged by seeing the word "peanut" in the same sentence, even though we generally have the intuition that $P(\text{butter}|\text{peanut})$ is higher than, for example $P(\text{butter}|\text{supernova})$.

We know that assumptions 2 and 3 are wrong, but they make things much simpler.

### 2.1. Training Data

The first thing needed for a classifier is a set of labeled reviews to use to train the model. NLTK has a corpus of 2000 labeled movie reviews that should work well for this. Run the cell below to import and transform the reviews into a convenient form:

In [5]:
# we're taking the corpus of movie reviews and transforming them into a useful form

import nltk
import random
import re
from nltk.corpus import movie_reviews
from pathlib import Path
from os.path import exists

common_nltk_data_location = str(Path.home())+'/groupshare/nltk_data'
if exists(common_nltk_data_location):
    nltk.data.path=[common_nltk_data_location]
else:
    nltk.download('punkt') 
    nltk.download('movie_reviews')  
movies = [(list(movie_reviews.words(fileid)),category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
random.seed(8823)
random.shuffle(movies)

In [8]:
# each item in the list has the tuple structure (review,label) where review is a list of words and label is a string that either says 'pos' or 'neg' (the u before each string indicates that it's unicode)

print("Review:")
review = ' '.join(movies[0][0])
review = review.replace(". ", ".\n\n") # adds newlines after sentences
review = review.replace("( ", "(").replace(" )", ")") # remove whitespace after opening paren and before closing paren
review = review.replace(" - ", "-").replace(" / ", "/") # remove surrounding whitespace around - and /
review = re.sub(r'\s([?.!",\';:‘](?:\s|$))', r'\1', review) # removes whitespaces before punctuation
print(review)
print("-------------------------")
print("label:", movies[0][1]) # print the above review's label (pos or neg)

Review:
delicatessen (directors: marc caro/jean-pierre jeunet; screenwriters: gilles adrien/marc caro; cinematographer: darius khondji; editor: herve schneid; cast: dominique pinon (louison), marie-laure dougnac (julie clapet), jean-claude dreyfus (clapet-the butcher), karin viard (mademoiselle plusse), ticky holgado (marcel tapioca), anne-marie pisani (madame tapioca), jacques mathou (roger), rufus (robert kube), howard vernon (frog man), edith ker (granny), boban janevski (young rascal), mikael todde (young rascal), chick ortega (postman), silvie laguna (aurore interligator), howard vernon (frog man); runtime: 96; miramax/constellation/ugc/hatchette premiere; 1991-france) reviewed by dennis schwartz a black comedy set in the near future in a boarding house run by a depraved butcher.

the comedy is played more in comic strip style for entertaining value than for deeper satire, as it features mostly zany sophomoric sight gags and relies heavily on special effects.

the world has fallen

### 2.2. Picking initial features

In [9]:
# find the 100 most common words in the reviews

frequencies = nltk.FreqDist(w.lower() for w in movie_reviews.words())
most_common = frequencies.most_common(100) 
print(most_common)

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595), (')', 11781), ('(', 11664), ('as', 11378), ('with', 10792), ('for', 9961), ('his', 9587), ('this', 9578), ('film', 9517), ('i', 8889), ('he', 8864), ('but', 8634), ('on', 7385), ('are', 6949), ('t', 6410), ('by', 6261), ('be', 6174), ('one', 5852), ('movie', 5771), ('an', 5744), ('who', 5692), ('not', 5577), ('you', 5316), ('from', 4999), ('at', 4986), ('was', 4940), ('have', 4901), ('they', 4825), ('has', 4719), ('her', 4522), ('all', 4373), ('?', 3771), ('there', 3770), ('like', 3690), ('so', 3683), ('out', 3637), ('about', 3523), ('up', 3405), ('more', 3347), ('what', 3322), ('when', 3258), ('which', 3161), ('or', 3148), ('she', 3141), ('their', 3122), (':', 3042), ('some', 2985), ('just', 2905), ('can', 2882), ('if', 2799), ('we', 2775), ('him', 2633), ('into', 2

### 2.3. Feature Extraction

In [10]:
# pick some random set of features/words to start with
features = ["the","only","old","almost","good","bad"]

# check if review contains selected features
def extract_features(review,features):
    doc_features = {}
    for word in features:
        doc_features[word] = (word in review)
    return doc_features

We don't actually care about the movie reviews themselves but only about the features they contain, together with the information about whether the reviews are positive or negative. With the feature extractor function just defined, we can turn all of the movie reviews into feature vectors. The feature vector of a given review consists of our word list together with information about whether the review contains each of these words.

In [12]:
# outputs a list of features together with their values (True means that the word occurs in the review, False means that it doesn't) together with a label (pos means that the review is positive, neg means it is negative)
movie_features = [(extract_features(review,features),category) for (review,category) in movies]
print(len(movie_features))
print(movie_features[0])

2000
({'the': True, 'only': True, 'old': False, 'almost': True, 'good': True, 'bad': False}, 'neg')


### 2.4. Training the Model
The NLTK NaiveBayesClassifier() function can now be trained on the feature vectors we have just extracted. To do this, we arbitrarily split our list of movie review feature vectors into training data (let's say reviews 100 and up) and test data (reviews 0-99). We hand the training data to the classifier function.

In [13]:
# split into training and testing sets:
movies_training, movies_test = movie_features[100:], movie_features[:100]
our_first_classifier = nltk.NaiveBayesClassifier.train(movies_training)

Now we can see how well our classifier has learned to distinguish between positive and negative reviews. To put the numbers you are about to see in perspective, let's first consider the baseline of this task, that is, how well we would expect to do if we classified reviews 0-99 as positive or negative completely blindly, without looking at their contents. We would not want to pay any money for a classifier that doesn't do much better than this baseline. A quick check tells us that there are almost equally many positive and negative movie reviews in our test set: 

In [14]:
count_pos = 0
count_neg = 0
for i in range(0, 100):
    label = movies[i][1]
    if label == "pos":
        count_pos += 1
    elif label == "neg":
        count_neg += 1
print("Number of positive reviews in our test set:", count_pos)
print("Number of negative reviews in our test set:", count_neg)

Number of positive reviews in our test set: 49
Number of negative reviews in our test set: 51


So, we can expect that if we just flipped a coin, we would be right about half the time. In other words, the baseline for this task is about 50%. (If we want to be pedantic, we can consider 51% to be the baseline, since an extremely pessimistic classifier would reach 51% accuracy by classifying every review in the test set as negative without bothering to read it.)

We will first try out how well it fares on the reviews that it has just used to learn that difference, that is, the training set (movies_training). Run the cell below to tell our classifier to classify the training set and check how well it did. The result is a fraction between 0 and 1, so the code multiplies it with 100 to get a percentage:

In [15]:
print("Accuracy of the first classifier on reviews 100-2000 from the NLTK corpus (previously used as training data):")
print(str(round(100 * nltk.classify.accuracy(our_first_classifier,movies_training),2)) + "%")

Accuracy of the first classifier on reviews 100-2000 from the NLTK corpus (previously used as training data):
62.37%


Running a classifier on its own training set is not meaningful except to tell us that it is not completely blind. It would not be good practice to use the resulting accuracy as a way to evaluate the classifier's performance, because it doesn't tell us whether the classifier is able to generalize beyond what it has already seen. The more meaningful result will come from using data on which the classifier has not been trained.  The cell below tries out our classifier on a test set which contains entirely new data (the reviews 0-99 that we have previously set aside for this purpose). We will use this as a baseline to compare with a new classifier with new features.

In [17]:
baseline = nltk.classify.accuracy(our_first_classifier,movies_test)
print("Accuracy of the first classifier on reviews 0-99 from the NLTK corpus (baseline):")
print(str(round(100 * baseline ,2)) + "%")

Accuracy of the first classifier on reviews 0-99 from the NLTK corpus (baseline):
71.0%


### 2.5. Applying the Model to New Reviews

Now let's see how well the trained classifier does with a set of entirely new movie reviews. Let's run the classifier on the reviews below, and then calculate its precision (positive predictive value) and recall (true positive rate or sensitivity). 

In [18]:
# import new movie reviews (from Rotten Tomatoes, various movies)
test1 = "After the pleasant surprise that was the first film of the new Planet of the Apes series, the expectations for the sequel, or middle part of the trilogy, were somewhat bigger. Thankfully, everyone involved was fully aware of that and delivered another smart blockbuster with a lot of vital commentary on the futility of war and violent conflicts. The film doesn't want you to pick a side too easily as hostility between the last remaining humans on Earth and the intelligent apes arise. There are decent and bad characters on both sides. This makes for an interesting ride, as the conflict spins more and more into chaos and there is little anyone can do against it, after a point of no return. Once again, the CGI is incredible, thanks to great motion capture acting and the accompanying special effects. Thankfully, the human actors are en par, especially Gary Oldman only takes two short scenes to make a strong point for being one of the best of his generation. The gloomy atmosphere, the great cinematography, it all adds up to an intelligent and pretty damn entertaining continuation of the story. If there is one complaint it would have to be that the ending is merely a cliffhanger for what's next in part three. But at least we all have something to look forward to." # positive

test2 = "I will start off this review with a caveat that I am not the biggest fan of Michael Bay films, its not that I don't like any of his films but i am just not the biggest fan. This is a Michael Bay film from beginning to end. For people that like Michael Bay's style and the other films in this franchise will certainly love this film. I actually did enjoy the first film in this franchise but every installment after has been worse. The first criticism is that Michael Bay's directing style and cliches were so heavy handed in this film that it became its own character and became a spoof of itself, it took me out of the film; from slow motion sequences, to low camera angles, and one liners that did not quite hit hard enough. The product placement in this film was also very in your face and often took me out of the film. The dialogue was cringe worthy and I often felt like Peter Cullen (voice of Optimus Prime) did not want to say half the lines. The plot was extremely convoluted partly due to this movie being just way too long (nearly 3 hours). There is one positive for this film though and it barely counts. Even if your not a fan of Michael Bay, you can never argue with the amazing visuals and intense action sequences that he brings to the screen though after a while of things just blowing up I began to get bored. Overall this film is is a steaming pile of crap and for people who are not me you need to be big fan of Michael Bay and other films in this franchise to truly enjoy it. Though even if you are a fan of Michael Bay it is going to be hard to enjoy this film as it is one of the worst movies I have ever seen." # negative

test3 = "Both leads are playing their stereotypical roles, but they feel very comfortable in it. Really the best part of this film is watching these two actors go head to head in some really good scenes. Aside from those few scenes though, most of the film is so schmaltzy and predictable that it doesn't make sense for the film to be as long as it is. The direction is so flawed in its portrayal of several characters and too misguided in others that its hard to take many performances seriously. There's an entire sub plot with Farmiga that could have been completely removed, and there is a brother with a disability character who borders on offensive for much of the film. Billy Bob Thornton is so underutilized in this film that I don't know why he signed on, and the same is true for Vincent D'Onofrio. I don't know why the film is as long as it is, and it feels so self-indulgent for the director most of the time. If it weren't for Robert Downey Jr. and Duvall's performances, this film wouldn't have almost anything going for it. I was surprised that the courtroom scenes were as lackluster as they were, I was really expecting to enjoy those. They just fell flat most of the time. It's overall inoffensive, but it is nothing spectacular." # negative

test4 = "David Ayer, fresh off of a weird mixture of directing \"Sabotage\" and \"End of Watch\" (which he also wrote and produced), \"Fury\" could have gone either way, but I must say, this film is extremely impressive. Every crew member aboard that tank gives it their all in their performances and that is definitely the dividing line between whether or not this film would be good or bad. Brad Pitt, Logan Lerman, Shia Labeouf, John Bernthal, and Michael Pena are all believable in their roles. Normally I wouldn't waste my time listing every cast member, but there is not one bad performance here and everyone deserves recognition for their work. Yes, a few of them are a little underdeveloped, but you understand theirt motivations and pride for their country the whole way through. I wa immersed in these immaculatly shot war sequences that will have your heart pumping. It has been a while that I was so immersed in a film like I was with \"Fury\" and that is saying something. The brutally honest emotions given by all the characters throughout this film are terrific and you will not even think this film is 135 Minutes long, because the experience is immersive. \"Fury\" is the best war film I have seen in a very long time. It has a few nitpicking scenes, but other than that it blew me away. \"Fury\" hit's it's target." # positive

test5 = "This new film fuses together everything good about the original films, as well as the recent Marvel films, and does so with gusto. There's just so much to love about this film, from the reassembled cast, to the asides for fans of the comics, to the awe inspiring action and it all works well together. This film comes on the heels of the rights transferring from Fox to Marvel, and it shows in the production value, which obviously has help from Marvel Studios, to set up for their newly announced 2016 film for the X-Men canon. It's just brilliantly constructed, bringing all your favorite characters together, while also showing new information and new characters for us to love. Most of what we see comes directly from the comics, and that's something to rejoice over, but it's also pure, perfect, psychological action thriller. This is the new breed of X-Men, and they're far more intelligent and calculated than ever before." # positive

test6 = "Derivative, needlessly shaky, poorly acted and devoid of excitement, Earth to Echo is an adventure film which lacks a relatively vital component; any sense of adventure. Apart from Astro's performance as Tuck, and a viscerally compelling sequence involving Reese C. Hartwig's character Munch, Earth to Echo is a film which shall displease fans of the alien-discovery category of film, as well as kids desiring a film full of varied and interesting action. Relatively impressive visual effects save the film from an entirely poor rating, though this is still surely one to miss." # negative

test7= "Before viewing this film, I lowered my expectations, knowing that the film was probably going to be all dick and fart jokes. Not only was that exactly what this film is, but it is also savagely racist, and Seth McFarlane's presence is very off-putting, because he is a much better voice actor. He thought he could put together a sloppy old-fashion western comedy with modern-day lingo thrown in, and normally a movie that does that is hit or miss, but this just misses almost every single time. \"A Million Ways To Die In The West\" is easily the worst comedy that has come from 2014 so far. While not being a fan of Family Guy should not affect my viewings on this film, it feels like the same stupid humour that is present there, just a lot more gross-out stuff. Don't get me wrong, I laughed at \"Ted\" as much as the next guy, but this just feels like the decline of McFarlane's career. With poor writing and sloppy directing, there is not much to like here and it will hardly gain a single laugh." # negative

test8 = "This movie is definitely for a more mature audience, but I give this movie a round of applause. It provides a comedic effect to serious situations of life and it also shares its awkward moments that seem to be very natural in life and that is why I give this film a high rating because it mirrors life as we know it. This indie movie was great for Jenny Slate to star in...this is good for her resume and is just good for her in general. I hope she gets many more films to come later on in her future career. She seems as if she can further develop into a multi-dimensional actress." # positive

test9 = "Romantic, inspiring, and strongly performed, Belle is a period piece that transcends its trappings, and becomes a film that has a lot to say about life and the way we see ourselves. Powerfully led by actress Gugu Mbatha-Raw, the entire cast of Belle finds the humanity in their characters, and every character feels like a real person. Watching Dido struggle with her self-worth and the problem of racism in the world is so captivating and enthralling that you don't want to look away. I'm not even a big fan of period-piece romances, but this film had my heart crying out for Dido to find true love. It's an incredibly sweet and earnest film, and it deserves every sweet moment it has." # positive

test10 = "Let's not waste too much time assessing the insipidness that contains TASM2. With very few redeeming qualities, the follow up to the Andrew Garfield starring reboot is even worse than its predecessor. What were studio heads thinking (were they even thinking). When movies are made to treat the audience as lab rats, testing to see when enough is enough, it can only spell eventual doom. This series not only condescend to its intended audience but down right insults the average viewer with continuously pretentious tongue and cheek self-aggrandizing winking. Its saving grace is its top-notch production values, sadly used to promote a frivolous film." # negative

new_test_set = [test1,test2,test3,test4,test5,test6,test7,test8,test9,test10]

According to the original source, the ten reviews above can be categorized as follows:

1 : pos
2 : neg
3 : neg
4 : pos
5 : pos
6 : neg
7 : neg
8 : pos
9 : pos
10 : neg

Without telling the classifier about these labels, store them (as "true_labels") to assess the classifier's performance later on.

In [21]:
true_labels=['pos','neg','neg','pos','pos','neg','neg','pos','pos','neg']

# run the first classifier to classify the reviews on its own
i = 1
for review in new_test_set:
    print(i,':',our_first_classifier.classify(extract_features(nltk.word_tokenize(review),features)))
    i+=1

1 : neg
2 : pos
3 : pos
4 : neg
5 : pos
6 : pos
7 : pos
8 : neg
9 : pos
10 : pos


In [22]:
# compute the first classfier's accuracy on the 10 reviews given above

i=0
correctly_classified_reviews=0
incorrectly_classified_reviews=0

# compute numbers of correctly and incorrectly classified reviews
for review in new_test_set:
    what_classifier_thinks = our_first_classifier.classify(extract_features(nltk.word_tokenize(review),features))
    the_truth = true_labels[i]
    if what_classifier_thinks == the_truth:
        correctly_classified_reviews += 1 
    else:
        incorrectly_classified_reviews +=1
    i+=1 # increase counter by 1

# convert these numbers from integer to float to make division easier
correctly_classified_reviews = float(correctly_classified_reviews)
incorrectly_classified_reviews = float(incorrectly_classified_reviews)

accuracy = correctly_classified_reviews / (correctly_classified_reviews + incorrectly_classified_reviews)
print("Accuracy of the first classifier on the ten reviews given above:", 100 * accuracy, "%")

Accuracy of the first classifier on the ten reviews given above: 20.0 %


As observed, accuracy on a small test set (Rotten Tomatoes test set) is not comparable to the results on the larger test set (NLTK test set). It is in fact much smaller. The small test set has less words that the classifier can work with, the sample size might not be representative in the first place.  

### 2.6. Feature Engineering
In this section, revise the feature set and train a new classifier to improve the results.

In [23]:
# show most informative features
our_first_classifier.show_most_informative_features(12)

Most Informative Features
                     bad = True              neg : pos    =      1.9 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0
                    only = False             pos : neg    =      1.2 : 1.0
                    only = True              neg : pos    =      1.1 : 1.0
                  almost = True              pos : neg    =      1.1 : 1.0
                     old = True              pos : neg    =      1.1 : 1.0
                  almost = False             neg : pos    =      1.0 : 1.0
                    good = False             neg : pos    =      1.0 : 1.0
                     old = False             neg : pos    =      1.0 : 1.0
                    good = True              pos : neg    =      1.0 : 1.0
                     the = True              pos : neg    =      1.0 : 1.0


The features with the most extreme odds ratio are listed first. For example, something like "good = True, pos : neg = 5.0 : 1.0" means that based on the presence of the word "good" alone, the classifier places the odds of the review containing it being positive at 5 to 1. The classifier uses both the presence and the absence of a given word as a feature, so each word occurs twice in the list.

In [24]:
# update feature set
student_defined_features = ["good", "bad", "fantastic", "pleassant", "weird", "fan", "poor", "like", "love", "enjoy", "worse", "incredible", "displease", "top-notch"]

print(student_defined_features)

# extract features
movie_features2 = [(extract_features(review,student_defined_features),category) for (review,category) in movies]
# split into training and testing sets
movies_training2, movies_test2 = movie_features2[100:], movie_features2[:100]
our_second_classifier = nltk.NaiveBayesClassifier.train(movies_training2)
# see how well our second classifier fares on the test set
accuracy_of_student_classifier = nltk.classify.accuracy(our_second_classifier,movies_test2)
improvement = accuracy_of_student_classifier - baseline
print("Accuracy of the student-defined classifier:", 100*accuracy_of_student_classifier, "%")
print("Baseline (accuracy of the classifier we defined above):", 100*baseline, "%")
print("Improvement compared with baseline:", round(100*improvement, 5), "%")
student_defined_features

['good', 'bad', 'fantastic', 'pleassant', 'weird', 'fan', 'poor', 'like', 'love', 'enjoy', 'worse', 'incredible', 'displease', 'top-notch']
Accuracy of the student-defined classifier: 72.0 %
Baseline (accuracy of the classifier we defined above): 71.0 %
Improvement compared with baseline: 1.0 %


['good',
 'bad',
 'fantastic',
 'pleassant',
 'weird',
 'fan',
 'poor',
 'like',
 'love',
 'enjoy',
 'worse',
 'incredible',
 'displease',
 'top-notch']

In [25]:
# show most informative features of second classifier
our_second_classifier.show_most_informative_features(len(student_defined_features)*2)

Most Informative Features
               fantastic = True              pos : neg    =      4.4 : 1.0
                   worse = True              neg : pos    =      2.2 : 1.0
                    poor = True              neg : pos    =      2.1 : 1.0
                     bad = True              neg : pos    =      1.9 : 1.0
              incredible = True              pos : neg    =      1.8 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0
                   enjoy = True              pos : neg    =      1.4 : 1.0
                    like = False             pos : neg    =      1.2 : 1.0
                   weird = True              neg : pos    =      1.2 : 1.0
                    love = True              pos : neg    =      1.2 : 1.0
                     fan = True              neg : pos    =      1.1 : 1.0
                    love = False             neg : pos    =      1.1 : 1.0
                   worse = False             pos : neg    =      1.1 : 1.0

In [26]:
# print the amount of improvement you made over the original classifier
print(round(improvement, 5))

0.01


I chose these words because they had either a strong positive or negative connotations. Outside of the context of the reviews, I thought that these woirds were far from being neutral. I think that the reason why I only had a 1% improvement over the baseline accuracy is because some of these features are not really common which would affect the odds ratio.

### 2.7. Assessing Performance

Imagine a user whose job is to promote a movie and who is using our classifier to automatically identify positive reviews in order to use them in advertising materials. For this user, negative reviews are irrelevant because the user is not interested in any negative reviews the system might be identifying. (In other words, positive reviews are like ham and negative reviews are like spam. Or to use another analogy, positive reviews are like relevant search engine results and negative reviews are like irrelevant search engine results. We are using this conceit so that we can apply the notions of precision and recall to this situation.) Let's assess the classifier's performance in terms of precision and recall.

In [27]:
# classify the reviews given above using the classifier that has been trained on the features defined above
i = 1
for review in new_test_set:
    print(i,'.',our_second_classifier.classify(extract_features(nltk.word_tokenize(review),student_defined_features)))
    i+=1

1 . pos
2 . neg
3 . pos
4 . neg
5 . pos
6 . neg
7 . neg
8 . pos
9 . pos
10 . neg


In [30]:
# compute the precision, recall, and overall accuracy (defined as the proportion of reviews that were correctly classified) of the classifier as a number between zero and one

counter=0

pos_classified_as_pos=0.0 # true positive -- a result that is correctly classified as positive
pos_classified_as_neg=0.0 # false negative -- a positive result that is incorrectly classified as negative
neg_classified_as_pos=0.0 # false positive -- a negative result that is incorrectly classified as positive
neg_classified_as_neg=0.0 # true negative -- a result that is correctly classified as negative

# compute numbers of true and false positives and negatives
for review in new_test_set:
    what_classifier_thinks = our_second_classifier.classify(extract_features(nltk.word_tokenize(review),student_defined_features))
    the_truth = true_labels[counter]
    if the_truth == 'pos' and what_classifier_thinks == 'pos':
        pos_classified_as_pos+=1 # true positive
    if  the_truth == 'pos' and what_classifier_thinks == 'neg':
        pos_classified_as_neg+=1 # false negative
    if the_truth == 'neg' and what_classifier_thinks == 'pos':
        neg_classified_as_pos+=1 # false positive
    if the_truth == 'neg' and what_classifier_thinks == 'neg':
        neg_classified_as_neg+=1 # true negative
    counter += 1 # increase counter by 1

total = counter

In [31]:
correctly_classified_reviews = pos_classified_as_pos+neg_classified_as_neg
print("Number of correctly classified reviews:", correctly_classified_reviews)

Number of correctly classified reviews: 8.0


In [32]:
incorrectly_classified_reviews = pos_classified_as_neg+neg_classified_as_pos
print("Number of incorrectly classified reviews:", incorrectly_classified_reviews)

Number of incorrectly classified reviews: 2.0


In [33]:
def print_results():
    print("Number of correctly classified reviews overall:", int(correctly_classified_reviews))
    print("Number of incorrectly classified reviews overall:", int(incorrectly_classified_reviews))
    print("Number of positive reviews correctly classified as positive:", int(pos_classified_as_pos))
    print("Number of positive reviews incorrectly classified as negative:", int(pos_classified_as_neg))
    print("Number of negative reviews incorrectly classified as positive:", int(neg_classified_as_pos))
    print("Number of negative reviews correctly classified as negative:", int(neg_classified_as_neg))
print_results()

Number of correctly classified reviews overall: 8
Number of incorrectly classified reviews overall: 2
Number of positive reviews correctly classified as positive: 4
Number of positive reviews incorrectly classified as negative: 1
Number of negative reviews incorrectly classified as positive: 1
Number of negative reviews correctly classified as negative: 4


In [34]:
precision = pos_classified_as_pos/(pos_classified_as_pos+neg_classified_as_pos)
print("Precision as a number between zero and one:", precision)
print("Precision in percent:", precision*100)

Precision as a number between zero and one: 0.8
Precision in percent: 80.0


In [35]:
recall = pos_classified_as_pos/(pos_classified_as_pos+pos_classified_as_neg)
print("Recall as a number between zero and one:", recall)
print("Recall in percent:", recall*100)

Recall as a number between zero and one: 0.8
Recall in percent: 80.0


In [36]:
accuracy = correctly_classified_reviews/(correctly_classified_reviews+incorrectly_classified_reviews)
print("Accuracy as a number between zero and one:", accuracy)
print("Accuracy in percent:", accuracy*100)

Accuracy as a number between zero and one: 0.8
Accuracy in percent: 80.0
