# Final Project - Kaggle Movie Reviews
Text Classification and Sentiment Analysis by Kelly Hwang

My final project will use the movie review data and run 3 iterations of the NB classifcation model using 3 different sets of feature experimentation: basic features, no stop word or non alphanumerica features, and finally a sentiment lexicon feature.

The aim is to view model improvement (or deprovement) with each iteration of features and to demonstrate its effects on model performance.

For the advanced experiment I used 3 models from the SciKit Learn Classifiers: Naive Bayes, Logistic Regression, and Random Forest.

### Text Classification and Tokenization
Using the python template classifyKaggle as a guide.


The sentiment labels are as follows:
- 0: negative
- 1: somewhat negative
- 2: neutral
- 3: somewhat positive
- 4: positive

In [1]:
#import modules
import os
import sys
import random
import nltk
from nltk.corpus import stopwords
import re
from nltk import FreqDist

In [2]:
f = open('train.tsv', 'r')

In [3]:
# loop over lines in the file and use the first limit of them
phrasedata = []
for line in f:
    # ignore the first line starting with Phrase and read all lines
    if (not line.startswith('Phrase')):
      # remove final end of line character
      line = line.strip()
      # each line has 4 items separated by tabs
      # ignore the phrase and sentence ids, and keep the phrase and sentiment
      phrasedata.append(line.split('\t')[2:4])

In [4]:
#randomize phrases
random.shuffle(phrasedata)

In [5]:
#phrase limit
limit = 10000

In [6]:
# pick a random sample of length limit because of phrase overlapping sequences
phraselist = phrasedata[:limit]
print('Read', len(phrasedata), 'phrases, using', len(phraselist), 'random phrases')

Read 156060 phrases, using 10000 random phrases


In [7]:
for phrase in phraselist[:10]:
    print (phrase)

['your seat , tense with suspense', '4']
['a bunch of hot-button items', '2']
['his chilling , unnerving film', '2']
['gives us', '2']
['inside a high-tech space station', '2']
['usurp', '2']
['on the genre', '2']
['Onion', '2']
['succumbs to joyless special-effects excess', '1']
['mild disturbance or detached pleasure', '1']


In [127]:
# create list of phrase documents as (list of words, label)
phrasedocs = []
# add all the phrases in lowercase
for phrase in phraselist:
    tokens = nltk.word_tokenize(phrase[0].lower())
    phrasedocs.append((tokens, int(phrase[1])))

In [128]:
# print a few
for phrase in phrasedocs[:10]:
    print (phrase)

(['your', 'seat', ',', 'tense', 'with', 'suspense'], 4)
(['a', 'bunch', 'of', 'hot-button', 'items'], 2)
(['his', 'chilling', ',', 'unnerving', 'film'], 2)
(['gives', 'us'], 2)
(['inside', 'a', 'high-tech', 'space', 'station'], 2)
(['usurp'], 2)
(['on', 'the', 'genre'], 2)
(['onion'], 2)
(['succumbs', 'to', 'joyless', 'special-effects', 'excess'], 1)
(['mild', 'disturbance', 'or', 'detached', 'pleasure'], 1)


### Building Word Features - Basic

In [138]:
#tokenize all words to begin building features
#get all words and prepare frequency distribution
all_words_token = [word for (sent, cat) in phrasedocs for word in sent]

#ignore this step as I moved the lower function in the beginning instead of
#while building features
#all_words_list = [w.lower() for w in all_words_token]

all_words = nltk.FreqDist(all_words_list)

In [131]:
#2000 most frequently used words in corpus
word_items = all_words.most_common(2000)

#preview top 20
word_items[:20]

[('the', 3342),
 (',', 2650),
 ('a', 2332),
 ('and', 2092),
 ('of', 2038),
 ('to', 1399),
 ('.', 1154),
 ("'s", 1099),
 ('in', 902),
 ('is', 864),
 ('that', 824),
 ('it', 736),
 ('as', 556),
 ('for', 506),
 ('with', 482),
 ('its', 472),
 ('film', 438),
 ('an', 414),
 ('this', 393),
 ('movie', 372)]

In [132]:
#build features on 2000 most frequently used words
word_features = [word for (word, count) in word_items]
print(word_features[:50])

['the', ',', 'a', 'and', 'of', 'to', '.', "'s", 'in', 'is', 'that', 'it', 'as', 'for', 'with', 'its', 'film', 'an', 'this', 'movie', 'but', 'be', 'you', 'on', "n't", 'more', 'his', 'by', 'about', 'one', 'all', 'not', 'from', 'at', '--', 'have', '``', 'than', 'or', 'has', 'like', 'are', 'so', "'", '-rrb-', 'most', '-lrb-', 'who', '...', 'out']


In [133]:
#define the features for unigram baseline
#each feature is contains(keyword) and is T or F depending on
#whether the keyword is in the phrase

def phrase_features(phrase, word_features):
    phrase_words = set(phrase)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in phrase_words)
    return features

In [134]:
featuresets = [(phrase_features(d, word_features), c) for (d, c) in phrasedocs]

In [135]:
#preview feature sets
featuresets[0]

({'V_the': False,
  'V_,': True,
  'V_a': False,
  'V_and': False,
  'V_of': False,
  'V_to': False,
  'V_.': False,
  "V_'s": False,
  'V_in': False,
  'V_is': False,
  'V_that': False,
  'V_it': False,
  'V_as': False,
  'V_for': False,
  'V_with': True,
  'V_its': False,
  'V_film': False,
  'V_an': False,
  'V_this': False,
  'V_movie': False,
  'V_but': False,
  'V_be': False,
  'V_you': False,
  'V_on': False,
  "V_n't": False,
  'V_more': False,
  'V_his': False,
  'V_by': False,
  'V_about': False,
  'V_one': False,
  'V_all': False,
  'V_not': False,
  'V_from': False,
  'V_at': False,
  'V_--': False,
  'V_have': False,
  'V_``': False,
  'V_than': False,
  'V_or': False,
  'V_has': False,
  'V_like': False,
  'V_are': False,
  'V_so': False,
  "V_'": False,
  'V_-rrb-': False,
  'V_most': False,
  'V_-lrb-': False,
  'V_who': False,
  'V_...': False,
  'V_out': False,
  'V_story': False,
  'V_if': False,
  'V_into': False,
  'V_up': False,
  'V_can': False,
  'V_good': False

### Naive Bayes Classification Model - The Base

In [136]:
#train using naive bayes classifier
#training set is 90% of the 10k phrases
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [137]:
#accuracy of the classifier
nltk.classify.accuracy(classifier, test_set)

0.523

For record keeping purposes the original accuracy from this randomized set is 0.523.

In [139]:
#which features are most informative
classifier.show_most_informative_features(30)

Most Informative Features
           V_beautifully = True                4 : 2      =     58.6 : 1.0
                 V_sweet = True                4 : 2      =     53.0 : 1.0
                 V_worse = True                0 : 2      =     48.2 : 1.0
                 V_waste = True                0 : 2      =     48.2 : 1.0
                  V_ugly = True                0 : 2      =     48.2 : 1.0
             V_hilarious = True                4 : 2      =     47.4 : 1.0
               V_delight = True                4 : 2      =     41.9 : 1.0
                 V_fails = True                0 : 2      =     40.7 : 1.0
            V_engrossing = True                4 : 2      =     36.3 : 1.0
              V_powerful = True                4 : 2      =     35.2 : 1.0
              V_visually = True                0 : 2      =     33.3 : 1.0
                V_absurd = True                0 : 2      =     33.3 : 1.0
                 V_empty = True                0 : 2      =     33.3 : 1.0

### Removing StopWords - Experiment

In [141]:
# get a list of stopwords from nltk
nltkstopwords = nltk.corpus.stopwords.words('english')
print(len(nltkstopwords))
print(nltkstopwords)

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [142]:
# function that takes a word and returns true if it consists only
#   of non-alphabetic characters  (assumes import re)
def alpha_filter(w):
  # pattern to match word of non-alphabetical characters
  pattern = re.compile('^[^a-z]+$')
  if (pattern.match(w)):
    return True
  else:
    return False

In [143]:
#apply function to all words list
alphatextwords = [w for w in all_words_list if not alpha_filter(w)]
print(alphatextwords[:100])
print(len(alphatextwords))

['your', 'seat', 'tense', 'with', 'suspense', 'a', 'bunch', 'of', 'hot-button', 'items', 'his', 'chilling', 'unnerving', 'film', 'gives', 'us', 'inside', 'a', 'high-tech', 'space', 'station', 'usurp', 'on', 'the', 'genre', 'onion', 'succumbs', 'to', 'joyless', 'special-effects', 'excess', 'mild', 'disturbance', 'or', 'detached', 'pleasure', 'reminds', 'us', 'that', 'beneath', 'the', 'hype', 'the', 'celebrity', 'the', 'high', 'life', 'the', 'conspiracies', 'and', 'the', 'mystery', 'there', 'were', 'once', 'a', 'couple', 'of', 'bright', 'young', 'men', 'promising', 'talented', 'charismatic', 'and', 'tragically', 'doomed', '-lrb-', 'hell', 'is', '-rrb-', 'looking', 'down', 'at', 'your', 'watch', 'and', 'realizing', 'serving', 'sara', 'is', "n't", 'even', 'halfway', 'through', 'to', 'break', 'free', 'of', 'her', 'old', 'life', 'i', 'like', 'about', 'men', 'with', 'brooms', 'and', 'what']
67652


In [144]:
#more stopwords as used in Lab
morestopwords = ['could','would','might','must','need','sha','wo','y',"'s","'d","'ll","'t","'m","'re","'ve", "n't"]

In [145]:
#preview new stopword compilation
stopwords = nltkstopwords + morestopwords
print(len(stopwords))
print(stopwords)

195
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [146]:
#filter shortwords list to not include any stopwords that we defined above
stoppedtextwords = [w for w in alphatextwords if not w in stopwords]
print(len(stoppedtextwords))

39123


In [147]:
#frequency distribution for top 2000
#after cleaning stopwords and non-alphanumeric
text_dist = FreqDist(stoppedtextwords)
text_items = text_dist.most_common(2000)

#preview top 25
text_items[:25]

[('film', 438),
 ('movie', 372),
 ('one', 242),
 ('like', 187),
 ('-rrb-', 164),
 ('-lrb-', 153),
 ('story', 141),
 ('good', 129),
 ('characters', 123),
 ('even', 112),
 ('funny', 108),
 ('time', 108),
 ('comedy', 106),
 ('way', 101),
 ('much', 97),
 ('life', 96),
 ('love', 93),
 ('movies', 90),
 ('enough', 87),
 ('new', 85),
 ('little', 85),
 ('work', 84),
 ('us', 83),
 ('make', 81),
 ('plot', 77)]

### Building Features Without Stopwords

In [148]:
#build features on 2000 most frequently used words
nostop_features = [word for (word, count) in text_items]
print(nostop_features[:50])

['film', 'movie', 'one', 'like', '-rrb-', '-lrb-', 'story', 'good', 'characters', 'even', 'funny', 'time', 'comedy', 'way', 'much', 'life', 'love', 'movies', 'enough', 'new', 'little', 'work', 'us', 'make', 'plot', 'something', 'many', 'director', 'better', 'two', 'never', 'bad', 'may', 'makes', 'people', 'world', 'best', 'made', 'action', 'see', 'films', 'hollywood', 'character', 'look', 'well', 'man', 'ever', 'really', 'without', 'humor']


In [151]:
#create features with phrase_features function
nostop_featuresets = [(phrase_features(d, nostop_features), c) for (d, c) in phrasedocs]

In [152]:
#preview feature sets
nostop_featuresets[0]

({'V_film': False,
  'V_movie': False,
  'V_one': False,
  'V_like': False,
  'V_-rrb-': False,
  'V_-lrb-': False,
  'V_story': False,
  'V_good': False,
  'V_characters': False,
  'V_even': False,
  'V_funny': False,
  'V_time': False,
  'V_comedy': False,
  'V_way': False,
  'V_much': False,
  'V_life': False,
  'V_love': False,
  'V_movies': False,
  'V_enough': False,
  'V_new': False,
  'V_little': False,
  'V_work': False,
  'V_us': False,
  'V_make': False,
  'V_plot': False,
  'V_something': False,
  'V_many': False,
  'V_director': False,
  'V_better': False,
  'V_two': False,
  'V_never': False,
  'V_bad': False,
  'V_may': False,
  'V_makes': False,
  'V_people': False,
  'V_world': False,
  'V_best': False,
  'V_made': False,
  'V_action': False,
  'V_see': False,
  'V_films': False,
  'V_hollywood': False,
  'V_character': False,
  'V_look': False,
  'V_well': False,
  'V_man': False,
  'V_ever': False,
  'V_really': False,
  'V_without': False,
  'V_humor': False,
  'V_a

### Retrain Naive Bayes Classifier Using No Stopwords

In [153]:
#train using naive bayes classifier
#training set is 90% of the 10k phrases
train_set, test_set = nostop_featuresets[1000:], nostop_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [154]:
#accuracy of the classifier
nltk.classify.accuracy(classifier, test_set)

0.541

Original accuracy was 0.523. Without stopwords accuracy increased to 0.541.

In [155]:
#which features are most informative
classifier.show_most_informative_features(30)

Most Informative Features
           V_beautifully = True                4 : 2      =     58.6 : 1.0
                 V_sweet = True                4 : 2      =     53.0 : 1.0
                 V_worse = True                0 : 2      =     48.2 : 1.0
                 V_waste = True                0 : 2      =     48.2 : 1.0
                  V_ugly = True                0 : 2      =     48.2 : 1.0
             V_hilarious = True                4 : 2      =     47.4 : 1.0
               V_delight = True                4 : 2      =     41.9 : 1.0
                 V_fails = True                0 : 2      =     40.7 : 1.0
            V_engrossing = True                4 : 2      =     36.3 : 1.0
              V_powerful = True                4 : 2      =     35.2 : 1.0
              V_visually = True                0 : 2      =     33.3 : 1.0
                V_absurd = True                0 : 2      =     33.3 : 1.0
                 V_empty = True                0 : 2      =     33.3 : 1.0

sweet, beautiful, and worse are still the top 3 features.

### More Features from Sentiment Lexicon - Experiment

Since removing stopwords was a success in making my model more accurate, I wanted to build additional features on top of my no_stop features using the Sentiment Lexicon.

In [19]:
import sentiment_read_subjectivity as srs
import sentiment_read_LIWC_pos_neg_words as liwc

In [20]:
lexpath = 'SentimentLexicons/subjclueslen1-HLTEMNLP05.tff'

In [21]:
#create positive, neutral, negative word lists
(positivelist, neutrallist, negativelist) = srs.read_subjectivity_three_types(lexpath)

In [22]:
#see first 5 words of each list
print('positive:', positivelist[0:5])
print('neutral:', neutrallist[0:5])
print('negative:', negativelist[0:5])

positive: ['abidance', 'abidance', 'abide', 'abilities', 'ability']
neutral: ['absolute', 'absolutely', 'absorbed', 'accentuate', 'activist']
negative: ['abandoned', 'abandonment', 'abandon', 'abase', 'abasement']


In [23]:
#establish dictionary
SL = srs.readSubjectivity(lexpath)

In [24]:
#test dictionary
print(SL['blasphemous'])
print(SL['absolute'])

['strongsubj', 'anypos', True, 'negative']
['strongsubj', 'adj', False, 'neutral']


In [25]:
#define SL features that will have word counts of subjectivity words
# negative feature will have number of weakly negative words +
#    2 * number of strongly negative words
# positive feature has similar definition
#    not counting neutral words
def SL_features(phrase, word_features, SL):
    phrase_words = set(phrase)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in phrase_words)
    # count variables for the 4 classes of subjectivity
    weakPos = 0
    strongPos = 0
    weakNeg = 0
    strongNeg = 0
    for word in phrase_words:
        if word in SL:
            strength, posTag, isStemmed, polarity = SL[word]
            if strength == 'weaksubj' and polarity == 'positive':
                weakPos += 1
            if strength == 'strongsubj' and polarity == 'positive':
                strongPos += 1
            if strength == 'weaksubj' and polarity == 'negative':
                weakNeg += 1
            if strength == 'strongsubj' and polarity == 'negative':
                strongNeg += 1
            features['positivecount'] = weakPos + (2 * strongPos)
            features['negativecount'] = weakNeg + (2 * strongNeg)      
    return features

In [156]:
#create feature sets as before, but using this feature extraction function
SL_featuresets = [(SL_features(d, nostop_features, SL), c) for (d, c) in phrasedocs]

In [157]:
#show just the two sentiment lexicon features in phrase 0
print(SL_featuresets[0][0]['positivecount'])
print(SL_featuresets[0][0]['negativecount'])

0
1


In [158]:
#this gives the label of phrase 0
SL_featuresets[0][1]

4

In [159]:
#number of features for phrase 0
len(SL_featuresets[0][0].keys())

2002

In [160]:
#preview new feature sets
SL_featuresets[0]

({'V_film': False,
  'V_movie': False,
  'V_one': False,
  'V_like': False,
  'V_-rrb-': False,
  'V_-lrb-': False,
  'V_story': False,
  'V_good': False,
  'V_characters': False,
  'V_even': False,
  'V_funny': False,
  'V_time': False,
  'V_comedy': False,
  'V_way': False,
  'V_much': False,
  'V_life': False,
  'V_love': False,
  'V_movies': False,
  'V_enough': False,
  'V_new': False,
  'V_little': False,
  'V_work': False,
  'V_us': False,
  'V_make': False,
  'V_plot': False,
  'V_something': False,
  'V_many': False,
  'V_director': False,
  'V_better': False,
  'V_two': False,
  'V_never': False,
  'V_bad': False,
  'V_may': False,
  'V_makes': False,
  'V_people': False,
  'V_world': False,
  'V_best': False,
  'V_made': False,
  'V_action': False,
  'V_see': False,
  'V_films': False,
  'V_hollywood': False,
  'V_character': False,
  'V_look': False,
  'V_well': False,
  'V_man': False,
  'V_ever': False,
  'V_really': False,
  'V_without': False,
  'V_humor': False,
  'V_a

### Retrain Naive Bayes Classifier Using New SL Features

In [161]:
#retrain the classification model using new features
#same 90% split
train_set, test_set = SL_featuresets[1000:], SL_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [162]:
#new accuracy
nltk.classify.accuracy(classifier, test_set)

0.551

The original accuracy was 0.523.

Removing stopwords increased this to 0.541.

Layering SL on top improved this further to 0.551!

### Evaluation Using Cross Validation

Using my NB model using SL features as the final model for my project, I will compare only this final iterance against the original model with basic features.

In [33]:
## cross-validation ##
#this function takes the number of folds, the feature sets
#it iterates over the folds, using different sections for training and testing in turn
#it prints the accuracy for each fold and the average accuracy at the end
def cross_validation_accuracy(num_folds, featuresets):
    subset_size = int(len(featuresets)/num_folds)
    print('Each fold size:', subset_size)
    accuracy_list = []
    # iterate over the folds
    for i in range(num_folds):
        test_this_round = featuresets[(i*subset_size):][:subset_size]
        train_this_round = featuresets[:(i*subset_size)] + featuresets[((i+1)*subset_size):]
        # train using train_this_round
        classifier = nltk.NaiveBayesClassifier.train(train_this_round)
        # evaluate against test_this_round and save accuracy
        accuracy_this_round = nltk.classify.accuracy(classifier, test_this_round)
        print (i, accuracy_this_round)
        accuracy_list.append(accuracy_this_round)
    # find mean accuracy over all rounds
    print ('mean accuracy', sum(accuracy_list) / num_folds)

In [163]:
#perform the cross-validation on the featuresets with word features
#and generate accuracy

k = 5
cross_validation_accuracy(k, featuresets)

Each fold size: 2000
0 0.517
1 0.542
2 0.52
3 0.5455
4 0.526
mean accuracy 0.5301


Cross-validation shows mean accuracy as 0.5301.

In [164]:
#cross-validation on SL_featuresets
#using same number of folds k = 5
cross_validation_accuracy(k, SL_featuresets)

Each fold size: 2000
0 0.5385
1 0.5585
2 0.545
3 0.5645
4 0.544
mean accuracy 0.5501


Cross-validation shows mean accuracy as 0.5501 which is an improvement over the regular featureset. The lift is similar to the regular accuracy evaluation method. In the 4th instance the accuracy hit its highest mark at 0.5645.

### Final Evaluation Using Precision, Recall, and F1 

In [167]:
#create goldlist and predicted list
goldlist = []
predictedlist = []
for (features, label) in test_set:
    	goldlist.append(label)
    	predictedlist.append(classifier.classify(features))

In [168]:
#look at the first 30 examples
print(goldlist[:30])
print(predictedlist[:30])

[4, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 0, 4, 3, 2, 2, 3, 2, 1, 2, 2, 4, 4, 2, 2, 2, 2, 3, 2, 4]
[1, 2, 3, 2, 3, 2, 2, 2, 2, 3, 3, 0, 2, 3, 2, 2, 3, 1, 1, 2, 2, 4, 4, 2, 3, 2, 2, 3, 2, 3]


In [169]:
#Function to compute precision, recall and F1 for each label
#and for any number of labels
#Input: list of gold labels, list of predicted labels (in same order)
#Output:  prints precision, recall and F1 for each label
def eval_measures(gold, predicted):
    # get a list of labels
    labels = list(set(gold))
    # these lists have values for each label 
    recall_list = []
    precision_list = []
    F1_list = []
    for lab in labels:
        # for each label, compare gold and predicted lists and compute values
        TP = FP = FN = TN = 0
        for i, val in enumerate(gold):
            if val == lab and predicted[i] == lab:  TP += 1
            if val == lab and predicted[i] != lab:  FN += 1
            if val != lab and predicted[i] == lab:  FP += 1
            if val != lab and predicted[i] != lab:  TN += 1
        # use these to compute recall, precision, F1
        recall = TP / (TP + FP)
        precision = TP / (TP + FN)
        recall_list.append(recall)
        precision_list.append(precision)
        F1_list.append( 2 * (recall * precision) / (recall + precision))

    # the evaluation measures in a table with one row per label
    print('\tPrecision\tRecall\t\tF1')
    # print measures for each label
    for i, lab in enumerate(labels):
        print(lab, '\t', "{:10.3f}".format(precision_list[i]), \
          "{:10.3f}".format(recall_list[i]), "{:10.3f}".format(F1_list[i]))

In [171]:
#call the function with our data
eval_measures(goldlist, predictedlist)

	Precision	Recall		F1
0 	      0.212      0.212      0.212
1 	      0.335      0.400      0.365
2 	      0.726      0.683      0.704
3 	      0.471      0.434      0.452
4 	      0.197      0.293      0.235


Label 2 is the best performer, with the highest Precision 0.726, Recall 0.683, and F1 0.704. The rest of the labels only had precision/recall values of less than 0.5 which shows Label 2 predictions were better by a very large margin.

Intuitively this makes complete sense, as label 2 equates to neutral sentiment. I would hypothesize that neutral reviews make up the majority of all review labels, and with more neutral labels to feed into the model, it is better at predicting those types of sentiments.

### SciKit Learn Additional Models

Advanced experiment attempt.

Using 3 different models from SciKit-Learning: NB, Logistic Regression, and Random Forest.

The test data set will be the one fully boosted without stopwords and sentiment lexicon applied.

In [210]:
from nltk.classify.scikitlearn import SklearnClassifier

In [213]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [217]:
classifierNB = SklearnClassifier(MultinomialNB())
classifierNB.train(train_set)
print("SciKitLearn Classifer: Multinomial NB")
print("Accuracy:", nltk.classify.accuracy(classifierNB, test_set))

SciKitLearn Classifer: Multinomial NB
Accuracy:  0.547


In [219]:
classifierLR = SklearnClassifier(LogisticRegression())
classifierLR.train(train_set)
print("SciKitLearn Classifer: Logistic Regression")
print("Accuracy:", nltk.classify.accuracy(classifierLR, test_set))

SciKitLearn Classifer: Logistic Regression
Accuracy: 0.555


In [220]:
classifierRDF = SklearnClassifier(RandomForestClassifier())
classifierRDF.train(train_set)
print("SciKitLearn Classifer: Random Forest")
print("Accuracy:", nltk.classify.accuracy(classifierRDF, test_set))

SciKitLearn Classifer: Random Forest
Accuracy: 0.534


Among these 3 models, Logistic Regression outputted the highest accuracy of 0.555.

### End - Thank you!