# Sentiment analysis with NLTK Naive Bayes

## Naive Bayes with NLTK

A Naive Bayes classifier determines the probability that an input text belongs to one of a set of classes, eg. predicting if a review is positive or negative.

It is ‘Naive’ because it assumes the words in the text are independent (even though in reality, in natural human language, the order of words convey contextual information).  Despite these assumptions, Naive Bayes has a high degree of accuracy when predicting classes with only a small training set.

- Zhang, H. (2004). The optimality of naive Bayes. Aa, 1(2), 3. https://www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf
- Baines, O., Naive Bayes: Machine Learning and Text Classification Application of Bayes’ Theorem. https://journals.le.ac.uk/ojs1/index.php/lumj/article/download/3484/3110

### imports

In [1]:
import nltk
from nltk.metrics.scores import precision, recall, f_measure
import pandas as pd
import collections

import sys
sys.path.append("..") # Adds higher directory to python modules path.
from NLPmoviereviews.data import load_data_sent
from NLPmoviereviews.utilities import preprocessing

### 1. Load data

In [3]:
# load data
X_train, y_train, X_test, y_test = load_data_sent(percentage_of_sentences=10);

### 2. Prepare text

In [264]:
# remove custom stop-words
def rm_custom_stops(sentence):
    '''
    Custom stop word remover
    Parameters:
        sentence (str): a string of words
    Returns:
        list_of_words (list): cleaned sentence as a list of words
    '''
    words = sentence.split()
    stop_words = {'movie', 'film', 'br', 'x96'}
    
    return [w for w in words if not w in stop_words]

In [265]:
# perform preprocessing (cleaning) & transform to dataframe
def process_df(X, y):
    '''
    Transform texts and labels into dataframe of 
    cleaned texts (as list of words) and human readable target labels
    
    Parameters:
        X (list): list of strings (reviews)
        y (list): list of target labels (0/1)
    Returns:
        df (dataframe): dataframe of processed reviews (as list of words)
                        and corresponding sentiment label (positive/negative)
    '''
    # create dataframe from data
    d = {'text': X, 'sentiment': y}
    df = pd.DataFrame(d)
    
    # make sentiment human-readable
    df['sentiment'] = df.sentiment.map(lambda x: 'positive' if x==1 else 'negative')

    # clean and split text into list of words
    df['text'] = df.text.apply(preprocessing)
    df['text'] = df.text.apply(rm_custom_stops)

    # Generate the feature sets for the movie review documents one by one
    return df

In [266]:
# process data
train_df = process_df(X_train, y_train)
test_df = process_df(X_test, y_test)

In [267]:
# inspect dataframe
train_df.head()

Unnamed: 0,text,sentiment
0,"[absolutely, terrible, dont, lure, christopher...",negative
1,"[know, fall, asleep, usually, due, combination...",negative
2,"[mann, photograph, alberta, rocky, mountain, s...",negative
3,"[kind, snowy, sunday, afternoon, rest, world, ...",positive
4,"[others, mention, woman, go, nude, mostly, abs...",positive


### 3. Create list of most common words

In [268]:
# get frequency distribution of words in corpus & select 2000 most common words
def most_common(df, n=2000):
    '''
    Get n most common words from data frame of text reviews
    
    Parameters:
        df (dataframe): dataframe with column of processed text reviews
        n (int): number of most common words to get
    Returns:
        most_common_words (list): list of n most common words
    '''
    # create list of all words in the train data
    complete_corpus = df.text.sum()
    
    # Construct a frequency dict of all words in the overall corpus 
    all_words = nltk.FreqDist(w.lower() for w in complete_corpus)

    # select the 2,000 most frequent words (incl. frequency)
    most_common_words = all_words.most_common(n)
    
    return [item[0] for item in most_common_words]

In [269]:
# get 2000 most common words
most_common_2000 = most_common(train_df)

# inspect first 10 most common words
most_common_2000[0:10]

['one',
 'make',
 'like',
 'see',
 'get',
 'time',
 'good',
 'go',
 'watch',
 'character']

### 4. Create nltk featuresets from train/test

For the nltk naive bayes classifier, we must tokenize the sentence and figure out which words the sentence shares with all_words/most_common_words. These constitute the sentence's features.

In [271]:
# for a given text, create a featureset (dict of features - {'word': True/False})
def review_features(review, most_common_words):
    '''
    Feature extractor that checks whether each of the most
    common words is present in a given review
    
    Parameters:
        review (list): text reviews as list of words
        most_common_words (list): list of n most common words
    Returns:
        features (dict): dict of most common words & corresponding True/False
    '''
    review_words = set(review)
    features = {}
    for word in most_common_words:
        features['contains(%s)' % word] = (word in review_words)
    return features

In [272]:
# create featureset for each text in a given dataframe
def make_set(df, most_common_words):
    '''
    Generates nltk featuresets for each movie review in dataframe.
    Feature sets are composed of a dict describing whether each of the most 
    common words is present in the text review or not

    Parameters:
        df (dataframe): processed dataframe of text reviews
        most_common_words (list): list of most common words
    Returns:
        feature_set (list): list of dicts of most common words & corresponding True/False
    '''
    return [(review_features(df.text[i], most_common_words), df.sentiment[i]) for i in range(len(df.sentiment))]

In [273]:
# make data into featuresets (for nltk naive bayes classifier)
train_set = make_set(train_df, most_common_2000)
test_set = make_set(test_df, most_common_2000)

In [274]:
# inspect first train featureset
first_label = train_set[0][1]
first_featureset_first10 = list(train_set[0][0].items())[:10]
first_featureset_first10, first_label

([('contains(one)', False),
  ('contains(make)', True),
  ('contains(like)', True),
  ('contains(see)', False),
  ('contains(get)', False),
  ('contains(time)', False),
  ('contains(good)', True),
  ('contains(go)', False),
  ('contains(watch)', False),
  ('contains(character)', False)],
 'negative')

### 5. Train & evaluate model (naive bayes classifier)

In [275]:
# Train a naive bayes classifier with train set by nltk
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [276]:
# Get the accuracy of the naive bayes classifier with test set
accuracy = nltk.classify.accuracy(classifier, test_set)
accuracy

0.8332

In [220]:
# build reference and test set of observed values (for each label)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
 
for i, (feats, label) in enumerate(train_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)


In [204]:
# print precision, recall, and f-measure
print('pos precision:', precision(refsets['positive'], testsets['positive']))
print('pos recall:', recall(refsets['positive'], testsets['positive']))
print('pos F-measure:', f_measure(refsets['positive'], testsets['positive']))
print('neg precision:', precision(refsets['negative'], testsets['negative']))
print('neg recall:', recall(refsets['negative'], testsets['negative']))
print('neg F-measure:', f_measure(refsets['negative'], testsets['negative']))


pos precision: 0.8739232576350823
pos recall: 0.9036437246963562
pos F-measure: 0.8885350318471338
neg precision: 0.902698282910875
neg recall: 0.8727272727272727
neg F-measure: 0.8874598070739551


In [176]:
# show top n most informative features
classifier.show_most_informative_features(20)

Most Informative Features
     contains(underrate) = True           positi : negati =     16.7 : 1.0
    contains(ridiculous) = True           negati : positi =     16.4 : 1.0
       contains(unfunny) = True           negati : positi =     13.3 : 1.0
        contains(unfold) = True           positi : negati =      9.6 : 1.0
   contains(wonderfully) = True           positi : negati =      8.6 : 1.0
         contains(appal) = True           negati : positi =      8.4 : 1.0
          contains(lame) = True           negati : positi =      8.0 : 1.0
          contains(dumb) = True           negati : positi =      7.9 : 1.0
         contains(awful) = True           negati : positi =      7.8 : 1.0
         contains(waste) = True           negati : positi =      7.3 : 1.0
     contains(laughable) = True           negati : positi =      7.3 : 1.0
         contains(worst) = True           negati : positi =      7.2 : 1.0
       contains(rubbish) = True           negati : positi =      7.1 : 1.0

We can see that people who give a positive review of a film are more likely to use words such as "underrate", "unfold", "wonderfully", or "subtle", while people who give a negative review are more likely to use words such as "ridiculous", "unfunny", "waste", or "asleep".

### 6. Make prediction

In [224]:
# predict on new review (from mubi.com)
new_review = "Surprisingly effective and moving, The Balcony Movie takes the Front Up\
            concept of talking to strangers, but here attaches it to a fixed perspective \
            in order to create a strong sense of the stream of life passing us by. \
            It's possible to not only witness the subtle changing of seasons\
            but also the gradual opening of trust and confidence in Lozinski's \
            repeating characters. A Pandemic movie, pre-pandemic. 3.5 stars"

In [225]:
# perform preprocessing (cleaning & featureset transformation)
processed_review = rm_custom_stops(preprocessing(new_review))
processed_review = review_features(processed_review, most_common_2000)

In [226]:
# predict label
classifier.classify(processed_review)

'positive'

In [263]:
# to get individual probability for each label and word, taken from:
# https://stackoverflow.com/questions/20773200/python-nltk-naive-bayes-probabilities

# show individual probabilities for top 10 most informative words
for label in classifier.labels():
    indv_probs = []
    for (word, fval) in classifier.most_informative_features(10):
        _prob = "{0:.2f}%".format(100*classifier._feature_probdist[label, word].prob(fval))
        indv_probs.append(f"{word}: {_prob}")
    print(pd.DataFrame({label: indv_probs}))

                       negative
0    contains(underrate): 0.12%
1   contains(ridiculous): 5.96%
2      contains(unfunny): 1.62%
3       contains(unfold): 0.20%
4  contains(wonderfully): 0.28%
5        contains(appal): 1.70%
6         contains(lame): 5.49%
7         contains(dumb): 3.52%
8       contains(awful): 10.39%
9       contains(waste): 12.68%
                       positive
0    contains(underrate): 1.98%
1   contains(ridiculous): 0.36%
2      contains(unfunny): 0.12%
3       contains(unfold): 1.90%
4  contains(wonderfully): 2.39%
5        contains(appal): 0.20%
6         contains(lame): 0.69%
7         contains(dumb): 0.44%
8        contains(awful): 1.33%
9        contains(waste): 1.74%
