# Sample Naïve Bayes text classifier for sentiment analysis

Example of a simple classifier using natural language processing with NLTK library. It uses a simple Naïve Bayes classifier (which is part of nltk) and pickles resulting classifier for future use.

Info on NLTK: http://www.nltk.org/book/

The approach we take here is quite simple. 
1. We have a labeled dataset of movie reviews (positive and negative). 
2. We combine all the reviews and tokenize them into words. 
3. Then we pick the most frequently used words. Note that we do not stem/lemmatize, remove stopwords or take grammar into consideration. This is because we use a Naïve Bayes algorithm to determine which of the most common words are founds mostly in positive or negative reviews. This step will take care of the grammar and irrelevant words (like "the", "to", punctuation, etc.).
4. Use Naïve Bayes to classify which of the commonly used words are more common in positive or negative reviews. We use a training set (subset of reviews) for this.
5. Use the 2 sets of positive and negative words to classify the rest of the documents. We use a testing set of reviews for this.
6. Evaluate the performance of the classifier.

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews

#### Create corpus 
This will be the list of all reviews with their category (positive or negative).

In [2]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
'''
## Same as:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append(list(movie_reviews.words(fileid), category)
'''
[d[1] for d in documents[:10]]

['neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg']

In [3]:
## No need to re-shuffle the documents, we will re-shuffle the feature sets further on.
#random.shuffle(documents)
#[d[1] for d in documents[:10]]    

#### Select most common words in all reviews (positive and negative)

In [4]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

Convert all_words from a list to a frequency distribution.

In [5]:
all_words = nltk.FreqDist(all_words)
all_words.most_common(15)

[(',', 77717),
 ('the', 76529),
 ('.', 65876),
 ('a', 38106),
 ('and', 35576),
 ('of', 34123),
 ('to', 31937),
 ("'", 30585),
 ('is', 25195),
 ('in', 21822),
 ('s', 18513),
 ('"', 17612),
 ('it', 16107),
 ('that', 15924),
 ('-', 15595)]

In [6]:
## See what we've got
all_words.elements  ## This is sorted by frequency list of elements with frequencies
all_words.keys()  ## This list is not sorted
all_words["smart"]

190

Select only top commonly used words.

In [7]:
word_features = [wordFreq[0] for wordFreq in all_words.most_common(3000)]
#word_features = list(all_words.keys())[:3000]  # takes random words, not the most common

print([wf for wf in word_features[:30]])

[',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by']


For all the words in all the reviews, find if they are in the commonly used words set.


In [10]:
#%%timeit -r1 -n1

def find_features(document):
    words = set(document)  ## pick only unique words in the review
    features = {}
    for w in words:
        features[w] = (w in word_features)
    return features

#find_features(movie_reviews.words('pos/cv980_10953.txt'))
#find_features(movie_reviews.words('neg/cv974_24303.txt'))

feature_sets = [(find_features(review_words),category) for review_words, category in documents]


feature_sets is a list of (__dict__, __str__) elements, where 
- __dict__ is a dictionary of unique words in a review with True/False values for whether this word is in commonly used set
- __str__ is a pos/neg category of the review

In [11]:
feature_sets[0]

({'a': True,
  'that': True,
  'not': True,
  '10': True,
  'sagemiller': False,
  'girlfriend': True,
  'coming': True,
  'music': True,
  'touches': True,
  'stick': True,
  'looooot': False,
  'taken': True,
  'unravel': False,
  '&': True,
  'and': True,
  'want': True,
  'sure': True,
  'starts': True,
  'down': True,
  'entire': True,
  'drive': True,
  'good': True,
  'salvation': False,
  'need': True,
  've': True,
  'somewhere': True,
  'attempt': True,
  'kudos': False,
  'or': True,
  'critique': True,
  'is': True,
  'what': True,
  '8': True,
  'production': True,
  'craziness': False,
  'plain': True,
  'doesn': True,
  ')': True,
  'in': True,
  'us': True,
  'concept': True,
  'character': True,
  'crow': True,
  'echoes': False,
  'unraveling': False,
  'see': True,
  'more': True,
  'apparitions': False,
  'her': True,
  'sense': True,
  'couples': False,
  'holds': True,
  'always': True,
  'know': True,
  'drink': False,
  'new': True,
  'mean': True,
  '20': True,

In [12]:
type(feature_sets)

list

## Using Naïve Bayes to classify the text.

NB is a popular baseline method for text categorization. It classifies data into 2 (and only 2) categories (true/false) and uses word frequencies as features.

In a nutshell: __posterior = prior occurences * likelyhood / evidence__

Advantages: 
* needs a small set of training data to estimate parameters for classification
* scalable (linear)
* with proper pre-processing can be comparable with more complicated methods

Disadvantages:
* strong (naïve) independence assumptions between features
* outperformed by other approaches (boosted trees, random forrests)

#### Re-shuffle feature sets and select training and test sets

In [190]:
[d[1] for d in feature_sets[:10]]

['neg', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'neg']

In [191]:
random.shuffle(feature_sets)

[d[1] for d in feature_sets[:10]]

['pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg']

In [192]:
training_set = feature_sets[:1900]
testing_set = feature_sets[1900:]

In [193]:
%%timeit -r1 -n1
classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))

Classifier accuracy: 0.71
1.86 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [197]:
classifier.show_most_informative_features(15)

Most Informative Features
                  regard = False             pos : neg    =     11.9 : 1.0
                    slip = False             pos : neg    =     11.9 : 1.0
                  avoids = False             pos : neg    =     11.9 : 1.0
             fascination = False             pos : neg    =     11.2 : 1.0
                    3000 = False             neg : pos    =     10.8 : 1.0
             outstanding = True              pos : neg    =     10.6 : 1.0
                  hatred = False             pos : neg    =     10.5 : 1.0
                seamless = False             pos : neg    =     10.5 : 1.0
              astounding = False             pos : neg    =     10.5 : 1.0
                   sucks = False             neg : pos    =     10.4 : 1.0
                  hudson = False             neg : pos    =     10.2 : 1.0
               insulting = False             neg : pos    =     10.0 : 1.0
               ludicrous = False             neg : pos    =     10.0 : 1.0

## Save a trained classifier for future use using Pickle

In [202]:
### Skip this step if you are running for the 2nd > time to see the difference between new and pickled classifier
import pickle

with open("nb_txt_classifier.pickle", 'wb') as pickle_file:
    pickle.dump(classifier, pickle_file)


In [204]:
## See if the pickle file was saved to disk:
%ll

total 32616
-rw-r--r--  1 korolo  10513     1069 10 Jul 16:51 LICENSE
-rw-r--r--  1 korolo  10513      108 10 Jul 16:51 README.md
-rw-r--r--  1 korolo  10513  7868888 13 Jul 15:48 nb_txt_classifier.pickle
-rw-r--r--  1 korolo  10513   914288 13 Jul 10:32 nltk.ipynb
-rw-r--r--  1 korolo  10513    25219 13 Jul 15:48 txt_classifier.ipynb
-rw-r--r--  1 korolo  10513  7868888 13 Jul 15:47 txt_classifier.pickle


### Re-using a pickled classifier

Note that in this particular example, newly reshuffled testing set will likely contain the reviews, which were used in the pickled classifier training set. So the performance of the pickled classifier here will almost always be better than the fresh classifier above.

In [199]:
with open("nb_txt_classifier.pickle", 'rb') as pickled_classifier:
    classifier_p = pickle.load(pickled_classifier)

In [205]:
%%timeit -r1 -n1
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Pickled Classifier accuracy: {}".format(classifier_accuracy))

Pickled Classifier accuracy: 0.95
166 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
