<li>Now that we're comfortable with NLTK, let's try to tackle text classification. The goal with text classification can be pretty broad. Maybe we're trying to classify text as about politics or the military. Maybe we're trying to classify it by the gender of the author who wrote it. A fairly popular text classification task is to identify a body of text as either spam or not spam, for things like email filters. In our case, we're going to try to create a sentiment analysis algorithm</li>

In [19]:
import nltk
import random
from nltk.corpus import movie_reviews
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
from nltk.classify.scikitlearn import SklearnClassifier

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# #similar approach
# documents = []
# for category in movie_reviews.categories():
#     for fileid in movie_reviews.fileid(category):
#         documents.append(list(movie_reviews.words(fileid) , category))

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

#Most popular words
all_words = nltk.FreqDist(all_words)

#print(all_words["stupid"])

#We have frequncy distribution of all words from most-least common words

#since we have lots of words we train the model using some of them
word_features = list(all_words.keys())[:3000]
#in all_words we have key=word and value=#
#we train against top 3000 words

In [2]:
#finfing features within the documents we are using
def find_features(document):
    #we convert list to set to get only a unique iteration of any word
    words = set(document)
    #empty dictionary
    features = {}
    #going through top 3000 top words
    for w in word_features:
        #the key is boolean value for the words
        #true if one of this top 3000 is in document 
        #false otherwise
        features[w] = (w in words)

    return features

In [3]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [4]:
# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll test against.
testing_set = featuresets[1900:]

In [5]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [6]:
classifier.show_most_informative_features(15)

Most Informative Features
                   sucks = True              neg : pos    =      9.3 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                 frances = True              pos : neg    =      8.4 : 1.0
           unimaginative = True              neg : pos    =      8.2 : 1.0
                  regard = True              pos : neg    =      7.1 : 1.0
                 idiotic = True              neg : pos    =      6.9 : 1.0
             silverstone = True              neg : pos    =      6.9 : 1.0
               atrocious = True              neg : pos    =      6.9 : 1.0
              schumacher = True              neg : pos    =      6.9 : 1.0
                  turkey = True              neg : pos    =      6.7 : 1.0
                obstacle = True              pos : neg    =      6.4 : 1.0
                 singers = True              pos : neg    =      6.4 : 1.0
                  shoddy = True              neg : pos    =      6.3 : 1.0

In [10]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, testing_set))

MultinomialNB accuracy percent: 0.87


In [11]:
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, testing_set))

BernoulliNB accuracy percent: 0.84


In [20]:
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)


Original Naive Bayes Algo accuracy percent: 84.0
Most Informative Features
                   sucks = True              neg : pos    =      9.3 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                 frances = True              pos : neg    =      8.4 : 1.0
           unimaginative = True              neg : pos    =      8.2 : 1.0
                  regard = True              pos : neg    =      7.1 : 1.0
                 idiotic = True              neg : pos    =      6.9 : 1.0
             silverstone = True              neg : pos    =      6.9 : 1.0
               atrocious = True              neg : pos    =      6.9 : 1.0
              schumacher = True              neg : pos    =      6.9 : 1.0
                  turkey = True              neg : pos    =      6.7 : 1.0
                obstacle = True              pos : neg    =      6.4 : 1.0
                 singers = True              pos : neg    =      6.4 : 1.0
                  shoddy 