Consider the following phrases:

"Titanic is a great movie."


"Titanic is not a great movie."


"Titanic is a movie."


The phrases correspond to short movie reviews, and each one of them conveys different sentiments. 


For example, the first phrase denotes positive sentiment about the film Titanic while the second one treats the movie as not so great (negative sentiment). 


Take a look at the third one more closely. 


There is no such word in that phrase which can tell you about anything regarding the sentiment conveyed by it. 


Hence, that is an example of neutral sentiment.


Now, from a strict machine learning point of view, this task is nothing but a supervised learning task. 


You will supply a bunch of phrases (with the labels of their respective sentiments) to the machine learning model, and you will test the model on unlabeled phrases.


Take a look at this review:

<img src='../img/sen1.JPG'>


The next step which seems natural is to create a representation similar to the following:


<img src='../img/sen2.JPG'>


The above representation is nothing but a Bag-of-words representation. 


This is probably the most fundamental concepts in NLP and is the first step of doing any text classification problem.


A bag-of-words representation of a document does not only contain specific words but all the unique words in a document and their frequencies of occurrences. 


A bag is a mathematical set here, so by the definition of a set, the bag does not contain any duplicate words.


The words that you found out in the bag-of-words will now construct the feature set of your document. 


So, consider you a collection of many movie reviews (documents), and you have created bag-of-words representations for each one of them and preserved their labels (i.e., sentiments - +ve or -ve in this case). 


Your training set should look like:

<img src='../img/sen3.JPG'>


This representation is also known as Corpus.



# Naive Bayes classification for sentiment analysis


Naive Bayes classification is nothing but applying Bayes rules for forming classification probabilities.

Let's first build the notion of general terms in Naive Bayes classifier the context of sentiment classification. 

### Bayes rule:

<img src='../img/sen4.JPG'>

In this case, the class comprises two sentiments. 
* Positive 
* Negative

Let's study each term of the above image in details in this context.

The RHS term P(c|d) is read as the probability of class c given a document d. This term is also known as Posterior.
P(d|c) should be similar.


The term which is shown as Prior is your original belief i.e., original label of the document being positive or negative (in terms of sentiments).


The term Likelihood is the probability of a document d given a class c.


Now think of the term Posterior as your updated rule or updated belief obtained by multiplying Prior and Likelihood.


But what is Normalization Constant P(d)? 


This term is divided with the result produced by the multiplication to ensure the outcome can be presented in a probability distribution.


# Importing data

In [1]:
# Load and prepare the dataset
import nltk
from nltk.corpus import movie_reviews
import random

nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\VIPUL.GAUR\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [2]:

documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

# Creating a Feature Extractor

To limit the number of features that the classifier needs to process, you start by constructing a list of the 2000 most frequent words in the overall corpus



In [3]:
# Define the feature extractor

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features


# Classification Task


Now that we've defined the feature extractor. 


Now, we can use it to train a Naive Bayes classifier to predict the sentiments of new movie reviews. 


To check your classifier's performance, you will compute its accuracy on the test set. 


NLTK provides show_most_informative_features() to see which features the classifier found to be most informative.



In [4]:
# Train Naive Bayes classifier
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [5]:
# Test the classifier
print(nltk.classify.accuracy(classifier, test_set))


0.78


# 79% accuracy without any hyperparameter tuning seems to be great

In [6]:
# Show the most important features as interpreted by Naive Bayes
classifier.show_most_informative_features(5)

Most Informative Features
 contains(unimaginative) = True              neg : pos    =      8.3 : 1.0
        contains(suvari) = True              neg : pos    =      7.0 : 1.0
          contains(mena) = True              neg : pos    =      7.0 : 1.0
       contains(martian) = True              neg : pos    =      7.0 : 1.0
    contains(schumacher) = True              neg : pos    =      7.0 : 1.0
