Sentiment analysis is the process of determining the sentiment of a piece of text. For example, it cna be used to determine whether a movie review is positive or negative. We can add more category as well. This technique can be used to get sensse of how people feel about a product, brand or topic. It is frequently used to analyze marketing campaigns, opinion polls, social media presence, product reviews on e-commerce sites and so on.

Naive Bayes classifier will be used to build this sentiment analyzer. The first step will be all the unique word will be extracted from the text. The NLTK classifier needs this data to be arranged in the form of dictionary so that it can ingest. Once the text data is divided into training and testing datasets, the Naive Bayes classifier will be trained to classify the reviews into positive and negative. Afterward, the top most informative words to indiate positive and negative reviews can be calculated and displayed. This information is interesting because it shows what words ca beingused to denote various reactions.

In [1]:
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

In [2]:
# Define a function to construct a dictionary object based on the input words and return it
def extract_features(words):
  return dict([(word,True) for word in words])

In [4]:
nltk.download('movie_reviews')
# Load the reviews from corpus
fileids_pos = movie_reviews.fileids('pos')
fileids_neg = movie_reviews.fileids('neg')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


In [5]:
# Extract the features from the reviews
features_pos = [(extract_features(movie_reviews.words(
        fileids=[f])), 'Positive') for f in fileids_pos]
features_neg = [(extract_features(movie_reviews.words(
        fileids=[f])), 'Negative') for f in fileids_neg]

The next step is the define the split between training and testing. In this project we will alloate 80% for training and 20% for testing.

In [6]:
# Define the train and test split(80% and 20%)
threshold = 0.8
num_pos = int(threshold* len(features_pos))
num_neg = int(threshold* len(features_neg))

In [7]:
# Seperate the feature vectors for training and testing
features_train = features_pos[:num_pos]+features_neg[:num_neg]
features_test = features_pos[num_pos:]+features_neg[num_neg:]

In [8]:
# Print the number of datapoints used
print('\nNumber of training datapoints:', len(features_train)) # len of the training dataset
print('Number of test datapoints:', len(features_test))


Number of training datapoints: 1600
Number of test datapoints: 400


Next we need to train a NaiveBayesClassifier using the training data and compute the accuracy using the inbuilt accuracy method available in NLTK

In [9]:
classifier = NaiveBayesClassifier.train(features_train)
print('\n Accuracy of the classifier:', nltk_accuracy(classifier, features_test))


 Accuracy of the classifier: 0.735


In [10]:
# Print the top N most informative words
N = 15
print('\nTop ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()):
    print(str(i+1) + '. ' + item[0])
    if i == N - 1:
        break


Top 15 most informative words:
1. outstanding
2. insulting
3. vulnerable
4. ludicrous
5. uninvolving
6. astounding
7. avoids
8. fascination
9. affecting
10. animators
11. anna
12. darker
13. seagal
14. symbol
15. idiotic


In [11]:
# Define the sample sentences to be used for testing
input_reviews = [
        'The costumes in this movie were great', 
        'I think the story was terrible and the characters were very weak',
        'People say that the director of the movie is amazing', 
        'This is such an idiotic movie. I will not recommend it to anyone.' 
    ]


In [12]:
# Iterate through the sample data and predict the output
print("\n Movie review predictions: ")
for review in input_reviews:
  print("\nReview: ", review)


 Movie review predictions: 

Review:  The costumes in this movie were great

Review:  I think the story was terrible and the characters were very weak

Review:  People say that the director of the movie is amazing

Review:  This is such an idiotic movie. I will not recommend it to anyone.


In [13]:
from nltk import probability
# Compute the probability for each class
probabilities = classifier.prob_classify(extract_features(review.split()))
# pick the maximum value
predicted_sentiment = probabilities.max()
#Print the predicted output class (positive or negative sentiment)
print("Predicted sentiment: ", predicted_sentiment)
print('Probabilities: ', round(probabilities.prob(predicted_sentiment), 2))

Predicted sentiment:  Negative
Probabilities:  0.87


In [15]:
for review in input_reviews:
    print("\nReview:", review)

     # Compute the probabilities
    probabilities = classifier.prob_classify(extract_features(review.split()))

        # Pick the maximum value
    predicted_sentiment = probabilities.max()

        # Print outputs
    print("Predicted sentiment:", predicted_sentiment)
    print("Probability:", round(probabilities.prob(predicted_sentiment), 2))




Review: The costumes in this movie were great
Predicted sentiment: Positive
Probability: 0.59

Review: I think the story was terrible and the characters were very weak
Predicted sentiment: Negative
Probability: 0.8

Review: People say that the director of the movie is amazing
Predicted sentiment: Positive
Probability: 0.6

Review: This is such an idiotic movie. I will not recommend it to anyone.
Predicted sentiment: Negative
Probability: 0.87


As can be seen from the above result, we  can see and verify intuitively that the predictions are correct.