##### Sentiment Analysis
It is a process of determining the sentiment of a given piece of text. For example it can be used to determine whether a movie review is positive or negative. It is frequently used to analyze marketing campaigns, opinion polls, social media presence, product reviews on e-commerce sites, and so on.

We will ue a Naive Bayes classifier to build this classifier. We first need to extract all tge unique words from the text. The NLTK classifier needs this data to be arranged in the form of a dictionary so that it can ingest it. Once we divide the text data into training anf testing datasets, we will train the Naive Bayes classifier to classify the reviews into positive and negative. We will also print out the top informative words to indicate positive and negative reviews. This information is interesting because it tells us what words ar ebeing used to denote various reactions. 

In [None]:
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

In [3]:
# Extract feature from the input list of words
def extract_features(words):
    return dict([(word, True) for word in words])

In [24]:
# Define the main function and load the labeled movie_reviews
if __name__ == "__main__":
    # Load the reviews from the corpus
    fileids_pos = movie_reviews.fileids('pos')
    fileids_neg = movie_reviews.fileids('neg')
    
    # Extract the features from the reviews
    features_pos = [(extract_features(movie_reviews.words(fileids = [f])), 'Positive') for f in fileids_pos]
    features_neg = [(extract_features(movie_reviews.words(fileids = [f])), 'Negative') for f in fileids_neg]
    
    # Define the train and test split (80% and 20%)
    threshold = 0.8
    num_pos = int(threshold * len(features_pos))
    num_neg = int(threshold * len(features_neg))
    
    # Create training and testing datasets
    features_train = features_pos[:num_pos] + features_neg[:num_neg]
    features_test = features_pos[num_pos:] + features_neg[num_neg:]
    
    # Print the number of datapoints used
    print('\nNumber of training datapoints: ', len(features_train))
    print('\nNumber of test datapoints: ', len(features_test))
    
    # Train a Naive Bayes Classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print('\nAccuracy of the classifier: ', nltk_accuracy(classifier, features_test))
    
    # Print the top N most informative words
    N = 15
    print('\nTop ' + str(N) + ' most informative words:')
    for i, item in enumerate(classifier.most_informative_features()):
        print(str(i + 1) + '. ' + item[0])
        if i == N - 1:
            break
    
    # Test input movie reviews
    input_reviews = [
        'I think the story was terrible and the characters were very weak',
        'People say that the director of the movie is amazing',
        'This is such an idiotic movie. I will not recommend it to anyone.'
    ]
    
    # Iterate through the test input movie reviews
    print('\nMovie review predictions:')
    for review in input_reviews:
        print('\nReview:', review)
        
        # Compute the probabilities
        probabilities = classifier.prob_classify(extract_features(review.split())) # To break the reviews into words
        
        # Pick the maximum value
        predicted_sentiment = probabilities.max()
        
        # Print outputs
        print("Predicted sentiment", predicted_sentiment)
        print("Probability: " ,round(probabilities.prob(predicted_sentiment), 2))


Number of training datapoints:  1600

Number of test datapoints:  400

Accuracy of the classifier:  0.735

Top 15 most informative words:
1. outstanding
2. insulting
3. vulnerable
4. ludicrous
5. uninvolving
6. astounding
7. avoids
8. fascination
9. animators
10. darker
11. anna
12. symbol
13. seagal
14. affecting
15. idiotic

Movie review predictions:

Review: I think the story was terrible and the characters were very weak
Predicted sentiment Negative
Probability:  0.8

Review: People say that the director of the movie is amazing
Predicted sentiment Positive
Probability:  0.6

Review: This is such an idiotic movie. I will not recommend it to anyone.
Predicted sentiment Negative
Probability:  0.87
