<h1 style="font-family: Trebuchet MS; padding: 20px; font-size: 25px; text-align: center; line-height: 0.75;background-color:#00A693"><b>Movie Reviews Sentiment Analysis
</b><br></h1>

In [72]:
# Import Necessary Dependencies & Packages

import nltk
import random
from nltk.corpus import movie_reviews
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import string
import warnings
warnings.filterwarnings('ignore')
from string import punctuation
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from nltk.classify import NaiveBayesClassifier


In [73]:
stopwords = nltk.corpus.stopwords.words('english') + list(string.punctuation)
# Combine NLTK English stopwords and string punctuation for a comprehensive list


In [74]:
all_words = []

# Iterate through words in movie_reviews dataset
for w in movie_reviews.words():
    # Check if lowercase word is not in the stopwords list
    if w.lower() not in stopwords:
        # Append the word to the all_words list
        all_words.append(w)

# Create a frequency distribution of words using nltk
all_words = nltk.FreqDist(all_words)
all_words
# Return the frequency distribution of words


FreqDist({'film': 9517, 'one': 5852, 'movie': 5771, 'like': 3690, 'even': 2565, 'good': 2411, 'time': 2411, 'story': 2169, 'would': 2109, 'much': 2049, ...})

In [75]:
movie_reviews.words() 

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

In [76]:
print(all_words.most_common(5))
# Print the five most common words and their frequencies from the movie_reviews dataset


[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565)]


In [77]:
word_features = []

# Define a function to extract common features (top 2000 words) from the frequency distribution
def common_features(x):
    for item in x.most_common(2000):
        # Append the word (item[0]) to the word_features list
        word_features.append(item[0])

# Call the function with the previously created frequency distribution 'all_words'
common_features(all_words)

word_features[:20]
# Display the first 20 common features (words) from the movie_reviews dataset


['film',
 'one',
 'movie',
 'like',
 'even',
 'good',
 'time',
 'story',
 'would',
 'much',
 'character',
 'also',
 'get',
 'two',
 'well',
 'characters',
 'first',
 '--',
 'see',
 'way']

In [78]:
documents = []  # Initialize an empty list to store tuples

# Iterate through categories and fileids in the movie_reviews dataset
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        # Create a tuple containing the list of words in the review and its category
        review_tuple = (list(movie_reviews.words(fileid)), category)
        # Append the tuple to the documents list
        documents.append(review_tuple)

# Shuffle the documents list to randomize the order
random.shuffle(documents)


In [79]:
print(documents[:2]) #Checking sample

[(['synopsis', ':', 'the', 'president', 'of', 'a', 'company', 'wants', 'to', 'test', 'his', 'self', '-', 'appointed', 'successor', ',', 'who', "'", 's', 'psychotic', ',', 'and', 'thinks', 'it', "'", 's', 'a', 'great', 'idea', 'to', 'spend', 'a', 'week', 'with', 'him', 'and', 'their', 'wives', 'in', 'an', 'isolated', 'cabin', 'a', 'hundred', 'miles', 'from', 'civilization', 'with', 'no', 'dependable', 'transportation', 'or', 'means', 'of', 'communication', 'after', 'a', 'heavy', 'snowfall', '.', 'comments', ':', 'tracks', 'of', 'a', 'killer', 'had', 'a', 'couple', 'of', 'strikes', 'against', 'it', 'before', 'i', 'even', 'began', 'watching', 'it', '.', 'strike', 'one', 'was', 'the', 'fact', 'that', 'someone', 'had', 'scrawled', 'the', 'word', 'garbage', 'on', 'the', 'videotape', "'", 's', 'sticker', 'in', 'black', 'marker', '(', 'not', 'typically', 'a', 'good', 'sign', 'when', 'you', 'rent', 'a', 'film', ')', '.', 'strike', 'two', 'came', 'while', 'the', 'previews', 'played', '.', 'did',

In [80]:
movie_reviews.categories()
# Return the categories present in the movie_reviews dataset

['neg', 'pos']

In [81]:
def document_features(document):
    # Convert the document into a set of words
    document_words = set(document)
    
    # Initialize an empty dictionary for features
    features = {}
    
    # Iterate through the word features and check if each word is present in the document
    for word in word_features:
        # Use the format 'contains({word})' as a feature key, with a boolean value indicating presence
        features['contains({})'.format(word)] = (word in document_words)
    
    # Return the feature dictionary
    return features


In [82]:
featuresets = []  # Initialize an empty list to store feature sets

# Define a function to process documents and create feature sets
def processed(x):
    for (d, c) in x:
        # Append tuples of document features and category to the featuresets list
        featuresets.append((document_features(d), c))

# Call the function with the 'documents' list
processed(documents)
# Create feature sets from the processed documents


In [83]:
featuresets[0:1] #Checking sample


[({'contains(film)': True,
   'contains(one)': True,
   'contains(movie)': True,
   'contains(like)': True,
   'contains(even)': True,
   'contains(good)': True,
   'contains(time)': True,
   'contains(story)': False,
   'contains(would)': False,
   'contains(much)': False,
   'contains(character)': True,
   'contains(also)': False,
   'contains(get)': True,
   'contains(two)': True,
   'contains(well)': True,
   'contains(characters)': True,
   'contains(first)': True,
   'contains(--)': False,
   'contains(see)': False,
   'contains(way)': False,
   'contains(make)': True,
   'contains(life)': False,
   'contains(really)': True,
   'contains(films)': True,
   'contains(plot)': True,
   'contains(little)': False,
   'contains(people)': True,
   'contains(could)': False,
   'contains(scene)': True,
   'contains(man)': False,
   'contains(bad)': True,
   'contains(never)': False,
   'contains(best)': False,
   'contains(new)': False,
   'contains(scenes)': True,
   'contains(many)': Fal

In [84]:
train_set, test_set = featuresets[100:], featuresets[:100]
# Split the featuresets into training and testing sets

vectorizer = DictVectorizer() 
# Initialize a DictVectorizer for converting feature dictionaries to sparse matrices

X_train = vectorizer.fit_transform([features for features, label in train_set]) 
# Fit and transform the feature dictionaries of the training set into a sparse matrix (X_train)
y_train = [label for features, label in train_set] 
# Extract labels from the training set

X_test = vectorizer.transform([features for features, label in test_set])
# Transform the feature dictionaries of the test set into a sparse matrix (X_test)
y_test = [label for features, label in test_set] 
# Extract labels from the test set


In [85]:
print(X_train[:5]) #sample checking

  (0, 0)	1.0
  (0, 1)	0.0
  (0, 2)	0.0
  (0, 3)	0.0
  (0, 4)	0.0
  (0, 5)	0.0
  (0, 6)	0.0
  (0, 7)	0.0
  (0, 8)	0.0
  (0, 9)	0.0
  (0, 10)	0.0
  (0, 11)	0.0
  (0, 12)	1.0
  (0, 13)	0.0
  (0, 14)	0.0
  (0, 15)	0.0
  (0, 16)	0.0
  (0, 17)	0.0
  (0, 18)	0.0
  (0, 19)	0.0
  (0, 20)	0.0
  (0, 21)	0.0
  (0, 22)	0.0
  (0, 23)	0.0
  (0, 24)	0.0
  :	:
  (4, 1975)	1.0
  (4, 1976)	1.0
  (4, 1977)	0.0
  (4, 1978)	1.0
  (4, 1979)	0.0
  (4, 1980)	0.0
  (4, 1981)	0.0
  (4, 1982)	0.0
  (4, 1983)	1.0
  (4, 1984)	0.0
  (4, 1985)	0.0
  (4, 1986)	1.0
  (4, 1987)	0.0
  (4, 1988)	0.0
  (4, 1989)	0.0
  (4, 1990)	0.0
  (4, 1991)	0.0
  (4, 1992)	0.0
  (4, 1993)	1.0
  (4, 1994)	0.0
  (4, 1995)	1.0
  (4, 1996)	0.0
  (4, 1997)	0.0
  (4, 1998)	0.0
  (4, 1999)	0.0


In [86]:
print(y_train[:5])

['neg', 'pos', 'pos', 'neg', 'pos']


In [87]:
classifiers = [
    ('Multinomial Naive Bayes', MultinomialNB()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('Support Vector Machine', SVC()),
    ('Naive Bayes (NLTK)', NaiveBayesClassifier.train(train_set))
] 
# Define a list of tuples, each containing a classifier name and the corresponding classifier instance

for classifier_name, classifier in classifiers:
    if isinstance(classifier, NaiveBayesClassifier):
        # For NLTK Naive Bayes, use the classifier to make predictions on the test set
        predictions = [classifier.classify(features) for features, _ in test_set]
    else:
        # For other classifiers, fit the model on the training set and predict on the test set
        classifier.fit(X_train, y_train)
        predictions = classifier.predict(X_test)
        
    accuracy = accuracy_score(y_test, predictions)
    # Calculate and print the accuracy of the classifier on the test set
    print(f'{classifier_name} Accuracy: {accuracy}')


Multinomial Naive Bayes Accuracy: 0.89
Decision Tree Accuracy: 0.58
Random Forest Accuracy: 0.78
Support Vector Machine Accuracy: 0.88
Naive Bayes (NLTK) Accuracy: 0.83


In [88]:
# Assuming NaiveBayesClassifier is the last classifier in the list
classifier.show_most_informative_features(20)
# Display the 20 most informative features for the NLTK Naive Bayes classifier


Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.4 : 1.0
        contains(seagal) = True              neg : pos    =      8.3 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.6 : 1.0
    contains(ridiculous) = True              neg : pos    =      5.7 : 1.0
         contains(flynt) = True              pos : neg    =      5.6 : 1.0
          contains(lame) = True              neg : pos    =      5.4 : 1.0
         contains(damon) = True              pos : neg    =      5.4 : 1.0
        contains(wasted) = True              neg : pos    =      5.3 : 1.0
         contains(waste) = True              neg : pos    =      5.2 : 1.0
         contains(awful) = True              neg : pos    =      5.1 : 1.0
        contains(poorly) = True              neg : pos    =      4.8 : 1.0
           contains(era) = True              pos : neg    =      4.7 : 1.0
         contains(worst) = True              neg : pos    =      4.4 : 1.0

## Conclusion

In this analysis, experimented with different classifiers to predict the sentiment of movie reviews. The accuracy results on the test set are as follows:

- **Multinomial Naive Bayes:** 89%
- **Decision Tree:** 58%
- **Random Forest:** 78%
- **Support Vector Machine:** 88%
- **Naive Bayes (NLTK):** 83%


The most informative features extracted from the Naive Bayes classifier shed light on words strongly associated with positive or negative sentiment. Here are some notable observations:

- **Outstanding:** The presence of the word "outstanding" strongly leans towards a positive sentiment with a ratio of 10.4 (pos:neg).

- **Mulan:** The word "Mulan" has a high positive sentiment association with a ratio of 8.3 (pos:neg).

- **Wonderfully:** Similarly, the word "wonderfully" is indicative of positive sentiment with a ratio of 6.6 (pos:neg).

- **Seagal:** On the negative side, the presence of "Seagal" is strongly associated with negative sentiment, indicating potential dislike for Steven Seagal movies.

It is evident that Multinomial Naive Bayes and Support Vector Machine performed well on the given dataset, achieving the highest accuracy scores. Decision Tree, on the other hand, demonstrated lower accuracy, indicating potential limitations in its ability to generalize to unseen data.


# </center><center><div style="font-family: Trebuchet MS; background-color: #00A693; padding: 12px; line-height: 1;">Any sort of feedback is appreciated!</div></center><center><div style="font-family: Trebuchet MS; background-color: #00A693;padding: 12px; line-height: 1;">Thank You!</div></center>