## Install required library

In [1]:
!pip install nltk
!pip install pandas
!pip install scikit-learn



## Import library and move_reviews dataset

In [2]:
import nltk
from nltk.corpus import movie_reviews
import pandas as pd
import sklearn as sk
import random

In [5]:
# Download movie reviews datasets
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Phi\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

## A list of the 30 features that the classifier finds to be most informative

In [6]:
# Find the top 1000 words in all movie reviews

words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words) # sorted most popular {words: freq}
word_features = list(all_words.keys())[:1000] # more words slows down the training

# example of results
word_features[:15]

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive']

In [7]:
# build list of words and their positive/negative classification from the reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [8]:
# Create feature set/class for each review against list of top 1000 words
# Extract words from document

def doc_features(document): # [_document-classify-extractor]
    doc_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in doc_words)
    return features

featuresets = [(doc_features(d), c) for (d,c) in documents]

In [9]:
# Split to create training and test data
train_set = featuresets[100:]
test_set = featuresets[:100]

In [10]:
# Train using Naive Bayes classifier
random.seed(4321)
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [11]:
# 30 most imporant features
classifier.show_most_informative_features(30)

Most Informative Features
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
         contains(kudos) = True              pos : neg    =      5.9 : 1.0
        contains(wasted) = True              neg : pos    =      5.6 : 1.0
         contains(awful) = True              neg : pos    =      5.4 : 1.0
        contains(poorly) = True              neg : pos    =      4.9 : 1.0
       contains(bronson) = True              neg : pos    =      4.8 : 1.0
         contains(bland) = True              neg : pos    =      4.2 : 1.0
      contains(thrilled) = True              pos : neg    =      4.1 : 1.0
       contains(runtime) = True              neg : pos    =      4.1 : 1.0
     contains(underwood) = True              neg : pos    =      4.1 : 1.0
     contains(stretched) = True              neg : pos    =      4.0 : 1.0
     contains(anastasia) = True              pos : neg    =      3.9 : 1.0
          contains(dull) = True              neg : pos    =      3.8 : 1.0

## Can you explain why these particular features are informative? Do you find any of them surprising?
#### Informative Features
- Negative Sentiment:
> - "turkey," "wasted," "awful," "poorly": These directly express negative opinions about the movie's quality or experience.
> - "bronson," "dull," "jumbled": These describe specific aspects of the movie that might contribute to negative feelings, like a bad performance, lack of excitement, or confusing plot.
> - "sexist," "mess": These address sensitive topics or chaotic storytelling, potentially leading to negative reactions.
> - "runtime," "underwood": Mentioning runtime indicates potential concerns about movie length or specific actors/directors, while "underwood" might be associated with negative opinions about actors often associated with bad movies.
- Positive Sentiment:
> - "kudos," "thrilled," "memorable": These directly express positive appreciation for the movie.
> - "robots," "anastasia": These might indicate specific genres or elements (e.g., sci-fi, animation) that resonate with certain viewers.
> - "considers," "remembers," "ponders": These suggest the movie evoked deeper thought and reflection, potentially leading to positive engagement.
> - "stable," "reformed": These could describe character arcs or thematic elements that inspire positive feelings.
#### Surprising Features:
> - "bland," "stretched": While negative, these might be subjective and depend on individual preferences for movie style or pacing.
> - "brooke," "stan": These seem specific to certain actors or fandoms, potentially indicating niche reactions.
> - "implied," "ponders": These are less explicit expressions of sentiment and might require further analysis to understand their true impact.
##### Overall, the informative features highlight words that convey strong opinions or evoke specific emotions related to the movie experience. Some surprises stem from subjectivity or niche references, requiring deeper context or investigation.