# Movie Reviews Sentiment Analysis with Scikit-Learn

In this tutorial, we will write a text classification to classify movie reviews as either positive or negative.
1. Use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a> to implement both tokenization and word occurrence counting on the text documents.
2. Use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB">multinomial Naive Bayes</a> classifier to train the model.
3. Work with movie review corpus,can be downloaded <a href="http://www.nltk.org/nltk_data/">here</a>.

## Load movie_reviews corpus data through sklearn

In [None]:
import sklearn
from sklearn.datasets import load_files

In [10]:
# You might need to modify the path according to where you store the data.
moviedir = r'D:\Lab\movie_reviews'

# Loading all files. 
movie = load_files(moviedir, shuffle=True)

In [12]:
len(movie.data)

2000

In [13]:
# Target names ("classes") are automatically generated from subfolder names
movie.target_names

['neg', 'pos']

In [14]:
# First 500 characters of the first file.
movie.data[0][:500]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"

In [19]:
# First file is a negative review and is mapped to 0 index 'neg' in target_names
movie.target[0]

0

In [22]:
sum(movie.target==0)

1000

## A detour: try out CountVectorizer

In [23]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
vectorizer = CountVectorizer()

In [25]:
corpus = ['the cat sat',
          'the cat sat in the hat',
          'the cat with the hat']

In [26]:
# It will perform both tokenization and occurrence counting on the corpus.
X = vectorizer.fit_transform(corpus)

In [27]:
vectorizer.get_feature_names()

['cat', 'hat', 'in', 'sat', 'the', 'with']

In [29]:
# Each of the row vectors in the resulting matrix is a mapping from the feature name to its corresponding frequency in the sentence.
X.toarray()

array([[1, 0, 0, 1, 1, 0],
       [1, 1, 1, 1, 2, 0],
       [1, 1, 0, 0, 2, 1]], dtype=int64)

### we completed the feature extraction process for our toy document set

·What if we get new ones?

In [35]:
# A new document
newdocs = ["the dog with the ball"]

# This time, no fitting needed: transform the new doc into count-vectorized form
# Unseen words ('dog', 'ball') are ignored
newdocs_counts = vectorizer.transform(newdocs)
newdocs_counts.toarray()

array([[0, 0, 0, 0, 2, 1]], dtype=int64)

## Back to real data: movie reviews

In [36]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(movie.data, movie.target, 
                                                          test_size = 0.20, random_state = 12)

In [37]:
# initialize CountVectorizer
movie_vectorizer = CountVectorizer(min_df=2, max_features=3000) # Ignore words with frequency less than 2, use top 3000 words only.

# fit and tranform using training text 
docs_train_counts = movie_vectorizer.fit_transform(docs_train)

In [38]:
# huge dimensions! 1,600 documents, 3K unique terms. 
docs_train_counts.shape

(1600, 3000)

### The feature extraction functions and traning data are ready.

### Next up: test data

·You have to prepare the test data using the same feature extraction scheme.

In [39]:
# Using the fitted vectorizer, tranform the test data
docs_test_counts = movie_vectorizer.transform(docs_test)

## Training and testing a Naive Bayes classifier

In [40]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

In [41]:
# Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
clf = MultinomialNB()
clf.fit(docs_train_counts, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [42]:
# Predict the Test set results, find accuracy
y_pred = clf.predict(docs_test_counts)
sklearn.metrics.accuracy_score(y_test, y_pred)

0.7775

## Trying the classifier on fake movie reviews

In [43]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 
               'Absolute joy ride', 
               'Steven Seagal was terrible', 
               'Two thumbs up', 
               'I fell asleep halfway through', 
               'Steven Seagal was amazing. His performance was Oscar-worthy.']

reviews_new_counts = movie_vectorizer.transform(reviews_new)         # turn text into count vector

In [44]:
# have classifier make a prediction
pred = clf.predict(reviews_new_counts)

In [45]:
# print out results
for review, category in zip(reviews_new, pred):
    print(f'{review} => {movie.target_names[category]}')

This movie was excellent => pos
Absolute joy ride => pos
Steven Seagal was terrible => neg
Two thumbs up => neg
I fell asleep halfway through => neg
Steven Seagal was amazing. His performance was Oscar-worthy. => neg
