# Movie Reviews Sentiment Analysis with Scikit-Learn

#### In this lab tutorial, we will build a sentiment classifier to classify movie reviews as either positive or negative.

## Load movie_reviews corpus data through sklearn

In [None]:
import sklearn
from sklearn.datasets import load_files

In [None]:
# The variable moviedir is the location of the movie_reviews folder.
# You might need to modify the path according to where you store the data.
moviedir= 'E:/NaiveBayes/movie_reviews'

# Loading all files.
# Set shuffle to be true means to rearrange the order of the files randomly.
movie = load_files(moviedir, shuffle=True)

In [None]:
# Total number of movie reviews
len(movie.data)

In [None]:
# Target names ("classes") are automatically generated from subfolder names.
movie.target_names

In [None]:
# First 500 characters of the first file
movie.data[0][:500]

In [None]:
# Print out the filename of the first file.
movie.filenames[0]

In [None]:
# First file is a negative review and is mapped to 0 index 'neg' in target_names.
movie.target[0]

In [None]:
# Number of negtive reviews
sum(movie.target==0)

## A detour: try out CountVectorizer

In [None]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

In [None]:
# corpus contains 3 documents
corpus = ['the cat sat',
          'the cat sat in the hat',
          'the cat with the hat']

In [None]:
# It will perform both tokenization and word occurrence counting on the corpus and therefore transform each of the documents into a vector.
X = vectorizer.fit_transform(corpus)

In [None]:
# Returns a list of feature names, in our case, these are the unique words appear in the corpus.
vectorizer.get_feature_names()

In [None]:
# Produces a 3x6 matrix, 3 represent the number of the document, 6 represents the size of the feature,
# each row vector corresponds to a vectorized document,
# each value in the row vector represents the number of occurrence of the feature at the corresponding position in the feature names list.
X.toarray()

### we completed the feature extraction process for our toy corpus

·What if we get new ones?

In [None]:
# A new document
newdocs = ["the dog with the ball"]

# This time, no fitting needed: directly apply transform method on the new document to convert it into count-vectorized form
# Unseen words ('dog', 'ball') are ignored
newdocs_counts = vectorizer.transform(newdocs)
newdocs_counts.toarray()

## Back to real data: movie reviews

In [None]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(movie.data, movie.target, 
                                                          test_size = 0.20, random_state = 12)

###  Write code to implement following:
- initialize CountVectorizer, ignore words with frequency less than 2 and use top 3000 words only. 
- fit and tranform using training data
- using the fitted vectorizer, tranform the test data
- check out <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer documentation</a> for any help.

In [None]:
### START CODE HERE ###
# initialize CountVectorizer
movie_vectorizer = 

# fit and tranform using training data 
docs_train_counts = 
### END CODE HERE ###

In [None]:
# test you code 
docs_train_counts.shape

#### Expected output
```
(1600, 3000)
```

In [None]:
### START CODE HERE ###
# Using the fitted vectorizer, tranform the test data
docs_test_counts = 
### END CODE HERE ###

## Training and testing a Naive Bayes classifier

In [None]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

###  Write code to implement following:
- initialize and train a Multinominal Naive Bayes classifier
- predict the test data results, using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html">accuracy_score</a> find accuracy
- check out <a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB">Multinomial Naive Bayes documentation</a> for any help.

In [None]:
clf = MultinomialNB()
### START CODE HERE ###

### END CODE HERE ###

#### Expected accuracy
```
0.7775
```

## Trying the classifier on fake movie reviews

In [None]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 
               'Absolute joy ride', 
               'Steven Seagal was terrible', 
               'Two thumbs up', 
               'I fell asleep halfway through', 
               'Steven Seagal was amazing. His performance was Oscar-worthy.']

reviews_new_counts = movie_vectorizer.transform(reviews_new)         # turn text into count vector

In [None]:
# have classifier make a prediction
pred = clf.predict(reviews_new_counts)

In [None]:
# print out results
for review, category in zip(reviews_new, pred):
    print(f'{review} => {movie.target_names[category]}')