In [1]:
data_dir = "../../ml_datasets/movies_reviews"

In [2]:
import numpy as np
import matplotlib.pyplot as plt

In [3]:
import os
cwd = os.getcwd()
os.chdir(cwd)
print(os.listdir(data_dir))

['.DS_Store', 'neg', 'new_reviews', 'pos']


The first step I'm doing is putting all the reviews into separate string arrays, one for the positive reviews folder and one for the negative reviews folder.

In [4]:
pos_reviews = []
for filename in os.listdir(data_dir +"/pos/"):
    with open(os.path.join(data_dir +"/pos/", filename), 'r') as file:
        data = file.read().replace('\n', '')
        pos_reviews.append(data)

neg_reviews = []
for filename in os.listdir(data_dir +"/neg/"):
    with open(os.path.join(data_dir +"/neg/", filename), 'r') as file:
        data = file.read().replace('\n', '')
        neg_reviews.append(data)

Here, I am placing exactly half of the positive and half of the negative reviews into a training set.  The rest are going into the test set.

In [5]:
X_train_reviews = pos_reviews[:500] + neg_reviews[:500]
X_test_reviews = pos_reviews[500:] + neg_reviews[500:]

In this step, I am applying transformations on the training set.  First, I am applying the count vectorizer, which counts the occurrences of each word.  Then, I pass this result into another transformer, Term-Frequency Inverse Document Frequency (TF-IDF), to give less weight to words such as "the."   

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_tf = count_vect.fit_transform(X_train_reviews)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)

In [7]:
Y_train = ['pos']*500 + ['neg']*500
len(Y_train)

1000

In this cell, I am building the model.  I use Multinomial Naive Bayes since it works well for text classification, and the data is balanced (if it weren't I could use Complement Naive Bayes).  

In [8]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, Y_train)
predicted = clf.predict(X_train_tfidf)

In [9]:
from sklearn.metrics import accuracy_score
print("Accuracy", accuracy_score(Y_train, predicted))

Accuracy 0.982


Now, I am using the model on the test data, applying all the same transformations prior to prediction.

In [10]:
X_test_tf = count_vect.transform(X_test_reviews)
X_test_tfidf = tfidf_transformer.transform(X_test_tf)
Y_test = ['pos']*500 + ['neg']*500
predicted = clf.predict(X_test_tfidf)
print("Accuracy", accuracy_score(Y_test, predicted))

Accuracy 0.819


I repeat the same procedures on some new data. I found reviews online for various movies, and I reported the actual numeric ratings in the last cell.

In [11]:
new_reviews = []
for filename in os.listdir(data_dir +"/new_reviews/"):
    with open(os.path.join(data_dir +"/new_reviews/", filename), 'r') as file:
        data = file.read().replace('\n', '')
        new_reviews.append(data)
len(new_reviews)

5

In [12]:
X_new_tf = count_vect.transform(new_reviews)
X_new_tfidf = tfidf_transformer.transform(X_new_tf)
predicted = clf.predict(X_new_tfidf)
predicted

array(['neg', 'pos', 'neg', 'pos', 'neg'], dtype='<U3')

### Ratings for the new reviews:
1. 1.5/4
2. 8/10
3. 2.5/5
4. 5/5
5. 2.5/4

Based off these ratings, the classifer correctly predicted almost all except 5, which may be considered a positive review (if above half is considered positive).