# Movie Reviews - Classification - AITAMALIK Amine

After the creation of all our datasets, we will here apply our models. The goal is to compare Naive Bayes to SVM and Logistic Regression in terms of training and testing accuracies. We will study the most impactful words which have the biggest weight in terms of determining the class of a review. Finally, we shall also look at the causes of our models' errors, such as sarcastic reviews.

### Imports

In [1]:
import numpy as np
import pvml
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import svm

# PART 1 - Normal Data

## Data

In [2]:
train_data = np.loadtxt("train.txt.gz")
X_train = train_data[:, :-1]
Y_train = train_data[:, -1]

print("Train Data Loaded")

Train Data Loaded


In [3]:
test_data = np.loadtxt("test.txt.gz")
X_test = test_data[:, :-1]
Y_test = test_data[:, -1]

print("Test Data Loaded")

Test Data Loaded


## I. Naive Bayes

### a) Training

In [4]:
def train_nb(X, Y):
    m = X.shape[0]
    n = X.shape[1]
    
    # Positive Reviews
    pos_counter = X[Y == 1, :].sum(0)
    pi_pos = (1 + pos_counter) / (n + pos_counter.sum())
    prior_pos = Y.sum() / m
    
    # Negative Reviews
    neg_counter = X[Y == 0, :].sum(0)
    pi_neg = (1 + neg_counter) / (n + neg_counter.sum())
    prior_neg = 1 - prior_pos
    
    w = np.log(pi_pos) - np.log(pi_neg)
    b = np.log(prior_pos) - np.log(prior_neg)
    
    return w, b

In [5]:
nb_w, nb_b = train_nb(X_train, Y_train)
print("Classifier trained")

Classifier trained


### b) Accuracies

In [3]:
def inference_nb(X, w, b):
    scores = X @ w + b
    labels = (scores > 0).astype(int)
    
    return labels

#### Training Accuracy

In [20]:
nb_train_predictions = inference_nb(X_train, nb_w, nb_b)
nb_train_accuracy = (nb_train_predictions == Y_train).mean()

print(f"Naive Bayes Training Accuracy: {nb_train_accuracy * 100}%")

Naive Bayes Training Accuracy: 80.032%


#### Testing Accuracy

In [21]:
nb_test_predictions = inference_nb(X_test, nb_w, nb_b)
nb_test_accuracy = (nb_test_predictions == Y_test).mean()

print(f"Naive Bayes Testing Accuracy: {nb_test_accuracy * 100}%")

Naive Bayes Testing Accuracy: 79.75999999999999%


## II. Analysis

### a) Impactful Words

In [114]:
f = open("vocabulary.txt", encoding="utf-8")
vocabulary = f.read().split()
f.close()

#### Most Impactful Words

In [118]:
print("Words Most Likely in Negative Reviews:")
print("")

indices = np.argsort(nb_w)

for i in indices[:10]:
    print(f" {vocabulary[i]}: {round(nb_w[i], 4)}")

Words Most Likely in Negative Reviews:

 waste: -2.6284
 worst: -2.2884
 awful: -2.0895
 poorly: -2.0768
 lame: -1.9208
 horrible: -1.7964
 crap: -1.7379
 terrible: -1.686
 stupid: -1.6793
 bad,: -1.677


In [117]:
print("Words Most Likely in Positive Reviews:")
print("")

for i in indices[-10:]:
    print(f" {vocabulary[i]}: {round(nb_w[i], 4)}")

Words Most Likely in Positive Reviews:

 perfectly: 1.2845
 perfect: 1.2881
 loved: 1.3188
 brilliant: 1.3455
 powerful: 1.3764
 amazing: 1.4578
 superb: 1.5322
 excellent: 1.5671
 wonderful: 1.5835
 fantastic: 1.5851


## II. SUPPORT VECTOR MACHINES

In [104]:
linear_svm = svm.LinearSVC(max_iter=10000, dual=False).fit(X_train, Y_train)

#### Training Accuracy

In [105]:
linear_svm_train_predictions = linear_svm.predict(X_train)
linear_svm_train_accuracy = (linear_svm_train_predictions == Y_train).mean()

print(f"Linear SVM Train Accuracy: {linear_svm_train_accuracy * 100}%")

Linear SVM Train Accuracy: 88.536%


#### Testing Accuracy

In [106]:
linear_svm_test_predictions = linear_svm.predict(X_test)
linear_svm_test_accuracy = (linear_svm_test_predictions == Y_test).mean()

print(f"Linear SVM Test Accuracy: {linear_svm_test_accuracy * 100}%")

Linear SVM Test Accuracy: 84.544%


## IV. LOGISTIC REGRESSION

In [25]:
logreg = LogisticRegression(
    random_state=0,
    solver='liblinear'
).fit(X_train,Y_train)

#### Training Accuracy

In [26]:
logreg_train_predictions = logreg.predict(X_train)
logreg_train_accuracy = (logreg_train_predictions == Y_train).mean()

print(f"Logistic Regression Train Accuracy: {logreg_train_accuracy * 100}%")

Logistic Regression Train Accuracy: 88.75999999999999%


#### Testing Accuracy

In [27]:
logreg_test_predictions = logreg.predict(X_test)
logreg_test_accuracy = (logreg_test_predictions == Y_test).mean()

print(f"Logistic Regression Test Accuracy: {logreg_test_accuracy * 100}%")

Logistic Regression Test Accuracy: 84.608%


# PART 2 - Bigger Vocabulary

## Data

In [4]:
train_data_big_voc = np.loadtxt("big_voc_train.txt.gz")
X_train_big_voc = train_data_big_voc[:, :-1]
Y_train_big_voc = train_data_big_voc[:, -1]

print("Train Data Loaded")

Train Data Loaded


In [5]:
test_data_big_voc = np.loadtxt("big_voc_test.txt.gz")
X_test_big_voc = test_data_big_voc[:, :-1]
Y_test_big_voc = test_data_big_voc[:, -1]

print("Test Data Loaded")

Test Data Loaded


## I. Naive Bayes

In [6]:
nb_w_big_voc, nb_b_big_voc = train_nb(X_train_big_voc, Y_train_big_voc)

print("Classifier trained")

Classifier trained


#### Training Accuracy

In [7]:
nb_train_predictions_big_voc = inference_nb(X_train_big_voc, nb_w_big_voc, nb_b_big_voc)
nb_train_accuracy_big_voc = (nb_train_predictions_big_voc == Y_train_big_voc).mean()

print(f"Naive Bayes Training Accuracy: {nb_train_accuracy_big_voc * 100}%")

Naive Bayes Training Accuracy: 81.76%


#### Testing Accuracy

In [8]:
nb_test_predictions_big_voc = inference_nb(X_test_big_voc, nb_w_big_voc, nb_b_big_voc)
nb_test_accuracy_big_voc = (nb_test_predictions_big_voc == Y_test_big_voc).mean()

print(f"Naive Bayes Testing Accuracy: {nb_test_accuracy_big_voc * 100}%")

Naive Bayes Testing Accuracy: 80.56%


## II. SUPPORT VECTOR MACHINES

In [101]:
linear_svm_big_voc = svm.LinearSVC(
    max_iter=10000,
    dual=False
).fit(X_train_big_voc, Y_train_big_voc)

#### Training Accuracy

In [102]:
linear_svm_train_predictions_big_voc = linear_svm_big_voc.predict(X_train_big_voc)
linear_svm_train_accuracy_big_voc = (linear_svm_train_predictions_big_voc == Y_train_big_voc).mean()

print(f"Linear SVM Train Accuracy: {linear_svm_train_accuracy_big_voc * 100}%")

Linear SVM Train Accuracy: 90.58800000000001%


#### Testing Accuracy

In [103]:
linear_svm_test_predictions_big_voc = linear_svm_big_voc.predict(X_test_big_voc)
linear_svm_test_accuracy_big_voc = (linear_svm_test_predictions_big_voc == Y_test_big_voc).mean()

print(f"Linear SVM Test Accuracy: {linear_svm_test_accuracy_big_voc * 100}%")

Linear SVM Test Accuracy: 86.36%


## III. LOGISTIC REGRESSION

In [59]:
logreg_big_voc = LogisticRegression(
    random_state=0,
    solver='liblinear'
).fit(X_train_big_voc, Y_train_big_voc)

#### Training Accuracy

In [60]:
logreg_train_predictions_big_voc = logreg_big_voc.predict(X_train_big_voc)
logreg_train_accuracy_big_voc = (logreg_train_predictions_big_voc == Y_train_big_voc).mean()

print(f"Logistic Regression Train Accuracy: {logreg_train_accuracy_big_voc * 100}%")

Logistic Regression Train Accuracy: 90.768%


#### Testing Accuracy

In [61]:
logreg_test_predictions_big_voc = logreg_big_voc.predict(X_test_big_voc)
logreg_test_accuracy_big_voc = (logreg_test_predictions_big_voc == Y_test_big_voc).mean()

print(f"Logistic Regression Test Accuracy: {logreg_test_accuracy_big_voc * 100}%")

Logistic Regression Test Accuracy: 86.37599999999999%


# PART 3 - STEMMED DATA WITHOUT STOPWORDS 

## Data

In [31]:
train_data_stem_sw = np.loadtxt("train_stem_sw.txt.gz")
X_train_stem_sw = train_data_stem_sw[:, :-1]
Y_train_stem_sw = train_data_stem_sw[:, -1]

print("Train Data Loaded")

Train Data Loaded


In [32]:
test_data_stem_sw = np.loadtxt("test_stem_sw.txt.gz")
X_test_stem_sw = test_data_stem_sw[:, :-1]
Y_test_stem_sw = test_data_stem_sw[:, -1]

print("Test Data Loaded")

Test Data Loaded


## I. Naive Bayes

In [33]:
nb_w_stem_sw, nb_b_stem_sw = train_nb(X_train_stem_sw, Y_train_stem_sw)

print("Classifier trained")

Classifier trained


#### Training Accuracy

In [34]:
nb_train_predictions_stem_sw = inference_nb(X_train_stem_sw, nb_w_stem_sw, nb_b_stem_sw)
nb_train_accuracy_stem_sw = (nb_train_predictions_stem_sw == Y_train_stem_sw).mean()

print(f"Naive Bayes Training Accuracy: {nb_train_accuracy_stem_sw * 100}%")

Naive Bayes Training Accuracy: 79.104%


#### Testing Accuracy

In [35]:
nb_test_predictions_stem_sw = inference_nb(X_test_stem_sw, nb_w_stem_sw, nb_b_stem_sw)
nb_test_accuracy_stem_sw = (nb_test_predictions_stem_sw == Y_test_stem_sw).mean()

print(f"Naive Bayes Testing Accuracy: {nb_test_accuracy_stem_sw * 100}%")

Naive Bayes Testing Accuracy: 77.656%


## II. SUPPORT VECTOR MACHINES

In [96]:
linear_svm_stem_sw = svm.LinearSVC(
    max_iter=10000,
    dual=False
).fit(X_train_stem_sw, Y_train_stem_sw)

#### Training Accuracy

In [99]:
linear_svm_train_predictions_stem_sw = linear_svm_stem_sw.predict(X_train_stem_sw)
linear_svm_train_accuracy_stem_sw = (linear_svm_train_predictions_stem_sw == Y_train_stem_sw).mean()

print(f"Linear SVM Train Accuracy: {linear_svm_train_accuracy_stem_sw * 100}%")

Linear SVM Train Accuracy: 82.50800000000001%


#### Testing Accuracy

In [100]:
linear_svm_test_predictions_stem_sw = linear_svm_stem_sw.predict(X_test_stem_sw)
linear_svm_test_accuracy_stem_sw = (linear_svm_test_predictions_stem_sw == Y_test_stem_sw).mean()

print(f"Linear SVM Test Accuracy: {linear_svm_test_accuracy_stem_sw * 100}%")

Linear SVM Test Accuracy: 80.264%


## III. LOGISTIC REGRESSION

In [109]:
logreg_stem_sw = LogisticRegression(
    random_state=0,
    solver='liblinear'
).fit(X_train_stem_sw, Y_train_stem_sw)

#### Training Accuracy

In [110]:
logreg_train_predictions_stem_sw = logreg_stem_sw.predict(X_train_stem_sw)
logreg_train_accuracy_stem_sw = (logreg_train_predictions_stem_sw == Y_train_stem_sw).mean()

print(f"Logistic Regression Train Accuracy: {logreg_train_accuracy_stem_sw * 100}%")

Logistic Regression Train Accuracy: 82.628%


#### Testing Accuracy

In [111]:
logreg_test_predictions_stem_sw = logreg_stem_sw.predict(X_test_stem_sw)
logreg_test_accuracy_stem_sw = (logreg_test_predictions_stem_sw == Y_test_stem_sw).mean()

print(f"Logistic Regression Test Accuracy: {logreg_test_accuracy_stem_sw * 100}%")

Logistic Regression Test Accuracy: 80.432%


# PART 4 - ACCURACY COMPARISON

### Training Accuracies

In [112]:
d = {
    'Normal Data':["80.03%", "88.54%", "88.75%"],
    'Data with Bigger Vocabulary':["81.76%", "90.59%", "90.77%"],
    'Stemmed Data without Stopwords':["79.10%", "82.50%", "82.63%"],
    'Classifiers': ["Naive Bayes", "Nu-SVM", "Logistic Regression"]
}

print("Training Accuracies")

df = pd.DataFrame(d)
df.set_index('Classifiers')

Training Accuracies


Unnamed: 0_level_0,Normal Data,Data with Bigger Vocabulary,Stemmed Data without Stopwords
Classifiers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Naive Bayes,80.03%,81.76%,79.10%
Nu-SVM,88.54%,90.59%,82.50%
Logistic Regression,88.75%,90.77%,82.63%


### Testing Accuracies

In [113]:
d = {
    'Normal Data':["79.76%", "84.54%", "84.61%"],
    'Data with Bigger Vocabulary':["80.56%", "86.36%", "86.38%"],
    'Stemmed Data without Stopwords':["77.66%", "80.26%", "80.43%"],
    'Classifiers': ["Naive Bayes", "Nu-SVM", "Logistic Regression"]
}

print("Testing Accuracies")

df = pd.DataFrame(d)
df.set_index('Classifiers')

Testing Accuracies


Unnamed: 0_level_0,Normal Data,Data with Bigger Vocabulary,Stemmed Data without Stopwords
Classifiers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Naive Bayes,79.76%,80.56%,77.66%
Nu-SVM,84.54%,86.36%,80.26%
Logistic Regression,84.61%,86.38%,80.43%


# PART 5 - EXAMPLE OF A SARCASTIC REVIEW

## a) Initialization 

#### Vocabulary with Indices

In [70]:
def load_vocabulary_with_indices(filename):
    f = open(filename, encoding="utf-8")
    text = f.read()
    f.close()
    words = text.split()
    
    # Create index for each word
    voc = {}
    index = 0
    for word in words:
        voc[word] = index
        index += 1
    
    return voc

#### Split Document into BoW Representation

In [71]:
PUNCT = "!#$%&()''*+-/.:;?@[]{}|^_`~<>=\"" # all punctuation we discard
TABLE = str.maketrans(PUNCT, " " * len(PUNCT)) # replace punctuation by space

def txt_as_bow(filename, voc):
    f = open(filename, encoding="utf-8") # specify encoding to avoid unreadable documents
    text = f.read()
    f.close()
    
    text = text.lower() # all words to lowercase
    text = text.translate(TABLE)
    words = text.split() # separate the document into list of words
    
    # Bag of Words
    bow = np.zeros(len(voc))
    for word in words:
        if word in voc:
            index = voc[word]
            bow[index] += 1
    
    return bow

#### Display Document

In [72]:
def display_text_document(filename, pos=False,):
    
    if pos == True:
        path = glob.glob("aclImdb/test/pos/*.txt")
    else:
        path = glob.glob("aclImdb/test/neg/*.txt")
    f = open(path[filename], encoding="utf-8") # specify encoding to avoid unreadable documents
    text = f.read()
    f.close()
    
    text = text.replace("<br /><br />", "")
    
    return text

## b) Application

In [74]:
vocabulary_with_indices = load_vocabulary_with_indices("vocabulary.txt")
documents_37 = []
labels_37 = []

path_neg_37 = glob.glob("aclImdb/test/neg/*.txt")[37]     # negative review number 38 in the test dataset 
bow_37 = txt_as_bow(path_neg_37, vocabulary_with_indices) # BoW representation of the review

documents_37.append(bow_37)
labels_37.append(1)
labels_37 = np.array(labels_37)

data_37 = np.concatenate([
    documents_37,
    labels_37[:, None]
],1)

X_37 = data_37[:, :-1]
Y_37 = data_37[:, -1]

prediction_37 = inference_nb(X_37, nb_w, nb_b)

print(path_neg_37)
print("")

print(f"Prediction: {prediction_37} , Real Class: {Y_37}")
print("")

print(display_text_document(37))

aclImdb/test/neg\10066_1.txt

Prediction: [0] , Real Class: [1.]

Looking for a REAL super bad movie? If you wanna have great fun, don't hesitate and check this one!Ferrigno is incredibly bad but is also the best of this mediocrity.
