<a href="https://colab.research.google.com/github/hayden-huynh/Rotten-Tomatoes-Review-Classifier/blob/master/RT_Reviews_NaiveBayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Process

1. Download the Rotten Tomatoes Reviews [dataset](https://www.kaggle.com/datasets/ulrikthygepedersen/rotten-tomatoes-reviews)
2. Text data pre-processing:
  - Lower-casing
  - Punctuation Removal
  - Tokenization
3. Split the original dataset into smaller *train* (70%), *dev* (10%), and *test* (20%) datasets
4. Train the classifier using *train* dataset
  - Calculate and store P(fresh) and P(rotten) priors
  - Calculate and store likelihoods of words
5. Improve the classifier using *dev* dataset
  - Smoothing
  - Float Probability vs Log Probability
  - Stemming and Lemmatization (?)
6. Test accuracy of the classifier using *test* dataset 

# Download Dataset from Kaggle

In [1]:
# Download the Rotten Tomatoes Reviews dataset from Kaggle
# Reference 1 (Ref 1): https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

# Ref 1 starts =====
! pip install kaggle
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download ulrikthygepedersen/rotten-tomatoes-reviews
! unzip rotten-tomatoes-reviews.zip
# ===== Ref 1 ends

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Downloading rotten-tomatoes-reviews.zip to /content
 64% 17.0M/26.5M [00:00<00:00, 57.8MB/s]
100% 26.5M/26.5M [00:00<00:00, 74.9MB/s]
Archive:  rotten-tomatoes-reviews.zip
  inflating: rt_reviews.csv          


# Text Pre-processing

In [2]:
# Reference 2 (Ref 2): https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
import pandas as pd
import numpy as np
import string
import re

reviews = pd.read_csv("/content/rt_reviews.csv", encoding="latin-1")

print("----- Samples -----")
print(reviews.head(5))

print("\n----- Summary -----")
print(reviews.describe())

----- Samples -----
  Freshness                                             Review
0     fresh   Manakamana doesn't answer any questions, yet ...
1     fresh   Wilfully offensive and powered by a chest-thu...
2    rotten   It would be difficult to imagine material mor...
3    rotten   Despite the gusto its star brings to the role...
4    rotten   If there was a good idea at the core of this ...

----- Summary -----
       Freshness                    Review
count     480000                    480000
unique         2                    339697
top        fresh   Parental Content Review
freq      240000                       166


In [3]:
# Lower-case all words
reviews["Review_lower"] = reviews["Review"].apply(lambda r: r.lower())

# Remove Punctuations
def remove_punctuations(text):
  punc_free = "".join([char for char in text if char not in string.punctuation])
  return punc_free

reviews["Review_clean"] = reviews["Review_lower"].apply(lambda r: remove_punctuations(r))

# Tokenizing and Removing Duplicate Words
def tokenize(text):
  tokens = re.split("\W+", text)
  tokens = list(filter(None, tokens))
  return sorted(list(set(tokens)))

reviews["Review_tokens"] = reviews["Review_clean"].apply(lambda r: tokenize(r))

reviews.sample(5)

Unnamed: 0,Freshness,Review,Review_lower,Review_clean,Review_tokens
157184,rotten,"Although it is an awesome technical feat, the...","although it is an awesome technical feat, the...",although it is an awesome technical feat the ...,"[a, although, an, awesome, be, before, clever,..."
174720,rotten,[W]hile the film offers silly fun for a while...,[w]hile the film offers silly fun for a while...,while the film offers silly fun for a while i...,"[a, and, become, film, for, fun, its, offers, ..."
287481,fresh,Birth of a Nation wants to reinforce while th...,birth of a nation wants to reinforce while th...,birth of a nation wants to reinforce while th...,"[a, and, birth, black, bones, disenfranchised,..."
637,rotten,21 and Over is pretty much for people with an...,21 and over is pretty much for people with an...,21 and over is pretty much for people with an...,"[21, an, and, for, iq, is, much, of, over, peo..."
429392,fresh,Fast Five boasts incredible action scenes tha...,fast five boasts incredible action scenes tha...,fast five boasts incredible action scenes tha...,"[a, action, all, and, are, as, boasts, breeze,..."


# Split the Original Dataset

In [133]:
# Reference 3 (Ref 3): https://stackoverflow.com/questions/43777243/how-to-split-a-dataframe-in-pandas-in-predefined-percentages 

# Ref 3 starts =====
def split_by_fractions(df, fracs, random_state=0):
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum = sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]
# Ref 3 ends =====

random_state = 1
train, dev, test = split_by_fractions(reviews, [0.7, 0.1, 0.2], random_state)
print(train.shape, dev.shape, test.shape)

(336000, 5) (48000, 5) (96000, 5)


# Training

In [134]:
from decimal import Decimal

train_fresh = train.loc[train["Freshness"] == "fresh"]
train_rotten = train.loc[train["Freshness"] == "rotten"]

# P(fresh) and P(rotten) priors
p_fresh = Decimal(len(train_fresh) / len(train))
p_rotten = Decimal(len(train_rotten) / len(train))

print(f'P(fresh) = {p_fresh}')
print(f'P(rotten) = {p_rotten}')

P(fresh) = 0.50085119047619042209618100969237275421619415283203125
P(rotten) = 0.499148809523809522392667759049800224602222442626953125


In [135]:
# Count word occurences
occ_fresh = {}
occ_rotten = {}

for words in train_fresh.loc[:,"Review_tokens"]:
  for w in words:
    if w not in occ_fresh.keys():
      occ_fresh[w] = 1
    else:
      occ_fresh[w] += 1

for words in train_rotten.loc[:,"Review_tokens"]:
  for w in words:
    if w not in occ_rotten.keys():
      occ_rotten[w] = 1
    else:
      occ_rotten[w] += 1

In [166]:
# Calculate word probabilities given fresh or rotten
probs_fresh = {}
probs_rotten = {}

def calc_word_likelihood(count, alpha, h):
  if h == "fresh":
    return Decimal((count + alpha) / (len(train_fresh) + alpha * len(occ_fresh)))
  elif h == "rotten":
    return Decimal((count + alpha) / (len(train_rotten) + alpha * len(occ_rotten)))

def calc_prob(alpha=0):
  for word, count in occ_fresh.items():
    probs_fresh[word] = calc_word_likelihood(count, alpha, "fresh")
  
  for word, count in occ_rotten.items():
    probs_rotten[word] = calc_word_likelihood(count, alpha, "rotten")

alpha = 10

calc_prob(alpha)

# Experimenting with *dev* dataset

In [167]:
import csv
import os

# Function to classify a review
def classify(review_words, alpha):
  for w in review_words:
    if w not in probs_fresh.keys():
      probs_fresh[w] = calc_word_likelihood(0, alpha, "fresh")
    if w not in probs_rotten.keys():
      probs_rotten[w] = calc_word_likelihood(0, alpha, "rotten")
  
  chance_fresh = p_fresh
  chance_rotten = p_rotten
  for w in review_words:
    chance_fresh = chance_fresh * probs_fresh[w]
    chance_rotten = chance_rotten * probs_rotten[w]
  
  if chance_fresh > chance_rotten:
    return "fresh"
  else:
    return "rotten"


# Function to classify a review, log10 applied to avoid underflowing floats
def classify_log(review_words, alpha):
  for w in review_words:
    if w not in probs_fresh.keys():
      probs_fresh[w] = calc_word_likelihood(0, alpha, "fresh")
    if w not in probs_rotten.keys():
      probs_rotten[w] = calc_word_likelihood(0, alpha, "rotten")

  chance_fresh = p_fresh.log10()
  chance_rotten = p_rotten.log10()
  for w in review_words:
    chance_fresh = chance_fresh + probs_fresh[w].log10()
    chance_rotten = chance_rotten + probs_rotten[w].log10()

  if chance_fresh > chance_rotten:
    return "fresh"
  else:
    return "rotten"


# Function to test entire dataset given
def test_accuracy(dataset, alpha, use_log=False, csv_writer=None):
  correct = 0
  for index, row in dataset.loc[:,["Freshness", "Review_tokens"]].iterrows():
    result = ""
    if use_log:
      result = classify_log(row["Review_tokens"], alpha)
    else:
      result = classify(row["Review_tokens"], alpha)
    if row["Freshness"] == result:
      correct += 1
  accuracy = round(correct / len(dataset) * 100, 4)
  csv_writer.writerow([alpha, accuracy])
  print(f"Successfully classified {correct}/{len(dataset)} ({accuracy}%) correctly")



# Experiment with Smoothing

dev_smoothing = open(f"dev_smoothing_log.csv", "a", newline='')
dev_smoothing_writer = csv.writer(dev_smoothing)
if (os.path.getsize(f"/content/dev_smoothing_log.csv") == 0):
  dev_smoothing_writer.writerow(["alpha", "accuracy"])

test_accuracy(dev, alpha, use_log=True, csv_writer=dev_smoothing_writer)
dev_smoothing.flush()

Successfully classified 38044/48000 (79.2583%) correctly


# Final Accuracy with *test* dataset

In [168]:
# test_accuracy(test, alpha)