<a href="https://colab.research.google.com/github/hayden-huynh/Rotten-Tomatoes-Review-Classifier/blob/master/RT_Reviews_NaiveBayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Process

1. Download the Rotten Tomatoes Reviews [dataset](https://www.kaggle.com/datasets/ulrikthygepedersen/rotten-tomatoes-reviews)
2. Text data pre-processing:
  - Lower-casing
  - Punctuation Removal
  - Tokenization
3. Split the original dataset into smaller *train* (70%), *dev* (10%), and *test* (20%) datasets
4. Train the classifier using *train* dataset
  - Calculate and store P(fresh) and P(rotten) priors
  - Calculate and store likelihoods of words
5. Improve the classifier using *dev* dataset
  - Smoothing
  - Normal Probability vs Log Probability
  - Stemming and Lemmatization (?)
6. Test accuracy of the classifier using *test* dataset 

# Download Dataset from Kaggle

In [None]:
# Download the Rotten Tomatoes Reviews dataset from Kaggle
# Reference 1 (Ref 1): https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

# Ref 1 starts =====
! pip install kaggle
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download ulrikthygepedersen/rotten-tomatoes-reviews
! unzip rotten-tomatoes-reviews.zip
# ===== Ref 1 ends

# Text Pre-processing

In [2]:
# Reference 2 (Ref 2): https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
import pandas as pd
import numpy as np
import string
import re

reviews = pd.read_csv("/content/rt_reviews.csv", encoding="latin-1")

print("----- Samples -----")
print(reviews.head(5))

print("\n----- Summary -----")
print(reviews.describe())

----- Samples -----
  Freshness                                             Review
0     fresh   Manakamana doesn't answer any questions, yet ...
1     fresh   Wilfully offensive and powered by a chest-thu...
2    rotten   It would be difficult to imagine material mor...
3    rotten   Despite the gusto its star brings to the role...
4    rotten   If there was a good idea at the core of this ...

----- Summary -----
       Freshness                    Review
count     480000                    480000
unique         2                    339697
top        fresh   Parental Content Review
freq      240000                       166


In [3]:
# Lower-case all words
reviews["Review_lower"] = reviews["Review"].apply(lambda r: r.lower())

# Remove Punctuations
def remove_punctuations(text):
  punc_free = "".join([char for char in text if char not in string.punctuation])
  return punc_free

reviews["Review_clean"] = reviews["Review_lower"].apply(lambda r: remove_punctuations(r))

# Tokenizing and Removing Duplicate Words
def tokenize(text):
  tokens = re.split("\W+", text)
  tokens = list(filter(None, tokens))
  return sorted(list(set(tokens)))

reviews["Review_tokens"] = reviews["Review_clean"].apply(lambda r: tokenize(r))

reviews.sample(5)

Unnamed: 0,Freshness,Review,Review_lower,Review_clean,Review_tokens
215572,rotten,"Feels like a half-hearted jumble, full of laz...","feels like a half-hearted jumble, full of laz...",feels like a halfhearted jumble full of lazy ...,"[a, feels, full, halfhearted, jumble, lazy, li..."
299383,rotten,"If it was free, perhaps fans wouldn't judge i...","if it was free, perhaps fans wouldn't judge i...",if it was free perhaps fans wouldnt judge it ...,"[able, also, be, change, channel, fans, free, ..."
324962,fresh,The kind of film that works on you while you'...,the kind of film that works on you while you'...,the kind of film that works on you while your...,"[a, afterwards, but, dessert, ever, feeling, f..."
44361,rotten,The jokes are tasteless (far too many of them...,the jokes are tasteless (far too many of them...,the jokes are tasteless far too many of them ...,"[about, and, are, around, as, being, by, colle..."
360292,fresh,There's little doubt that Ocean's Eight comes...,there's little doubt that ocean's eight comes...,theres little doubt that oceans eight comes o...,"[a, and, as, comes, doubt, eight, endeavor, er..."


# Split the Original Dataset

In [4]:
# Reference 3 (Ref 3): https://stackoverflow.com/questions/43777243/how-to-split-a-dataframe-in-pandas-in-predefined-percentages 

# Ref 3 starts =====
def split_by_fractions(df, fracs, random_state=1):
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum = sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]
# Ref 3 ends =====

train, dev, test = split_by_fractions(reviews, [0.7, 0.1, 0.2])
print(train.shape, dev.shape, test.shape)

(336000, 5) (48000, 5) (96000, 5)


# Training

In [15]:
import math

train_fresh = train.loc[train["Freshness"] == "fresh"]
train_rotten = train.loc[train["Freshness"] == "rotten"]

# P(fresh) and P(rotten) priors
p_fresh = len(train_fresh) / len(train)
p_rotten = len(train_rotten) / len(train)

In [None]:
# Count word occurences
occ_fresh = {}
occ_rotten = {}

for words in train_fresh.loc[:,"Review_tokens"]:
  for w in words:
    if w not in occ_fresh.keys():
      occ_fresh[w] = 1
    else:
      occ_fresh[w] += 1

for words in train_rotten.loc[:,"Review_tokens"]:
  for w in words:
    if w not in occ_rotten.keys():
      occ_rotten[w] = 1
    else:
      occ_rotten[w] += 1

# Calculate word probabilities given fresh or rotten
prob_fresh = {}
prob_rotten = {}

def calc_prob(alpha=0):
  for word, count in occ_fresh.items():
    prob_fresh[word] = (count + alpha) / (len(train_fresh) + alpha * len(occ_fresh.keys()))
  
  for word, count in occ_rotten.items():
    prob_rotten[word] = (count + alpha) / (len(train_rotten) + alpha * len(occ_rotten.keys()))

# Experimenting with *dev* dataset

# Final Accuracy with *test* dataset