<a href="https://colab.research.google.com/github/hayden-huynh/Rotton-Tomatoes-Review-Classifier/blob/master/RT_Reviews_NaiveBayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Process

1. Download the Rotten Tomatoes Reviews [dataset](https://www.kaggle.com/datasets/ulrikthygepedersen/rotten-tomatoes-reviews)
2. Split the original dataset into smaller *train* (70%), *dev* (10%), and *test* (20%) datasets
3. Train the classifier using *train* dataset
  - Calculate P(fresh) and P(rotten) priors
  - Text pre-processing:
    - Lower-case
    - Remove punctuations
4. Improve the classifier using *dev* dataset (?)
5. Test accuracy of the classifier using *test* dataset 

# Download Dataset from Kaggle

In [None]:
# Download the Rotten Tomatoes Reviews dataset from Kaggle
# Reference 1 (Ref 1): https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

# Ref 1 starts =====
! pip install kaggle
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download ulrikthygepedersen/rotten-tomatoes-reviews
! unzip rotten-tomatoes-reviews.zip
# ===== Ref 1 ends

# Split the Original Dataset

In [10]:
# Reference 2 (Ref 2): https://stackoverflow.com/questions/43777243/how-to-split-a-dataframe-in-pandas-in-predefined-percentages 

import pandas as pd
import numpy as np

reviews = pd.read_csv("/content/rt_reviews.csv", encoding="latin-1")

print("----- Samples -----")
print(reviews.head(5))

print("\n----- Summary -----")
print(reviews.describe())

----- Samples -----
  Freshness                                             Review
0     fresh   Manakamana doesn't answer any questions, yet ...
1     fresh   Wilfully offensive and powered by a chest-thu...
2    rotten   It would be difficult to imagine material mor...
3    rotten   Despite the gusto its star brings to the role...
4    rotten   If there was a good idea at the core of this ...

----- Summary -----
       Freshness                    Review
count     480000                    480000
unique         2                    339697
top        fresh   Parental Content Review
freq      240000                       166
(480000, 2)


In [42]:
# Ref 2 starts =====
def split_by_fractions(df, fracs, random_state=1):
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum = sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]
# Ref 2 ends =====

train, dev, test = split_by_fractions(reviews, [0.7, 0.1, 0.2])
print(train.shape, dev.shape, test.shape)

(336000, 2) (48000, 2) (96000, 2)


# Training

In [47]:
# Reference 3 (Ref 3): https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
import string
import re

# Text Pre-processing
# Lower-case all words
train["Review_lower"] = train["Review"].apply(lambda r: r.lower())
train.sample(5)

Unnamed: 0,Freshness,Review,Review_lower,Review_clean
316289,fresh,We Own the Night plays like gangbusters.,we own the night plays like gangbusters.,we own the night plays like gangbusters
251651,rotten,It's Alvin and the Chipmunks with only one ch...,it's alvin and the chipmunks with only one ch...,its alvin and the chipmunks with only one chi...
125509,fresh,The fuzzy thinking allows for gorgeous outdoo...,the fuzzy thinking allows for gorgeous outdoo...,the fuzzy thinking allows for gorgeous outdoo...
69413,rotten,Partridge is a smidgen less abhorrent here th...,partridge is a smidgen less abhorrent here th...,partridge is a smidgen less abhorrent here th...
373479,rotten,Boy Erased has perhaps squandered a plum oppo...,boy erased has perhaps squandered a plum oppo...,boy erased has perhaps squandered a plum oppo...


In [46]:
# Remove Punctuations
def remove_punctuations(text):
  punc_free = "".join([char for char in text if char not in string.punctuation])
  return punc_free

train["Review_clean"] = train["Review_lower"].apply(lambda r: remove_punctuations(r))
train.sample(5)

Unnamed: 0,Freshness,Review,Review_lower,Review_clean
258838,rotten,It's easy to get upset with the writer and di...,it's easy to get upset with the writer and di...,its easy to get upset with the writer and dir...
415049,fresh,"It's a very conventional, Bible-bashing appro...","it's a very conventional, bible-bashing appro...",its a very conventional biblebashing approach...
422708,fresh,Cellular cleverly fleshes out the premise of ...,cellular cleverly fleshes out the premise of ...,cellular cleverly fleshes out the premise of ...
410768,fresh,"""Little Men"" is able to achieve a balance bet...","""little men"" is able to achieve a balance bet...",little men is able to achieve a balance betwe...
171532,fresh,You forget you're watching a movie and instea...,you forget you're watching a movie and instea...,you forget youre watching a movie and instead...


In [76]:
# Tokenizing and Removing Duplicate Words
def tokenize(text):
  tokens = re.split("\W+", text)
  tokens = list(filter(None, tokens))
  return sorted(list(set(tokens)))

train["Review_tokens"] = train["Review_clean"].apply(lambda r: tokenize(r))
train.sample(5)

Unnamed: 0,Freshness,Review,Review_lower,Review_clean,Review_tokens
158945,rotten,"Great start, but the movie keeps fighting its...","great start, but the movie keeps fighting its...",great start but the movie keeps fighting its ...,"[but, fighting, great, internal, its, keeps, l..."
439197,fresh,My suspicion is that Apatow made a conscious ...,my suspicion is that apatow made a conscious ...,my suspicion is that apatow made a conscious ...,"[a, apatow, at, battlefield, chaos, conscious,..."
308532,rotten,Even the biggest fans of this type of film mu...,even the biggest fans of this type of film mu...,even the biggest fans of this type of film mu...,"[acknowledge, an, biggest, embarrassingly, eve..."
91244,rotten,It begins as a clever pseudo-mumblecore provo...,it begins as a clever pseudo-mumblecore provo...,it begins as a clever pseudomumblecore provoc...,"[a, as, begins, bruce, clever, indefensible, i..."
304638,rotten,"On the whole it's desperately poor fare, whic...","on the whole it's desperately poor fare, whic...",on the whole its desperately poor fare which ...,"[audience, counted, desperately, fare, halflau..."


In [None]:
train_fresh = train.loc[train["Freshness"] == "fresh"]
train_rotten = train.loc[train["Freshness"] == "rotten"]

# P(fresh) and P(rotten) priors
p_fresh = len(train_fresh) / len(train)
P_rotten = len(train_rotten) / len(train)