## Predicting Score Based on Review Content

Can we take a review and, based on the words in the review, predict what the score will be? Let's find out!

For this we will be using naive Bayes theorem for document classification - we create a collection of the words in a review, find the frequency of each word within that collection/review, and then compare the collection/review to the all of the words the model has already seen and processed in order to see which category the review will most likely fall into. We are not, in this instance, trying to predict the exact score. Instead, we create buckets of scores, based on the distribution of all 18,389 scores, in order to classify the score into 6 types.

Bayes theorem in this case can be written as "the probability that a word within a review will fall into a specific review category, given the word, is equal to the probability of the word given its category times the probability of the category, divided by the probability of the word" :

 $$ P(\text{Review Category | Word}) = \dfrac{P(\text{Word | Review Category})P(\text{Review Category})}{P(\text{Word})}$$  

Using our collections of words within reviews, you can then define $P(\text{Word | Review Category})$ as "the probability of a word within a review given the review category equals the frequency of the word within the review divided by the frequency of the word across all reviews in that category." 

BUT! This could lead to errors if a word does not show up in our training data, because then the freqency of the word across all reviews would be zero and we'd be dividing by zero. To avoid that, we'll apply Laplacian smoothing - adding one to all word freqencies, and then adding the number of words across all reviews to the denominator :

 $$P(\text{Word | Review Category}) = \dfrac{\text{Word Frequency in Review + 1}}{\text{Word Frequency Across All Reviews in that Category + Number of Words Across All Reviews}}$$  

Because the reviews are independent, that's when naive Bayes comes in - we are assuming that the probability of each word (the denominator of our function) is the same across all categories. Thus, since we are only comparing the relative probabilities of the review categories, we can ignore the denominator - P(Word) - and use the numerator alone to decide in which category the review likely belongs.

In [1]:
%load_ext autoreload
%autoreload 2

#### First, imports and data preparation:

In [2]:
# Imports

import sqlite3
import pandas as pd
import numpy as np
from scipy import stats 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Importing the class I wrote for Document Classification, which includes
# functions for much of the work here for reproducibility
from DocumentClassification import *

In [3]:
# Creating connection to SQLite database

conn = sqlite3.Connection("database.sqlite")
c = conn.cursor()

In [4]:
# Querying for the score and the content of each review

c.execute("""SELECT reviewid, score, content
            FROM reviews
            JOIN content
            USING(reviewid)
            ;""")

# Creating a pandas dataframe from our SQLite query
df = pd.DataFrame(c.fetchall())
df.columns = [x[0] for x in c.description]
df.head()

Unnamed: 0,reviewid,score,content
0,22703,9.3,"“Trip-hop” eventually became a ’90s punchline,..."
1,22721,7.9,"Eight years, five albums, and two EPs in, the ..."
2,22659,7.3,Minneapolis’ Uranium Club seem to revel in bei...
3,22661,9.0,Kleenex began with a crash. It transpired one ...
4,22725,8.1,It is impossible to consider a given release b...


In [5]:
# Checking for duplicates
duplicates = df[df.duplicated()]
print(len(duplicates))
duplicates

12


Unnamed: 0,reviewid,score,content
12117,9417,7.0,\r\n A song-for-song reggae cover of Radioh...
12119,9505,8.2,\nOn the one hand it is a largely superfluous ...
12121,9499,6.2,"When we last left our heroes, the Blood Brothe..."
12123,9460,7.8,Strange things are a foot in the bowels of hel...
12124,9417,7.0,\r\n A song-for-song reggae cover of Radioh...
12125,9417,7.0,\r\n A song-for-song reggae cover of Radioh...
12126,9505,8.2,\nOn the one hand it is a largely superfluous ...
12127,9505,8.2,\nOn the one hand it is a largely superfluous ...
12128,9499,6.2,"When we last left our heroes, the Blood Brothe..."
12129,9499,6.2,"When we last left our heroes, the Blood Brothe..."


In [6]:
# Since 4 of these reviews are, for whatever reason, repeated 3 times each,
# we simply can remove the duplicate copies

df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18389 entries, 0 to 18400
Data columns (total 3 columns):
reviewid    18389 non-null int64
score       18389 non-null float64
content     18389 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 574.7+ KB


In [7]:
# Initial look at our descriptive statistics, including quartiles
df["score"].describe()

count    18389.000000
mean         7.005715
std          1.293758
min          0.000000
25%          6.400000
50%          7.200000
75%          7.800000
max         10.000000
Name: score, dtype: float64

In [8]:
# Defining the thresholds for our categories
# Here, we're breaking these down into 6 categories, that shake out to:
# Very Worst (bottom 5%), Bad (next 20%), Okay (next 25%, up to halfway), 
# Pretty Good (next 25%, after halfway), Good (next 20%), Very Best (top 5%)

category_threshold = []

for pcts in [.05, .25, .5, .75, .95]:
    category_threshold.append(df.quantile(q=pcts)[1])
    
category_threshold

[4.5, 6.4, 7.2, 7.8, 8.6]

In [9]:
# Defining our review categories

def review_type(score):
    if score <= category_threshold[0]: # From 0 to 4.5, bottom 5%
        review_type = 0
    elif score <= category_threshold[1]: # From 4.5 to 6.4, next 20%
        review_type = 1
    elif score <= category_threshold[2]: # From 6.4 to 7.2, next 25%
        review_type = 2
    elif score <= category_threshold[3]: # From 7.2 to 7.8, next 25%
        review_type = 3
    elif score <= category_threshold[4]: # From 7.8 to 8.7, next 20%
        review_type = 4
    else: # Above 8.7, top 5%
        review_type = 5
    return review_type

In [10]:
# Creating a new column to apply our review categories to each review
df["review_type"] = df.score.map(lambda x: review_type(x))
df.head()

Unnamed: 0,reviewid,score,content,review_type
0,22703,9.3,"“Trip-hop” eventually became a ’90s punchline,...",5
1,22721,7.9,"Eight years, five albums, and two EPs in, the ...",4
2,22659,7.3,Minneapolis’ Uranium Club seem to revel in bei...,3
3,22661,9.0,Kleenex began with a crash. It transpired one ...,5
4,22725,8.1,It is impossible to consider a given release b...,4


In [11]:
# Looking at the breakdown of the number of reviews per category
df["review_type"].value_counts()

3    4649
2    4623
1    3724
4    3545
0     943
5     905
Name: review_type, dtype: int64

In [12]:
# Creating a train-test split of our reviews, so we train our model only
# on training data and save some data to test our model

X = df["content"]
y = df["review_type"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_df = pd.concat([X_train, y_train], axis=1) 
test_df = pd.concat([X_test, y_test], axis=1)

#### Creating each piece of our Bayes theorem equation:

Reminder: 

 $$ P(\text{Review Category | Word}) = \dfrac{P(\text{Word | Review Category})P(\text{Review Category})}{P(\text{Word})}$$  
 
Where P(Word) can be assumed away (we can ignore it since it's the same for all words across all categories, if we assume independence), and where:

 $$P(\text{Word | Review Category}) = \dfrac{\text{Word Frequency in Review + 1}}{\text{Word Frequency Across All Reviews in that Category + Number of Words Across All Reviews}}$$  

In [13]:
# Calculating P(Review Category): the probability that a review falls in each 
# category. These make intuitive sense, since we created these categories 
# based on percentage breakdowns in our data

# So will be roughly 0:.05, 1:.20, 2:.25, 3:.25, 4:.20, 5:.05
# Variation from that is explained by our use of 'less than/equal to' when 
# defining our review categories

p_categories = dict(df["review_type"].value_counts(normalize = True))
p_categories

{3: 0.25281418239164716,
 2: 0.2514002936538148,
 1: 0.20251237152645604,
 4: 0.19277829136984068,
 0: 0.05128065691445973,
 5: 0.04921420414378161}

#### Creating the pieces for P (Word | Review Category)

In [14]:
# Calculating the Word Frequency Across All Reviews in that Category:
# Creating a dictionary with review categories as keys, where the values are 
# the frequency of each word within reviews of that category

word_freq = {}

# Using train_df to make sure we're using only training data
categories = train_df["review_type"].unique()

# Putting our cats in the bag (sorry?)
for cat in categories:
    temp_df = train_df[train_df["review_type"]==cat]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['content'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    word_freq[cat] = bag

In [15]:
word_freq

{3: {'The': 10954,
  'artist': 401,
  'Jessica': 19,
  'Ingram’s': 1,
  'series': 378,
  'of': 65334,
  'photographs': 12,
  '“Road': 1,
  'Through': 113,
  'Midnight:': 1,
  'A': 1295,
  'Civil': 16,
  'Rights': 10,
  'Memorial”': 1,
  'seems,': 24,
  'at': 6819,
  'a': 59285,
  'glance,': 14,
  'to': 42494,
  'simply': 554,
  'portray': 10,
  'beautiful': 233,
  'or': 6602,
  'quotidian': 9,
  'parts': 312,
  'the': 102486,
  'Southern': 129,
  'landscape.': 27,
  'But': 3915,
  'postcard-ready': 2,
  'images': 117,
  'are': 8019,
  'backlit': 3,
  'by': 7024,
  'an': 9699,
  'appalling': 1,
  'fact:': 2,
  'They': 762,
  'were': 2066,
  'all': 4688,
  'sites': 12,
  'racist': 6,
  'murders.': 2,
  'On': 1757,
  'his': 11976,
  'new': 2391,
  'Biosphere': 3,
  'record,': 576,
  'Departed': 11,
  'Glories,': 1,
  'influential': 48,
  'Norwegian': 53,
  'ambient': 419,
  'musician': 153,
  'Geir': 5,
  'Jenssen': 18,
  'does': 1048,
  'something': 1821,
  'strikingly': 46,
  'similar,'

In [16]:
# Calculating the Number of Words Across All Reviews for Laplacian smoothing,
# by counting the total number of unique words in all reviews

# Making vocabulary a set so it only keeps unique elements/words
vocabulary = set()

# Again, using train_df so we only use training data
for text in train_df["content"]:
    for word in text.split():
        vocabulary.add(word)

# Arriving at our number of unique words, 496946
V = len(vocabulary)
V

496946

In [17]:
# Creating a function to count the words within a review
# Purpose: to calculate Word Frequency in Review
def count_words(review_content):
    count = {}
    
    # Adding 1 to the count for the word
    # If the word is in the 'count' dictionary, grabs that value
    # If the word is not yet in the 'count' dictionary, sets value to 0
    # Then, adds 1 to the value
    for word in review_content.split():
        count[word] = count.get(word, 0) + 1
        
    return count

#### Putting all of the pieces together:

We have created all of the pieces except the probability of the word:

 $$ P(\text{Review Category | Word}) = \dfrac{\dfrac{\text{count_words function + 1}}{\text{word_freq + V}} * \text{p_categories}}{P(\text{Word})}$$  

since:

 $$P(\text{Word | Review Category}) = \dfrac{\text{count_words function + 1}}{\text{word_freq + V}}$$  

In [18]:
# Defining our P(Review Category | Word) function!
# This function guesses the review category based on the review content

'''
Inputs:
review_content: content of the review - df["content"]
p_categories: probability that a review falls into each category
word_freq: nested dictionary for words based on category
V: count of all unique words across all reviews (used for smoothing)
return_posteriors: boolean, saying whether or not you want to print the values
     in the posteriors list (per category)
'''

# Avoiding Underflow: underflow is when python rounds to zero because of
# numerical approximation limitations. We avoid this by using np.log of
# our probabilities, rather than the probabilities themselves
# Because algebra: we add the log where we would multiply the raw probability

def classify_doc(review_content, p_categories, word_freq, V, return_posteriors=False):
    
    # Using count_words function to count the words within the provided review
    count = count_words(review_content)
    
    categories = []
    posteriors = []
    
    # This part of our function is putting together the pieces to calculate
    # P(Word | Review Category):
    
    # The keys of our word_freq dictionary are the categories of reviews, 0-5
    # In other words, we do this one category at a time
    for category in word_freq.keys():
        
        # Finding the default/original probability of that category
        # Here you can see us using the log, rather than the raw probability
        p = np.log(p_categories[category])
        
        # The keys of our count dictionary are the words within the review
        for word in count.keys():
            
            # Numerator for P(Word|Review Category): Word Frequency in Review + 1
            num = count[word] + 1
            
            # Denominator for P(Word|Review Category): Word Frequency Across 
            # All Reviews in that Category + Number of Words Across All Reviews
            # If the word is not yet in the 'word_freq' dictionary, value = 0
            denom = word_freq[category].get(word, 0) + V
            
            # Adding the probability of that word being in the review based on 
            # category to the probability of the category, in order to arrive
            # at a new and hopefully better probability that the review
            # is in the current category
            
            # Again, using log instead of raw probability, hence why we are
            # using addition instead of multiplication
            p += np.log(num/denom)
        
        # Adding the current category to the 'categories' list
        categories.append(category)
        
        # Adding the updated probability for the current category to the 
        # 'posteriors' list
        posteriors.append(p)
    
    # Prints our posteriors, which are what our model will compare, if 
    # return_posteriors = True
    if return_posteriors:
        print(posteriors)
    
    # Returning the category with the highest probability
    return categories[np.argmax(posteriors)]

In [19]:
# Example, running the function on the first row of our train_df
classify_doc(train_df.iloc[0]["content"], p_categories, word_freq, V, return_posteriors = True)

[-5416.667861680065, -5416.603767827312, -5417.235699745461, -5416.551581093784, -5417.0017485522485, -5416.803889282037]


1

In [20]:
# Actually training our model! Training the function on each row in the 
# training data set
y_hat_train = X_train.map(lambda x: classify_doc(x, p_categories, word_freq, V))

# Checking the accurancy of our model, based on the training data alone
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.799217
True     0.200783
dtype: float64

In [21]:
y_hat_test = X_test.map(lambda x: classify_doc(x, p_categories, word_freq, V))

residuals = y_test == y_hat_test
residuals.value_counts(normalize=True)

False    0.793823
True     0.206177
dtype: float64

In [22]:
type(y_train)

pandas.core.series.Series

Our initial results are not great! Our model can only predict the category 20% of the time (and, oddly enough, performs better on the test data than the training data).

In [23]:
initial_df = df.copy()
initial = DocumentClassification(initial_df, "content", "review_type")

In [24]:
initial.data.head()

Unnamed: 0,reviewid,score,content,review_type
0,22703,9.3,"“Trip-hop” eventually became a ’90s punchline,...",5
1,22721,7.9,"Eight years, five albums, and two EPs in, the ...",4
2,22659,7.3,Minneapolis’ Uranium Club seem to revel in bei...,3
3,22661,9.0,Kleenex began with a crash. It transpired one ...,5
4,22725,8.1,It is impossible to consider a given release b...,4


In [25]:
predicted_target = []
for row in range(len(initial.data)):
    predicted_target.append(initial.classify_doc(initial.data.iloc[row][initial.word_loc]))

In [26]:
initial.data["predicted_review_type"] = predicted_target

In [27]:
initial.data.head()

Unnamed: 0,reviewid,score,content,review_type,predicted_review_type
0,22703,9.3,"“Trip-hop” eventually became a ’90s punchline,...",5,1
1,22721,7.9,"Eight years, five albums, and two EPs in, the ...",4,1
2,22659,7.3,Minneapolis’ Uranium Club seem to revel in bei...,3,1
3,22661,9.0,Kleenex began with a crash. It transpired one ...,5,1
4,22725,8.1,It is impossible to consider a given release b...,4,1


In [28]:
residuals = initial.data[initial.category_loc] == initial.data["predicted_review_type"]
residuals.value_counts(normalize=True)

False    0.797705
True     0.202295
dtype: float64

## Refining our model

One problem with our model is that we did very little pre-processing of the text, to make sure our model can look through our text and understand what the various pieces are. Let's do a bit of pre-processing and see if it improves our model.

All of our functions above have been written into a class, DocumentClassification, in order to rerun all of this and test how each refinement helps to improve our model!

#### Setting All Letters to Lowercase:

An initial problem is that our word frequency counter is not able to recognize that "The" and "the" are the same words. First thing we'll do to refine our model is to make sure all letters are in the same case:

In [29]:
first_refined_df = df.copy()
first_refined_df["content"] = first_refined_df["content"].apply(lambda x: x.lower())
first_refined_df.head()

Unnamed: 0,reviewid,score,content,review_type
0,22703,9.3,"“trip-hop” eventually became a ’90s punchline,...",5
1,22721,7.9,"eight years, five albums, and two eps in, the ...",4
2,22659,7.3,minneapolis’ uranium club seem to revel in bei...,3
3,22661,9.0,kleenex began with a crash. it transpired one ...,5
4,22725,8.1,it is impossible to consider a given release b...,4


In [30]:
first_refinement = DocumentClassification(first_refined_df, 
                                          "content", 
                                          "review_type")

first_predicted_target = []
for row in range(len(first_refinement.data)):
    first_predicted_target.append(first_refinement.classify_doc(
        first_refinement.data.iloc[row][first_refinement.word_loc]))

In [31]:
first_refinement.data["predicted_review_type"] = first_predicted_target
first_residuals = first_refinement.data[first_refinement.category_loc] == first_refinement.data["predicted_review_type"]
first_residuals.value_counts(normalize=True)

False    0.798086
True     0.201914
dtype: float64

#### Removing Unicode Leftovers:

A problem with the way the content of each review was gathered is that it left over a lot of unicode spaces, including "\xa0" and "\n". Let's just replace them with spaces:

In [32]:
second_refined_df = first_refinement.data.copy()
second_refined_df.drop("predicted_review_type", axis=1, inplace=True)

In [33]:
second_refined_df["content"] = second_refined_df["content"].apply(lambda x: x.replace(u'\xa0', u' '))
second_refined_df["content"] = second_refined_df["content"].apply(lambda x: x.replace(u'\n    ', u' '))
second_refined_df.head()

Unnamed: 0,reviewid,score,content,review_type
0,22703,9.3,"“trip-hop” eventually became a ’90s punchline,...",5
1,22721,7.9,"eight years, five albums, and two eps in, the ...",4
2,22659,7.3,minneapolis’ uranium club seem to revel in bei...,3
3,22661,9.0,kleenex began with a crash. it transpired one ...,5
4,22725,8.1,it is impossible to consider a given release b...,4


In [34]:
second_refinement = DocumentClassification(second_refined_df, 
                                          "content", 
                                          "review_type")

second_predicted_target = []
for row in range(len(second_refinement.data)):
    second_predicted_target.append(second_refinement.classify_doc(
        second_refinement.data.iloc[row][second_refinement.word_loc]))

In [35]:
second_refinement.data["predicted_review_type"] = second_predicted_target
second_residuals = second_refinement.data[second_refinement.category_loc] == second_refinement.data["predicted_review_type"]
second_residuals.value_counts(normalize=True)

False    0.798086
True     0.201914
dtype: float64

In [36]:
second_refinement.data.head()

Unnamed: 0,reviewid,score,content,review_type,predicted_review_type
0,22703,9.3,"“trip-hop” eventually became a ’90s punchline,...",5,1
1,22721,7.9,"eight years, five albums, and two eps in, the ...",4,1
2,22659,7.3,minneapolis’ uranium club seem to revel in bei...,3,1
3,22661,9.0,kleenex began with a crash. it transpired one ...,5,1
4,22725,8.1,it is impossible to consider a given release b...,4,1


#### Removing Punctuation:

Another problem is punctuation. Our word frequency counter doesn't realize that the word at the end of a sentence is the same as any other instance of that word ("it." isn't the same in our model as "it"). The problem is it recognizes the punctuation as part of the word - but it isn't always part of the word! There are many ways to tackle this problem, and they all have their pros/cons, but in this instance we decided to simply remove all punctuation from our text:

In [37]:
third_refined_df = second_refinement.data.copy()
third_refined_df.drop("predicted_review_type", axis=1, inplace=True)

In [38]:
punctuation = ["’", "‘", "”", "“", '!', '"', '#', '$', '%', '&', "'", '(', 
               ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', 
               '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', "—"]

def remove_punctuation(text):
    for symbol in punctuation:
        text = text.replace(symbol, '')
    return text

third_refined_df["content"] = third_refined_df["content"].apply(remove_punctuation)
third_refined_df.head()

Unnamed: 0,reviewid,score,content,review_type
0,22703,9.3,triphop eventually became a 90s punchline a mu...,5
1,22721,7.9,eight years five albums and two eps in the new...,4
2,22659,7.3,minneapolis uranium club seem to revel in bein...,3
3,22661,9.0,kleenex began with a crash it transpired one n...,5
4,22725,8.1,it is impossible to consider a given release b...,4


In [39]:
third_refinement = DocumentClassification(third_refined_df, 
                                          "content", 
                                          "review_type")

third_predicted_target = []
for row in range(len(third_refinement.data)):
    third_predicted_target.append(third_refinement.classify_doc(
        third_refinement.data.iloc[row][third_refinement.word_loc]))

In [40]:
third_refinement.data["predicted_review_type"] = third_predicted_target
third_residuals = third_refinement.data[third_refinement.category_loc] == third_refinement.data["predicted_review_type"]
third_residuals.value_counts(normalize=True)

False    0.948719
True     0.051281
dtype: float64

In [41]:
third_refinement.data.head()

Unnamed: 0,reviewid,score,content,review_type,predicted_review_type
0,22703,9.3,triphop eventually became a 90s punchline a mu...,5,0
1,22721,7.9,eight years five albums and two eps in the new...,4,0
2,22659,7.3,minneapolis uranium club seem to revel in bein...,3,0
3,22661,9.0,kleenex began with a crash it transpired one n...,5,0
4,22725,8.1,it is impossible to consider a given release b...,4,0


#### Removing Stop Words:

One way to pre-process and make our model better is to remove noise and useless data. Stop words are commonly used words that search engines ignore, since they are useless for predictive or processing purposes. The list of stop words we're using is from the Natural Language Toolkit (NLTK) for python. 

For ease of use, the removal of stopwords is written into the Document Classification class, since it allows us to remove stopwords as lists of words are generated.

In [42]:
fourth_refined_df = third_refinement.data.copy()
fourth_refined_df.drop("predicted_review_type", axis=1, inplace=True)

In [43]:
fourth_refinement = DocumentClassification(fourth_refined_df, 
                                          "content", 
                                          "review_type",
                                          stopwords=True)

fourth_predicted_target = []
for row in range(len(fourth_refinement.data)):
    fourth_predicted_target.append(fourth_refinement.classify_doc(
        fourth_refinement.data.iloc[row][fourth_refinement.word_loc]))

In [44]:
fourth_refinement.data["predicted_review_type"] = fourth_predicted_target
fourth_residuals = fourth_refinement.data[fourth_refinement.category_loc] == fourth_refinement.data["predicted_review_type"]
fourth_residuals.value_counts(normalize=True)

False    0.780738
True     0.219262
dtype: float64

#### Account for Class Imbalance:

Subset the dataset so each category is of equal size - smallest category is the very best reviews (category 5), with 905 reviews.

In [45]:
balanced_refined_df = fourth_refined_df.copy()
balanced_refined_df.drop("predicted_review_type", axis=1, inplace=True)
balanced_refined_df.head()

Unnamed: 0,reviewid,score,content,review_type
0,22703,9.3,triphop eventually became a 90s punchline a mu...,5
1,22721,7.9,eight years five albums and two eps in the new...,4
2,22659,7.3,minneapolis uranium club seem to revel in bein...,3
3,22661,9.0,kleenex began with a crash it transpired one n...,5
4,22725,8.1,it is impossible to consider a given release b...,4


In [46]:
# Creating a dataframe for the minority, reviews in category 5
very_best_minority = balanced_refined_df[balanced_refined_df["review_type"] == 5]

# Creating dataframes for the other categories, using samples of the same
# length as our minority df
good_sample = balanced_refined_df[balanced_refined_df["review_type"] == 4].sample(
    n=len(very_best_minority))
pretty_good_sample = balanced_refined_df[balanced_refined_df["review_type"] == 3].sample(
    n=len(very_best_minority))
okay_sample = balanced_refined_df[balanced_refined_df["review_type"] == 2].sample(
    n=len(very_best_minority))
bad_sample = balanced_refined_df[balanced_refined_df["review_type"] == 1].sample(
    n=len(very_best_minority))
very_worst_sample = balanced_refined_df[balanced_refined_df["review_type"] == 0].sample(
    n=len(very_best_minority))

sample_dfs = [very_best_minority, good_sample,
              pretty_good_sample, okay_sample, bad_sample, very_worst_sample]

# Concatenating the above dataframes into one balanced dataframe
df_balanced = pd.concat(sample_dfs)

df_balanced["review_type"].value_counts()

3    905
2    905
5    905
1    905
4    905
0    905
Name: review_type, dtype: int64

In [47]:
balanced_refinement = DocumentClassification(df_balanced,
                                             "content",
                                             "review_type",
                                             stopwords=True)

balanced_predicted_target = []
for row in range(len(balanced_refinement.data)):
    balanced_predicted_target.append(balanced_refinement.classify_doc(
        balanced_refinement.data.iloc[row][balanced_refinement.word_loc]))

In [48]:
balanced_refinement.data["predicted_review_type"] = balanced_predicted_target
balanced_residuals = balanced_refinement.data[balanced_refinement.category_loc] == balanced_refinement.data["predicted_review_type"]
balanced_residuals.value_counts(normalize=True)

False    0.911786
True     0.088214
dtype: float64

Alas. Unfortunately, it appears that, as we iteratively moved through all of our pre-processing options, we actually managed to make the model predict less accurately than just running through the raw data we started with. The two steps that seemed to make our model predict less accurately were the removal of punctuation and the balancing of categories.

For fun, let's see if a random number between 0 and 5 is a better predictor than our original model:

In [49]:
random_numbers = np.random.randint(0, 5, len(df))

In [50]:
len(random_numbers)

18389

In [51]:
random_residuals = df["review_type"] == random_numbers
random_residuals.value_counts(normalize=True)

False    0.801838
True     0.198162
Name: review_type, dtype: float64

In [53]:
balanced_refinement.data.head()

Unnamed: 0,reviewid,score,content,review_type,predicted_review_type
0,22703,9.3,triphop eventually became a 90s punchline a mu...,5,3
3,22661,9.0,kleenex began with a crash it transpired one n...,5,3
28,22707,9.0,all is not well with ray charles catalog nowad...,5,3
46,22643,9.3,put on nearly any of the 36 discs in bob dylan...,5,2
50,22663,8.8,todays underground may be the answer to tomorr...,5,3
