Book Review: Book Categorization by Review

In [1]:
"""
README

NLP: Book Categorization- Exploratory Data Analysis and Cleaning
By. Kevin Lin

This notebook prepares the data for our group's Book Review Categorization project.

We use two files from Kaggle’s “Amazon Books Reviews” dataset, which contains over 3 million reviews:
	•	books_data.csv: Book metadata, including title, author, description, and categories (used as labels)
	•	Books_rating.csv: Customer reviews, including review/text, rating, and other metadata

The notebook:
	•	Merges and cleans both datasets
	•	Filters for valid entries across 16 target categories
	•	Generates stratified samples for training and testing:
	•	14400_strat_samp_training.csv
	•	1600_strat_samp_test.csv


OPTION 1: If "books_data.csv" and "Books_rating.csv" are already in the same folder
as this notebook, leave line below commented out.

OPTION 2: If you're using Colab and the files are not in the same directory,
uncomment the lines below to manually upload them.

"""

# Run this if using Colab
from google.colab import files
uploaded = files.upload()  # Upload "books_data.csv" and "Books_rating.csv"

Saving books_data.csv to books_data.csv


In [None]:
uploaded1 = files.upload()

In [2]:
import pandas as pd
import numpy as np

In [None]:
# NOTE: This is reading over 3 million+ datapoints, it may take ~2 minutes (locally)
df = pd.read_csv("books_data.csv")
df1 = pd.read_csv("Books_rating.csv")

# Preview
print(df.head(10))
print()
print(df.head(10))

In [None]:
# Exploratory Data Analysis

# Caculates number of each category
category_counts = df["categories"].value_counts()
print("--------------All Category Count--------------")
print(category_counts)

"""
We wanted to retain as many relevant categories as we can and we found
16 to be a good number as it represents over 50% of all datapoints and
the least frequent category, "Music" is still easily different from the
others.
"""

category_number = 16 # Number of Categories we want to RETAIN
print("\n------------ Top 16 Category Count------------")
print(category_counts.head(category_number))

# Percentage Share of DF
print(f"\n\nThe top {category_number} category accounts for: {sum(category_counts.head(category_number))/sum(category_counts):.2%} of all datapoints")

In [None]:
title_category_dict = {}
target_categories = [
    'Fiction',
    'Religion',
    'History',
    'Juvenile Fiction',
    'Biography & Autobiography',
    'Business & Economics',
    'Computers',
    'Social Science',
    'Juvenile Nonfiction',
    'Science',
    'Education',
    'Cooking',
    'Sports & Recreation',
    'Family & Relationships',
    'Literary Criticism',
    'Music'
]

# Cleaning of the Category
for row_num, row in df.iterrows():
    if str(row["categories"]).strip("[]").replace("'", "").replace('"', "") in target_categories:
      # Excess and duplicate quotations and brackets removal
      title_category_dict[row["Title"]] = str(row["categories"]).strip("[]").replace("'", "").replace('"', "")

print(dict(list(title_category_dict.items())[:10]))

In [None]:
# Create a map of title to category since they are in different dataframes

df1["categories"] = df1["Title"].map(title_category_dict)
df1.dropna(subset=["categories"], inplace=True)

print(df1.head(10))

In [None]:
# Map titles to categories
df1["categories"] = df1["Title"].map(title_category_dict)

# Drop rows where no category was found
df1.dropna(subset=["categories"], inplace=True)

# Print counts of each category
print("Counts of each category:")
print(df1["categories"].value_counts())


In [None]:
print(sum(df1["categories"].value_counts())) # Remaining number of reviews out of 3M original

In [None]:
# Creates a pointer
imported_df1 = df1 # If previous step aren't ran again, you can simply read_csv here instead

In [None]:
target_cat_count = {
    'Fiction':0,
    'Religion':0,
    'History':0,
    'Juvenile Fiction':0,
    'Biography & Autobiography':0,
    'Business & Economics':0,
    'Computers':0,
    'Social Science':0,
    'Juvenile Nonfiction':0,
    'Science':0,
    'Education':0,
    'Cooking':0,
    'Sports & Recreation':0,
    'Family & Relationships':0,
    'Literary Criticism':0,
    'Music':0
}

print(f"Number of Target Categories: {len(target_cat_count.keys())}")

In [None]:
# Note that keeping other columns was intentional so we can easily to incorporate other datapoints later if necessary

imported_df1.head(1)

In [None]:
training_df = []
test_df = []

for key, value in target_cat_count.items():
    # Gets the 900 instances of each category
    filtered_training = imported_df1[imported_df1["categories"] == key][0:900]
    training_df.append(filtered_training)

    # Gets the next 100 instances of each category
    filtered_test = imported_df1[imported_df1["categories"] == key][900:1000]
    test_df.append(filtered_test)

training_df = pd.concat(training_df, axis=0)
test_df = pd.concat(test_df, axis=0)

In [None]:
# NOTE: Shuffling is now required since they are in order by category from before

# Shuffle rows

training_df = training_df.sample(frac=1).reset_index(drop=True)
print(training_df.shape)

test_df = test_df.sample(frac=1).reset_index(drop=True)
print(test_df.shape)

# Write to CSV
training_df.to_csv("14400_strat_samp_training.csv", index=False)
test_df.to_csv("1600_strat_samp_test.csv", index=False)