# RACE CLASSIFICATION PAPER

Task: Build a Distilbert classifier that takes in biography text and makes predictions about the person's race.

Steps:
- `Local Steps`
    - **PREPROCESSING**
        1. Preprocess only the `mini_bio` column (note: in my thesis paper, I combined the entire IMDb biography page of text, including trivia, family, etc. if available. [Here](#https://www.imdb.com/name/nm0000329/bio/) is an example of all the text I was previously training the model on.
    
-  `Colab Steps`
    - TRAINING: filename
        1. Perform 5-fold cross-validation on train, validation datasets
        2. Finetune `distilbert-base-uncased` using four race categories, and save models
        3. We needed to use Colab so that the GPU could run faster than my local CPU... Save model in a stable location
    - TESTING: filename
        1. Run model on the unseen test set (save predictions) and evaluate results

## Step 1: Preprocess biography text

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../script")
import text_preprocessing

root_dir = "/Users/jennywang/jw10/Summer2023/RaceClassification/data"
df = pd.read_csv(f"{root_dir}/data/final_sample_metadata.csv") # before adding tokens
df = df.replace(np.nan, "", regex=True)

# create preprocessed column
df["tokens"] = df.apply(lambda row: text_preprocessing.preprocess(row["mini_bio"], lemmatization=True), axis=1)
df["bio_preprocessed"] = df.apply(lambda row: ' '.join(row["tokens"]), axis=1)

2023-09-07 10:51:02.018408: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# clean Mini Bio text
def clean_mini_bio(text):
    return text.split("- IMDb")[0].strip()

df["bio"] = df["mini_bio"].apply(clean_mini_bio)

In [4]:
# only keep relevant columns
df = df[["name", "href", "race", "role", "image", "bio", "bio_preprocessed"]]
# df.to_csv(f"{root_dir}/final_sample_preprocessed_new.csv", index=False)
df.to_csv(f"{root_dir}/data/cleaned_final_sample_metadata.csv", index=False)
df.head()

Unnamed: 0,name,href,race,role,image,bio,bio_preprocessed
0,Ang Lee,/name/nm0000487,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BODA2MT...,"Born in 1954 in Pingtung, Taiwan, Ang Lee has ...",bear pingtung taiwan ang lee today great conte...
1,James Wan,/name/nm1490123,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTY5Nz...,James Wan (born 26 February 1977) is an Austra...,james wan bear february australian film produc...
2,Jon M. Chu,/name/nm0160840,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BNDM0Nj...,Jon is an alumni of the USC School of Cinema-T...,jon alumnus usc school cinema television win p...
3,Taika Waititi,/name/nm0169806,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMzk4MD...,"Taika Waititi, also known as Taika Cohen, hail...",taika waititi know taika cohen hail raukokore ...
4,Karyn Kusama,/name/nm0476201,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTUzMT...,"Karyn Kusama was born on March 21, 1968 in Bro...",karyn kusama bear march brooklyn new york usa ...


# End

We've reached the end of the preprocessing step. Next, we use the "metadata" text to build a biography-based race classifier. Please see the README for the following notebook.