# ICWSM: RACE CLASSIFICATION PAPER (CoLab Steps)

Task: Build a Distilbert classifier that takes in biography text and makes predictions about the person's race.

Notes:
1. *Version 1* fine-tunes DistilBERT on **preprocessed** biography text. We do not conduct hyperparameter optimization and only use the accuracy metric.
2. *Version 2* fine-tunes DistilBERT on **preprocessed** biography text. We conduct hyperparameter optimization across epochs {2,3,4,5}, learning rate {1e-5,1e-4,1e-3}, weight decay {None,1e-4}, batch size {8,16}.
3. *Version 3* fine-tunes DistilBERT on **non-preprocessed** AKA **original** biography text. We conduct hyperparameter optimization across epochs {2,3,4}, learning rate {5e-5,2e-5,1e-5}, weight decay {None,1e-4}, batch size {8,16}.
_____
4. *Final Version* fine-tunes DistilBERT on non-preprocessed biography text. It uses the hyperparameters with the best average F1 score from the nested cross validation as denoted [here](https://docs.google.com/spreadsheets/d/1YTTPvM-19XDaCplRDJ9Fk3zMA2KLY2eRYhgHUPkha6Q/edit#gid=507982372), and it generates model prediction probabilities for each fold.

### Colab Steps
- **TRAINING: filename**
  1. Perform 5-fold cross-validation on train, validation datasets
  2. Finetune `distilbert-base-uncased` using four race categories, and save models
  3. We needed to use Colab so that the GPU could run faster than my local CPU... Save model in a stable location
- TESTING: filename
  1. Run model on the unseen test set (save predictions) and evaluate results

# Version 1: K-Fold Cross-Validation (No Hyperparameter Optimization, Accuracy Metric)

In [None]:
!pip install transformers
from sklearn.model_selection import StratifiedKFold
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import tensorflow as tf
from google.colab import drive

drive.mount('/content/drive')
# in colab, create a directory under runtime
root_dir = "/content/drive/MyDrive/Undergrad/Summer 2023/Race Classification/biography"

df = pd.read_csv(f"{root_dir}/cleaned_final_sample_metadata.csv", header=0) # if needed to read in
df["bio_preprocessed"] = df["bio_preprocessed"].replace(np.nan, "", regex=True)
df['true_race_cat'] = df['race'].astype('category').cat.codes # 0: asian, 1: black, 2: hispanic, 3: white

X = df["bio_preprocessed"].to_numpy() # Features (text not-tokenized yet)
y = df["true_race_cat"].to_numpy() # Labels

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m30.2 MB/s[0m eta [36m0:00:0

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
# k fold cross-validation loop
for i, (train_index, val_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={val_index}")

    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    # save test indices for testing
    test_set = pd.DataFrame({'val_index': val_index, 'x': X_val, 'y': y_val})
    test_set.to_csv(f"{root_dir}/preprocessed/test-set-{i}.csv", index=None)

    train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, return_tensors='tf')
    val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True, return_tensors='tf')

    # Convert the tokenized encodings to numpy arrays
    X_train = np.asarray(train_encodings['input_ids'])
    X_val = np.asarray(val_encodings['input_ids'])
    y_train = np.asarray(y_train)
    y_val = np.asarray(y_val)

    # OPTIMIZER HYPERPARAMETERS SET: cross entropy loss (default), Adam optimizer, learning rate = 0.00002, epochs = 3, batch size = 16
    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
    model.compile(optimizer=optimizer, metrics=['accuracy'])

    model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=8)

    # Evaluate the model on validation data for this fold
    val_results = model.evaluate(X_val, y_val)
    print(val_results)

    # Save model
    model.save_pretrained(f"{root_dir}/distilbert-{i}")
    tokenizer.save_pretrained(f"{root_dir}/distilbert-{i}")

Fold 0:
  Train: index=[   0    1    2 ... 5197 5198 5200]
  Test:  index=[   4   12   13 ... 5179 5180 5199]
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.3889923691749573, 0.8780019283294678]
Fold 1:
  Train: index=[   0    1    2 ... 5197 5198 5199]
  Test:  index=[  20   22   27 ... 5192 5196 5200]
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.1383700966835022, 0.9567307829856873]
Fold 2:
  Train: index=[   1    2    3 ... 5197 5199 5200]
  Test:  index=[   0    7    9 ... 5186 5187 5198]
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.05838023126125336, 0.9846153855323792]
Fold 3:
  Train: index=[   0    3    4 ... 5198 5199 5200]
  Test:  index=[   1    2    6 ... 5189 5193 5194]
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.028116613626480103, 0.9903846383094788]
Fold 4:
  Train: index=[   0    1    2 ... 5198 5199 5200]
  Test:  index=[   3    5    8 ... 5191 5195 5197]
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.08298975974321365, 0.9769230484962463]


## What's happening above

---

In the above cell, I define my DistilBERT base model and tokenizer, and I split the data into 5 separate folds. Next, I run a for-loop that takes 4 folds as training data and 1 fold as validation data. Within each loop, I evaluate the model fitted on the training data on the validation set. Then I save the model in Google Drive.


## Validation Set Metrics Obtained:

Fold 0:
Validation Metrics - loss, accuracy
[0.3889923691749573, 0.8780019283294678]

Fold 1:
Validation Metrics - loss, accuracy
[0.1383700966835022, 0.9567307829856873]

Fold 2:
Validation Metrics - loss, accuracy
[0.05838023126125336, 0.9846153855323792]

Fold 3:
Validation Metrics - loss, accuracy
[0.028116613626480103, 0.9903846383094788]

Fold 4:
Validation Metrics - loss, accuracy
[0.08298975974321365, 0.9769230484962463]

## Next Step: Testing
Although we evaluated the accuracy metric in each step, we want to see how the model performed in each category. This step is done in `Bio_Race_Classification-Testing.ipynb`

# Version 2: Nested Cross-Validation (preprocessed data)

We conduct nested cross validation based on the procedure described [here](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/). This allows us to tune the hyperparameters. In the previous model training, we assume hyperparameters.

In [None]:
!pip install transformers
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from google.colab import drive

drive.mount('/content/drive')
# in colab, create a directory under runtime
root_dir = "/content/drive/MyDrive/Undergrad/Summer 2023/Race Classification/biography"

df = pd.read_csv(f"{root_dir}/cleaned_final_sample_metadata.csv", header=0) # if needed to read in
df["bio_preprocessed"] = df["bio_preprocessed"].replace(np.nan, "", regex=True)
df['true_race_cat'] = df['race'].astype('category').cat.codes # 0: asian, 1: black, 2: hispanic, 3: white

# X = df.drop(columns=["true_race_cat"]) # Features (text not-tokenized yet)
X = df["bio_preprocessed"].to_numpy() # Features (text not-tokenized yet)
y = df["true_race_cat"].to_numpy() # Labels

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
from sklearn.metrics import classification_report, f1_score
import tensorflow as tf

def f1_metric(y_true, y_pred):
    # Define a wrapper function that computes F1 score using NumPy
    def f1_numpy(y_true, y_pred):
        return f1_score(y_true, np.argmax(y_pred, axis=-1), average='weighted')

    # Use tf.py_function to apply the NumPy-based F1 computation
    f1 = tf.py_function(func=f1_numpy, inp=[y_true, y_pred], Tout=tf.float32)
    return f1

In [None]:
# k fold cross-validation loop
for i, (train_index, val_index) in enumerate(skf.split(X, y)):
    if i == 0:
      continue

    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={val_index}")

    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    # save test indices for testing
    test_set = pd.DataFrame({'val_index': val_index, 'x': X_val, 'y': y_val})
    test_set.to_csv(f"{root_dir}/preprocessed/test-set-{i}.csv", index=None)

    train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, return_tensors='tf')
    val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True, return_tensors='tf')

    # Convert the tokenized encodings to numpy arrays
    X_train = np.asarray(train_encodings['input_ids'])
    X_val = np.asarray(val_encodings['input_ids'])
    y_train = np.asarray(y_train)
    y_val = np.asarray(y_val)

    # GRID SEARCH OF HYPERPARAMETERS
    grid = {
        "learning_rate": [1e-5, 1e-4, 0.001],
        "weight_decay": [None, 1e-4],
        "epochs": [2,3,4,5],
        "batch_size": [8, 16]
      }

    best_f1 = 0
    best_param = [-1, -1, -1, -1]
    for i in grid["learning_rate"]:
      for j in grid["weight_decay"]:
        for k in grid["epochs"]:
          for l in grid["batch_size"]:
            hyperparam_txt = f"optimizer Adam, learning rate {i}, weight decay {j}, epochs {k}, batch_size {l}"
            print(hyperparam_txt)

            # Compile with hyperparameters
            optimizer = tf.keras.optimizers.Adam(learning_rate=i, weight_decay=j)
            model.compile(optimizer=optimizer, metrics=['accuracy', f1_metric])
            model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=k, batch_size=l)

            # Evaluate the model on validation data for this fold
            val_results = model.evaluate(X_val, y_val)
            print(val_results)
            if val_results[2] > best_f1:
              best_param = [i, j, k, l]

            # Write line to file
            with open(f"/{root_dir}/nested_cross_val/fold1_results.txt", "a") as out_file:
                out_file.write(hyperparam_txt)
                out_file.write(f"{val_results}")
                out_file.write(f"\n")

    print(best_f1)
    print("best param", best_param)
    # # Save model
    # model.save_pretrained(f"{root_dir}/nested_cross_val/distilbert-{i}")
    # tokenizer.save_pretrained(f"{root_dir}/nested_cross_val/distilbert-{i}")

# Version 3: Nested Cross-Validation (no preprocessing)

We conduct nested cross validation based on the procedure described [here](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/). This allows us to tune the hyperparameters. We fine tune the model without preprocessing the data.

In [None]:
!pip install transformers
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from google.colab import drive

drive.mount('/content/drive')
# in colab, create a directory under runtime
# root_dir = "/content/drive/MyDrive/Undergrad/Summer 2023/Race Classification/biography" # jw10
root_dir = "/content/drive/MyDrive/Race Classification/biography" # jw2019 and itsjust

df = pd.read_csv(f"{root_dir}/cleaned_final_sample_metadata.csv", header=0) # if needed to read in
df["bio"] = df["bio"].replace(np.nan, "", regex=True)
df['true_race_cat'] = df['race'].astype('category').cat.codes # 0: asian, 1: black, 2: hispanic, 3: white

# X = df.drop(columns=["true_race_cat"]) # Features (text not-tokenized yet)
X = df["bio"].to_numpy() # Features (text not-tokenized yet)
y = df["true_race_cat"].to_numpy() # Labels

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m62.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m70.3 MB/s[0m eta [36m0:00:0

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
from sklearn.metrics import classification_report, f1_score
import tensorflow as tf

def f1_metric(y_true, y_pred):
    # Define a wrapper function that computes F1 score using NumPy
    def f1_numpy(y_true, y_pred):
        return f1_score(y_true, np.argmax(y_pred, axis=-1), average='weighted')

    # Use tf.py_function to apply the NumPy-based F1 computation
    f1 = tf.py_function(func=f1_numpy, inp=[y_true, y_pred], Tout=tf.float32)
    return f1

In [None]:
# k fold cross-validation loop
for fold, (train_index, val_index) in enumerate(skf.split(X, y)):
    if fold != 3:
      continue
    print(f"Fold {fold}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={val_index}")

    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    # save test indices for testing
    test_set = pd.DataFrame({'val_index': val_index, 'x': X_val, 'y': y_val})
    test_set.to_csv(f"{root_dir}/non-preprocessed/test-set-{fold}.csv", index=None)

    train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, return_tensors='tf')
    val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True, return_tensors='tf')

    # Convert the tokenized encodings to numpy arrays
    X_train = np.asarray(train_encodings['input_ids'])
    X_val = np.asarray(val_encodings['input_ids'])
    y_train = np.asarray(y_train)
    y_val = np.asarray(y_val)

    # GRID SEARCH OF HYPERPARAMETERS
    grid = {
        "learning_rate": [2e-5, 5e-5],
        "weight_decay": [None, 1e-4],
        "epochs": [2,3,4],
        "batch_size": [8, 16]
      }

    best_f1 = 0
    best_param = [-1, -1, -1, -1]
    for i in grid["learning_rate"]:
      for j in grid["weight_decay"]:
        for k in grid["epochs"]:
          for l in grid["batch_size"]:
            if not (i == 5e-5 and j == 1e-4 and k in {3,4}):
              continue
            hyperparam_txt = f"optimizer Adam, learning rate {i}, weight decay {j}, epochs {k}, batch_size {l}"
            print(hyperparam_txt)

            # Compile with hyperparameters
            optimizer = tf.keras.optimizers.Adam(learning_rate=i, weight_decay=j)
            model.compile(optimizer=optimizer, metrics=['accuracy', f1_metric])
            model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=k, batch_size=l)

            # Evaluate the model on validation data for this fold
            val_results = model.evaluate(X_val, y_val)
            print(val_results)
            if val_results[2] > best_f1:
              best_param = [i, j, k, l]

            # Write line to file
            with open(f"/{root_dir}/nested_cross_val/noprep-results.csv", "a") as out_file:
                results_txt = f"{val_results}"
                out_file.write(f"{fold},{i},{j},{k},{l},{results_txt[1:len(results_txt)-1]}")
                out_file.write(f"\n")
    print(best_f1)
    print("best param", best_param)
    # # Save model
    # model.save_pretrained(f"{root_dir}/nested_cross_val/distilbert-{i}")
    # tokenizer.save_pretrained(f"{root_dir}/nested_cross_val/distilbert-{i}")

Fold 3:
  Train: index=[   0    3    4 ... 5198 5199 5200]
  Test:  index=[   1    2    6 ... 5189 5193 5194]
optimizer Adam, learning rate 5e-05, weight decay 0.0001, epochs 3, batch_size 8
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.4448140859603882, 0.8586538434028625, 0.9191582202911377]
optimizer Adam, learning rate 5e-05, weight decay 0.0001, epochs 3, batch_size 16
Epoch 1/3
Epoch 2/3
Epoch 3/3
[0.5790050029754639, 0.8711538314819336, 0.927556037902832]
optimizer Adam, learning rate 5e-05, weight decay 0.0001, epochs 4, batch_size 8
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
[0.5733615159988403, 0.8663461804389954, 0.9243249893188477]
optimizer Adam, learning rate 5e-05, weight decay 0.0001, epochs 4, batch_size 16
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
[0.8757795691490173, 0.8403846025466919, 0.9066438674926758]
0
best param [5e-05, 0.0001, 4, 16]


# Final Version: Training Model on Chosen Hyperparameters


hyperparam = {
    "lr": 5e-5,
    "weight_decay": 1e-4,
    "epochs": 2,
    "batch_size": 16
}

 ## Model Training

In [2]:
!pip install transformers
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from google.colab import drive

drive.mount('/content/drive')
# in colab, create a directory under runtime
# root_dir = "/content/drive/MyDrive/Undergrad/Summer 2023/Race Classification/biography" # jw10
root_dir = "/content/drive/MyDrive/Race Classification/biography" # jw2019 and itsjust

df = pd.read_csv(f"{root_dir}/cleaned_final_sample_metadata.csv", header=0) # if needed to read in
df["bio"] = df["bio"].replace(np.nan, "", regex=True)
df['true_race_cat'] = df['race'].astype('category').cat.codes # 0: asian, 1: black, 2: hispanic, 3: white

# X = df.drop(columns=["true_race_cat"]) # Features (text not-tokenized yet)
X = df["bio"].to_numpy() # Features (text not-tokenized yet)
y = df["true_race_cat"].to_numpy() # Labels

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m69.1 MB/s[0m eta [36m0:00:0

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [3]:
from sklearn.metrics import classification_report, f1_score
import tensorflow as tf

def f1_metric(y_true, y_pred):
    # Define a wrapper function that computes F1 score using NumPy
    def f1_numpy(y_true, y_pred):
        return f1_score(y_true, np.argmax(y_pred, axis=-1), average='weighted')

    # Use tf.py_function to apply the NumPy-based F1 computation
    f1 = tf.py_function(func=f1_numpy, inp=[y_true, y_pred], Tout=tf.float32)
    return f1

In [4]:
# k fold cross-validation loop
for i, (train_index, val_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={val_index}")

    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

    # save test indices for testing
    test_set = pd.DataFrame({'val_index': val_index, 'x': X_val, 'y': y_val})
    test_set.to_csv(f"{root_dir}/BioRaceBERT/BioRaceBERT-test-{i}.csv", index=None)

    train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, return_tensors='tf')
    val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True, return_tensors='tf')

    # Convert the tokenized encodings to numpy arrays
    X_train = np.asarray(train_encodings['input_ids'])
    X_val = np.asarray(val_encodings['input_ids'])
    y_train = np.asarray(y_train)
    y_val = np.asarray(y_val)

    hyperparam = {
        "lr": 5e-5,
        "weight_decay": 1e-4,
        "epochs": 2,
        "batch_size": 16
    }

    optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparam["lr"], weight_decay=hyperparam["weight_decay"])
    model.compile(optimizer=optimizer, metrics=[f1_metric])

    model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=hyperparam["epochs"], batch_size=hyperparam["batch_size"])

    # Evaluate the model on validation data for this fold
    val_results = model.evaluate(X_val, y_val)
    print(val_results)

    # Save model
    model.save_pretrained(f"{root_dir}/BioRaceBERT/BioRaceBERT-{i}")
    tokenizer.save_pretrained(f"{root_dir}/BioRaceBERT/BioRaceBERT-{i}")

Fold 0:
  Train: index=[   0    1    2 ... 5197 5198 5200]
  Test:  index=[   4   12   13 ... 5179 5180 5199]
Epoch 1/2
Epoch 2/2
[0.349795401096344, 0.9356135129928589]
Fold 1:
  Train: index=[   0    1    2 ... 5197 5198 5199]
  Test:  index=[  20   22   27 ... 5192 5196 5200]
Epoch 1/2
Epoch 2/2
[0.1637207716703415, 0.9734432101249695]
Fold 2:
  Train: index=[   1    2    3 ... 5197 5199 5200]
  Test:  index=[   0    7    9 ... 5186 5187 5198]
Epoch 1/2
Epoch 2/2
[0.08237297087907791, 0.9882842302322388]
Fold 3:
  Train: index=[   0    3    4 ... 5198 5199 5200]
  Test:  index=[   1    2    6 ... 5189 5193 5194]
Epoch 1/2
Epoch 2/2
[0.04097912088036537, 0.9946513175964355]
Fold 4:
  Train: index=[   0    1    2 ... 5198 5199 5200]
  Test:  index=[   3    5    8 ... 5191 5195 5197]
Epoch 1/2
Epoch 2/2
[0.01946006901562214, 0.9970985054969788]


# End

Please see README for next notebook.