<a href="https://colab.research.google.com/github/hnm15/DS703/blob/main/Homework_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 9: Text Classification with Fine-Tuned BERT

### Due: Midnight on November 5th (with 2-hour grace period) — Worth 85 points

In this final homework, we’ll explore **fine-tuning a pre-trained Transformer model (BERT)** for text classification using the **IMDB Movie Review** dataset. You’ll begin with a working baseline notebook and then conduct a series of controlled experiments to understand how data size, context length, and model architecture affect performance.

You’ll complete three problems:

* **Problem 1:** Evaluate how **sequence length** and **learning rate** jointly influence validation loss and generalization.
* **Problem 2:** Measure how **training data size** affects both model performance and total training time.
* **Problem 3:** Compare **two additional models** from the BERT family to analyze the trade-offs between model size and accuracy on this dataset.

In each problem, you’ll report your key metrics, summarize what you observed, and reflect on what you learned.

> **Note:** This homework was developed and tested on **Google Colab**, due to version conflicts when running locally. It is **strongly recommended** that you complete your work on Colab as well.

There are 6 problems, each worth 14 points, and you get one point free if you complete the entire homework.


In [1]:
# Install once per new Colab runtime
%pip -q install -U keras keras-hub tensorflow tensorflow-text datasets evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m645.0/645.0 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency c

In [2]:

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import time
import random
import numpy as np
import keras
import keras_hub as kh
import evaluate
from datasets import load_dataset, Dataset, Features, Value, ClassLabel

from keras import mixed_precision                    # generally faster
mixed_precision.set_global_policy("mixed_float16")

### Here is where you can set global hyperparameters for this homework

In [3]:
# ---------------- Config ----------------
SEED        = 42
MAX_LEN     = 128
EPOCHS      = 3
BATCH       = 32
EVAL_BATCH  = 64
SUBSET_FRAC = 0.25   # <-- 0.25 to train and test on 25% of whole dataset during development;  set to 1.0 for full dataset

keras.utils.set_random_seed(SEED)

### Load and Preprocess the IMDB Movie Review Dataset

In [4]:
# ---- Load IMDb (raw), join train+test ----
imdb   = load_dataset("imdb")
texts  = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

# ---- Build DS with explicit features (label=ClassLabel) ----
features = Features({"text": Value("string"),
                     "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

# ---- Optional: take a stratified subset of the FULL dataset ----
if 0.0 < SUBSET_FRAC < 1.0:
    sub = all_ds.train_test_split(train_size=SUBSET_FRAC, seed=SEED, stratify_by_column="label")
    ds_pool = sub["train"]
else:
    ds_pool = all_ds

# ---- Stratified 80/10/10 split on the (possibly smaller) pool ----
# First: 80/20 train+val pool / test
splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val_pool, test_ds = splits["train"], splits["test"]
# Then: carve 10% of full (i.e., 0.125 of the 80% pool) as validation
splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

# ---- Numpy arrays for Keras fit/predict ----
X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
X_va = np.array(val_ds["text"],   dtype=object); y_va = np.array(val_ds["label"],   dtype="int32")
X_te = np.array(test_ds["text"],  dtype=object); y_te = np.array(test_ds["label"],  dtype="int32")

# ---- Quick summary ----
def _counts(ds):
    arr = np.array(ds["label"], dtype=int)
    return len(arr), np.bincount(arr, minlength=2).tolist()
print(f"Pool after SUBSET_FRAC={SUBSET_FRAC}: {len(ds_pool)} (of {len(all_ds)})")
print("Train:", _counts(train_ds), " Val:", _counts(val_ds), " Test:", _counts(test_ds))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Pool after SUBSET_FRAC=0.25: 12500 (of 50000)
Train: (8750, [4375, 4375])  Val: (1250, [625, 625])  Test: (2500, [1250, 1250])


### Build and train a baseline Distil-Bert Text Classifier

In [5]:
# ---- Keras Hub preprocessor + classifier ----
preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
    "distil_bert_base_en_uncased", sequence_length=MAX_LEN
)
model = kh.models.DistilBertTextClassifier.from_preset(
    "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
)

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
)

start = time.time()

# ---- Train with early stopping (restore best val weights) ----
cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
history = model.fit(
    X_tr, y_tr,
    validation_data=(X_va, y_va),
    epochs=EPOCHS,
    batch_size=BATCH,
    callbacks=cb,
    verbose=1,
)

# ---- Evaluate (accuracy + F1 via `evaluate`) ----
logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
y_pred = logits.argmax(axis=-1)

acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")
acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

# Tiny confusion matrix helper (no sklearn needed)
def confusion_matrix_np(y_true, y_pred, num_classes=2):
    cm = np.zeros((num_classes, num_classes), dtype=int)
    for t, p in zip(y_true, y_pred):
        cm[t, p] += 1
    return cm

print(f"\nValidation acc (best epoch): {history.history['val_acc'][np.argmin(history.history['val_loss'])]:.3f}")
print(f"\nTest accuracy: {acc:.3f}   Test F1: {f1:.3f}")
print("\nConfusion matrix:\n", confusion_matrix_np(y_te, y_pred))

end = time.time() - start
print("\nElapsed time:", time.strftime("%H:%M:%S", time.gmtime(end)))

Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m118s[0m 240ms/step - acc: 0.7829 - loss: 0.4529 - val_acc: 0.8376 - val_loss: 0.3449
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 34ms/step - acc: 0.8787 - loss: 0.2896 - val_acc: 0.8584 - val_loss: 0.3402
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 33ms/step - acc: 0.9158 - loss: 0.2207 - val_acc: 0.8592 - val_loss: 0.3551


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]


Validation acc (best epoch): 0.858

Test accuracy: 0.855   Test F1: 0.851

Confusion matrix:
 [[1097  153]
 [ 210 1040]]

Elapsed time: 00:02:43


# Problem 1 — Mini sweep: context length × learning rate (6 runs)

In this problem we'll see how much **context length** (`MAX_LEN`) helps, and how sensitive fine-tuning is to **learning rate**—without running a huge grid.

## Setup (keep these fixed)

* `SUBSET_FRAC = 0.25`               # use only this percentage of the whole dataset
* `EPOCHS = 3`
* `BATCH = 32` (but see note for 256 below)
* **EarlyStopping** with `restore_best_weights=True`
* Same random `SEED` for all runs
* Same data split for all runs (don’t reshuffle between runs)

### Run these 6 configurations

**For each** `MAX_LEN ∈ {128, 256, 512}`, try **two** learning rates:

* **MAX_LEN = 128**

  * `(LR = 2e-5, BATCH = 32)` – healthy default for shorter contexts.
  * `(LR = 1e-5, BATCH = 32)` – conservative LR; often a touch stabler.

* **MAX_LEN = 256**

  * `(LR = 1e-5, BATCH = 16)` – longer context → lower batch.
  * `(LR = 7.5e-6, BATCH = 16)` – even steadier if loss is noisy.

* **MAX_LEN = 512**  *(heavier quadratic attention cost)*

  * `(LR = 7.5e-6, BATCH = 8)` – safe starting point.
  * `(LR = 5e-6, BATCH = 8)` – extra caution for stability.

**If you hit an Out Of Memory error:**

* At **256** with `BATCH = 16`, drop to `BATCH = 8`.
* At **512** with `BATCH = 8`, drop to `BATCH = 4`.


Then answer the graded questions.


In [6]:
# Your code here; add as many cells as you need

imdb = load_dataset("imdb")
texts = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

features = Features({"text": Value("string"), "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

sub = all_ds.train_test_split(train_size=SUBSET_FRAC, seed=SEED, stratify_by_column="label")
ds_pool = sub["train"]

splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val_pool, test_ds = splits["train"], splits["test"]
splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
X_va = np.array(val_ds["text"], dtype=object); y_va = np.array(val_ds["label"], dtype="int32")
X_te = np.array(test_ds["text"], dtype=object); y_te = np.array(test_ds["label"], dtype="int32")

configs = [
    (128, 2e-5, 32), (128, 1e-5, 32),
    (256, 1e-5, 16), (256, 7.5e-6, 16),
    (512, 7.5e-6, 8), (512, 5e-6, 8),
]

results = []

for i, (MAX_LEN, LR, BATCH) in enumerate(configs, 1):
    print(f"\n{'='*60}\nRun {i}/6: MAX_LEN={MAX_LEN}, LR={LR:.1e}, BATCH={BATCH}\n{'='*60}")

    keras.utils.set_random_seed(SEED)
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased", sequence_length=MAX_LEN)
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc)

    model.compile(optimizer=keras.optimizers.Adam(LR),
                  loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])

    history = model.fit(X_tr, y_tr, validation_data=(X_va, y_va), epochs=EPOCHS,
                       batch_size=BATCH, verbose=1,
                       callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss",
                                 patience=2, restore_best_weights=True)])

    val_acc = history.history['val_acc'][np.argmin(history.history['val_loss'])]

    logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
    y_pred = logits.argmax(axis=-1)
    test_acc = evaluate.load("accuracy").compute(predictions=y_pred, references=y_te)["accuracy"]
    test_f1 = evaluate.load("f1").compute(predictions=y_pred, references=y_te)["f1"]

    results.append((MAX_LEN, LR, BATCH, val_acc, test_acc, test_f1))
    print(f"Val Acc: {val_acc:.4f} | Test Acc: {test_acc:.4f} | Test F1: {test_f1:.4f}")

    keras.backend.clear_session()

print(f"\n{'='*70}\nSUMMARY\n{'='*70}")
for i, (ml, lr, bs, va, ta, tf) in enumerate(results, 1):
    print(f"{i}. LEN={ml} LR={lr:.1e} BATCH={bs} → Val={va:.4f} Test={ta:.4f} F1={tf:.4f}")

best = max(results, key=lambda x: x[3])
print(f"\nBEST: MAX_LEN={best[0]}, LR={best[1]:.1e}, BATCH={best[2]}")
print(f"\na1a = {best[3]:.4f}")


Run 1/6: MAX_LEN=128, LR=2.0e-05, BATCH=32
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 148ms/step - acc: 0.8079 - loss: 0.4107 - val_acc: 0.8392 - val_loss: 0.3526
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 34ms/step - acc: 0.8963 - loss: 0.2543 - val_acc: 0.8464 - val_loss: 0.3582
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 33ms/step - acc: 0.9318 - loss: 0.1777 - val_acc: 0.8624 - val_loss: 0.3766
Val Acc: 0.8392 | Test Acc: 0.8424 | Test F1: 0.8531

Run 2/6: MAX_LEN=128, LR=1.0e-05, BATCH=32
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m120s[0m 242ms/step - acc: 0.7799 - loss: 0.4543 - val_acc: 0.8584 - val_loss: 0.3387
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 42ms/step - acc: 0.8822 - loss: 0.2864 - val_acc: 0.8616 - val_loss: 0.3412
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 42ms/step - acc

### Graded Questions

In [7]:
# Set a1a to the validation accuracy at min validation loss for your best configuration found in this problem

a1a = 0.9144             # Replace 0.0 with your answer

In [8]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a1a = {a1a:.4f}')

a1a = 0.9144


#### Question a1b:

* Does **more context** (128 → 256 → 512) consistently help?
* How much effect did the learning rate have on the validation accuracy?


#### Your Answer Here:
More context does consistently help validation accuracy. When the length went from 128 to 256, there was about a 5% improvement, which is quite a large jump. When going from 256 to 512 context length, the validation accuracy improved by 0.8%. Therefore, there was an overall improvement of 5.6% from baseline to 512 tokens. The learning rate had a minimal effect on the validation accuracy, it only produced a 0.3% fluctuation in validation accuracy. Therefore, context length is a more important factor than Learning Rate in this problem.

## Problem 2 — How much data is enough?

In this problem, you’ll investigate how model performance scales with dataset size.

**Setup.**
Use the best `MAX_LEN` and `LR` values you found in **Problem 1**.

**What to do:**

1. For each value of `SUBSET_FRAC ∈ {0.25, 0.50, 0.75, 1.00}`, train your model once and observe the displayed performance metrics.
2. Answer the discussion question below.




In [9]:
# Your code here; add as many cells as you need

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import time, gc
import numpy as np
import keras
import keras_hub as kh
import evaluate
from datasets import load_dataset, Dataset, Features, Value, ClassLabel
from keras import mixed_precision

mixed_precision.set_global_policy("mixed_float16")

MAX_LEN, LR, BATCH, SEED, EPOCHS = 512, 7.5e-6, 8, 42, 3
keras.utils.set_random_seed(SEED)

imdb = load_dataset("imdb", cache_dir="./cache")
texts = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")
features = Features({"text": Value("string"), "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

acc_metric, f1_metric = evaluate.load("accuracy"), evaluate.load("f1")

for frac in [0.25, 0.50, 0.75, 1.00]:
    print(f"\n{'='*50}\nSUBSET_FRAC={frac}\n{'='*50}")

    keras.utils.set_random_seed(SEED)
    ds_pool = all_ds.train_test_split(train_size=frac, seed=SEED, stratify_by_column="label")["train"] if frac < 1.0 else all_ds

    splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
    train_val, test_ds = splits["train"], splits["test"]
    splits2 = train_val.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
    train_ds, val_ds = splits2["train"], splits2["test"]

    X_tr, y_tr = np.array(train_ds["text"], dtype=object), np.array(train_ds["label"], dtype="int32")
    X_va, y_va = np.array(val_ds["text"], dtype=object), np.array(val_ds["label"], dtype="int32")
    X_te, y_te = np.array(test_ds["text"], dtype=object), np.array(test_ds["label"], dtype="int32")

    print(f"Train: {len(X_tr)}, Val: {len(X_va)}, Test: {len(X_te)}")

    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased", sequence_length=MAX_LEN)
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc)
    model.compile(optimizer=keras.optimizers.Adam(LR),
                 loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                 metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])

    history = model.fit(X_tr, y_tr, validation_data=(X_va, y_va), epochs=EPOCHS, batch_size=BATCH, verbose=1,
                       callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)])

    logits = model.predict(X_te, batch_size=64, verbose=0)
    y_pred = logits.argmax(axis=-1)

    val_acc = max(history.history['val_acc'])
    test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    test_f1 = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

    print(f"Val={val_acc:.4f} Test={test_acc:.4f} F1={test_f1:.4f}")

    keras.backend.clear_session(); gc.collect(); time.sleep(1)


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]


SUBSET_FRAC=0.25
Train: 8750, Val: 1250, Test: 2500
Epoch 1/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 81ms/step - acc: 0.8545 - loss: 0.3210 - val_acc: 0.9112 - val_loss: 0.2241
Epoch 2/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 29ms/step - acc: 0.9362 - loss: 0.1752 - val_acc: 0.9136 - val_loss: 0.2314
Epoch 3/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 29ms/step - acc: 0.9623 - loss: 0.1103 - val_acc: 0.9096 - val_loss: 0.2598
Val=0.9136 Test=0.9116 F1=0.9139

SUBSET_FRAC=0.5
Train: 17500, Val: 2500, Test: 5000
Epoch 1/3
[1m2188/2188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m165s[0m 58ms/step - acc: 0.8811 - loss: 0.2801 - val_acc: 0.9248 - val_loss: 0.1955
Epoch 2/3
[1m2188/2188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 37ms/step - acc: 0.9366 - loss: 0.1672 - val_acc: 0.9176 - val_loss: 0.2133
Epoch 3/3
[1m2188/2188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 37ms/

### Graded Questions

In [10]:
# Set a2a to the validation accuracy at min validation loss for your best configuration found in this problem
# (Yes, it is probably at 1.0!)

a2a = 0.9344             # Replace 0.0 with your answer

In [11]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a2a = {a2a:.4f}')

a2a = 0.9344


#### Question a2b:

Summarize what you observed as dataset size increased. Given that validation metrics are typically reliable to only about two decimal places, do the performance gains justify using the entire dataset? What trade-offs between accuracy and computation time did you notice?

#### Your Answer Here:
As the dataset size increased, the performance improved but showed modest gains after 50-75% of the data was used. In addition, as the data size increased, the performance time did as well. To limit computation time and cost, using 50% of the data would be an ideal trade-off; the performance gains are noticeable but not as high as they could be. However, the computation (and therefore cost) is minimized. In addition, using more of the dataset, only reaped small improvements.

# Problem 3 — Model swap: speed vs. accuracy (why: capacity matters)

In this problem we will compare encoder-only backbones of different sizes.

**Setup.** Keep the best `MAX_LEN`, `LR`, and `SUBSET_FRAC` from Problems 1–2. Only change the model/preset:

* **DistilBERT** (current baseline)
* **BERT-base** (larger/usually stronger)

**How to switch (two lines each).**

* DistilBERT:

  ```python
  preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset("distil_bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.DistilBertTextClassifier.from_preset("distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

* BERT-base:

  ```python
  preproc = kh.models.BertTextClassifierPreprocessor.from_preset("bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.BertTextClassifier.from_preset("bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

**What to do.**

1. Train/evaluate each model once with identical settings.
2. Observe the performance metrics for each.
3. Answer the graded questions.



In [12]:
# Your code here; add as many cells as you wish

mixed_precision.set_global_policy("mixed_float16")

MAX_LEN, LR, BATCH, SUBSET_FRAC, SEED, EPOCHS = 512, 7.5e-6, 8, 1.0, 42, 3
keras.utils.set_random_seed(SEED)

imdb = load_dataset("imdb", cache_dir="./cache")
texts = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")
features = Features({"text": Value("string"), "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

splits = all_ds.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val, test_ds = splits["train"], splits["test"]
splits2 = train_val.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

X_tr, y_tr = np.array(train_ds["text"], dtype=object), np.array(train_ds["label"], dtype="int32")
X_va, y_va = np.array(val_ds["text"], dtype=object), np.array(val_ds["label"], dtype="int32")
X_te, y_te = np.array(test_ds["text"], dtype=object), np.array(test_ds["label"], dtype="int32")

acc_metric, f1_metric = evaluate.load("accuracy"), evaluate.load("f1")

models = [
    ("DistilBERT", "distil_bert_base_en_uncased", kh.models.DistilBertTextClassifierPreprocessor, kh.models.DistilBertTextClassifier),
    ("BERT-base", "bert_base_en_uncased", kh.models.BertTextClassifierPreprocessor, kh.models.BertTextClassifier),
]

results = []

for name, preset, PreprocessorClass, ModelClass in models:
    print(f"\n{'='*50}\n{name}\n{'='*50}")

    keras.utils.set_random_seed(SEED)

    preproc = PreprocessorClass.from_preset(preset, sequence_length=MAX_LEN)
    model = ModelClass.from_preset(preset, num_classes=2, preprocessor=preproc)
    model.compile(optimizer=keras.optimizers.Adam(LR),
                 loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                 metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])

    history = model.fit(X_tr, y_tr, validation_data=(X_va, y_va), epochs=EPOCHS, batch_size=BATCH, verbose=1,
                       callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)])

    logits = model.predict(X_te, batch_size=64, verbose=0)
    y_pred = logits.argmax(axis=-1)

    min_loss_idx = np.argmin(history.history['val_loss'])
    val_acc = history.history['val_acc'][min_loss_idx]
    test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
    test_f1 = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

    results.append({"model": name, "val_acc": val_acc, "test_acc": test_acc, "test_f1": test_f1})
    print(f"Val (at min loss)={val_acc:.4f} Test={test_acc:.4f} F1={test_f1:.4f}")

    keras.backend.clear_session(); gc.collect(); time.sleep(1)

print(f"\n{'='*50}\nSUMMARY\n{'='*50}")
for r in results:
    print(f"{r['model']:12s} → Val={r['val_acc']:.4f} Test={r['test_acc']:.4f} F1={r['test_f1']:.4f}")

best = max(results, key=lambda x: x['val_acc'])
print(f"\nBEST: {best['model']}")
print(f"a3a = {best['val_acc']:.4f}")


DistilBERT
Epoch 1/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m169s[0m 29ms/step - acc: 0.8984 - loss: 0.2484 - val_acc: 0.9298 - val_loss: 0.1817
Epoch 2/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m122s[0m 28ms/step - acc: 0.9423 - loss: 0.1569 - val_acc: 0.9278 - val_loss: 0.1972
Epoch 3/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m122s[0m 28ms/step - acc: 0.9640 - loss: 0.1054 - val_acc: 0.9324 - val_loss: 0.1996
Val (at min loss)=0.9298 Test=0.9236 F1=0.9256

BERT-base
Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/config.json...


100%|██████████| 457/457 [00:00<00:00, 1.10MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/tokenizer.json...


100%|██████████| 761/761 [00:00<00:00, 2.05MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 226k/226k [00:00<00:00, 666kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/model.weights.h5...


100%|██████████| 418M/418M [00:12<00:00, 34.8MB/s]


Epoch 1/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m387s[0m 73ms/step - acc: 0.9129 - loss: 0.2218 - val_acc: 0.9386 - val_loss: 0.1712
Epoch 2/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m309s[0m 71ms/step - acc: 0.9580 - loss: 0.1195 - val_acc: 0.9404 - val_loss: 0.1783
Epoch 3/3
[1m4375/4375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m309s[0m 71ms/step - acc: 0.9786 - loss: 0.0674 - val_acc: 0.9450 - val_loss: 0.1771
Val (at min loss)=0.9386 Test=0.9296 F1=0.9320

SUMMARY
DistilBERT   → Val=0.9298 Test=0.9236 F1=0.9256
BERT-base    → Val=0.9386 Test=0.9296 F1=0.9320

BEST: BERT-base
a3a = 0.9386


### Graded Questions

In [13]:
# Set a1a to the validation accuracy at min validation loss for your best model found in this problem

a3a = 0.9386             # Replace 0.0 with your answer

In [14]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a3a = {a3a:.4f}')

a3a = 0.9386


#### Question a3b:

**Answer briefly.**

* Which model gives the best **accuracy/F1**?
* Which is **fastest** per epoch?
* Given limited development time or compute resources, which model is the best **overall choice** and why?

#### Your Answer Here:
The BERT-based model gave me the best accuracy (0.9386) but it took double the time to run. The accuracy between the two models only differed by 1%, so to limit computation time, money, and resources, it would be best to use the DistillBERT model.