# Part 1: Theory of Fine-Tuning

## Concept Check (Multiple Choice Questions)
What is the main benefit of fine-tuning an LLM?

A) It improves the model’s speed.

**B) It customizes the model for specific tasks or domains.**

C) It eliminates the need for high-quality datasets.

D) It prevents overfitting entirely.


Which of the following describes "catastrophic forgetting"?

A) When the model forgets its pre-training data during inference.

**B) When the model loses its generalization ability after excessive fine-tuning on a specific task.**

C) When the model produces irrelevant outputs due to overfitting.

D) When the model fails to save fine-tuned weights.

## Application Task

1. Write a 150–200 word explanation of transfer learning using a real-world analogy. Use examples from any domain (e.g., healthcare, legal, e-commerce).

  A: Transfer learning is the process by which a pre-trained model is adapted for a different or related task. Fine tuning LLMs is the type of transfer learning in this context. Let's use an example of an AI chatbot that answers customer questions about a product based on the product reviews and description. The initial model would be a pre-trained LLM. Then, it would undergo the process of fine tuning, where the input would be the customer reviews and product description and the output would be answers to questions about the product. For an air fryer, a user might ask: "Do customers say this is reliable long term?" and the chatbot might answer: "Yes, customers frequently report the air fryer working after well over a year."
2. Provide an example dataset structure for a fine-tuning task of your choice. Label and clean your dataset to match the requirements for the task.

  See imdb dataset preprocessing below

# Part 2: Practical Fine-Tuning Session

## 1. Environment Setup

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
!pip install datasets
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
from sklearn.metrics import classification_report


Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

In [4]:
# Check for GPU availability
print(torch.cuda.is_available()) # Should return True

True


## 2. Load and Preprocess Data

In [5]:
model_name = "distilbert-base-uncased"
model_checkpoint_dir = "/content/drive/MyDrive/model_checkpoint"

# if training from start:
# model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# tokenizer = AutoTokenizer.from_pretrained(model_name)

# if opening saved model:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_dir)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_dir)

In [6]:
dataset = load_dataset("imdb")
def preprocess_function(examples):
  return tokenizer(examples['text'], truncation=True, padding=True)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 3. Fine-Tune the Model

Define Training Arguments

In [7]:
training_args = TrainingArguments(
  output_dir="./results",
  save_steps=500,
  save_total_limit=3,
  evaluation_strategy="epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  num_train_epochs=3, weight_decay=0.01,
)
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_dataset["train"],
  eval_dataset=tokenized_dataset["test"],
)



Train the Model

In [8]:
# if training from start:
# trainer.train()
# if opening saved model:
trainer.train(resume_from_checkpoint=True)

ValueError: No valid checkpoint found in output directory (./results)

Save the Fine-Tuned Model

In [9]:
model.save_pretrained(model_checkpoint_dir)
tokenizer.save_pretrained(model_checkpoint_dir)
!cp -r ./results /content/drive/MyDrive/results

## 4. Save and Evaluate

In [10]:
results = trainer.evaluate()
print(results)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkcauwenb[0m ([33mkcauwenb-uc-san-diego-health[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 0.2828763723373413, 'eval_model_preparation_time': 0.0015, 'eval_runtime': 355.1949, 'eval_samples_per_second': 70.384, 'eval_steps_per_second': 8.798}


Detailed metrics with sklearn

In [11]:
predictions = trainer.predict(tokenized_dataset["test"])
y_pred = predictions.predictions.argmax(axis=1)
y_true = tokenized_dataset["test"]["label"]
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.93      0.93     12500
           1       0.93      0.94      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000



## Reflection
Summarize the key challenges you faced during the fine-tuning process and how you addressed them.
Provide suggestions for improving the model’s performance if the accuracy was below 90%.

Challenges I faced:
- This was my first time using Colab for a long time since I haven't used GPU to run trainings for a while so I had to relearn the logistics of it.
- The training took a long time and I had to figure out how to save my results. I often lost my data. So I learned the hard way to save my results to google drive just in case the runtime disconnected.

The model performance seemed ok, the accuracy was over 90%. If I wanted to improve it though, I would increase the batch size and number of epochs.