<a href="https://colab.research.google.com/github/oyeong011/AI_FINAL_PROJECT/blob/main/final_project_finetuning_w_BERT3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial intelligence Final Project: Finetuning_BERT

Copyright (C) Computer Science & Engineering, Soongsil University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them (November 2025.)

BERT(Bidirectional Encoder Representations from Transformers) is a groundbreaking model in the NLP domain. This tutorial provides a step-by-step guide on how to fine-tune the lightweight BERT variant using Hugging Face's transformers library for text classification tasks.<br>

This is about BERT (Devlin et al., 2018).<br>
https://arxiv.org/abs/1810.04805

The code below are based on the following link. <br>
https://medium.com/@khang.pham.exxact/text-classification-with-bert-7afaacc5e49b


### Fine-tune the model
1. Design your model's prediction head
2. Finetune the model by changing the hyperparameters.
3. You will get a score based on the your (hidden) test accuracy for text classification (ranking-based).  

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that we can grade both your code and results.  


Now proceed to the code.


## Install libraries

In [1]:
import os

In [2]:
!python3 -m pip install pandas
!python3 -m pip install transformers



### import libraries

In [3]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import pandas as pd
from tqdm import tqdm
# from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
from transformers import BertTokenizer, BertModel, get_linear_schedule_with_warmup
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Specify your GPU number if necessary

In [5]:
%env CUDA_VISIBLE_DEVICES = 0

if torch.cuda.is_available() is True:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

env: CUDA_VISIBLE_DEVICES=0
Using device: cuda


## Preparing dataset

link : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

1. Download the dataset from attached link.
2. Move the downloaded zip file under the "data" directory and then unzip the zip file.
3. Run the following cell

In [6]:
import pandas as pd
import os

data_file_path = '/content/drive/MyDrive/finalProjectAI/final_project(2025)/part1/data/IMDB Dataset Train.csv'

if os.path.exists(data_file_path):
    print(f"파일 다운로드 성공: {data_file_path}")
    # 데이터 로드 테스트
    df = pd.read_csv(data_file_path)
    print(f"데이터 샘플: {len(df)}개 로드됨")
else:
    print("파일 다운로드 실패")

def load_imdb_data(data_file_path):
    if os.path.exists(data_file_path):
        df = pd.read_csv(data_file_path)
        texts = df['review'].tolist()
        labels = [1 if sentiment == "positive" else 0 for sentiment in df['sentiment'].tolist()]
        return texts, labels
    else:
        raise FileNotFoundError(f"The file '{data_file_path}' does not exist.")

data_file_path = '/content/drive/MyDrive/finalProjectAI/final_project(2025)/part1/data/IMDB Dataset Train.csv'
texts, labels = load_imdb_data(data_file_path)
print(f"Loaded {len(texts)} samples")

파일 다운로드 성공: /content/drive/MyDrive/finalProjectAI/final_project(2025)/part1/data/IMDB Dataset Train.csv
데이터 샘플: 50000개 로드됨
Loaded 50000 samples


## Dataset class

In [7]:
class CustomTextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_seq_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(
            text,
            max_length=self.max_seq_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

## Classifier head for BERT( Design your model's prediction head )

In [8]:
class CustomBERTClassifier(nn.Module):
    def __init__(self, bert_model_name, num_classes):
        super(CustomBERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        ######################## TO-DO ########################
        # BERT base hidden size is 768
        self.dropout1 = nn.Dropout(0.3)
        self.fc1 = nn.Linear(768, 256)
        self.relu = nn.ReLU()
        self.dropout2 = nn.Dropout(0.2)
        self.fc2 = nn.Linear(256, num_classes)
        ######################## TO-DO ########################

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        ######################## TO-DO ########################
        x = self.dropout1(pooled_output)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout2(x)
        logits = self.fc2(x)
        ######################## TO-DO ########################
        return logits

## train and evaluation method

In [9]:
def train_model(model, data_loader, optimizer, scheduler, device):
    model.train()
    total_loss = 0
    for batch in tqdm(data_loader, desc="Train"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        ######################## TO-DO ########################
        loss = nn.CrossEntropyLoss()(outputs, labels)
        ######################## TO-DO ########################
        total_loss += loss.item()
        loss.backward()
        # Gradient clipping to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()

    avg_loss = total_loss / len(data_loader)
    print(f"Average Training Loss: {avg_loss:.4f}")

def evaluate_model(model, data_loader, device):
    model.eval()
    predictions = []
    actual_labels = []
    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Validation"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
            predictions.extend(preds.cpu().tolist())
            actual_labels.extend(labels.cpu().tolist())
    return accuracy_score(actual_labels, predictions), classification_report(actual_labels, predictions)

## Hyper-parameter settings

In [10]:
# Set up parameters
# Hint: generally, 5 ~ 10 epochs will be enough.
bert_model_name = 'bert-base-uncased'
num_classes = 2
######################## TO-DO ########################
max_seq_length = 256  # IMDB reviews can be long, 256 is a good balance
batch_size = 32  # Adjust based on GPU memory (use 8 if OOM)
num_epochs = 4  # BERT typically needs 2-4 epochs for fine-tuning
learning_rate = 2e-5  # Standard BERT fine-tuning learning rate
######################## TO-DO ########################

print(f"Hyperparameters:")
print(f"  max_seq_length: {max_seq_length}")
print(f"  batch_size: {batch_size}")
print(f"  num_epochs: {num_epochs}")
print(f"  learning_rate: {learning_rate}")

Hyperparameters:
  max_seq_length: 256
  batch_size: 16
  num_epochs: 4
  learning_rate: 2e-05


## get data utils

In [11]:
######################## DO NOT CHANGE ########################
train_texts, val_texts, train_labels, val_labels = \
train_test_split(texts, labels, test_size=0.2, random_state=42)
######################## DO NOT CHANGE ########################

print(f"Training samples: {len(train_texts)}")
print(f"Validation samples: {len(val_texts)}")

tokenizer = BertTokenizer.from_pretrained(bert_model_name)
train_dataset = CustomTextClassificationDataset(train_texts, train_labels, tokenizer, max_seq_length)
val_dataset = CustomTextClassificationDataset(val_texts, val_labels, tokenizer, max_seq_length)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

Training samples: 40000
Validation samples: 10000


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Define model, optimizer and scheduler

In [12]:
model = CustomBERTClassifier(bert_model_name, num_classes).to(device)
######################## TO-DO ########################
# Use AdamW optimizer with weight decay for regularization
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
######################## TO-DO ########################
total_steps = len(train_dataloader) * num_epochs
# Warmup helps stabilize training in early steps
warmup_steps = int(0.1 * total_steps)  # 10% warmup
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)

print(f"Total training steps: {total_steps}")
print(f"Warmup steps: {warmup_steps}")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Total training steps: 10000
Warmup steps: 1000


## Train model and save best model

In [13]:
eval_acc = 0
for epoch in range(num_epochs):
    model_path = '/content/drive/MyDrive/finalProjectAI/final_project(2025)/part1/data/finetuned_bert.pth'
    print(f"\n{'='*50}")
    print(f"Epoch {epoch + 1}/{num_epochs}")
    print(f"{'='*50}")
    train_model(model, train_dataloader, optimizer, scheduler, device)
    accuracy, report = evaluate_model(model, val_dataloader, device)
    print(f"Validation Accuracy: {accuracy:.4f}")
    print(report)

    if eval_acc < accuracy:
        torch.save(model.state_dict(), model_path)
        print('Saved Trained Model.')
        eval_acc = accuracy

print(f"\n{'='*50}")
print(f"Best Validation Accuracy: {eval_acc:.4f}")
print(f"{'='*50}")


Epoch 1/4


Train: 100%|██████████| 2500/2500 [27:58<00:00,  1.49it/s]


Average Training Loss: 0.3157


Validation: 100%|██████████| 625/625 [02:49<00:00,  3.68it/s]


Validation Accuracy: 0.9144
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      4961
           1       0.93      0.89      0.91      5039

    accuracy                           0.91     10000
   macro avg       0.92      0.91      0.91     10000
weighted avg       0.92      0.91      0.91     10000

Saved Trained Model.

Epoch 2/4


Train: 100%|██████████| 2500/2500 [27:57<00:00,  1.49it/s]


Average Training Loss: 0.1725


Validation: 100%|██████████| 625/625 [02:47<00:00,  3.72it/s]


Validation Accuracy: 0.9274
              precision    recall  f1-score   support

           0       0.92      0.93      0.93      4961
           1       0.93      0.92      0.93      5039

    accuracy                           0.93     10000
   macro avg       0.93      0.93      0.93     10000
weighted avg       0.93      0.93      0.93     10000

Saved Trained Model.

Epoch 3/4


Train: 100%|██████████| 2500/2500 [27:57<00:00,  1.49it/s]


Average Training Loss: 0.0975


Validation: 100%|██████████| 625/625 [02:49<00:00,  3.68it/s]


Validation Accuracy: 0.9270
              precision    recall  f1-score   support

           0       0.94      0.91      0.93      4961
           1       0.92      0.94      0.93      5039

    accuracy                           0.93     10000
   macro avg       0.93      0.93      0.93     10000
weighted avg       0.93      0.93      0.93     10000


Epoch 4/4


Train: 100%|██████████| 2500/2500 [27:56<00:00,  1.49it/s]


Average Training Loss: 0.0509


Validation: 100%|██████████| 625/625 [02:50<00:00,  3.66it/s]


Validation Accuracy: 0.9278
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      4961
           1       0.92      0.94      0.93      5039

    accuracy                           0.93     10000
   macro avg       0.93      0.93      0.93     10000
weighted avg       0.93      0.93      0.93     10000

Saved Trained Model.

Best Validation Accuracy: 0.9278


# Task
Analyze the performance of the fine-tuned BERT model by displaying the classification report, generating and visualizing a confusion matrix, and then providing a comprehensive analysis of the model's strengths and weaknesses based on these metrics.

## Display Classification Report

### Subtask:
Display the detailed classification report (precision, recall, f1-score) for the model's performance on the validation set.


**Reasoning**:
The subtask requires displaying the classification report, which is available in the `report` variable. Printing this variable will fulfill the instruction.



In [14]:
print(report)

              precision    recall  f1-score   support

           0       0.93      0.92      0.93      4961
           1       0.92      0.94      0.93      5039

    accuracy                           0.93     10000
   macro avg       0.93      0.93      0.93     10000
weighted avg       0.93      0.93      0.93     10000

