# Using GPT-2 for Text Classification

For demonstration purposes and due to resource limitations, we have chosen GPT-2 as our model. You should be able to apply the same methods to larger models which will likely yield better performance.

We build on the notebook by Sebastian Raschka [3] Sebastian Raschka's LLMs Course: [GitHub - rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb)


## Environment Setup

First, we need to install and import the required libraries. Ensure that all necessary packages are installed; otherwise, please use `pip install` to install them.

In [11]:
#!$ pip install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org <package_name>
#!pip install torch --user
#!pip uninstall matplotlib -y
#!pip uninstall pillow -y 
#!pip uninstall numpy -y 
#!pip uninstall datasets -y -v 
#!pip install matplotlib
#!pip install pillow
#!pip install numpy
#!pip install datasets --user

In [1]:
%load_ext autoreload
%autoreload 2
#import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
#!pip install datasets --user
#!pip install evaluate --user
#!pip3 install Cython --user

In [2]:
# %%

from importlib.metadata import version 
import sys 

pkgs = [ 
    "matplotlib", 
    "numpy", 
    "torch", 
    "transformers",   
    "datasets",        
    "pandas",          
    "evaluate",        
] 

for p in pkgs: 
    try: 
        print(f"{p} version: {version(p)}") 
    except: 
        print(f"{p} is not installed. Please use `pip install {p}` to install.") 
        sys.exit(1) 
data_path = "content"


matplotlib version: 3.10.0
numpy version: 2.0.2
torch version: 2.5.1
transformers version: 4.48.0
datasets version: 3.2.0
pandas version: 2.2.3
evaluate version: 0.4.3


In [3]:
import torch

In [4]:
# Set model name
from transformers import GPT2Tokenizer
model_name = "gpt2"

# Load GPT-2 Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


## Data Preparation and Loading

We will use the SMS Spam Collection dataset. This dataset contains SMS messages labeled as "spam" or "ham". We will download, preprocess the data, and split it into training, validation, and test sets.

If there is an error downloading, please try to download it manually.
Download link: https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip

The code below creates balanced classes of ham and spam. **Reflect on the advantages and disadvantages of this step.** 


In [6]:
from src.fine_helper import load_complete_dataframe
is_balanced = True
train_df, validation_df, test_df = load_complete_dataframe(data_path,is_balanced=is_balanced)

Original dataset label counts:
Label
1        37569
0        29780
label        1
Name: count, dtype: int64
(67350, 2)

Balanced dataset label distribution:
Label
1    29780
0    29780
Name: count, dtype: int64

Training set size: 67350, Validation set size: 873, Test set size: 1822


In [11]:
# %%

import urllib.request 
import zipfile 
import os 
from pathlib import Path 
import pandas as pd 

url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip" 
zip_path = "sms_spam_collection.zip" 
extracted_path = "sms_spam_collection" 
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv" 

def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path): 
    if data_file_path.exists(): 
        print(f"{data_file_path} already exists. Skipping download and extraction.") 
        return 

    with urllib.request.urlopen(url) as response: 
        with open(zip_path, "wb") as out_file: 
            out_file.write(response.read()) 

    with zipfile.ZipFile(zip_path, "r") as zip_ref: 
        zip_ref.extractall(extracted_path) 

    original_file_path = Path(extracted_path) / "SMSSpamCollection" 
    os.rename(original_file_path, data_file_path) 
    print(f"File downloaded and saved as {data_file_path}") 

#download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) 

df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"]) 
print(df.head()) 
print("Original dataset label counts:") 
print(df["Label"].value_counts()) 

def create_balanced_dataset(df): 
    num_spam = df[df["Label"] == "spam"].shape[0] 
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123) 
    return pd.concat([ham_subset, df[df["Label"] == "spam"]], ignore_index=True) 

balanced_df = create_balanced_dataset(df) 
balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1}) 
print("\nBalanced dataset label distribution:") 
print(balanced_df["Label"].value_counts()) 

def random_split(df, train_frac, validation_frac): 
    df = df.sample(frac=1, random_state=123).reset_index(drop=True) 
    train_end = int(len(df) * train_frac) 
    validation_end = train_end + int(len(df) * validation_frac) 
    train_df = df[:train_end] 
    validation_df = df[train_end:validation_end] 
    test_df = df[validation_end:] 
    return train_df, validation_df, test_df 

train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1) 
train_df.to_csv("train.csv", index=None) 
validation_df.to_csv("validation.csv", index=None) 
test_df.to_csv("test.csv", index=None) 

print(f"\nTraining set size: {len(train_df)}, Validation set size: {len(validation_df)}, Test set size: {len(test_df)}") 

  Label                                               Text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
Original dataset label counts:
Label
ham     4825
spam     747
Name: count, dtype: int64

Balanced dataset label distribution:
Label
0    747
1    747
Name: count, dtype: int64

Training set size: 1045, Validation set size: 149, Test set size: 300


## Creating Data Loaders

Next, we need to encode and pad the SMS text to ensure consistent input length in each batch. Here, we use `<|PAD|>` as the padding token and build the attention_mask.


### New Code

In [5]:
from src.dataset_loader import get_enc_dataset
train_dataset, val_dataset,test_dataset,train_loader,val_loader,test_loader, pad_token_id = get_enc_dataset(data_path,tokenizer,is_log = True)

Added new pad_token '<|PAD|>' with ID: 50257
Index(['Text', 'Label'], dtype='object')
Max Length is:  65
Index(['Text', 'Label'], dtype='object')
Max Length is:  65
Index(['Text', 'Label'], dtype='object')
Max Length is:  65
Max Length:  65
Number of training batches: 7147, Number of validation batches: 298


### Old Code 

In [9]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer
import pandas as pd
import os 
# Set model name
model_name = "gpt2"

# Load GPT-2 Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Add a separate pad_token
if tokenizer.pad_token is None:
    # Use '<|PAD|>' as the padding token
    tokenizer.add_special_tokens({'pad_token': '<|PAD|>'})
    pad_token_id = tokenizer.pad_token_id
    print("Added new pad_token '<|PAD|>' with ID:", pad_token_id)

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=None):
        self.data = pd.read_csv(csv_file,sep="\t")
        
        # Verify necessary columns in the CSV file
        required_columns = ["sentence", "label"]
        if not all(col in self.data.columns for col in required_columns):
            raise ValueError(f"CSV file must contain the following columns: {required_columns}")
        
        # Ensure labels are of integer type
        self.data["label"] = self.data["label"].astype(int)
        
        self.texts = self.data["sentence"].tolist()
        self.labels = self.data["label"].tolist()
        
        # Set pad_token_id, if not specified, use tokenizer's pad_token_id
        self.pad_token_id = pad_token_id if pad_token_id is not None else tokenizer.pad_token_id
        
        # Encode texts
        self.encoded_texts = []
        for text in self.texts:
            try:
                encoded = tokenizer.encode(text, add_special_tokens=True)
                self.encoded_texts.append(encoded)
            except Exception as e:
                raise ValueError(f"Error encoding text: {text[:50]}...") from e
        
        # Dynamically calculate max_length, or use specified max_length
        if max_length is None:
            self.max_length = self._longest_encoded_length()
        else:
            self.max_length = max_length
            # Truncate sequences longer than max_length
            self.encoded_texts = [
                encoded_text[:self.max_length] for encoded_text in self.encoded_texts
            ]
        
        # Pad all sequences and generate attention_mask
        self.padded_texts = []
        self.attention_masks = []
        for enc in self.encoded_texts:
            enc = enc[:self.max_length]
            attention_mask = [1] * len(enc)
            
            pad_len = self.max_length - len(enc)
            if pad_len > 0:
                enc += [self.pad_token_id] * pad_len
                attention_mask += [0] * pad_len
            
            self.padded_texts.append(enc)
            self.attention_masks.append(attention_mask)
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        input_ids = torch.tensor(self.padded_texts[idx], dtype=torch.long)
        attention_mask = torch.tensor(self.attention_masks[idx], dtype=torch.long)
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        text = self.texts[idx]
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": label,
            "text": text
        }
    
    def _longest_encoded_length(self):
        return max(len(encoded_text) for encoded_text in self.encoded_texts)

# Create datasets
train_dataset = SpamDataset(os.path.join(data_path,"train.tsv"), tokenizer, pad_token_id=pad_token_id)
val_dataset = SpamDataset(os.path.join(data_path,"dev.tsv"), tokenizer, max_length=train_dataset.max_length, pad_token_id=pad_token_id)
#test_dataset = SpamDataset(os.path.join(data_path,"test.tsv"), tokenizer, max_length=train_dataset.max_length, pad_token_id=pad_token_id)

# Set DataLoader parameters
batch_size = 8
num_workers = 0

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, drop_last=False)
#test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, drop_last=False)

print(f"Number of training batches: {len(train_loader)}, Number of validation batches: {len(val_loader)}, Number of test batches: {len(test_loader)}")

Added new pad_token '<|PAD|>' with ID: 50257


NameError: name 'test_loader' is not defined

## Defining Computational Device
Before loading and training the model, we need to define the computational device (CPU or GPU). If a GPU is available, it will be preferred to accelerate training.

In [6]:
if torch.cuda.is_available(): 
    device = torch.device("cuda") 
elif torch.backends.mps.is_available():
    device = torch.device("mps") 
else: 
    device = torch.device("cpu") 
print(f"Using device: {device}") 
#device = torch.device("cpu") 

Using device: cuda


## Understanding Model Structure ##
Load the pre-trained GPT-2 model and make sure you understand its structure. **What do the abbreviations represent? Which of the layer types have we seen in the lecture?**

In [7]:
from transformers import GPT2LMHeadModel, GPT2ForSequenceClassification

2025-01-17 12:41:02.277006: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-17 12:41:02.294455: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1737114062.316656   25498 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1737114062.323453   25498 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-17 12:41:02.345140: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

### Model structure 

In [8]:
from transformers import GPT2LMHeadModel, GPT2ForSequenceClassification
pretrained_gpt2_lm = GPT2LMHeadModel.from_pretrained("gpt2") 

#print(pretrained_gpt2_lm)


Now compare the structure to the GPT-2 model for classification. **What is the difference?**  

In [9]:
classification_model = GPT2ForSequenceClassification.from_pretrained( 
    model_name, 
    num_labels=2, 
    pad_token_id=tokenizer.pad_token_id 
).to(device) 

#print(classification_model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Initial Model Testing

At this point, the classification head is untrained, and the results are usually poor.

In [10]:
#pad_token_id

In [11]:
# %%

def classify_text(text, model, tokenizer, device, max_length, pad_token_id=50256): 
    model.eval() 
    enc = tokenizer.encode(text, add_special_tokens=True, truncation=True, max_length=max_length) 
    att_mask = [1]*len(enc) 
    pad_len = max_length - len(enc) 
    if pad_len > 0: 
        enc += [pad_token_id]*pad_len 
        att_mask += [0]*pad_len 
     
    input_ids = torch.tensor([enc], dtype=torch.long).to(device) 
    attention_mask = torch.tensor([att_mask], dtype=torch.long).to(device) 

    with torch.no_grad(): 
        print(input_ids.shape)
        
        outputs = model(input_ids)#, attention_mask=attention_mask) 
        logits = outputs.logits 
        predicted = torch.argmax(logits, dim=-1).item() 
    return predicted

sample_text_spam = "Fine" 
sample_text_ham = "Bad" 

print("(Before fine-tuning) Initial prediction of the classification head:") 
print(f"Postive sample => Prediction: {classify_text(sample_text_spam, classification_model, 
                                                  tokenizer, device, max_length=65,pad_token_id=pad_token_id-1
                                                    )}") 
print(f"Negative sample => Prediction: {classify_text(sample_text_ham, classification_model,
                                                 tokenizer, device,max_length=65
                                                )}") 

(Before fine-tuning) Initial prediction of the classification head:
torch.Size([1, 65])
Postive sample => Prediction: 0
torch.Size([1, 65])
Negative sample => Prediction: 0



## Tuning only the classification layer

As a first step, we will freeze the parameters of the GPT2-model and only tune the classification layer. 

In [12]:

# We need to add the padding token to the embeddings
classification_model.resize_token_embeddings(len(tokenizer)) 
classification_model.to(device) 

i =0
k=0
for param in classification_model.base_model.parameters(): 
    param.requires_grad = False 
    i+=param.numel()

print(f"Number of base parameters: {i}")


for param in classification_model.score.parameters(): 
    param.requires_grad = True 
    k+=param.numel()

print(f"\nTraining only the classification head, trainable parameters: {k}") 


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Number of base parameters: 124440576

Training only the classification head, trainable parameters: 1536


In [19]:
import time 
from ult import evaluate_accuracy
from src.train import train_head_only


start_time_head = time.time() 
classification_model = train_head_only(classification_model, train_loader, val_loader, device, epochs=8, lr=3e-5) 
end_time_head = time.time() 

train_accuracy_head = evaluate_accuracy(classification_model, train_loader, device) 
val_accuracy_head = evaluate_accuracy(classification_model, val_loader, device) 
#test_accuracy_head = evaluate_accuracy(classification_model, test_loader, device) 
finetune_head_time = (end_time_head - start_time_head) / 60 

print(f"\n=== Fine-tuning only the classification head completed in {finetune_head_time:.2f} minutes ===") 
print(f"Training accuracy: {train_accuracy_head*100:.2f}%") 
print(f"Validation accuracy: {val_accuracy_head*100:.2f}%") 
#print(f"Test accuracy: {test_accuracy_head*100:.2f}%") 

Epoch 1/8, step 500/7147, loss = 0.7802
Epoch 1/8, step 1000/7147, loss = 0.6003
Epoch 1/8, step 1500/7147, loss = 0.6505
Epoch 1/8, step 2000/7147, loss = 0.7296
Epoch 1/8, step 2500/7147, loss = 0.5789
Epoch 1/8, step 3000/7147, loss = 0.5842
Epoch 1/8, step 3500/7147, loss = 0.6220
Epoch 1/8, step 4000/7147, loss = 0.5829
Epoch 1/8, step 4500/7147, loss = 0.5478
Epoch 1/8, step 5000/7147, loss = 0.6448
Epoch 1/8, step 5500/7147, loss = 0.4478
Epoch 1/8, step 6000/7147, loss = 0.4949
Epoch 1/8, step 6500/7147, loss = 0.5545
Epoch 1/8, step 7000/7147, loss = 0.4498
Epoch 1/8, step 7140/7147, loss = 0.5660
Epoch 1/8, Average training loss: 0.6997
Validation accuracy: 73.90%

Epoch 2/8, step 500/7147, loss = 0.5973
Epoch 2/8, step 1000/7147, loss = 0.6920
Epoch 2/8, step 1500/7147, loss = 0.5373
Epoch 2/8, step 2000/7147, loss = 0.6156
Epoch 2/8, step 2500/7147, loss = 0.5983
Epoch 2/8, step 3000/7147, loss = 0.7882
Epoch 2/8, step 3500/7147, loss = 0.7298
Epoch 2/8, step 4000/7147, los

You may find that fine-tuning just a linear layer can achieve a significant improvement. Are you satisfied with the test accuracy results? Try different hyperparameters, such as increasing the learning rate, and try to get better accuracy. 

Next, we will try unlocking the Transformer block before the linear layer; Sebastian Raschka found that this can significantly improve the model's performance on specific downstream tasks.

## Unfreeze Partial Model Layers and Further Fine-Tuning

We unfreeze the last Transformer block (`transformer.h.11`) and the final LayerNorm of GPT-2, and train them along with the classification head (`score`) to further enhance performance.

In [13]:
# Unfreeze the last transformer block and LayerNorm in base_model
for param in classification_model.base_model.h[-1].parameters(): 
    param.requires_grad = True 
    k+=param.numel()
for param in classification_model.base_model.ln_f.parameters(): 
    param.requires_grad = True
    k += param.numel()

print(f"Total trainable parameters: {k}") 
print("\nTrainable parts after unfreezing the last Transformer block and LayerNorm:") 
trainable_params_count = 0 
for name, param in classification_model.named_parameters(): 
    if param.requires_grad: 
        print(f"  {name} => shape={param.size()}") 


Total trainable parameters: 7090944

Trainable parts after unfreezing the last Transformer block and LayerNorm:
  transformer.h.11.ln_1.weight => shape=torch.Size([768])
  transformer.h.11.ln_1.bias => shape=torch.Size([768])
  transformer.h.11.attn.c_attn.weight => shape=torch.Size([768, 2304])
  transformer.h.11.attn.c_attn.bias => shape=torch.Size([2304])
  transformer.h.11.attn.c_proj.weight => shape=torch.Size([768, 768])
  transformer.h.11.attn.c_proj.bias => shape=torch.Size([768])
  transformer.h.11.ln_2.weight => shape=torch.Size([768])
  transformer.h.11.ln_2.bias => shape=torch.Size([768])
  transformer.h.11.mlp.c_fc.weight => shape=torch.Size([768, 3072])
  transformer.h.11.mlp.c_fc.bias => shape=torch.Size([3072])
  transformer.h.11.mlp.c_proj.weight => shape=torch.Size([3072, 768])
  transformer.h.11.mlp.c_proj.bias => shape=torch.Size([768])
  transformer.ln_f.weight => shape=torch.Size([768])
  transformer.ln_f.bias => shape=torch.Size([768])
  score.weight => shape=tor

Now, start training this model.

In [14]:
from src.ult import initialize_classifier_head
from src.train import  train_partial_unfreeze
import time 

initialize_classifier_head(classification_model)
print("Classification head reinitialized to eliminate sequential advantage impact.\n")

start_time_further = time.time() 
classification_model = train_partial_unfreeze(classification_model, train_loader, val_loader, device, epochs=8, lr=3e-5) 
end_time_further = time.time() 
finetune_further_time = (end_time_further - start_time_further) / 60 

train_accuracy_partial = evaluate_accuracy(classification_model, train_loader, device) 
val_accuracy_partial = evaluate_accuracy(classification_model, val_loader, device) 
test_accuracy_partial = evaluate_accuracy(classification_model, test_loader, device) 

print(f"\n=== Fine-tuning after unfreezing partial Transformer completed in {finetune_further_time:.2f} minutes ===") 
print(f"Training accuracy: {train_accuracy_partial*100:.2f}%") 
print(f"Validation accuracy: {val_accuracy_partial*100:.2f}%") 
print(f"Test accuracy: {test_accuracy_partial*100:.2f}%") 

Classification head initialized.
Classification head reinitialized to eliminate sequential advantage impact.

Epoch 1/8, step 500/7147, loss = 0.8376
Epoch 1/8, step 1000/7147, loss = 1.0200
Epoch 1/8, step 1500/7147, loss = 1.5388
Epoch 1/8, step 2000/7147, loss = 1.1472
Epoch 1/8, step 2500/7147, loss = 1.0757
Epoch 1/8, step 3000/7147, loss = 0.5841
Epoch 1/8, step 3500/7147, loss = 0.7648
Epoch 1/8, step 4000/7147, loss = 0.9431
Epoch 1/8, step 4500/7147, loss = 0.5554
Epoch 1/8, step 5000/7147, loss = 0.6419
Epoch 1/8, step 5500/7147, loss = 0.6214
Epoch 1/8, step 6000/7147, loss = 1.0334
Epoch 1/8, step 6500/7147, loss = 0.6036
Epoch 1/8, step 7000/7147, loss = 1.0877
Epoch 1/8, step 7140/7147, loss = 0.4104
Epoch 1/8, Average training loss: 1.0636
Validation accuracy: 68.95%

Epoch 2/8, step 500/7147, loss = 0.5880
Epoch 2/8, step 1000/7147, loss = 0.7671
Epoch 2/8, step 1500/7147, loss = 0.9469
Epoch 2/8, step 2000/7147, loss = 0.4601
Epoch 2/8, step 2500/7147, loss = 0.7905
Ep

NameError: name 'evaluate_accuracy' is not defined

You may notice that the model's accuracy on the test set has improved.

Now we have obtained a GPT-2 model that has been trained on the last Transformer block + LayerNorm layer + classification head.