# Using GPT-2 for Text Classification

For demonstration purposes and due to resource limitations, we have chosen GPT-2 as our model. You should be able to apply the same methods to larger models which will likely yield better performance.

We build on the notebook by Sebastian Raschka [3] Sebastian Raschka's LLMs Course: [GitHub - rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb)


## Environment Setup

First, we need to install and import the required libraries. Ensure that all necessary packages are installed; otherwise, please use `pip install` to install them.

In [1]:
#!$ pip install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org <package_name>
#!pip install torch --user
#!pip uninstall matplotlib -y
#!pip uninstall pillow -y 
#!pip uninstall numpy -y 
#!pip uninstall datasets -y -v 
#!pip install matplotlib
#!pip install pillow
#!pip install numpy
#!pip install datasets --user

In [2]:
%load_ext autoreload
%autoreload 2
#import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
#!pip install datasets --user
#!pip install evaluate --user
#!pip3 install Cython --user

In [3]:
# %%

from importlib.metadata import version 
import sys 
import os 
path_to_save_folder = "model"
path_to_lora = os.path.join(path_to_save_folder,"lora")
path_to_partial = os.path.join(path_to_save_folder,"partial")
pkgs = [ 
    "matplotlib", 
    "numpy", 
    "torch", 
    "transformers",   
    "datasets",        
    "pandas",          
    "evaluate",        
] 

for p in pkgs: 
    try: 
        print(f"{p} version: {version(p)}") 
    except: 
        print(f"{p} is not installed. Please use `pip install {p}` to install.") 
        sys.exit(1) 
data_path = "content"


matplotlib version: 3.10.0
numpy version: 2.0.2
torch version: 2.5.1
transformers version: 4.48.0
datasets version: 3.2.0
pandas version: 2.2.3
evaluate version: 0.4.3


In [4]:
import torch

In [5]:
# Set model name
from transformers import GPT2Tokenizer
model_name = "gpt2"

# Load GPT-2 Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


## Data Preparation and Loading

We will use the SMS Spam Collection dataset. This dataset contains SMS messages labeled as "spam" or "ham". We will download, preprocess the data, and split it into training, validation, and test sets.

If there is an error downloading, please try to download it manually.
Download link: https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip

The code below creates balanced classes of ham and spam. **Reflect on the advantages and disadvantages of this step.** 


In [6]:
from src.fine_helper import load_complete_dataframe
is_balanced = True
#
#Just needs to be done ONCE ! 
#
train_df, validation_df, test_df = load_complete_dataframe(data_path,is_balanced=is_balanced)

Original dataset label counts:
Label
1        37569
0        29780
label        1
Name: count, dtype: int64
(67350, 2)

Balanced dataset label distribution:
Label
1    29780
0    29780
Name: count, dtype: int64

Training set size: 67350, Validation set size: 873, Test set size: 1822


## Creating Data Loaders

Next, we need to encode and pad the SMS text to ensure consistent input length in each batch. Here, we use `<|PAD|>` as the padding token and build the attention_mask.


### New Code

In [6]:
from src.dataset_loader import get_enc_dataset
train_dataset, val_dataset,test_dataset,train_loader,val_loader,test_loader, pad_token_id = get_enc_dataset(data_path,tokenizer,is_log = True)

Added new pad_token '<|PAD|>' with ID: 50257
Index(['Text', 'Label'], dtype='object')
Max Length is:  65
Index(['Text', 'Label'], dtype='object')
Max Length is:  65
Index(['Text', 'Label'], dtype='object')
Max Length is:  65
Max Length:  65
Number of training batches: 7147, Number of validation batches: 298


## Defining Computational Device
Before loading and training the model, we need to define the computational device (CPU or GPU). If a GPU is available, it will be preferred to accelerate training.

In [7]:
if torch.cuda.is_available(): 
    device = torch.device("cuda") 
elif torch.backends.mps.is_available():
    device = torch.device("mps") 
else: 
    device = torch.device("cpu") 
print(f"Using device: {device}") 
#device = torch.device("cpu") 

Using device: cuda


## Understanding Model Structure ##
Load the pre-trained GPT-2 model and make sure you understand its structure. **What do the abbreviations represent? Which of the layer types have we seen in the lecture?**

In [8]:
from transformers import GPT2LMHeadModel, GPT2ForSequenceClassification

2025-01-18 02:34:10.259725: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-18 02:34:10.276726: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1737164050.298258   94856 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1737164050.304884   94856 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-18 02:34:10.326565: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

### Model structure 

In [9]:
from transformers import GPT2LMHeadModel, GPT2ForSequenceClassification
pretrained_gpt2_lm = GPT2LMHeadModel.from_pretrained("gpt2") 

#print(pretrained_gpt2_lm)


Now compare the structure to the GPT-2 model for classification. **What is the difference?**  

In [10]:
classification_model = GPT2ForSequenceClassification.from_pretrained( 
    model_name, 
    num_labels=2, 
    pad_token_id=tokenizer.pad_token_id 
).to(device) 

#print(classification_model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Initial Model Testing

At this point, the classification head is untrained, and the results are usually poor.

In [11]:
#pad_token_id

In [12]:
# %%

def classify_text(text, model, tokenizer, device, max_length, pad_token_id=50256): 
    model.eval() 
    enc = tokenizer.encode(text, add_special_tokens=True, truncation=True, max_length=max_length) 
    att_mask = [1]*len(enc) 
    pad_len = max_length - len(enc) 
    if pad_len > 0: 
        enc += [pad_token_id]*pad_len 
        att_mask += [0]*pad_len 
     
    input_ids = torch.tensor([enc], dtype=torch.long).to(device) 
    attention_mask = torch.tensor([att_mask], dtype=torch.long).to(device) 

    with torch.no_grad(): 
        print(input_ids.shape)
        
        outputs = model(input_ids)#, attention_mask=attention_mask) 
        logits = outputs.logits 
        predicted = torch.argmax(logits, dim=-1).item() 
    return predicted

sample_text_spam = "Fine" 
sample_text_ham = "Bad" 

print("(Before fine-tuning) Initial prediction of the classification head:") 
print(f"Postive sample => Prediction: {classify_text(sample_text_spam, classification_model, 
                                                  tokenizer, device, max_length=65,pad_token_id=pad_token_id-1
                                                    )}") 
print(f"Negative sample => Prediction: {classify_text(sample_text_ham, classification_model,
                                                 tokenizer, device,max_length=65
                                                )}") 

(Before fine-tuning) Initial prediction of the classification head:
torch.Size([1, 65])
Postive sample => Prediction: 0
torch.Size([1, 65])
Negative sample => Prediction: 0



## Tuning only the classification layer

As a first step, we will freeze the parameters of the GPT2-model and only tune the classification layer. 

In [13]:

# We need to add the padding token to the embeddings
classification_model.resize_token_embeddings(len(tokenizer)) 
classification_model.to(device) 

i =0
k=0
for param in classification_model.base_model.parameters(): 
    param.requires_grad = False 
    i+=param.numel()

print(f"Number of base parameters: {i}")


for param in classification_model.score.parameters(): 
    param.requires_grad = True 
    k+=param.numel()

print(f"\nTraining only the classification head, trainable parameters: {k}") 


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Number of base parameters: 124440576

Training only the classification head, trainable parameters: 1536


In [13]:
import time 
from ult import evaluate_accuracy
from src.train import train_head_only
is_dry=False
#Epochs 10
start_time_head = time.time() 
classification_model, all_avg_loss,all_acc_train,all_acc_val = train_head_only(classification_model, train_loader, val_loader, device, epochs=11, lr=3e-5,is_dry=is_dry) 
end_time_head = time.time() 

train_accuracy_head = evaluate_accuracy(classification_model, train_loader, device,is_dry=is_dry) 
val_accuracy_head = evaluate_accuracy(classification_model, val_loader, device,is_dry=is_dry) 
test_accuracy_head = evaluate_accuracy(classification_model, test_loader, device,is_dry=is_dry) 
finetune_head_time = (end_time_head - start_time_head) / 60 

print(f"\n=== Fine-tuning only the classification head completed in {finetune_head_time:.2f} minutes ===") 
print(f"Training accuracy: {train_accuracy_head*100:.2f}%") 
print(f"Validation accuracy: {val_accuracy_head*100:.2f}%") 
#print(f"Test accuracy: {test_accuracy_head*100:.2f}%") 

Epoch 1/11, step 500/7147, loss = 0.3942
Epoch 1/11, step 1000/7147, loss = 0.7415
Epoch 1/11, step 1500/7147, loss = 0.7541
Epoch 1/11, step 2000/7147, loss = 0.6668
Epoch 1/11, step 2500/7147, loss = 0.6145
Epoch 1/11, step 3000/7147, loss = 0.6348
Epoch 1/11, step 3500/7147, loss = 0.5880
Epoch 1/11, step 4000/7147, loss = 0.6075
Epoch 1/11, step 4500/7147, loss = 0.4708
Epoch 1/11, step 5000/7147, loss = 0.5165
Epoch 1/11, step 5500/7147, loss = 0.6139
Epoch 1/11, step 6000/7147, loss = 0.6627
Epoch 1/11, step 6500/7147, loss = 0.6722
Epoch 1/11, step 7000/7147, loss = 0.5596
Epoch 1/11, step 7140/7147, loss = 0.7131
Epoch 1/11, Average training loss: 0.6824
Validation accuracy: 74.53% train_acc=61.19%

Epoch 2/11, step 500/7147, loss = 0.6114
Epoch 2/11, step 1000/7147, loss = 0.6706
Epoch 2/11, step 1500/7147, loss = 0.6326
Epoch 2/11, step 2000/7147, loss = 0.6903
Epoch 2/11, step 2500/7147, loss = 0.6296
Epoch 2/11, step 3000/7147, loss = 0.6398
Epoch 2/11, step 3500/7147, loss

In [15]:
# Save model 
from src.eval_helper import save_everything
from src.eval_helper import *
#path_to_partial TODO
train_run_label = "headonly_ep11_saveUpdate_try2"
#save_everything(path_to_partial, train_run_label, elapsed, train_losses, val_accs)
#classification_model, all_avg_loss,all_acc_train,all_acc_val

"""
save_everything(path_to_head_only, train_run_label, finetune_further_time,
                 all_avg_loss,all_acc_val,all_acc_val,
                train_accuracy_partial,val_accuracy_partial,test_accuracy_partial,classification_model)
"""


save_everything(path_to_save_folder=path_to_head_only,
                 train_run_label=train_run_label,
                 elapsed=finetune_head_time,
                 train_losses=all_avg_loss,
                 train_acc=all_acc_train,
                 val_accs=all_acc_val,
                 train_acc_complete=train_accuracy_head,
                 val_acc_complete=val_accuracy_head,
                 test_acc_complete=test_accuracy_head,
                 model=classification_model)

Everything saved at
 01:30:00


You may find that fine-tuning just a linear layer can achieve a significant improvement. Are you satisfied with the test accuracy results? Try different hyperparameters, such as increasing the learning rate, and try to get better accuracy. 

Next, we will try unlocking the Transformer block before the linear layer; Sebastian Raschka found that this can significantly improve the model's performance on specific downstream tasks.

## Unfreeze Partial Model Layers and Further Fine-Tuning

We unfreeze the last Transformer block (`transformer.h.11`) and the final LayerNorm of GPT-2, and train them along with the classification head (`score`) to further enhance performance.

In [14]:
# Unfreeze the last transformer block and LayerNorm in base_model
for param in classification_model.base_model.h[-1].parameters(): 
    param.requires_grad = True 
    k+=param.numel()
for param in classification_model.base_model.ln_f.parameters(): 
    param.requires_grad = True
    k += param.numel()

print(f"Total trainable parameters: {k}") 
print("\nTrainable parts after unfreezing the last Transformer block and LayerNorm:") 
trainable_params_count = 0 
for name, param in classification_model.named_parameters(): 
    if param.requires_grad: 
        print(f"  {name} => shape={param.size()}") 


Total trainable parameters: 7090944

Trainable parts after unfreezing the last Transformer block and LayerNorm:
  transformer.h.11.ln_1.weight => shape=torch.Size([768])
  transformer.h.11.ln_1.bias => shape=torch.Size([768])
  transformer.h.11.attn.c_attn.weight => shape=torch.Size([768, 2304])
  transformer.h.11.attn.c_attn.bias => shape=torch.Size([2304])
  transformer.h.11.attn.c_proj.weight => shape=torch.Size([768, 768])
  transformer.h.11.attn.c_proj.bias => shape=torch.Size([768])
  transformer.h.11.ln_2.weight => shape=torch.Size([768])
  transformer.h.11.ln_2.bias => shape=torch.Size([768])
  transformer.h.11.mlp.c_fc.weight => shape=torch.Size([768, 3072])
  transformer.h.11.mlp.c_fc.bias => shape=torch.Size([3072])
  transformer.h.11.mlp.c_proj.weight => shape=torch.Size([3072, 768])
  transformer.h.11.mlp.c_proj.bias => shape=torch.Size([768])
  transformer.ln_f.weight => shape=torch.Size([768])
  transformer.ln_f.bias => shape=torch.Size([768])
  score.weight => shape=tor

Now, start training this model.

In [15]:
from src.ult import initialize_classifier_head
from src.train import  train_partial_unfreeze
from ult import evaluate_accuracy
import time 
is_dry=False

initialize_classifier_head(classification_model)
print("Classification head reinitialized to eliminate sequential advantage impact.\n")

start_time_further = time.time() 
classification_model, all_avg_loss,all_acc_train,all_acc_val = train_partial_unfreeze(classification_model, train_loader, val_loader, device, epochs=8, lr=3e-5,is_dry=is_dry) 
end_time_further = time.time() 
finetune_further_time = (end_time_further - start_time_further) / 60 

train_accuracy_partial = evaluate_accuracy(classification_model, train_loader, device,is_dry=is_dry) 
val_accuracy_partial = evaluate_accuracy(classification_model, val_loader, device,is_dry=is_dry) 
test_accuracy_partial = evaluate_accuracy(classification_model, test_loader, device,is_dry=is_dry) 

print(f"\n=== Fine-tuning after unfreezing partial Transformer completed in {finetune_further_time:.2f} minutes ===") 
print(f"Training accuracy: {train_accuracy_partial*100:.2f}%") 
print(f"Validation accuracy: {val_accuracy_partial*100:.2f}%") 
print(f"Test accuracy: {test_accuracy_partial*100:.2f}%") 

Classification head initialized.
Classification head reinitialized to eliminate sequential advantage impact.

Epoch 1/8, step 500/7147, loss = 1.35494
Epoch 1/8, step 1000/7147, loss = 1.2663
Epoch 1/8, step 1500/7147, loss = 0.4811
Epoch 1/8, step 2000/7147, loss = 0.5282
Epoch 1/8, step 2500/7147, loss = 0.7143
Epoch 1/8, step 3000/7147, loss = 0.5997
Epoch 1/8, step 3500/7147, loss = 1.0294
Epoch 1/8, step 4000/7147, loss = 0.7298
Epoch 1/8, step 4500/7147, loss = 0.8081
Epoch 1/8, step 5000/7147, loss = 0.4371
Epoch 1/8, step 5500/7147, loss = 0.5999
Epoch 1/8, step 6000/7147, loss = 0.6458
Epoch 1/8, step 6500/7147, loss = 0.5266
Epoch 1/8, step 7000/7147, loss = 0.5490
Epoch 1/8, step 7140/7147, loss = 0.7666
Epoch 1/8, Average training loss: 1.1999
Validation accuracy: 67.56% train_acc=55.05%

Epoch 2/8, step 500/7147, loss = 0.6014
Epoch 2/8, step 1000/7147, loss = 0.5937
Epoch 2/8, step 1500/7147, loss = 0.6144
Epoch 2/8, step 2000/7147, loss = 0.6461
Epoch 2/8, step 2500/7147

You may notice that the model's accuracy on the test set has improved.

Now we have obtained a GPT-2 model that has been trained on the last Transformer block + LayerNorm layer + classification head.

#### Saving training results

In [16]:
from src.eval_helper import save_everything
from src.eval_helper import *
#path_to_partial TODO
train_run_label = "partial_ep8_try2"
#save_everything(path_to_partial, train_run_label, elapsed, train_losses, val_accs)
#classification_model, all_avg_loss,all_acc_train,all_acc_val
"""
save_everything(path_to_partial, train_run_label, 
                finetune_further_time, all_avg_loss, all_acc_val,
                train_accuracy_partial,
                val_accuracy_partial,test_accuracy_partial,classification_model)
"""
print("Save")

save_everything(path_to_save_folder=path_to_partial,
                 train_run_label=train_run_label,
                 elapsed=finetune_further_time,
                 train_losses=all_avg_loss,
                 train_acc=all_acc_train,
                 val_accs=all_acc_val,
                 train_acc_complete=train_accuracy_partial,
                 val_acc_complete=val_accuracy_partial,
                 test_acc_complete=test_accuracy_partial,
                 model=classification_model)

Save
Everything saved at
 02:17:14


In [16]:
print("Hallo Welt")

Hallo Welt
