# Fine-Tuning GPT-2 for Arithmetic Expression Prediction

## Overview:
- **Objective**: Fine-tune a GPT-2 model to predict results of arithmetic expressions.
- **Steps**:
  1. **Setup**: Imported libraries and prepared the tokenizer, model, and dataset.
  2. **Data Handling**: Tokenized expressions and split the dataset for training, validation, and testing.
  3. **Training**: Configured and fine-tuned GPT-2 using `Trainer`.
  4. **Loss Tracking**: Implemented a custom callback to log training losses.
  5. **Evaluation**: Tested the model on unseen data and displayed predictions.
  6. **Visualization**: Plotted training loss trends.

## Import Libraries and Set Device

- **Libraries Imported**:
  - `torch` for PyTorch framework.
  - `transformers` for using GPT-2 tokenizer and model.
  - `torch.utils.data` for dataset handling.
  - `os` for system path and file management.
  
- **GPU Check**:
  - Determines if CUDA-compatible GPU is available.


In [1]:
# pip install transformers torch torchvision

In [2]:
# Import libraries
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from torch.utils.data import Dataset
import os

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


## Dataset Preparation and Tokenization

### Key Components:
1. **Dataset Class**:
   - `ArithmeticDataset`:
     - Handles tokenization of input samples.
     - Implements `__len__` and `__getitem__` for compatibility with PyTorch's `Dataset`.
     - Sets `labels` as `input_ids` for GPT-2, which uses self-supervised learning.

2. **Tokenizer Initialization**:
   - GPT-2 tokenizer is loaded with `eos_token` as the padding token for compatibility.

3. **Dataset Loading**:
   - Reads a text file containing arithmetic samples.

4. **Sample Preparation**:
   - Combines consecutive lines into paired samples.

5. **Dataset Splitting**:
   - Splits samples into training (80%), validation (10%), and test (10%) sets using `train_test_split`.

6. **Dataset Creation**:
   - Converts the split samples into PyTorch datasets using the `ArithmeticDataset` class.

### Outputs:
- Number of samples in each split is displayed.


In [3]:
# pip install scikit-learn

In [4]:
from sklearn.model_selection import train_test_split

# Load and preprocess the dataset
class ArithmeticDataset(Dataset):
    def __init__(self, samples, tokenizer, max_length=512):
        self.samples = samples

        # Tokenize the dataset
        self.tokenized_data = tokenizer(
            self.samples,
            truncation=True,
            padding=True,
            max_length=max_length,
            return_tensors="pt"
        )

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return {
            "input_ids": self.tokenized_data["input_ids"][idx],
            "attention_mask": self.tokenized_data["attention_mask"][idx],
            "labels": self.tokenized_data["input_ids"][idx],  # GPT-2 uses input_ids as labels
        }

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set padding token for GPT-2 compatibility

# Load the full dataset
dataset_path = "/home/lxp334/LLM_Final_Report/arithmetic__mixed.txt"
with open(dataset_path, "r") as file:
    lines = file.readlines()

# Prepare samples
samples = [f"{lines[i].strip()} {lines[i + 1].strip()}" for i in range(0, len(lines), 2)]

# Split into training, validation, and test sets
train_samples, test_samples = train_test_split(samples, test_size=0.2, random_state=42)
train_samples, val_samples = train_test_split(train_samples, test_size=0.1, random_state=42)

print(f"Training samples: {len(train_samples)}")
print(f"Validation samples: {len(val_samples)}")
print(f"Test samples: {len(test_samples)}")

# Create datasets
train_dataset = ArithmeticDataset(train_samples, tokenizer)
val_dataset = ArithmeticDataset(val_samples, tokenizer)
test_dataset = ArithmeticDataset(test_samples, tokenizer)


Training samples: 479998
Validation samples: 53334
Test samples: 133334


## Custom Callback: Loss Logger

### Purpose:
- Tracks and logs the training loss during model training.

### Implementation:
1. **`LossLoggerCallback` Class**:
   - Inherits from `TrainerCallback` in the `transformers` library.
   - Maintains a list (`self.losses`) to store logged loss values.

2. **`on_log` Method**:
   - Triggered during logging events in training.
   - Appends the current loss value (if available) from the `logs` dictionary to `self.losses`.

### Usage:
- This callback can be added to the `Trainer` to monitor loss trends across training epochs.


In [5]:
from transformers import TrainerCallback

# Custom callback to track and log training loss
class LossLoggerCallback(TrainerCallback):
    def __init__(self):
        self.losses = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        if "loss" in logs:
            self.losses.append(logs["loss"])


## Model Training Setup

- **Model**: Loaded GPT-2 and resized token embeddings to match tokenizer size.
- **Training Arguments**:
  - Checkpoints: Saved after each epoch in `./gpt2-arithmetic`.
  - Epochs: 3, with a batch size of 8.
  - Optimizer: Learning rate of `5e-5` with weight decay (`0.01`) and warm-up steps (`50`).
  - Logging: Every 10 steps; logs saved in `./logs`.
  - Mixed Precision: Enabled if GPU is available.
  - Checkpoint Limit: Maximum of 2, with best model auto-loaded at end.

- **Trainer**: Initialized with model, datasets, tokenizer, and `LossLoggerCallback` for tracking loss.


In [6]:
# pip install accelerate>=0.26.0

In [7]:
# Load the GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))  # Adjust the embedding size to match the tokenizer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-arithmetic",      # Directory to save model checkpoints
    overwrite_output_dir=True,          # Overwrite existing output directory
    num_train_epochs=3,                 # Number of training epochs
    per_device_train_batch_size=8,      # Batch size per GPU/CPU
    evaluation_strategy="epoch",        # Evaluate the model at the end of each epoch
    save_strategy="epoch",              # Save the model at the end of each epoch
    learning_rate=5e-5,                 # Learning rate
    weight_decay=0.01,                  # Weight decay for regularization
    warmup_steps=50,                    # Warm-up steps for learning rate scheduling
    logging_dir="./logs",               # Directory for training logs
    logging_steps=10,                   # Log every 10 steps
    save_total_limit=2,                 # Limit the number of saved checkpoints
    load_best_model_at_end=True,        # Load the best model at the end of training
    fp16=torch.cuda.is_available(),     # Use mixed-precision training if GPU is available
)

# Initialize the LossLoggerCallback
loss_logger = LossLoggerCallback()

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    callbacks=[loss_logger]  # Attach the loss logger callback here
)


  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## Model Training and Loss Logging

- **Training**: Model is trained using the configured `Trainer`.
- **Loss Logging**:
  - Saves tracked training losses from `LossLoggerCallback` to `training_losses.txt` for future reference.


In [8]:
# Train the model
trainer.train()

# Save the training losses for future reference
with open("training_losses.txt", "w") as f:
    for loss in loss_logger.losses:
        f.write(f"{loss}\n")


Epoch,Training Loss,Validation Loss
1,0.8557,0.916585
2,0.8351,0.883198
3,0.7855,0.868684


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


## Visualizing Training Loss

- **Plot Configuration**:
  - Loss plotted against training steps.
  - Styled with Seaborn's "whitegrid" and pastel palette.
  - Titles and axis labels added with enhanced styling.

- **Output**:
  - The loss plot is saved as `FineTuning_GPT2_Loss.png` with high resolution (300 DPI).


In [9]:
# pip install --upgrade matplotlib

Collecting matplotlib
  Using cached matplotlib-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB)
Collecting numpy>=1.23
  Using cached numpy-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Installing collected packages: numpy, matplotlib
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.3
    Not uninstalling numpy at /usr/local/easybuild_allnodes/software/SciPy-bundle/2022.05-foss-2022a/lib/python3.10/site-packages, outside environment /home/lxp334/LLM_Final_Report/loki
    Can't uninstall 'numpy'. No files were found to uninstall.
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.5.2
    Not uninstalling matplotlib at /usr/local/easybuild_allnodes/software/matplotlib/3.5.2-foss-2022a/lib/python3.10/site-packages, outside environment /home/lxp334/LLM_Final_Report/loki
    Can't uninstall 'matplotlib'. No files were found to uninstall.
[31mERROR: pip's dependency resolver does no

In [10]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set modern style for the plot
sns.set_theme(style="whitegrid", palette="pastel")

# Plot the training loss
plt.figure(figsize=(12, 6))
plt.plot(
    range(1, len(loss_logger.losses) + 1),
    loss_logger.losses,
    marker='o',
    linestyle='-',
    linewidth=2,
    markersize=6,
    label="Training Loss",
)

# Add titles and labels
plt.title("Training Loss Over Steps", fontsize=18, weight='bold')
plt.xlabel("Training Steps", fontsize=14, labelpad=10)
plt.ylabel("Loss", fontsize=14, labelpad=10)
plt.legend(fontsize=12, loc='upper right')

# Add grid and style
plt.grid(True, linestyle='--', alpha=0.6)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Save the plot
plt.savefig("FineTuning_GPT2_Loss.png", dpi=300, bbox_inches='tight')


In [11]:
# Save the model as a .pth file
torch.save(model.state_dict(), "fine_tuned_gpt2.pth")
print("Model saved successfully as fine_tuned_gpt2.pth")


Model saved successfully as fine_tuned_gpt2.pth


## Model Testing and Predictions

- **Function**: `test_model_and_display_predictions`
  - Evaluates the model on the test dataset and compares predictions with ground truth.

- **Key Steps**:
  1. **Evaluation Mode**: Model set to `eval` to disable gradient updates.
  2. **Input Decoding**: Converts tokenized inputs back to text for readability.
  3. **Prediction**: Generates output using greedy decoding.
  4. **Ground Truth Comparison**: Assumes the last token in the input is the correct result.

- **Output**:
  - Displays a table of input expressions, ground truth values, and model predictions for a sample of test data.


In [12]:
def test_model_and_display_predictions(test_dataset, model, tokenizer, num_examples=10):
    model.eval()  # Set model to evaluation mode
    predictions = []
    ground_truths = []
    inputs = []

    with torch.no_grad():
        for idx in range(min(num_examples, len(test_dataset))):
            sample = test_dataset[idx]

            # Decode input sequence
            input_text = tokenizer.decode(sample["input_ids"], skip_special_tokens=True)
            inputs.append(input_text)

            # Extract ground truth
            ground_truth = input_text.split()[-1]  # Assuming the last token is the result
            ground_truths.append(ground_truth)

            # Generate prediction
            generated = model.generate(
                input_ids=sample["input_ids"].unsqueeze(0).to(device),
                max_length=50,
                eos_token_id=tokenizer.eos_token_id,
                num_beams=1  # Greedy decoding
            )
            predicted_text = tokenizer.decode(generated[0], skip_special_tokens=True)
            predicted_result = predicted_text.split()[-1]  # Extract predicted result
            predictions.append(predicted_result)

    # Display results
    print("Testing Results:")
    print(f"{'Input Expression':<50} {'Ground Truth':<15} {'Prediction':<15}")
    print("-" * 80)
    for inp, gt, pred in zip(inputs, ground_truths, predictions):
        print(f"{inp:<50} {gt:<15} {pred:<15}")
        
# Test the fine-tuned model and display predictions
test_model_and_display_predictions(test_dataset, model, tokenizer, num_examples=10)



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask

Testing Results:
Input Expression                                   Ground Truth    Prediction     
--------------------------------------------------------------------------------
What is the value of (-9)/12 + 0 - 38/(-8)? 4      4               4              
What is 500/(-40) - (-27)/6? -8                    -8              -8             
What is the value of 1/(-5) + (-152)/190? -1       -1              -1             
Evaluate (4/2)/(-3*(-42)/(-18)). -2/7              -2/7            -2/7           
Calculate (-12)/40*(102/18 - 5). -1/5              -1/5            -1/5           
What is 6/(-120)*6 - (-2)/5? 1/10                  1/10            1/10           
Evaluate 8*1/((-9)/((-18)/8)). 2                   2               2              
What is the value of (-12)/15 + (-357)/(-315)? 1/3 1/3             1/3            
Calculate (288/320)/(6/(-10)) - 1. -5/2            -5/2            -5/2           
What is the value of 8/3*(18/(-12) - 0)? -4        -4              -4   