# Tutorial 2: GPT-Adder - Learning Arithmetic with Complete Input

Welcome to the GPT-Adder tutorial! In this version, we train a transformer model to perform addition where:
- **Input (X)**: Complete question like "2+3="
- **Output (Y)**: Single predicted answer like "5"

This is different from the original autoregressive character-by-character prediction for NLP. Instead, we treat this as a sequence-to-single-token prediction task.

**Goal:** Train a transformer to map complete addition questions to single numeric answers.

- Input X is the full question "a+b="
- Output Y is a single token representing the answer
- We'll use a classification approach where each possible answer is a class
- Model architecture includes a classification head

## 1. Setup and Imports

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Config, GPT2Model
import random
import numpy as np

## 2. Configuration

In [2]:
# Hyperparameters
batch_size = 32     # How many independent sequences will we process in parallel to speed up the training process
max_iters = 2000       # Iteration of training 
eval_interval = 250    # Interval of evaluation
learning_rate = 1e-3   # Learning rate for the optimizer
device = 'cpu'         # The device to run the model on
eval_iters = 100       # The number of iterations to evaluate the model
n_embd = 128           # The number of embedding dimensions
n_head = 4             # The number of attention heads
n_layer = 4            # The number of layers
dropout = 0.1          # Dropout rate, this is to prevent overfitting

# Parameters for data generation
ndigit = 2  # Up to 2-digit numbers (0-99)

# Calculate maximum possible answer for classification
max_answer = (10**ndigit - 1) + (10**ndigit - 1)  # e.g., 99+99=198 for ndigit=2
num_answer_classes = max_answer + 1  # 0 to max_answer inclusive

print(f"Maximum possible answer: {max_answer}")
print(f"Number of answer classes: {num_answer_classes}")

# For reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

Maximum possible answer: 198
Number of answer classes: 199


<torch._C.Generator at 0x1155373b0>

#### Explanation: Maximum Possible Answer and Classification

In this tutorial, we are framing the addition problem "a+b=" as a **classification task**. This means the model's goal is not to generate the sequence of digits for the answer, but to predict which *single class* the answer belongs to.

Think of it like image classification where a model predicts if an image is a "cat," "dog," or "bird." Here, our "classes" are all the possible numerical answers the addition problems can produce.

1.  **Defining the Classes:**
    Since our input numbers `a` and `b` are limited by `ndigit` (e.g., for `ndigit=2`, numbers range from 0 to 99), there's a maximum possible sum.
    - If `ndigit=2`, the largest sum is 99 + 99 = 198.
    - The smallest sum is 0 + 0 = 0.
    So, all possible answers lie in the range \[0, 198].

2.  **`num_answer_classes`:**
    Each unique integer answer in this range becomes a distinct "class" for our model.
    - `max_answer = (10**ndigit - 1) + (10**ndigit - 1)` calculates this maximum sum.
    - `num_answer_classes = max_answer + 1` determines the total number of unique classes (from 0 up to `max_answer`, inclusive). For `ndigit=2`, this is 198 + 1 = 199 classes.

3.  **Why Classification?**
    By treating this as a classification problem:
    - The model's output layer (the "classification head") will have `num_answer_classes` neurons.
    - Each neuron corresponds to one possible sum (e.g., neuron 0 for answer "0", neuron 1 for answer "1", ..., neuron 198 for answer "198").
    - The model will output a probability distribution over these classes, and the class with the highest probability is chosen as the predicted answer.
    - We use `CrossEntropyLoss`, which is standard for classification tasks.

This approach simplifies the problem compared to generating an answer character by character, especially since the output (the sum) is a single entity. The model just needs to learn to map the input question sequence to the correct answer "bucket" or class.

## 3. Data Preparation

### 3.1 Vocabulary and Tokenization

In [3]:
input_chars = '0123456789+= '  # Added space at the end for padding
input_vocab_size = len(input_chars)
print(f"Input vocabulary: '{input_chars}'")
print(f"Input vocabulary size: {input_vocab_size}")

# Create mappings for input
input_stoi = {ch: i for i, ch in enumerate(input_chars)} #mapping from input elements to index
input_itos = {i: ch for i, ch in enumerate(input_chars)} #mapping from index to input elements

def encode_input(s):
    return [input_stoi[c] for c in s] #encode the input string into a list of indices

def decode_input(l):
    return ''.join([input_itos[i] for i in l]) #decode the list of indices into a string

# Test encoding/decoding
test_question = "2+3="
encoded_test = encode_input(test_question)
decoded_test = decode_input(encoded_test)
print(f"Original question: '{test_question}'")
print(f"Encoded: {encoded_test}")
print(f"Decoded: '{decoded_test}'")

# Test with padding
test_padded = "2+3= "  # With space padding
encoded_padded = encode_input(test_padded)
decoded_padded = decode_input(encoded_padded)
print(f"Padded question: '{test_padded}'")
print(f"Encoded padded: {encoded_padded}")
print(f"Decoded padded: '{decoded_padded}'")

Input vocabulary: '0123456789+= '
Input vocabulary size: 13
Original question: '2+3='
Encoded: [2, 10, 3, 11]
Decoded: '2+3='
Padded question: '2+3= '
Encoded padded: [2, 10, 3, 11, 12]
Decoded padded: '2+3= '


### 3.2 Data Generation

In [4]:
def generate_addition_data(num_digits):
    """Generate a single addition problem and answer."""
    a = random.randint(0, 10**num_digits - 1)
    b = random.randint(0, 10**num_digits - 1)
    c = a + b
    question = f"{a}+{b}="
    answer = c  # Single integer answer
    return question, answer

# Calculate maximum question length for padding
# For ndigit=1: max question is "9+9=" (4 characters)
max_question_length = ndigit + 1 + ndigit + 1  # a + "+" + b + "="
print(f"Maximum question length: {max_question_length}")

# Test data generation
print("Sample problems:")
for _ in range(5):
    q, a = generate_addition_data(ndigit)
    print(f"Question: '{q}' -> Answer: {a}")

Maximum question length: 6
Sample problems:
Question: '81+14=' -> Answer: 95
Question: '3+94=' -> Answer: 97
Question: '35+31=' -> Answer: 66
Question: '28+17=' -> Answer: 45
Question: '94+13=' -> Answer: 107


#### Explanation: Padding Input Sequences

Transformer models, like the GPT-2 architecture we're using as a base, are designed to process sequences of a fixed length. However, our input questions (e.g., "1+2=", "10+5=", "99+99=") can have varying lengths.

**Why Padding?**
1.  **Batch Processing:** To train neural networks efficiently, we feed data in batches. All sequences within a single batch must have the same length so they can be processed in parallel by the GPU or CPU.
2.  **Fixed-Size Model Input:** The transformer architecture itself expects inputs of a predefined maximum sequence length (`n_positions` in `GPT2Config`, which we set to `max_question_length`).

**How Padding Works:**
1.  **Determine `max_question_length`:** We first calculate the maximum possible length an input question can have. For `ndigit=2`, the longest question is "99+99=" (6 characters). This becomes our `max_question_length`.
2.  **Add a Padding Token:** We add a special padding character to our input vocabulary (in this case, a space ' ').
3.  **Pad Shorter Sequences:** Any question shorter than `max_question_length` is padded with this special character (usually at the end) until it reaches the `max_question_length`.
    - "2+3=" (length 4) with `max_question_length=6` becomes "2+3=  " (length 6).
    - The `ljust(max_question_len)` method in the `ModifiedAdditionDataset` handles this.

**Attention Mechanism and Padding:**
While the input sequences are padded, the transformer's attention mechanism can be designed (often through an attention mask) to ignore these padding tokens during computation. This ensures that the padding doesn't negatively influence the learning process. For this specific `GPT2Model` from Hugging Face, it typically handles attention masking internally based on standard padding token IDs or by allowing explicit attention masks. In our simplified setup, the model will still "see" the padding tokens, but their embeddings will be learned like any other token. The key is that the *structure* of the input is now uniform across the batch.

This padding ensures that all input tensors passed to the model have a consistent shape, which is essential for the underlying computations and batching.

### 3.3 Dataset Class

In [5]:
class ModifiedAdditionDataset(Dataset):
    """Dataset where X is complete question and Y is single answer."""
    
    def __init__(self, num_digits, num_samples, max_question_len):
        self.num_digits = num_digits
        self.num_samples = num_samples
        self.max_question_len = max_question_len
        
        self.questions = []
        self.answers = []
        
        for _ in range(num_samples):
            question, answer = generate_addition_data(num_digits)
            
            # Pad question to consistent length. This is to make sure each question has the same length in the data batch. Thus, the model can handle the data batch in parallel.
            padded_question = question.ljust(max_question_len)
            encoded_question = encode_input(padded_question)
            
            self.questions.append(torch.tensor(encoded_question, dtype=torch.long)) #convert the encoded question into a tensor
            self.answers.append(answer)  # Keep as integer for classification
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        return self.questions[idx], self.answers[idx]

# Create datasets
train_dataset_size = 5000  # The size of the training dataset
val_dataset_size = 500    # The size of the validation dataset


train_dataset = ModifiedAdditionDataset(ndigit, train_dataset_size, max_question_length)
val_dataset = ModifiedAdditionDataset(ndigit, val_dataset_size, max_question_length)

# Test the dataset
sample_question, sample_answer = train_dataset[0]
print(f"Sample question tensor: {sample_question}")
print(f"Sample question decoded: '{decode_input(sample_question.tolist())}'")
print(f"Sample answer: {sample_answer}")
print(f"Question tensor shape: {sample_question.shape}")

Sample question tensor: tensor([ 8,  6, 10,  9,  4, 11])
Sample question decoded: '86+94='
Sample answer: 180
Question tensor shape: torch.Size([6])


### 3.4 Data Loaders

In [6]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, drop_last=True)

# Test the dataloader
questions_batch, answers_batch = next(iter(train_dataloader))
print(f"Questions batch shape: {questions_batch.shape}")
print(f"Answers batch shape: {answers_batch.shape}")
print(f"First question in batch: '{decode_input(questions_batch[0].tolist())}'")
print(f"First answer in batch: {answers_batch[0].item()}")

Questions batch shape: torch.Size([32, 6])
Answers batch shape: torch.Size([32])
First question in batch: '32+27='
First answer in batch: 59


## 4. Model Definition

We'll use GPT2Model (without the language modeling head) and add our own classification head.

In [7]:
class AdditionClassifier(nn.Module):
    """Transformer model for addition classification."""
    
    def __init__(self, input_vocab_size, num_classes, max_seq_len, n_embd, n_layer, n_head, dropout):
        super().__init__()
        
        # GPT2 configuration for training the model from scratch
        config = GPT2Config(
            vocab_size=input_vocab_size,
            n_positions=max_seq_len,
            n_embd=n_embd,
            n_layer=n_layer,
            n_head=n_head,
            resid_pdrop=dropout,
            embd_pdrop=dropout,
            attn_pdrop=dropout,
            bos_token_id=None,
            eos_token_id=None
        )
        
        # Use GPT2Model (without LM head)
        self.transformer = GPT2Model(config)
        
        # Classification head is a linear layer that maps the last hidden state to the number of classes (i.e. the number of possible answers from 0 to max_answer)
        self.classifier = nn.Linear(n_embd, num_classes)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input_ids, labels=None):
        # Get transformer outputs
        transformer_outputs = self.transformer(input_ids)
        hidden_states = transformer_outputs.last_hidden_state  # [batch_size, seq_len, n_embd]
        
        # Use the last token's representation for classification
        # This corresponds to the "=" token in "a+b="
        last_hidden = hidden_states[:, -1, :]  # [batch_size, n_embd]
        
        # Apply classification head
        logits = self.classifier(self.dropout(last_hidden))  # [batch_size, num_classes]
        
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)
        
        return {'loss': loss, 'logits': logits}

# Create model
model = AdditionClassifier(
    input_vocab_size=input_vocab_size,
    num_classes=num_answer_classes,
    max_seq_len=max_question_length,
    n_embd=n_embd,
    n_layer=n_layer,
    n_head=n_head,
    dropout=dropout
)

model.to(device)
print(f"{sum(p.numel() for p in model.parameters())/1e6:.2f} M parameters")

0.82 M parameters


#### Explanation: Custom Classification Head

The standard GPT-2 model from the `transformers` library, when used as `GPT2LMHeadModel`, is designed for **language modeling**. This means its primary goal is to predict the next token in a sequence, autoregressively. It has a "language modeling head" which is essentially a linear layer that maps the transformer's output hidden states to logits over the entire vocabulary (to predict the next word/character).

**Our Task is Different:**
In this tutorial, we are not performing traditional language modeling. Our task is **many-to-one classification**:
-   **Input (Many):** A sequence of characters representing an addition problem (e.g., "23+45=").
-   **Output (One):** A single class label representing the numerical answer (e.g., class 68).

**Why `GPT2Model` + Custom Head?**

1.  **Leveraging Transformer Power:** We still want to use the powerful sequence processing capabilities of the transformer architecture (self-attention, positional encodings, etc.) to understand the input question "23+45=". `GPT2Model` provides the core transformer blocks (embedding layer, multiple transformer layers) without the final language modeling layer.

2.  **Tailoring to Classification:**
    -   The output of `GPT2Model` is a sequence of hidden states, one for each input token. For our classification task, we are particularly interested in the information aggregated by the transformer over the entire sequence. A common strategy is to use the hidden state of the *last* token (or a special `[CLS]` token if one were used, or an aggregation like pooling). In our case, we use the hidden state corresponding to the final input token (which is often the '=' sign or a padding token if the actual question is shorter).
    -   This chosen hidden state (a vector of size `n_embd`) is then fed into our custom **classification head**.

3.  **The `self.classifier`:**
    Our classification head is a simple `nn.Linear` layer: `self.classifier = nn.Linear(n_embd, num_classes)`.
    -   It takes the `n_embd`-dimensional feature vector from the transformer.
    -   It projects this vector into a `num_classes`-dimensional space. Each dimension in this output corresponds to one of the possible numerical answers (from 0 to `max_answer`).
    -   The output of this linear layer are the **logits** for our classification task. Applying a softmax function to these logits gives the probabilities for each possible answer class.

In summary, we use `GPT2Model` as a powerful feature extractor for our input sequence and then add a simple linear layer (`self.classifier`) on top to perform the final classification into one of the `num_answer_classes`. This adapts the general-purpose transformer architecture to our specific arithmetic task.

## 5. Training

In [8]:
# Optimizer. AdamW is a variant of the Adam optimizer that uses weight decay and is widely used in the transformer community.
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

@torch.no_grad() # This decorator is used to disable gradient calculation during the evaluation process.
def estimate_loss():
    out = {}
    model.eval()
    
    for split_name, dataloader in [('train', train_dataloader), ('val', val_dataloader)]:
        losses = []
        accuracies = []
        
        data_iter = iter(dataloader)
        for k in range(min(eval_iters, len(dataloader))):
            try:
                questions, answers = next(data_iter)
                questions, answers = questions.to(device), answers.to(device)
                
                outputs = model(questions, labels=answers)
                loss = outputs['loss']
                logits = outputs['logits']
                
                losses.append(loss.item())
                
                # Calculate accuracy
                predictions = torch.argmax(logits, dim=-1)
                accuracy = (predictions == answers).float().mean().item()
                accuracies.append(accuracy)
                
            except StopIteration:
                break
        
        if losses:
            out[split_name + '_loss'] = np.mean(losses)
            out[split_name + '_acc'] = np.mean(accuracies)
        else:
            out[split_name + '_loss'] = float('nan')
            out[split_name + '_acc'] = float('nan')
    
    model.train()
    return out

In [9]:
# Training loop
print(f"Training on {device}...")
train_iter = iter(train_dataloader)

for iter_num in range(max_iters):
    # Evaluate
    if iter_num % eval_interval == 0 or iter_num == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses.get('train_loss', float('nan')):.4f}, train acc {losses.get('train_acc', float('nan')):.4f}, val loss {losses.get('val_loss', float('nan')):.4f}, val acc {losses.get('val_acc', float('nan')):.4f}")

    # Get batch
    try:
        questions, answers = next(train_iter)
    except StopIteration:
        train_iter = iter(train_dataloader)
        questions, answers = next(train_iter)
    
    questions, answers = questions.to(device), answers.to(device)

    # Forward pass
    outputs = model(questions, labels=answers)
    loss = outputs['loss']

    # Backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print("Training finished!")

Training on cpu...
step 0: train loss 5.4547, train acc 0.0047, val loss 5.4531, val acc 0.0063
step 250: train loss 3.8255, train acc 0.0591, val loss 3.8705, val acc 0.0479
step 500: train loss 3.4196, train acc 0.0878, val loss 3.4543, val acc 0.0708
step 750: train loss 3.1685, train acc 0.0884, val loss 3.2182, val acc 0.0667
step 1000: train loss 2.9683, train acc 0.1131, val loss 3.0286, val acc 0.0938
step 1250: train loss 2.8491, train acc 0.1194, val loss 2.9258, val acc 0.1125
step 1500: train loss 2.8503, train acc 0.1178, val loss 2.8858, val acc 0.0771
step 1750: train loss 2.7698, train acc 0.1244, val loss 2.8880, val acc 0.0917
step 1999: train loss 2.5739, train acc 0.1750, val loss 2.7087, val acc 0.1417
Training finished!


## 6. Testing and Evaluation

In [10]:
def test_model_addition(num_tests=20, num_digits_test=ndigit):
    print(f"--- Testing model on {num_tests} examples (up to {num_digits_test}-digit numbers) ---")
    model.eval()
    correct_predictions = 0

    for i in range(num_tests):
        # Generate test problem
        a = random.randint(0, 10**num_digits_test - 1)
        b = random.randint(0, 10**num_digits_test - 1)
        correct_answer = a + b
        question = f"{a}+{b}="
        
        # Pad and encode question
        padded_question = question.ljust(max_question_length)
        encoded_question = torch.tensor(encode_input(padded_question), dtype=torch.long, device=device).unsqueeze(0)
        
        # Get model prediction
        with torch.no_grad():
            outputs = model(encoded_question)
            logits = outputs['logits']
            predicted_answer = torch.argmax(logits, dim=-1).item()
        
        is_correct = (predicted_answer == correct_answer)
        if is_correct:
            correct_predictions += 1
            status = "CORRECT"
        else:
            status = "INCORRECT"
        
        print(f"Problem {i+1:2d}: {question}{correct_answer} -> Model predicted: {predicted_answer} -> {status}")
    
    accuracy = (correct_predictions / num_tests) * 100
    print(f"Accuracy: {accuracy:.2f}% ({correct_predictions}/{num_tests} correct)")
    
    model.train()

# Run test
test_model_addition(num_tests=20, num_digits_test=ndigit)

--- Testing model on 20 examples (up to 2-digit numbers) ---
Problem  1: 76+29=105 -> Model predicted: 103 -> INCORRECT
Problem  2: 44+38=82 -> Model predicted: 76 -> INCORRECT
Problem  3: 36+85=121 -> Model predicted: 120 -> INCORRECT
Problem  4: 91+73=164 -> Model predicted: 165 -> INCORRECT
Problem  5: 86+17=103 -> Model predicted: 103 -> CORRECT
Problem  6: 91+48=139 -> Model predicted: 135 -> INCORRECT
Problem  7: 66+94=160 -> Model predicted: 161 -> INCORRECT
Problem  8: 31+13=44 -> Model predicted: 42 -> INCORRECT
Problem  9: 41+66=107 -> Model predicted: 105 -> INCORRECT
Problem 10: 74+24=98 -> Model predicted: 97 -> INCORRECT
Problem 11: 99+46=145 -> Model predicted: 144 -> INCORRECT
Problem 12: 58+12=70 -> Model predicted: 67 -> INCORRECT
Problem 13: 18+91=109 -> Model predicted: 109 -> CORRECT
Problem 14: 54+52=106 -> Model predicted: 105 -> INCORRECT
Problem 15: 10+11=21 -> Model predicted: 26 -> INCORRECT
Problem 16: 7+54=61 -> Model predicted: 67 -> INCORRECT
Problem 17: 

### Interactive Testing

In [11]:
def ask_adder(problem_input):
    """Ask the model to solve an addition problem."""
    model.eval()
    
    # Ensure input ends with '='
    if not problem_input.endswith('='):
        question = problem_input + '='
    else:
        question = problem_input
    
    # Pad and encode
    padded_question = question.ljust(max_question_length)
    encoded_question = torch.tensor(encode_input(padded_question), dtype=torch.long, device=device).unsqueeze(0)
    
    # Get prediction
    with torch.no_grad():
        outputs = model(encoded_question)
        logits = outputs['logits']
        predicted_answer = torch.argmax(logits, dim=-1).item()
        
        # Also get top-3 predictions with probabilities
        probs = F.softmax(logits, dim=-1)
        top_probs, top_indices = torch.topk(probs, k=3, dim=-1)
        
    model.train()
    
    print(f"Question: {question}")
    print(f"Predicted answer: {predicted_answer}")
    print("Top 3 predictions:")
    for i in range(3):
        ans = top_indices[0][i].item()
        prob = top_probs[0][i].item()
        print(f"  {ans}: {prob:.3f}")
    
    return predicted_answer

# Test examples
print("=== Interactive Testing ===")
ask_adder('2+3')
print()
ask_adder('5+4')
print()
ask_adder('9+9')

=== Interactive Testing ===
Question: 2+3=
Predicted answer: 6
Top 3 predictions:
  6: 0.092
  14: 0.091
  18: 0.086

Question: 5+4=
Predicted answer: 9
Top 3 predictions:
  9: 0.130
  14: 0.103
  6: 0.078

Question: 9+9=
Predicted answer: 14
Top 3 predictions:
  14: 0.110
  9: 0.096
  11: 0.085


14