# Chapter 7: Fine-Tuning to Follow Instructions

## 7.2 Preparing a Dataset for Instruction Fine-Tuning

We need to download and format the instruction dataset for instruction fine-tuning a pretrained LLM.

In [1]:
import json
import os
import requests

def download_and_load_file(filepath, url):
    if not os.path.exists(filepath):
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text_data = response.text

        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(text_data)

    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)

    return data

In [2]:
filepath = 'instruction-data.json'
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"

data = download_and_load_file(filepath, url)
print("Number of entries:", len(data))

Number of entries: 1100


In [3]:
print("Example entry:\n", data[0])

Example entry:
 {'instruction': 'Evaluate the following phrase by transforming it into the spelling given.', 'input': 'freind --> friend', 'output': 'The spelling of the given phrase "freind" is incorrect, the correct spelling is "friend".'}


In [4]:
print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


Instruction fine-tuning involves training a model on a dataset where the input-output pairs are explicitly provided.

There are two different example formats, referred to as *prompt styles*, used in the trianing of LLMs, such as Alpaca and Phi-3.
- The *Alpaca* style uses a structured format with defined sections for instruction, input, and response:
    ```
    Below is an instruction that describes a task. Write a response that appropriately completes the request.

    ### Instruction:
    Identify the correct spelling of the following word.

    ### Input:
    Ocassion

    ### Response:
    The correct spelling is 'Occasion'.
    ```
- The *Phi-3* style uses a conversational format with designated `<|user|>` and `<|assistant|>` tokens:
    ```
    <|user|> 
    Identify the correct spelling of the following word: 'Ocassion'

    <|assistant|>
    The correct spelling is 'Occasion'.
    ```

In this chapter, we will use the Alpaca-style format for instruction fine-tuning.

In [5]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry['input'] else ""

    return instruction_text + input_text

In [6]:
model_input = format_input(data[0])
desired_output = f"\n\n### Response:\n{data[0]['output']}"

print(model_input + desired_output)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Evaluate the following phrase by transforming it into the spelling given.

### Input:
freind --> friend

### Response:
The spelling of the given phrase "freind" is incorrect, the correct spelling is "friend".


In [7]:
# data without input field
model_input = format_input(data[999])
desired_output = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_output)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


In [8]:
# Split the dataset
train_portion = int(len(data) * 0.85) # 85% for training
test_portion = int(len(data) * 0.10)  # 10% for testing
val_portion = len(data) - train_portion - test_portion  # 5% for validation

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

print("Training set size:", len(train_data))
print("Testing set size:", len(test_data))
print("Validation set size:", len(val_data))

Training set size: 935
Testing set size: 110
Validation set size: 55


## 7.3 Organizing Data into Training Batches

The next step is to build the training batches effectively. In previous chapters, the training batches were created automatically with the `DataLoader` class from PyTorch, which employs a default *collate function* to combine lists of samples into batches.

A collate function is responsible for taking a list of data samples and merging them into a single batch that can be processed by the model during training.

For our instruction fine-tuning task, we need to implement a custom collate function that can handle the specific structure of our instruction dataset.

First, we need to define a `Dataset` class that can load and preprocess our instruction data.

In [9]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        # Pre-tokenize text
        self.encoded_texts = []

        for entry in self.data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text

            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]
    
    def __len__(self):
        return len(self.data)

In [10]:
# Similarly as before, we use `gpt2` tokenizer and add special tokens
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


Then we will adopt a more sophisticated approach by developing a custom collate function that we can pass to the dataloader.

This custom collate function pads the training examples in each batch to the same length while allowing different batches to have different lengths. This approach minimizes unnecessary padding by only extending sequences to match the longest sequence in each batch, not the longest sequence in the entire dataset.

In [11]:
def custom_collate_draft_1(
        batch,
        pad_token_id=50256,
        device='cpu'
):
    # Find the longest sequence in the batch
    # and increase by 1 for padding
    batch_max_length = max(len(item) + 1 for item in batch)

    # Pad the prepare inputs
    inputs_list = []

    for item in batch:
        new_item = item.copy()
        # Pad sequences to `batch_max_length`
        padded = (
            new_item + [pad_token_id] * (batch_max_length - len(new_item))
        )

        # Via `padded[:-1]`, we remove the extra padded tokens
        # that have been added via the +1 setting in `batch_max_length`
        # (the extra padding tokens will be relevant in later steps)
        inputs = torch.tensor(padded[:-1])
        inputs_list.append(inputs)

    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_list).to(device)

    return inputs_tensor

In [12]:
# Test
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = [inputs_1, inputs_2, inputs_3]

print("Batch before collation:\n", batch)
collated_batch = custom_collate_draft_1(batch)
print("Collated batch:\n", collated_batch)
print("Collated batch shape:", collated_batch.shape)


Batch before collation:
 [[0, 1, 2, 3, 4], [5, 6], [7, 8, 9]]
Collated batch:
 tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
Collated batch shape: torch.Size([3, 5])


The `custom_collate_draft_1` function creates batches from lists of inputs. We also need to create batches with the target token IDs corresponding to the batch of input IDs. Similar to the process we used to pretrain an LLM, the target token IDs match the input token IDs but are shifted one position to the right, allowing the LLM to learn how to predict the next token in a sequence.

In [13]:
def custom_collate_draft_2(
        batch,
        pad_token_id=50256,
        device='cpu'
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item) + 1 for item in batch)

    # Pad the prepare inputs
    inputs_list, targets_list = [], []

    for item in batch:
        new_item = item.copy()

        # Add an `<|endoftext|>` token at the end of each sequence
        new_item += [pad_token_id]
        # Pad sequences to `batch_max_length`
        padded = (
            new_item + [pad_token_id] * (batch_max_length - len(new_item))
        )
        # Truncate the last token for inputs
        inputs = torch.tensor(padded[:-1])
        # Shift one position to the right for targets
        targets = torch.tensor(padded[1:])

        inputs_list.append(inputs)
        targets_list.append(targets)

    # Convert list of inputs/targets to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_list).to(device)
    targets_tensor = torch.stack(targets_list).to(device)

    return inputs_tensor, targets_tensor

In [14]:
# Test the same batch with the new collate function
collated_inputs, collated_targets = custom_collate_draft_2(batch)
print("Collated inputs:\n", collated_inputs)
print("Collated inputs shape:", collated_inputs.shape)
print("Collated targets:\n", collated_targets)
print("Collated targets shape:", collated_targets.shape)

Collated inputs:
 tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
Collated inputs shape: torch.Size([3, 5])
Collated targets:
 tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])
Collated targets shape: torch.Size([3, 5])


Next, we will assign a `-100` placeholder value to all padding tokens. Thhis special valuue allows us to exclude these padding tokens from contributing to the training loss calculation, ensuring that only meaningful data influences model learning.

We only retain one `<|endoftext|>` token, ID `50256`, in the target list. Retaining it allows the LLM to learn when to generate an end-of-text token in response to instructions, which we use as an indicator that the generated resopnse is complete.

In the following `custom_collate_fn` function, we modify and replace tokens with ID `50256` with `-100` in the target lists.

In addition, we introduce an `allowed_max_length` parameter to optionally limit the length of the samples. This will be useful if we plan to work with our own dataset that exceed the 1024-token context size supported by the GPT-2 model we are using for instruction fine-tuning.

In [15]:
def custom_collate_fn(
        batch,
        pad_token_id=50256,
        ignore_index=-100,
        allowed_max_length=None,
        device='cpu'
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item) + 1 for item in batch)

    # Pad the prepare inputs and targets
    inputs_list, targets_list = [], []

    for item in batch:
        new_item = item.copy()

        # Add an `<|endoftext|>` token at the end of each sequence
        new_item += [pad_token_id]
        # Pad sequences to `batch_max_length`
        padded = (
            new_item + [pad_token_id] * (batch_max_length - len(new_item))
        )

        # Truncate the last token for inputs
        inputs = torch.tensor(padded[:-1])
        # Shift one position to the right for targets
        targets = torch.tensor(padded[1:])

        # NEW: Replace all but the first padding tokens in targets by `ignore_index`
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        # NEW: Optionally truncate sequences to `allowed_max_length`
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_list.append(inputs)
        targets_list.append(targets)

    # Convert list of inputs/targets to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_list).to(device)
    targets_tensor = torch.stack(targets_list).to(device)

    return inputs_tensor, targets_tensor

In [16]:
# Test the same batch with the new collate function
collated_inputs, collated_targets = custom_collate_fn(batch)
print("Collated inputs:\n", collated_inputs)
print("Collated inputs shape:", collated_inputs.shape)
print("Collated targets:\n", collated_targets)
print("Collated targets shape:", collated_targets.shape)

Collated inputs:
 tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
Collated inputs shape: torch.Size([3, 5])
Collated targets:
 tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])
Collated targets shape: torch.Size([3, 5])


Consider the following exampple where each output logit corresponds to a potential token from the model's vocabulary. We can calculate the cross entropy loss during training when the model predicts a sequence of tokens:

In [17]:
logits_1 = torch.tensor(
    [[-1., 1.],   # 1st training example
     [-0.5, 1.5]] # 2nd training example
)
targets_1 = torch.tensor([0, 1])

loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print("Loss 1:", loss_1.item())

Loss 1: 1.1269280910491943


If we add an additional token ID, it would affect the loss calculation:

In [18]:
logits_2 = torch.tensor(
    [[-1., 1.],
     [-0.5, 1.5],
     [-0.5, 1.5]] # New 3rd training example
)
targets_2 = torch.tensor([0, 1, 1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print("Loss 2:", loss_2.item())

Loss 2: 0.7935947775840759


If we set the additional token ID to `-100`, it will be ignored in the loss calculation:

In [19]:
# logits_3 starts the same
logits_3 = torch.tensor(
    [[-1., 1.],
     [-0.5, 1.5],
     [-0.5, 1.5]]
)

targets_3 = torch.tensor([0, 1, -100])

loss_3 = torch.nn.functional.cross_entropy(logits_3, targets_3)
print("Loss 3:", loss_3.item())

Loss 3: 1.1269280910491943


The resulting loss only considers the valid token predictions, effectively ignoring the padding token.

By default, PyTorch has the `cross_entropy` setting `ignore_index=-100`, which means that any target token with the value `-100` will be excluded from the loss computation. Using `-100` `ignore_index`, we can ignore the additional end-of-text tokens in the batches that we used to pad the training examples to equal lengths.

However, we do not want to ignore the first instance of the end-of-text token because it can help signal to the LLM when the response is complete.

## 7.4 Creating Dataloaders for an Instruction Dataset