# 2.6 Using sliding windows for data sampling

In the previous section, we detailed the tokenization step and the process of converting string tokens into integer token IDs. Before creating the embedding of the LLM, we need to generate the input-target pairs required to train the LLM.

What do these input-target pairs look like? As we learned in Chapter 1, LLMs are pre-trained by predicting the next word in a text, as shown in Figure 2.1.

**Figure 2.12 Given a text sample, an input block is extracted as an input subsample of the LLM. The task of the LLM during training is to predict the next word after the input block. During training, we mask out all words after the target word. Note that the text has been tokenized before being processed by the LLM. For ease of illustration, the tokenization step is omitted in this figure. **

![fig2.12](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-2-12.jpg?raw=true)

In this chapter, we implemented a data loader that uses a sliding window approach to obtain the input-target pairs described in Figure 2.12 from the training dataset.

First, we will tokenize the entire The Verdict short story using the BPE tokenizer introduced in the previous section.

In [2]:
import re
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


After applying the BPE tokenizer to the training set, 5145 tokens are obtained.

Next, we will remove the first 50 toekns from the dataset in order to present more attractive text passages in subsequent steps:

In [3]:
enc_sample = enc_text[50:]

When creating an input-target pair for the next word prediction task, a simple and intuitive approach is to create two variables x and y. x is used to store the input token sequence, while y is used to store the target token sequence. The target sequence is composed of each token in the input sequence shifted one position to the right. Thus, an input-target pair is formed.

In [4]:
context_size = 4 #A
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


Running the above code will print the following output:

```
x: [290, 4920, 2241, 287]
y: [4920, 2241, 287, 257]
```

After we generate the corresponding target data by shifting the input data one position to the right, we can refer to Figure 2.12 and follow the steps below to create the next word prediction task:

In [5]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


The above code will print the following:

```
[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257
```

The content on the left side of the arrow (---->) refers to the input received by the LLM, and the token ID on the right side of the arrow represents the target token ID that the LLM should predict.

For better understanding, we repeat the previous code, but this time we convert the token ID back into text:

In [6]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


The following output shows the text format of the input and output:

```
and ----> established
and established ----> himself
and established himself ----> in
and established himself in ----> a
```

Now that we have created our input-target pairs, we can train our LLM in the next section.

Before we convert tokens into embedding vectors, there is one last task to complete. As we mentioned at the beginning of this chapter, we also need to implement an efficient data loader that iterates over the input dataset and returns input-target pairs. These inputs and targets are PyTorch tensors, which can be understood as multidimensional arrays.

Specifically, we want to return two tensors: one is the input tensor, which contains the text that the LLM has seen; the other is the target tensor, which contains the target that the LLM wants to predict, as shown in Figure 2.13.

**Figure 2.13 To implement an efficient data loader, we store all inputs into a tensor called x, where each row represents an input context. At the same time, we create another tensor called y to store the corresponding predicted targets (i.e., the next word), which are obtained by shifting the input content one position to the right. **

![fig2.13](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-2-13.jpg?raw=true)

For ease of understanding, Figure 2.13 shows the token in string format, but in the code implementation, we will directly operate the token ID. This is because the `encode` method of the BPE tokenizer combines the two steps of tokenization and conversion to token ID into one step.

To implement an efficient data loader, we will use PyTorch's built-in Dataset and DataLoader classes. For additional information and guidance on installing PyTorch, see Section A.1.3, "Installing PyTorch" in Appendix A.

The code of the dataset class is shown in Code Example 2.5:

### Code Example 2.5 A Dataset for Batch Input and Target

In [8]:
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt) #A
        for i in range(0, len(token_ids) - max_length, stride): #B
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self): #C
        return len(self.input_ids)
    def __getitem__(self, idx): #D
        return self.input_ids[idx], self.target_ids[idx]

In Listing 2.5, we built a class called `GPTDatasetV1`, which is a subclass of PyTorch's `Dataset` class. This class specifies how to extract individual samples from the dataset, each of which contains a certain number of token IDs stored in the `input_chunk` tensor (the number is determined by the `max_length` parameter). The `target_chunk` tensor holds the target corresponding to the input. To understand this process more deeply, it is recommended that you continue reading to see what the data returned looks like when the dataset is used in conjunction with PyTorch's DataLoader.

If you are not familiar with the structure of PyTorch's Dataset class, as shown in Listing 2.5, I recommend that you read Section A.6, "Building Efficient Data Loaders" in Appendix A. There, you will find a detailed explanation of the basic structure and usage of PyTorch's Dataset and DataLoader classes.

The following code will use `GPTDatasetV1` to batch load input data through PyTorch's `DataLoader`:

### Listing 2.6, a batch data loader for generating input-target pairs:

In [9]:
def create_dataloader_v1(txt, batch_size=4,
        max_length=256, stride=128, shuffle=True, drop_last=True):
    tokenizer = tiktoken.get_encoding("gpt2") #A
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #B
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) #C
    return dataloader

To gain a more intuitive understanding of how the GPTDatasetV1 class in Listing 2.5 and the create_dataloader_v1 function in Listing 2.6 work together, we will test the data loader with a batch size of 1 in an LLM with a context size of 4:

In [11]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    dataloader = create_dataloader_v1(
        raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
    data_iter = iter(dataloader) #A
    first_batch = next(data_iter)
    print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


Executing the above code will print the following:

```
[tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
```

The `first_batch` variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs. Since `max_length` is 4, both tensors contain only 4 token IDs. It should be noted that the input size of 4 here is relatively small and is only used for demonstration. When training language models in practice, the input size is usually at least 256.

To illustrate the meaning of `stride=1`, let's get another batch of data from this dataset:

In [12]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


The second batch of data is as follows

```
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]
```

By comparing the first and second batches of data, we can see that the token IDs of the second batch are shifted one position to the right compared to the first batch (for example, the second ID of the first batch input is 367, which is exactly the first ID of the second batch input). The `stride` parameter determines the number of positions that the input is shifted between batches, which simulates the concept of a sliding window, as shown in Figure 2.14.

**Figure 2.14 In the process of creating multiple batches from the input dataset, we slide an input window over the text. If the stride is set to 1, then when generating the next batch, we move the input window to the right by 1 position. If the stride is set to the size of the input window, then overlap between batches can be avoided. **

![fig2.14](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-2-14.jpg?raw=true)

**Exercise 2.2 Loading data using data loaders with different step sizes and context sizes**

To further understand how the data loader works, you can try running it with different settings, such as max_length=2 and stride=2 , or max_length=8 and stride=2 .

So far, we have sampled batch sizes of 1 from the data loader, which is helpful for demonstration purposes. If you have experience with deep learning, you probably know that smaller batch sizes require less memory during training, but can result in more noise when updating the model. Just like in regular deep learning, batch size is a hyperparameter that requires trade-offs and experimentation when training LLMs.

Before we dive into the last two sections on how to create embedding vectors from token IDs, let’s take a quick look at how to use data loaders for sampling with batch sizes greater than 1.

In [14]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[  306,    11,   475,   465],
        [  338,   523, 14348,     0],
        [  423,   284,  1394, 26148],
        [  465,  1182,   284,   804],
        [  355,   339,  3830,   612],
        [22665,  4252, 10899,    13],
        [   13,   383, 18098,   373],
        [  683,  1207,  8344,   803]])

Targets:
 tensor([[   11,   475,   465,  2951],
        [  523, 14348,     0,   632],
        [  284,  1394, 26148,   526],
        [ 1182,   284,   804,   510],
        [  339,  3830,   612,   290],
        [ 4252, 10899,    13,   198],
        [  383, 18098,   373,  3940],
        [ 1207,  8344,   803,   306]])


The output is as follows:

```
Inputs:
tensor([[ 306, 11, 475, 465],
[ 338, 523, 14348, 0],
[ 423, 284, 1394, 26148],
[ 465, 1182, 284, 804],
[ 355, 339, 3830, 612],
[22665, 4252, 10899, 13],
[ 13, 383, 18098, 373],
[ 683, 1207, 8344, 803]])

Targets:
tensor([[ 11, 475, 465, 2951],
[ 523, 14348, 0, 632],
[ 284, 1394, 26148, 526],
[ 1182, 284, 804, 510],
[ 339, 3830, 612, 290],
[ 4252, 10899, 13, 198],
[ 383, 18098, 373, 3940],
[ 1207, 8344, 803, 306]])
```

Note that we increased the stride to 4. This is to make full use of the dataset (no words are skipped) and also to avoid overlap between batches. Too much overlap may increase the risk of overfitting.

In the next two sections of this chapter, we will implement the Embedding layer. Its role is to convert the token ID into a continuous vector representation, which will be used as the input of the LLM.