# 2.6 Using sliding windows for data sampling

In the previous chapter, we converted token IDs into continuous vector representations, also known as token encodings, as input to the LLM. However, one disadvantage of the LLM is that the self-attention mechanism (which will be introduced in detail in Chapter 3) does not contain information about the position or order of tokens in the sequence.

In the previously introduced embedding layer generation method, the same token ID is always mapped to the same vector representation, regardless of the position of the token ID in the input sequence, as shown in Figure 2.17.

**Figure 2.17 The encoding layer converts a token ID into the same vector representation regardless of its position in the input sequence. For example, token ID 5 will produce the same encoding vector whether it is in the first or third position of the token ID input vector. **

![](../img/fig-2-17.jpg)

In principle, having deterministic, position-independent encodings is beneficial for reproducibility. However, since the self-attention mechanism of LLM itself does not pay attention to position, it is helpful to inject additional position information into LLM.

To achieve this, there are two commonly used position encodings: relative position encoding and absolute position encoding.

Absolute position encodings are associated with a specific position in the sequence. For each position in the input sequence, a unique position encoding is added to the token to indicate its exact position. For example, the first token will have a specific position encoding, the second token will have another different position encoding, and so on, as shown in Figure 2.18.

**Figure 2.18 Position encodings are added to tokens to create the input to the LLM. The position vector has the same dimensions as the original token. For simplicity, the token encoding is shown as 1. **

![](../img/fig-2-18.jpg)

Rather than focusing on the absolute position of tokens, relative position encodings focus on the relative position or distance between tokens. This means that the model learns the relationship of "how far apart are they from each other" rather than "at which exact position". The benefit of this is that it can generalize better to sequences of varying lengths, even if the model has not seen such lengths during training.

Both types of position encodings are designed to enhance the LLM's ability to understand token order and relationships, ensuring more accurate and context-aware predictions. Their choice often depends on the specific application and the nature of the data being processed.

OpenAI's GPT model uses absolute position encodings that are optimized during training rather than being fixed or predefined like the position encodings in the original Transformer model. This optimization process is part of the model training itself, and we will implement it later in this book. For now, let's create the initial position encodings to create the LLM inputs for the upcoming chapters.

In this chapter, we previously focused on very small encoding sizes for illustration purposes. Now, let's consider more realistic and useful encoding sizes and encode the input tokens into 256-dimensional vector representations. This is smaller than the one used by the original GPT-3 model (in GPT-3, the encoding size is 12,288 dimensions), but still reasonable for experiments. In addition, we assume that tokeThe n IDs are created by the BPE tagger we implemented earlier, which has a vocabulary size of 50,257:
> output_dim = 256

> vocab_size = 50257

> token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

Using the token_embedding_layer above, if we sample data from dataloader, we will encode each token in each batch into a 256-dimensional vector. If our batch size is 8 and each batch has four tokens, the result will be an 8 x 4 x 256 tensor.

First, let's instantiate the dataloader from Section 2.6, "Data Sampling Using Sliding Windows":

max_length = 4

dataloader = create_dataloader_v1(
> raw_text, batch_size=8, max_length=max_length, stride=max_len)

data_iter = iter(dataloader)

inputs, targets = next(data_iter)

print("Token IDs:\n", inputs)

print("\nInputs shape:\n", inputs.shape)

The previous code prints the following output:
>Token IDs:

>tensor([[ 40, 367, 2885, 1464],\
[ 1807, 3619, 402, 271],\
[10899, 2138, 257, 7026],\
[15632, 438, 2016, 257],\
[ 922, 5891, 1576, 438],\
[ 568, 340, 373, 645],\
[ 1049, 5975, 284, 502],\
[ 284, 3285, 326, 11]])

>Inputs shape:
>torch.Size([8, 4])

As we can see, the token ID tensor is of 8x4 dimension, which means that the data batch consists of 8 text samples, each with 4 tokens.

Now let's embed these token IDs into 256-dimensional vectors using token_embedding_layer:
> token_embeddings = token_embedding_layer(inputs)
> 
> print(token_embeddings.shape)

The previous code prints the following output:

>torch.Size([8, 4, 256])

From the 8x4x256 dimensional tensor output, we can see that each token ID is now embedded as a 256 dimensional vector.
For the absolute embedding method of the GPT model, we just need to create another embedding layer with the same dimensions as token_embedding_layer:

>context_length = max_length

>pos_embedding_layer = torch.nn.Embedding(context_lengthe, output_dim)

>pos_embeddings = pos_embedding_layer(torch.arange(context_length))

>print(pos_embeddings.shape)

As shown in the previous code example, the input to pos_embeddings is usually a placeholder vector torch.arange(context_length) containing a series of numbers 0, 1, ..., up to the maximum input length minus 1. context_length is a variable representing the input size supported by LLM. Here, we choose it to be similar to the maximum length of the input text. In practice, the input text can be longer than the supported context length.In this case, we have to truncate the text.

The output of the print statement is as follows:

>torch.Size([4, 256])

As we can see, the position embeddings tensor consists of four 256-dimensional vectors. We can now directly add them to the token embeddings, and PyTorch will add a 4x256-dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in each batch of 8:
>input_embeddings = token_embeddings + pos_embeddings

>print(input_embeddings.shape)

The print output is as follows:

>torch.Size([8, 4, 256])

The input_embeddings we created, summarized in Figure 2.19, is an example of an embedded input that can now be processed by the main LLM module we will start implementing in Chapter 3.

![](../img/fig-2-19.jpg)
**Figure 2.19 As part of the input processing pipeline, the input text is first broken down into individual tokens. These tokens are then converted into token IDs using the vocabulary. The token IDs are converted into encoding vectors, and then positional encodings of similar size are added to produce the encodings used as input to the main LLM layer.