# Chapter 5 Pre-training on untokenized data

**Introduction**:

- Calculate the loss of the training set and validation set to evaluate the quality of the text generated by the LLM during training
- Implement the training function and pre-train the LLM
- Save and load the model weights to continue training the LLM
- Load OpenAI's pre-trained weights

In the previous chapters, we implemented data sampling, attention mechanism, and wrote the code for the LLM architecture. In this chapter, we will focus on how to implement the training function and pre-train the LLM, as shown in Figure 5.1.

**Figure 5.1 A mental model of the three main stages of building an LLM, including pre-training the LLM on a general text dataset and fine-tuning it on a token dataset. This chapter will focus primarily on pre-training the LLM, including implementing the training code, evaluating performance, and saving and loading model weights. **

![fig5.1](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-1.jpg?raw=true)

As shown in Figure 5.1, we will further learn basic model evaluation techniques in order to measure the quality of generated text, which is a key step in optimizing LLM during training. In addition, we will also explore how to load pre-trained weights, which will provide a solid foundation for us to fine-tune LLM in subsequent chapters.

**Weight parameters**

In the context of LLMs and other deep learning models, weights refer to the trainable parameters that need to be adjusted during the learning process. These weights are also called weight parameters or simply parameters. In frameworks like PyTorch, these weights are stored in linear layers, such as the ones we used when implementing the multi-head attention module in Chapter 3 and the GPTModel in Chapter 4. After initializing a layer (`new_layer = torch.nn.Linear(...)`), we can access its weights through the `.weight` attribute, that is, `new_layer.weight`. In addition, for convenience, PyTorch allows direct access to all trainable parameters of the model, including weights and biases, through the `model.parameters()` method, which we will use later when implementing model training.

## 5.1 Evaluating the Text Generation Model

We will start from the code in the previous chapter, introduce how to use LLM for text generation, and then discuss basic methods for evaluating the quality of generated text. The content of this section and the rest of this chapter is summarized in Figure 5.2.

**Figure 5.2 The main topics of this chapter are as follows. We first review the text generation content in the previous chapter, and then implement the basic techniques for model evaluation during the pre-training stage. **

![fig5.2](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-2.jpg?raw=true)

As shown in Figure 5.2, in the next section we will review the text generation content set up at the end of the previous chapter, and then in the subsequent sections we will delve into text evaluation and calculate training and validation losses.

### 5.1.1 Generating text using GPT

In this section, we will initialize LLM and briefly review the text generation process implemented in Chapter 4. First, we will initialize a GPT model, which will be evaluated and trained in this chapter. We will use the GPTModel class and GPT_CONFIG_124M dictionary from Chapter 4 to complete the model initialization:

In [21]:
import torch
from chapter04 import GPTModel
GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 256, #A
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1, #B
"qkv_bias": False
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
          (3): Dropout(p=0.1, inplace=False)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_resid): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttent

For the GPT_CONFIG_124M dictionary, the only adjustment we made compared to the previous chapter was to reduce the context length (context_length) to 256 tokens. This change reduces the computational pressure of model training and makes it possible to train on an ordinary laptop.

The 124 million parameter GPT-2 model was originally configured to process 1024 tokens. After the training process is finished, we will update the context size setting at the end of this chapter and load the pre-trained weights to work with the model configured for a context length of 1024 tokens.

With the GPTmodel example, we use the `generate_text_simple` function introduced in the previous chapter and introduce two practical functions, `text_to_token_ids` and `token_ids_to_text`. These functions facilitate the conversion between text and token representations, and we will use them frequently in this chapter. To provide a clearer understanding, we show this process through Figure 5.3 before diving into the code.

**Figure 5.3 The text generation process involves encoding the text into token IDs, which are then processed into logit vectors by LLM. These logit vectors are then converted back to token IDs and finally decoded into text form. **

![fig5.3](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-3.png?raw=true)

Figure 5.3 depicts the three steps of text generation using the GPT model. First, the tokenizer converts the input text into a series of token IDs, as described in Chapter 2. Second, the model takes these token IDs and generates corresponding logits, which are vectors that represent the probability distribution of each token in the vocabulary, as described in Chapter 4. Finally, these logits are converted back to token IDs, which the tokenizer decodes into human-readable text, completing the cycle from text input to text output.

We implemented the following code for the text generation process:

In [22]:
import tiktoken
from chapter04 import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # 添加批次维度
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # 删除批次维度
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you rentingetic wasnم refres RexMeCHicular stren


Using the previous code, the model generates the following text:

```
Output text:
Every effort you make is a waste of time
```

From the output, it is clear that the model is not yet able to generate coherent text because it has not yet been trained. In order to define what is "coherent" or "high-quality" text, we need to implement a numerical method to evaluate the generated content. This method will allow us to monitor and improve the performance of the model throughout the training process.

The following sections describe how we compute a loss metric for the generated output. This loss serves as a measure of training progress and a sign of success. Additionally, in a subsequent section on fine-tuning the LLM, we will review other methods for evaluating model quality.

### 5.1.2 Calculating text generation loss

In this section, we will explore a technique for calculating text generation loss to quantitatively evaluate the quality of text generated during training. We will gradually parse this topic in depth through a practical example to make the concept clearer and easier to practice. Let's first briefly review the data loading in Chapter 2 and the `generate_text_simple` function in Chapter 4 to generate text.

Figure 5.4 clearly depicts the entire process from input text to LLM-generated text in a five-step process.

**Figure 5.4 For the three inputs shown on the left side of the image, we compute a vector for each input token that contains the probability scores for each token in the vocabulary. The index position with the highest probability score in each vector represents the most likely next token ID. The token IDs associated with the highest probability scores are selected and mapped back to a text that represents the text generated by the model. **

![fig5.4](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-4.jpg?raw=true)

The text generation pipeline in Figure 5.4 details the inner workings of the `generate_text_simple` function from Chapter 4. We need to perform these same initial steps before calculating the loss in generated text quality later in this section.

Figure 5.4 outlines the text generation process for a small vocabulary of only 7 tokens to display this image on a single page. However, our GPT model uses a large vocabulary of 50,257 words; therefore, in the following code, token IDs will range from 0 to 50,256, not just 0 to 6.

In addition, for simplicity, Figure 5.4 shows only one text example ("every effort moves"). In the following code examples that implement the steps in Figure 5.4, we will use two input examples ("every effort moves" and "I really like") as input to the GPT model.

Consider two input examples that have been converted to corresponding token IDs, corresponding to step 1 in Figure 5.4:

In [23]:
inputs = torch.tensor([[16833, 3626, 6100], # ["every effort moves",
                       [40, 1107, 588]]) # "I really like"]

Matching these inputs, the `targets` contain the token IDs we aim for the model to produce:

In [24]:
targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves you",
                        [588, 428, 11311]]) # " really like chocolate"]

Note that the target is the same as the input, just shifted one position forward, a concept we discussed when implementing data loaders in Chapter 2. This shifting strategy is crucial for training the model to predict the next element in the sequence.

We feed the input into the model to compute the logistic vectors for two input examples, each consisting of three toekns, and apply the softmax function to convert these logistic values ​​into probability scores, which corresponds to step 2 in Figure 5.4:

In [25]:
with torch.no_grad(): #A
    logits = model(inputs)
probas = torch.softmax(logits, dim=-1) # 词表中每个 token 的概率
print(probas.shape)

torch.Size([2, 3, 50257])


The resulting probability score (probas) tensor dimensions are as follows:

`torch.Size([2, 3, 50257])`

The first number, 2, represents the two examples (rows) in the input, also known as the batch size. The second number, 3, represents the number of tokens in each input (row). The last number corresponds to the dimensionality of the embedding, which is determined by the vocabulary size, as we discussed in the previous section.

After converting the logical values ​​to probabilities through the softmax function, these probability scores are converted back to text using the `generate_text_simple` function we implemented in Chapter 4, as shown in steps 3-5 of Figure 5.4.

We can implement steps 3 and 4 by applying the argmax function to the probability scores to obtain the corresponding token IDs:

In [26]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])


Considering that we have two input batches, each containing 3 tokens, applying the argmax function to the probability scores (as shown in step 3 of Figure 5.4) will produce two sets of outputs, each containing 3 predicted token IDs:

```
Token IDs:
tensor([[[16657], # First batch
[ 339],
[42826]],
[[49906], # Second batch
[29669],
[41751]]])
```

Finally, step 5 converts the token ID back into text:

In [27]:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix


When we decode these tokens, we find that these output tokens are completely different from the target tokens we want the model to generate:

```
Targets batch 1: effort moves you
Outputs batch 1: Armed heNetflix
```

The random text generated by the model is different from the target text because it has not been trained yet. Now, we will use a method called "loss" to numerically evaluate the performance of the text generated by the model, as shown in Figure 5.4. This method is not only important for measuring the quality of generated text, but also the basis for implementing subsequent training functions. We will use this function to update the weights of the model to improve the quality of generated text.

**Figure 5.5 In the remainder of this section, we will implement the text evaluation function. In the next section, we will apply this evaluation function to the entire dataset used for model training. **

![fig5.5](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-5.png?raw=true)

In the rest of this section, we will implement parts of the text evaluation process, as shown in Figure 5.5. This process is to measure the "distance" between the generated token and the correct prediction (the target). In the training function later in this chapter, we will use this information to adjust the model weights so that the generated text is closer to (ideally the same as) the target text.

The goal of model training is to improve the softmax probability of the index position corresponding to the correct target token ID, as shown in Figure 5.6. This softmax probability is also used in the evaluation indicators we will implement in the subsequent parts of this section to numerically evaluate the output generated by the model: the higher the probability of the correct position, the better the effect.

**Figure 5.6 Before training, the model randomly generates the probability vector of the next token. The goal of model training is to maximize the probability value corresponding to the target token ID. **

![fig5.6](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-6.jpg?raw=true)

Note that Figure 5.6 shows the softmax probabilities for a compact vocabulary of only 7 tokens, so that all information is integrated into a single graph. This means that the initial random values ​​will be around 1/7 (about 0.14).

However, the vocabulary we use in the GPT-2 model has 50,257 tokens, so most of the initial probabilities will be around 0.00002 (1/50,257).

For each of the two input texts, we can print the initial softmax probability score corresponding to the target token with the following code:

In [28]:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([3.9836e-05, 1.6783e-05, 4.7559e-06])


The probabilities of the 3 target token IDs for each batch are as follows:

```
Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([3.9836e-05, 1.6783e-05, 4.7559e-06])
```

The training goal of LLM is to maximize these probability values, making them as close to 1 as possible. This way, when the model generates the next token, it will always choose the target token - the next word in the sentence.

Backpropagation

How can we maximize the softmax probability values ​​corresponding to the target tokens? In general, we update the weights of the model so that the model outputs a higher value for each token ID we want to generate. The weight updates are done through a process called backpropagation, which is a standard technique for training deep neural networks (see Sections A.3 to A.7 of Appendix A for more details on backpropagation and model training).

Backpropagation requires a loss function, which is used to calculate the difference between the model's predicted output (here, the probability corresponding to the target token ID) and the actual expected output. This loss function is used to measure the degree of deviation between the model's prediction and the target value.

In the rest of this section, we will compute the loss of probability scores for two batches of examples, `target_probas_1` and `target_probas_2`. The main steps are shown in Figure 5.7.

**Figure 5.7 Calculating the loss involves multiple steps. Steps 1 to 3 are used to calculate the token probabilities corresponding to the target tensor. Then, in steps 4 to 6, these probabilities are log-transformed and averaged. **

![fig5.7](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-7.jpg?raw=true)

Since we have already calculated `target_probas_1` and `target_probas_2` according to steps 1-3 in Figure 5.7, we will proceed to step 4 and apply the logarithmic function to these probability scores.

In [29]:
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)

tensor([ -9.5042, -10.3796, -11.3677, -10.1307, -10.9951, -12.2561])


The results are as follows:

```
tensor([ -9.5042, -10.3796, -11.3677, -10.1308, -10.9951, -12.2561])
```

In mathematical optimization, it is more convenient to work with the logarithms of probability scores rather than directly with the scores themselves. This topic is beyond the scope of this book, but I have given a lecture on this in detail, which you can find a link to in the References section of Appendix B.

Next, we combine these log-probabilities into a single score by computing the mean (step 5 in Figure 5.7):

In [30]:
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

tensor(-10.7722)


The resulting average log-odds score is:

`tensor(-10.7722)`

Our goal is to make the mean log probability as close to 0 as possible by updating the model weights during training, which we will achieve in Section 5.2.

However, in deep learning, a common approach is not to increase the mean log probability to 0, but to reduce the negative mean log probability to 0. The negative mean log probability is the mean log probability multiplied by -1, which corresponds to step 6 in Figure 5.7:

In [31]:
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

tensor(10.7722)


This prints the tensor `(10.7722)`

This negative value (-10.7722 becomes 10.7722) is called cross entropy loss in deep learning.

PyTorch comes in handy here because it has a built-in `cross_entropy` function that can handle all six steps in Figure 5.7 for us.

**Cross entropy loss**

Cross-entropy loss is a common metric in machine learning and deep learning that measures the difference between two probability distributions — typically the true distribution of labels (in this case, the tokens in the dataset) and the predicted distribution of the model (e.g., the token probabilities generated by an LLM).

In the field of machine learning, especially in frameworks like PyTorch, the `cross_entropy` function is used to calculate a metric for discrete outcomes, which is similar to the negative mean log probability of the target token given the probability of the token generated by the model. Therefore, the terms cross entropy and negative mean log probability are often used interchangeably in practice.

Before applying the cross entropy function, let's briefly review the shapes of the logits and target tensors:

In [32]:
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])


The shape is as follows:

```
Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])
```

As we can see, the logits tensor has three dimensions: batch size, number of tokens, and vocabulary size. The targets tensor has two dimensions: batch size and number of tokens.

For the cross entropy loss function in PyTorch, we want to flatten these tensors by concatenating them across the batch dimension:

In [33]:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])


The resulting tensor dimensions are as follows:

```
Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])
```

Note that the target values ​​are the token IDs we expect the LLM to generate, while the logits contain the unscaled output values ​​of the model before passing through the softmax function to obtain probability scores.

Earlier, we applied the softmax function, selected the probability scores corresponding to the target ID, and calculated the negative mean log probability. PyTorch's `cross_entropy` function will handle all of these steps for us:

In [34]:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

tensor(10.7722)


The resulting loss is the same as we obtained earlier when we manually implemented the steps shown in Figure 5.7:

```
tensor(10.7722)
```

Perplexity

Perplexity is a commonly used evaluation metric, often used together with cross entropy loss to evaluate model performance for tasks such as LLM. It provides a more understandable way to help us understand the uncertainty of the model when predicting the next token in a sequence.

Perplexity measures how well the probability distribution predicted by the model matches the actual distribution of words in the dataset. Similar to loss, lower perplexity indicates that the model's predictions are closer to the actual distribution.

The perplexity can be calculated using the formula `perplexity = torch.exp(loss)`. When we apply this formula to the previously calculated loss value, the result is `tensor(47678.8633)`.

Perplexity is often considered easier to understand than the raw loss value because it represents the uncertainty of the model about the effective vocabulary size at each step. In this case, it means that out of the 47,678 words or tokens in the vocabulary, the model is unsure which one will be the next token to be generated.

In this section, we compute the loss on two small text inputs for illustration. In the next section, we will compute the loss on the entire training and validation sets.

### 5.1.3 Calculate the training set and validation set loss

In this section, we first prepared the training and validation sets that will be used to train the LLM later in this chapter. Next, we calculated the cross entropy of the training and validation sets, as shown in Figure 5.8, which is an important part of the model training process.

**Figure 5.8 In the previous section, we have calculated the cross entropy loss. Now we will apply this loss calculation method to the entire text dataset that we will use for model training. **

![fig5.8](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-8.jpg?raw=true)

To compute the training and validation losses shown in Figure 5.8, we used a very small text dataset, the short stories "The Verdict" by Edith Wharton, which we already used in Chapter 2. By choosing text in the public domain, we avoided any issues related to usage rights. In addition, the reason we chose such a small dataset is that it allows us to execute the code examples in a few minutes on a standard laptop, even without a high-end GPU, which is very beneficial for teaching purposes.

For those interested, you can also use the additional code provided with this book to prepare a larger dataset of over 60,000 public domain books from Project Gutenberg and train an LLM on this data (see Appendix D for details).

**Cost of Pre-training LLM**

To better understand the scale of our project, we can refer to the training of the 7 billion parameter Llama 2 model, a relatively well-known and publicly available LLM. This model took 184,320 hours to run on an expensive A100 GPU, processing 2 trillion tokens. At the time of writing, running an 8xA100 cloud server on AWS costs about $30 per hour. A rough estimate puts the total training cost of such an LLM at about $690,000 (184,320 hours divided by 8, then multiplied by $30).

The following code loads the short story "The Verdict" that we used in Chapter 2:

In [35]:
file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
    text_data = file.read()

After loading the dataset, we can check the number of characters and tokens in the dataset:

In [36]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)

Characters: 20479
Tokens: 5145


The output is as follows:

```
Characters: 20479
Tokens: 5145
```

Although this text only has 5145 tokens, it may seem too small to train a large LLM. However, as we mentioned before, this is for educational purposes, allowing us to run the code in minutes rather than weeks. In addition, we will load OpenAI's pre-trained weights into our GPTModel code at the end of this chapter.

Next, we split the dataset into training and validation sets and use the data loader from Chapter 2 to prepare batches of data for LLM training. This process is visually demonstrated in Figure 5.9.

**Figure 5.9 In the process of preparing the data loader, we first split the input text into training and validation sets. Next, we tokenize the text (for simplicity, only the training set part is shown here) and divide the tokenized text into blocks of user-specified length (6 in this case). Finally, we shuffle the order of the lines and organize the divided text into batches (in this case, the batch size is 2) so that we can use them for model training. **

![fig5.9](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-9.jpg?raw=true)

For ease of visualization, we set the maximum length to 6 in Figure 5.9, mainly due to space limitations. However, in our actual implementation of the data loader, we set the maximum length to the context length of 256 tokens supported by LLM, which allows LLM to be exposed to longer text during training.

**Training with variable length**

We train the model with chunks of data of similar size for simplicity and efficiency reasons. However, in practice, it is also beneficial to train the LLM with inputs of different lengths, which helps the model better adapt to various types of inputs when used.

To achieve the data splitting and loading shown in Figure 5.9, we first define a train_ratio, using 90% of the data for training and the remaining 10% as validation data during model training:

In [37]:
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

With the train_data and val_data subsets, we can now create the corresponding data loaders, reusing the `create_dataloader_v1` code from Chapter 2:

In [40]:
from chapter02 import create_dataloader_v1
torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True
    )
val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False
    )

In the above code, we choose a smaller batch size to reduce the consumption of computing resources because we are dealing with a very small dataset. However, in practice, it is common to use a batch size of 1024 or larger to train LLM.

As an optional check, we can iterate over the data loaders to ensure they were created correctly:

In [41]:
print("Train loader:")
for x, y in train_loader:
    print(x.shape, y.shape)
    
print("\nValidation loader:")
for x, y in val_loader:
    print(x.shape, y.shape)

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])


We can see the following output:

```
Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])
```

According to the running results of the preceding code, we have a total of 9 training batches, each batch contains 2 samples, and each sample has 256 tokens. Since we only allocate 10% of the data for the validation process, there is only one validation batch, which contains 2 input examples.

As we expected, the input data (x) and target data (y) have the same shape (batch size times the number of tokens in each batch) because the target is the input data shifted one position backwards, which we discussed in Chapter 2.

Next, we will implement a utility function that calculates the cross entropy loss for a specific batch returned by the training and validation loaders:

In [42]:
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device) #A
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(
    logits.flatten(0, 1), target_batch.flatten()
    )
    return loss

We can now use this utility function calc_loss_batch , which computes the loss for a single batch, to implement the following function calc_loss_loader , which computes the loss for all batches sampled by a given data loader:

**Code Listing 5.2 Functions for calculating training and validation losses**

In [43]:
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if num_batches is None:
        num_batches = len(data_loader) #A
    else:
        num_batches = min(num_batches, len(data_loader)) #B
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item() #C
        else:
            break
    return total_loss / num_batches #D

By default, the `calc_loss_batch` function iterates over all batches in the given data loader, accumulates the losses of each batch in the `total_loss` variable, and then calculates and averages the losses of all batches. Alternatively, we can specify a smaller number of batches through the `num_batches` parameter to speed up evaluation during model training.

Now, let’s apply this calc_loss_batch function to our train and validation set loaders to see it in action:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #A
model.to(device)
train_loss = calc_loss_loader(train_loader, model, device) #B
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

The resulting loss values ​​are as follows:

```
Training loss: 10.98758347829183
Validation loss: 10.98110580444336
```

Since the model has not been trained yet, the loss value is relatively high. In contrast, if the model can learn how to generate the next token in the order of the training set and the validation set, the loss value will be close to 0.

Now that we have a way to measure the quality of generated text, we will next train the LLM to reduce this loss so that it performs better at generating text, as shown in Figure 5.10.

**Figure 5.10 We have reviewed the text generation process and implemented basic model evaluation techniques to calculate the loss of the training set and validation set. Next, we will move into understanding the training function and pre-training the LLM. **

![fig5.10](https://github.com/Pr04Ark/llms-from-scratch-cn/blob/trans01/Translated_Book/img/fig-5-10.jpg?raw=true)

As shown in Figure 5.10, the following section will focus on the pre-training of LLM. After the model training is completed, we will adopt different text generation strategies and save and load the pre-trained model weights.