<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Appendix A: Introduction to PyTorch (Part 2)

https://livebook.manning.com/book/build-a-large-language-model-from-scratch/appendix-a/270

## A.9 Optimizing training performance with GPUs

### A.9.1 PyTorch computations on GPU devices

In [2]:
import torch

print(torch.__version__)

2.5.1+cu124


In [3]:
print(torch.cuda.is_available())

False


In [4]:
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])

print(tensor_1 + tensor_2)

tensor([5., 7., 9.])


In [5]:
tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")

print(tensor_1 + tensor_2)

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

In [5]:
tensor_1 = tensor_1.to("cpu")
print(tensor_1 + tensor_2)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

### A.9.2 Single-GPU training

PyTorch implements a Dataset and a DataLoader class. The Dataset class is used to instantiate objects that define how each data record is loaded. The DataLoader handles how the data is shuffled and assembled into batches.

In [24]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

In [25]:
from torch.utils.data import Dataset


class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

In [None]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=1,
    # In practice, having a substantially smaller batch as the last batch in a training epoch can disturb the convergence during training
    # When the last batch contains significantly fewer samples than the other batches, it can cause a sharper shift in the gradient estimates. 
    # Neural networks rely on gradients that approximate the full dataset, and if one batch has substantially fewer examples, 
    # its gradient may be less representative, increasing the variance or “noise” in the optimization process.
    # Additionally, techniques like batch normalization rely on consistent batch statistics (mean, variance) to stabilize training. 
    # If the last batch is small, its statistics can differ from those in larger batches, potentially leading to instability or slower convergence over time.
    drop_last=True
)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=1
)

# if you iterate over the dataset a second time, you will see that the shuffling order will change. This is desired to prevent deep neural networks from getting caught in repetitive update cycles during training
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, "--->", y)

Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) ---> tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) ---> tensor([0, 0])


In [16]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(

            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits
    

model = NeuralNetwork(50, 3)
print(model)

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)


print(model.layers[0].weight.shape)
print(model.layers[0].weight)


NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)
Total number of trainable model parameters: 2213
torch.Size([30, 50])
Parameter containing:
tensor([[-0.0239,  0.0571, -0.0261,  ..., -0.0688,  0.1206, -0.0904],
        [ 0.0073,  0.0428,  0.0269,  ...,  0.0794,  0.1054,  0.0645],
        [ 0.0237, -0.0184, -0.1275,  ...,  0.0246, -0.0365,  0.0579],
        ...,
        [-0.0832, -0.0792, -0.0952,  ...,  0.0292,  0.0998, -0.0457],
        [-0.0445, -0.0387, -0.0648,  ..., -0.0557, -0.0108,  0.0277],
        [ 0.0604,  0.1190,  0.0029,  ..., -0.0247, -0.0509, -0.0709]],
       requires_grad=True)


In [22]:
### Generate a toy input and do a forward pass
torch.manual_seed(123)
X = torch.rand((1, 50))
out = model(X)
print(out)

# The values can now be interpreted as class-membership probabilities that sum up to 1.
with torch.no_grad():
    out = torch.softmax(out, dim=1)
print(out)


tensor([[ 0.1802, -0.1991, -0.2394]], grad_fn=<AddmmBackward0>)
tensor([[0.4270, 0.2922, 0.2807]])


In [None]:
import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # NEW
model = model.to(device) # NEW

optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):
    # This is necessary for components that behave differently during training and inference, such as dropout or batch normalization layer
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        features, labels = features.to(device), labels.to(device) # NEW
        logits = model(features)
        loss = F.cross_entropy(logits, labels) # Loss function

        # reset the gradients to 0
        optimizer.zero_grad()
        loss.backward()
        # use the gradients to update the model parameters to minimize the loss
        optimizer.step()

        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader)-1:03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation
    

with torch.no_grad():
    outputs = model(X_train)
torch.set_printoptions(sci_mode=False)

# The values can now be interpreted as class-membership probabilities that sum up to 1.
'''
In PyTorch, the first dimension (dim=0) typically corresponds to the batch size, so dim=1 
often refers to the “features” or “classes” dimension. For example, in a classification 
setting with output shape (batch_size, num_classes), specifying dim=1 means applying 
an operation (like softmax or a sum) across the classes for each item in the batch. 
This ensures the function processes each row (i.e., each sample in the batch) independently.
'''
probas = torch.softmax(outputs, dim=1)
print(outputs)
print(probas)

predictions = torch.argmax(probas, dim=1)
print(predictions)
    
torch.sum(predictions == y_train)

Total number of trainable model parameters: 752
Epoch: 001/003 | Batch 000/001 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/001 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/001 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/001 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/001 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/001 | Train/Val Loss: 0.00
tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])
tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])
tensor([0, 0, 0, 1, 1])


tensor(5)

In [34]:
def compute_accuracy(model, dataloader, device):
    model = model.eval()
    correct = 0.0
    total_examples = 0

    for idx, (features, labels) in enumerate(dataloader):

        features, labels = features.to(device), labels.to(device) # New

        with torch.no_grad():
            logits = model(features)

        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

In [43]:
compute_accuracy(model, train_loader, device=device)

1.0

In [44]:
compute_accuracy(model, test_loader, device=device)

1.0

### A.9.3 Training with multiple GPUs

See [DDP-script.py](DDP-script.py)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/12.webp" width="600px">
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/appendix-a_compressed/13.webp" width="600px">