# PyTorch in One Hour: 

- https://sebastianraschka.com/teaching/pytorch-1h/

This tutorial covers the following topics:

- An overview of the PyTorch deep learning library
- Setting up an environment and workspace for deep learning
- Tensors as a fundamental data structure for deep learning
- The mechanics of training deep neural networks
- Training models on GPUs

## The three core components of PyTorch

![fig_1](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_01.webp)
Figure 1. PyTorch's three main components include a tensor library as a fundamental building block for computing, automatic differentiation for model optimization, and deep learning utility functions, making it easier to implement and train deep neural network models.

- PyTorch is a tensor library that extends the concept of array-oriented programming library NumPy with the additional feature of accelerated computation on GPUs, thus providing a seamless switch between CPUs and GPUs.

- PyTorch is an automatic differentiation engine, also known as *autograd*, which enables the <u>automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization</u>.

- PyTorch is a <u>deep learning library</u>, meaning that it offers modular, flexible, and efficient building blocks (including pre-trained models, loss functions, and optimizers) for designing and training a wide range of deep learning models, catering to both researchers and developers.

## Deep learning

Unlike traditional machine learning techniques that excel at simple pattern recognition, deep learning is particularly good at handling unstructured data like images, audio, or text, so deep learning is particularly well suited for LLMs.

The typical predictive modeling workflow (also referred to as supervised learning) in machine learning and deep learning is summarized in Figure 2.

![fig_2](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_03.webp)

## Installing PyTorch


![fig_3](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_04.webp)
Figure 3. Access the PyTorch installation recommendation on https://pytorch.org to customize and select the installation command for your system.


In [1]:
import torch
torch.__version__

'2.8.0'

## Check GPU avaliablity

In [2]:
# Nvida gpu
torch.cuda.is_available()

False

In [3]:
# Apple M series

In [4]:
torch.backends.mps.is_available()

True

In [5]:
torch.backends.mps.is_built()

True

|||
|-|-|
| Output | Meaning | 
| True / True | MPS backend is built and available — Metal acceleration is usable.|
| True / False | Rare — indicates partial support. Usually, PyTorch was not compiled with MPS.|
| False / True | PyTorch has MPS compiled in, but it can’t access your GPU right now (e.g. running in a virtual env without GPU access).
| False / False | MPS not supported or using PyTorch built without Metal backend.|


# Understanding tensors

- Tensors represent a mathematical concept that generalizes vectors and matrices to potentially higher dimensions. 
- Tensors are mathematical objects that can be characterized by their order (or rank), which provides the number of dimensions.
- For example,
    - a scalar (just a number) is a tensor of rank 0
    - a vector is a tensor of rank 1
    - a matrix is a tensor of rank 2, as illustrated in Figure 6.

![fig_4](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_06.webp)
Figure 4. An illustration of tensors with different ranks. Here 0D corresponds to rank 0, 1D to rank 1, and 2D to rank 2. Note that a 3D vector, which consists of 3 elements, is still a rank 1 tensor.



## Scalars, vectors, matrices, and tensors

- PyTorch tensors are data containers for array-like structures.
- There is no specific term for higher-dimensional tensors, so we typically refer to a 3-dimensional tensor as just a 3D tensor, and so forth.




In [3]:
# create a 0D tensor (scalar) from a Python integer
tensor0d = torch.tensor(1)
print(tensor0d, type(tensor0d), tensor0d.shape)

tensor(1) <class 'torch.Tensor'> torch.Size([])


In [4]:
tensor0d_new = torch.tensor([1])
print(tensor0d_new, type(tensor0d_new), tensor0d_new.shape)

tensor([1]) <class 'torch.Tensor'> torch.Size([1])


In [8]:
# create a 1D tensor (vector) from a Python list
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d, type(tensor1d), tensor1d.shape)

tensor([1, 2, 3]) <class 'torch.Tensor'> torch.Size([3])


In [9]:
# create a 2D tensor from a nested Python list
tensor2d = torch.tensor([[1, 2], [3, 4]])
print(tensor2d, type(tensor2d), tensor2d.shape)

tensor([[1, 2],
        [3, 4]]) <class 'torch.Tensor'> torch.Size([2, 2])


In [10]:
# create a 3D tensor from a nested Python list
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(tensor3d, type(tensor3d), tensor3d.shape)

tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]]) <class 'torch.Tensor'> torch.Size([2, 2, 2])


In [11]:
import numpy as np

tensor4d = torch.tensor( np.arange(1, 17).reshape(2, 2, 2, 2) )
print(tensor4d, type(tensor4d), tensor4d.shape)

tensor([[[[ 1,  2],
          [ 3,  4]],

         [[ 5,  6],
          [ 7,  8]]],


        [[[ 9, 10],
          [11, 12]],

         [[13, 14],
          [15, 16]]]]) <class 'torch.Tensor'> torch.Size([2, 2, 2, 2])


## Tensor data types

- In this case, PyTorch adopts the default 64-bit integer data type from Python.
- We can access the data type of a tensor via the `.dtype` attribute of a tensor:

In [12]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d.dtype)

torch.int64


- If tensors from Python floats, PyTorch creates tensors with a 32-bit precision by default



In [13]:
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


- This choice is primarily due to the balance between precision and computational efficiency.
- A 32-bit floating point number offers sufficient precision for most deep learning tasks, while consuming less memory and computational resources than a 64-bit floating point number.
- Moreover, GPU architectures are optimized for 32-bit computations, and using this data type can significantly speed up model training and inference.

It is possible to readily change the precision using a tensor’s `.to` method. 

In [14]:
floatvec = tensor1d.to(torch.float32)
print(floatvec.dtype)

torch.float32


In [15]:
x = torch.tensor([[1., -2.], [3., 4.]], requires_grad=True)
x

tensor([[ 1., -2.],
        [ 3.,  4.]], requires_grad=True)

In [16]:
out = x.pow(2).sum() # x^2
out

tensor(30., grad_fn=<SumBackward0>)

In [17]:
out.backward()

In [18]:
x.grad

tensor([[ 2., -4.],
        [ 6.,  8.]])

## Common PyTorch tensor operations


We already introduced the `torch.tensor()` function to create new tensors.

In [19]:
tensor2d = torch.tensor([[1, 2, 3],
                         [4, 5, 6]])
tensor2d

tensor([[1, 2, 3],
        [4, 5, 6]])

In [20]:
tensor2d.shape

torch.Size([2, 3])

In [21]:
tensor2d.reshape(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

However, note that the more common command for reshaping tensors in PyTorch is `.view()`:

> Most traditional PyTorch code and tutorials use `.view()` because it was introduced first and is slightly faster for contiguous tensors. However, `.reshape()` is more versatile for general use, especially as PyTorch evolves.



Next, we can use `.T` to transpose a tensor, which means flipping it across its diagonal. 

In [22]:
tensor2d.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

Lastly, the common way to multiply two matrices in PyTorch is the `.matmul` method:



In [23]:
tensor2d.matmul(tensor2d.T)

tensor([[14, 32],
        [32, 77]])

In [24]:
tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

# Seeing models as computation graphs
PyTorch’s autograd system provides functions to compute gradients in dynamic computational graphs automatically. 

- A computational graph (or computation graph in short) is a directed graph that allows us to express and visualize mathematical expressions.
- In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network
- We will need this later to compute the required gradients for backpropagation, which is the main training algorithm for neural networks.

The following code implements the forward pass (prediction step) of a simple logistic regression classifier, which can be seen as a single-layer neural network, returning a score between 0 and 1 that is compared to the true class label (0 or 1) when computing the loss:

In [5]:
import torch.nn.functional as F

y = torch.tensor([1.0])  # true label
x1 = torch.tensor([1.1]) # input feature
w1 = torch.tensor([2.2]) # weight parameter
b = torch.tensor([0.0])  # bias unit

z = x1 * w1 + b          # net input
print(z)
a = torch.sigmoid(z)     # activation & output

loss = F.binary_cross_entropy(a, y)
print(loss)

tensor([2.4200])
tensor(0.0852)


- $a = \frac{1}{1 + e^{-z}}$
- $ \text{BCE}(a, y) = -[y \cdot \log(a) + (1-y) \cdot \log(1-a)] $

The point of this example is not to implement a logistic regression classifier but rather to illustrate how we can think of a sequence of computations as a computation graph.

![fig_5](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_07.webp)
A logistic regression forward pass as a computation graph. The input feature `x1` is multiplied by a model weight `w1` and passed through an activation function *σ* after adding the bias. The loss is computed by comparing the model output `a` with a given label `y`.

In fact, PyTorch builds such a computation graph in the background, and we can use this to calculate gradients of a loss function with respect to the model parameters (here `w1` and `b`) to train the model, which is the topic of the upcoming sections.

# Automatic differentiation made easy

If we carry out computations in PyTorch, it will build such a graph internally by default if one of its terminal nodes has the `requires_grad` attribute set to `True`. 

Gradients are required when training neural networks via the popular <u>backpropagation algorithm</u>, which can be thought of as an implementation of the chain rule from calculus for neural networks

![fig_6](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_08.webp)

Partial derivatives and gradients. 

- Figure shows partial derivatives, which measure the rate at which a function changes with respect to one of its variables.
- A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.
- On a high level, the chain rule is a way to compute gradients of a loss function with respect to the model’s parameters in a computation graph.
- This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model’s performance, using a method such as gradient descent.

By tracking every operation performed on tensors, PyTorch’s autograd engine constructs a computational graph in the background. Then, calling the grad function, we can compute the gradient of the loss with respect to model parameter `w1` as follows:

In [26]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])

# w1 = torch.tensor([2.2], requires_grad=False)
# b = torch.tensor([0.0], requires_grad=False)

w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)


z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

By default, PyTorch destroys the computation graph after calculating the gradients to free memory. However, since we are going to reuse this computation graph shortly, we set `retain_graph=True` so that it stays in memory.



In [27]:
print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


Above, we have been using the grad function “manually,” which can be useful for experimentation, debugging, and demonstrating concepts. 

But in practice, PyTorch provides even more high-level tools to automate this process. 

For instance, we can call `.backward()` on the loss, and PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensors’ `.grad` attributes:

In [28]:
loss.backward()

print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


# Implementing multilayer neural networks

An illustration of a multilayer perceptron with 2 hidden layers. Each node represents a unit in the respective layer. Each layer has only a very small number of nodes for illustration purposes.

![fig_7](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_09.webp)

When implementing a neural network in PyTorch, we typically subclass the `torch.nn.Module` class to define our own custom network architecture. This `Module` base class provides a lot of functionality, making it easier to build and train models. For instance, it allows us to encapsulate layers and operations and keep track of the model’s parameters.

Within this subclass, 
- we define the network layers in the `__init__` constructor and specify how they interact in the `forward` method.
- The `forward` method describes how the input data passes through the network and comes together as a computation graph.
- In contrast, the `backward` method, which we typically do not need to implement ourselves, is used during training to compute gradients of the loss function with respect to the model parameters.

The following code implements a classic multilayer perceptron with two hidden layers to illustrate a typical usage of the `Module` class:


In [29]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(

            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits

In [30]:
import torch
import torch.nn as nn
import torch.nn.functional as F

def create_model(num_inputs, num_outputs):
    model = nn.Sequential(
        nn.Linear(num_inputs, 30),
        nn.ReLU(),
        nn.Linear(30, 20),
        nn.ReLU(),
        nn.Linear(20, num_outputs)
    )
    return model

num_inputs = 50
num_outputs = 3
model_f = create_model(num_inputs, num_outputs)


x = torch.randn(5, num_inputs)

logits = model_f(x)

print(logits)
print(logits.shape)

tensor([[-0.1536,  0.0858,  0.3405],
        [-0.1462, -0.0495,  0.3382],
        [-0.1153,  0.0292,  0.2675],
        [-0.1191, -0.2425,  0.2436],
        [-0.0715,  0.0565,  0.1399]], grad_fn=<AddmmBackward0>)
torch.Size([5, 3])


We can then instantiate a new neural network object as follows:



In [31]:
model = NeuralNetwork(50, 3)
model

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)

In [32]:
model_f = create_model(num_inputs, num_outputs)
model_f

Sequential(
  (0): Linear(in_features=50, out_features=30, bias=True)
  (1): ReLU()
  (2): Linear(in_features=30, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=3, bias=True)
)

Note that we used the `Sequential` class when we implemented the `NeuralNetwork` class. 

Using `Sequential` is not required, but it can make our life easier if we have a series of layers that we want to execute in a specific order, as is the case here. 

This way, after instantiating `self.layers = Sequential(...)` in the `__init__` constructor, we just have to call the `self.layers` instead of calling each layer individually in the `NeuralNetwork`’s forward method.

Next, let’s check the total number of trainable parameters of this model:

In [33]:
num_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)
print("Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


In [34]:
num_params = sum(
    p.numel() for p in model_f.parameters() if p.requires_grad
)
print("Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


In [35]:
(50 * 30 + 30) + (30 * 20 + 20) + (20 * 3 + 3)

2213

In [36]:
for p in model.parameters():
    if p.requires_grad:
        print(p.numel())

1500
30
600
20
60
3


Note that each parameter for which `requires_grad=True` counts as a trainable parameter and will be updated during training.

In the case of our neural network model with the two hidden layers above, these trainable parameters are contained in the torch.nn.Linear layers. A linear layer multiplies the inputs with a weight matrix and adds a bias vector. This is sometimes also referred to as a ***feedforward*** or ***fully connected layer***.

Based on the `print(model)` call we executed above, we can see that the first Linear layer is at index position 0 in the layers attribute. We can access the corresponding weight parameter matrix as follows:

In [37]:
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0189,  0.0539,  0.0336,  ...,  0.0314,  0.1205, -0.1234],
        [ 0.0750,  0.0147, -0.0704,  ..., -0.0581,  0.0733, -0.0049],
        [ 0.0690,  0.1288,  0.0683,  ..., -0.1351, -0.0269, -0.1198],
        ...,
        [ 0.0771,  0.0759,  0.0426,  ...,  0.0330, -0.0856,  0.1361],
        [-0.1026, -0.0767,  0.1187,  ..., -0.1289, -0.0432, -0.0453],
        [ 0.0311,  0.1265, -0.1335,  ...,  0.0178,  0.0390,  0.0291]],
       requires_grad=True)


In [38]:
model.layers[0].bias

Parameter containing:
tensor([-0.0926, -0.0136,  0.0910, -0.0180,  0.0377, -0.0586, -0.0292, -0.0024,
        -0.0862, -0.0338, -0.0357,  0.0130, -0.0719,  0.0206,  0.0377,  0.1295,
         0.0698, -0.0370, -0.0904,  0.0012, -0.1218, -0.0387, -0.0920, -0.0168,
         0.0341,  0.0153,  0.1381,  0.1095,  0.0670,  0.1255],
       requires_grad=True)

In [39]:
print(model.layers[0].weight.shape)

torch.Size([30, 50])


In [40]:
print(model.layers[0].bias.shape)

torch.Size([30])


- The weight matrix above is a 30x50 matrix
- We can see that the `requires_grad` is set to `True`, which means its entries are trainable
- This is the default setting for weights and biases in `torch.nn.Linear`.

In deep learning, initializing model weights with small random numbers is desired to break symmetry during training – otherwise, the nodes would be just performing the same operations and updates during backpropagation, which would not allow the network to learn complex mappings from inputs to outputs.

However, while we want to keep using small random numbers as initial values for our layer weights, we can make the random number initialization reproducible by seeding PyTorch’s random number generator via `manual_seed`:

In [41]:
torch.manual_seed(123)

model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


In [42]:
torch.manual_seed(123)

model_f = create_model(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


Now, after we spent some time inspecting the `NeuralNetwork` instance, let’s briefly see how it’s used via the forward pass:

In [43]:
torch.manual_seed(123)

X = torch.rand((1, 50))
out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


In [44]:
torch.manual_seed(123)

X = torch.rand((1, 50))
out = model_f(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


In the code above, 

we generated a single random training example X as a toy input (note that our network expects 50-dimensional feature vectors) and fed it to the model, returning three scores. When we call `model(x)`, it will automatically execute the forward pass of the model.

The forward pass refers to calculating output tensors from input tensors. This involves passing the input data through all the neural network layers, starting from the input layer, through hidden layers, and finally to the output layer.

These three numbers returned above correspond to a score assigned to each of the three output nodes. Notice that the output tensor also includes a `grad_fn` value.

Here, `grad_fn=<AddmmBackward0>` represents the last-used function to compute a variable in the computational graph. In particular, `grad_fn=<AddmmBackward0>` means that the tensor we are inspecting was created via a matrix multiplication and addition operation. PyTorch will use this information when it computes gradients during backpropagation. The `<AddmmBackward0>` part of `grad_fn=<AddmmBackward0>` specifies the operation that was performed. In this case, it is an `Addmm` operation. `Addmm` stands for <u>matrix multiplication (mm) followed by an addition (Add)</u>.

If we just want to use a network without training or backpropagation, for example, if we use it for prediction after training, constructing this computational graph for backpropagation can be wasteful as it performs unnecessary computations and consumes additional memory. 

So, when we <u>use a model for inference (for instance, making predictions) rather than training, it is a best practice to use the `torch.no_grad()` context manager</u>, as shown below. This tells PyTorch that it doesn’t need to keep track of the gradients, which can result in significant savings in memory and computation.

In [45]:
with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


In PyTorch, it’s common practice to code models such that they return the outputs of the last layer (`logits`) without passing them to a nonlinear activation function. 

That’s because PyTorch’s commonly used loss functions combine the softmax (or sigmoid for binary classification) operation with the negative log-likelihood loss in a single class.

The reason for this is numerical efficiency and stability. So, if we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly:

In [46]:
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)

tensor([[0.3113, 0.3934, 0.2952]])


In [47]:
out.sum()

tensor(1.)

The values can now be interpreted as class-membership probabilities that sum up to 1. The values are roughly equal for this random input, which is expected for a randomly initialized model without training.



# Setting up efficient data loaders

![fig_8](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_10.webp)

We will implement a custom Dataset class that we will use to create a training and a test dataset that we’ll then use to create the data loaders.

Let’s start by creating a simple toy dataset of five training examples with two features each.

In [48]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

In [49]:
X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

if we have class labels 0, 1, 2, 3, and 4, the neural network output layer should consist of 5 nodes

Next, we create a custom dataset class, `ToyDataset`, by subclassing from PyTorch’s `Dataset` parent class, as shown below.

In [50]:
from torch.utils.data import Dataset

class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

This custom ToyDataset class’s purpose is to use it to instantiate a PyTorch DataLoader. But before we get to this step, let’s briefly go over the general structure of the ToyDataset code.

In PyTorch, the three main components of a custom Dataset class are 
- the `__init__` constructor, the `__getitem__` method, and the `__len__` method

In the `__init__` method, 
- we set up attributes that we can access later in the `__getitem__` and `__len__` methods.
- This could be file paths, file objects, database connectors, and so on.
- Since we created a tensor dataset that sits in memory, we are simply assigning X and y to these attributes, which are placeholders for our tensor objects.

In the `__getitem__` method, 
- we define instructions for returning exactly one item from the dataset via an index.
- This means the features and the class label corresponding to a single training example or test instance.
- The data loader will provide this index, which we will cover shortly.

Finally, `the __len__` method,
- Contains instructions for retrieving the length of the dataset.
- Here, we use the `.shape` attribute of a tensor to return the number of rows in the feature array.
- In the case of the training dataset, we have five rows, which we can double-check as follows:

In [51]:
len(train_ds)

5

In [52]:
type(train_ds.features), type(train_ds.labels)

(torch.Tensor, torch.Tensor)

In [53]:
train_ds.features, train_ds.labels

(tensor([[-1.2000,  3.1000],
         [-0.9000,  2.9000],
         [-0.5000,  2.6000],
         [ 2.3000, -1.1000],
         [ 2.7000, -1.5000]]),
 tensor([0, 0, 0, 1, 1]))

Now that we defined a PyTorch `Dataset` class we can use for our toy dataset, we can use PyTorch’s `DataLoader` class to sample from it, as shown in the code below:

In [54]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2, 
    shuffle=True,
    num_workers=0
)

In [55]:
test_ds = ToyDataset(X_test, y_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

After instantiating the training data loader, we can iterate over it as shown below. 

In [56]:
# batch_size=2, 
# shuffle=True,

for idx, (x, y) in enumerate(train_loader):
    print(f"Batch #{idx+1}\nx: {x},\ny: {y}\n")

Batch #1
x: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]),
y: tensor([1, 0])

Batch #2
x: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]),
y: tensor([0, 0])

Batch #3
x: tensor([[ 2.7000, -1.5000]]),
y: tensor([1])



As we can see based on the output above, the train_loader iterates over the training dataset visiting each training example exactly once. This is known as a training ***epoch***. 

Since we seeded the random number generator using `torch.manual_seed(123)` above, you should get the exact same shuffling order of training examples as shown above. 

However <u>if you iterate over the dataset a second time, you will see that the shuffling order will change</u>. This is desired to prevent deep neural networks getting caught in repetitive update cycles during training.



In practice, having a substantially smaller batch as the last batch in a training epoch can disturb the convergence during training. To prevent this, it’s recommended to set `drop_last=True`, which will drop the last batch in each epoch, as shown below:

In [57]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True # drop the last batch
)

In [58]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch #{idx+1}\nx: {x},\ny: {y}\n")

Batch #1
x: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]),
y: tensor([0, 0])

Batch #2
x: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]),
y: tensor([1, 0])



Lastly, let’s discuss the setting `num_workers=0` in the DataLoader. 

- This parameter in PyTorch’s DataLoader function is crucial for <u>parallelizing data loading and preprocessing</u>.
- When num_workers is set to ***0***, the data loading will be done in the main process and not in separate worker processes.
- This might seem unproblematic, but it can lead to significant slowdowns during model training when we train larger networks on a GPU.
- This is because instead of focusing solely on the processing of the deep learning model, the CPU must also take time to load and preprocess the data.
- As a result, the GPU can sit idle while waiting for the CPU to finish these tasks.

In contrast, when `num_workers` is set to ***a number greater than zero***, 
- multiple worker processes are launched to load data in parallel, freeing the main process to focus on training your model and better utilizing your system’s resources.

![fig_9](https://sebastianraschka.com/images/teaching/pytorch-1h/figure_11.webp)
Figure. Loading data without multiple workers (setting `num_workers=0`) will create a data loading bottleneck where the model sits idle until the next batch is loaded as illustrated in the left subpanel. If multiple workers are enabled, the data loader can already queue up the next batch in the background as shown in the right subpanel.


If we are working with very small datasets, 
- setting num_workers to 1 or larger may not be necessary since the total training time takes only fractions of a second anyway.
- if you are working with tiny datasets or interactive environments such as Jupyter notebooks, increasing num_workers may not provide any noticeable speedup.
- They might, in fact, lead to some issues. One potential issue is the overhead of spinning up multiple worker processes, which could take longer than the actual data loading when your dataset is small.
- for Jupyter notebooks, setting num_workers to greater than 0 can sometimes lead to issues related to the sharing of resources between different processes, resulting in errors or notebook crashes.

Setting `num_workers=4` usually leads to optimal performance on many real-world datasets, but optimal settings depend on your hardware and the code used for loading a training example defined in the Dataset class.

# Training loop

We’ve discussed all the requirements for training neural networks: 
- PyTorch’s tensor library, autograd, the Module API, and efficient data loaders.

Let’s now combine all these things and train a neural network on the toy dataset from the previous section. The training code is shown in code below.

In [59]:
import torch.nn.functional as F


torch.manual_seed(123)

model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

In [60]:
# f_loss = - (labels * np.log(logits) + (1 - labels) * np.log(1 - logits))

In [61]:
num_epochs = 3

for epoch in range(num_epochs):

    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)

        loss = F.cross_entropy(logits, labels) # Loss function

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00


As we can see, the loss reaches zero after 3 epochs, a sign that the model converged on the training set. However, before we evaluate the model’s predictions, let’s go over some of the details of the preceding code.

First, note that we initialized a model with two inputs and two outputs. 
- That’s because the toy dataset from the previous section has <u>two input features and two class labels</u> to predict.
- We used a ***stochastic gradient descent (SGD) optimizer*** with a ***learning rate (lr)*** of 0.5.
- The learning rate is a hyperparameter, meaning it’s a <u>tunable setting that we have to experiment with based on observing the loss</u>.
- Ideally, we want to choose a learning rate such that the loss converges after a certain number of epochs – the number of epochs is another hyperparameter to choose.

In practice, 
- we often use a third dataset, a so-called ***validation dataset***, <u>to find the optimal hyperparameter settings</u>.
- A ***validation dataset*** is similar to a ***test set***. However, while we only want to use a ***test set*** <u>precisely once to avoid biasing the evaluation</u>, we usually use the ***validation set*** <u>multiple times to tweak the model settings</u>.

We also introduced new settings called `model.train()` and `model.eval()`. 
- As these names imply, these settings are used to put the model into a training and an evaluation mode.
- This is necessary for components that behave differently during training and inference, such as ***dropout*** or ***batch normalization*** layers.
- Since we don’t have dropout or other components in our NeuralNetwork class that are affected by these settings, using `model.train()` and `model.eval()` is redundant in our code above.
- However, it’s best practice to include them anyway to avoid unexpected behaviors when we change the model architecture or reuse the code to train a different model.

As discussed earlier, 
- we pass the ***logits*** directly into the ***cross_entropy loss function***, which will apply the ***softmax function*** internally for efficiency and numerical stability reasons. 

- Then, calling `loss.backward()` </u>will calculate the gradients in the computation graph that PyTorch constructed in the background</u>. 

- The `optimizer.step()` </u>method will use the gradients to update the model parameters to minimize the loss</u>. In the case of the ***SGD optimizer***, this <u>means multiplying the gradients with the learning rate and adding the scaled negative gradient to the parameters</u>.

- Preventing undesired gradient accumulation. It is important to include an `optimizer.zero_grad()` <u>call in each update round to reset the gradients to zero</u>. Otherwise, the gradients will accumulate, which may be undesired.

# Evaluation

After we trained the model, we can use it to make predictions, as shown below:

In [62]:
model.eval()

with torch.no_grad():
    outputs = model(X_train)

print(outputs)

tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])


To obtain the class membership probabilities, we can then use PyTorch’s ***softmax function***, as follows:

In [63]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])


Let’s consider the first row in the code output above. Here, the first value (column) means that the training example has a 99.91% probability of belonging to class 0 and a 0.09% probability of belonging to class 1. (The `set_printoptions` call is used here to make the outputs more legible.)

We can convert these values into class labels predictions using PyTorch’s `argmax` function, which <u>returns the index position of the highest value in each row if we set</u> `dim=1` (setting `dim=0` <u>would return the highest value in each column</u>):



In [64]:
predictions = torch.argmax(probas, dim=1) # probas
print(predictions)

tensor([0, 0, 0, 1, 1])


Note that it is <u>unnecessary to compute softmax probabilities to obtain the class labels</u>. We could also apply the argmax function to the logits (outputs) directly:



In [65]:
pridictions = torch.argmax(outputs, dim=1)
predictions

tensor([0, 0, 0, 1, 1])

Above, we computed the predicted labels for the training dataset. Since the training dataset is relatively small, we could compare it to the true training labels by eye and see that the model is 100% correct. We can double-check this using the `==` comparison operator:

In [66]:
predictions == y_train

tensor([True, True, True, True, True])

In [67]:
torch.sum(predictions == y_train)

tensor(5)

Generalize the computation of the prediction accuracy, let’s implement a `compute_accuracy` function as shown in the following code.

In [68]:
def compute_accuracy(model, dataloader):

    model = model.eval()
    correct = 0.0
    total_examples = 0

    for idx, (features, labels) in enumerate(dataloader):

        with torch.no_grad():
            logits = model(features)

        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

Note that the following `compute_accuracy` function iterates over a data loader to compute the number and fraction of the correct predictions. This is because when we work with large datasets, we typically can only call the model on a small part of the dataset due to memory limitations. 

The `compute_accuracy` function above is a general method that scales to datasets of arbitrary size since, in each iteration, the dataset chunk that the model receives is the same size as the batch size seen during training.

Notice that the internals of the `compute_accuracy` function are similar to what we used before when we converted the logits to the class labels.

We can then apply the function to the training as follows:

In [69]:
compute_accuracy(model, train_loader)

1.0

In [70]:
compute_accuracy(model, test_loader)

1.0

# Saving and loading models

Here’s the recommended way how we can save and load models in PyTorch:

In [74]:
from datetime import datetime

torch.save(model.state_dict(), 'model.pth')

The model’s `state_dict` is a Python dictionary object that maps each layer in the model to its trainable parameters (weights and biases). 

Note that `"model.pth"` is an arbitrary filename for the model file saved to disk. We can give it any name and file ending we like; however, `.pth` and `.pt` are the most common conventions.

Once we saved the model, we can restore it from disk as follows:

In [75]:
model = NeuralNetwork(2, 2) # needs to match the original model exactly
model.load_state_dict(torch.load('model.pth', weights_only=True))

<All keys matched successfully>

The `torch.load("model.pth")` function reads the file `"model.pth"` and reconstructs the Python dictionary object containing the model’s parameters while `model.load_state_dict()` applies these parameters to the model, effectively restoring its learned state from when we saved it.

Note that the line `model = NeuralNetwork(2, 2)` above is not strictly necessary if you execute this code in the same session where you saved a model. 

However, I included it here to illustrate that we need an instance of the model in memory to apply the saved parameters. Here, the `NeuralNetwork(2, 2)` architecture needs to match the original saved model exactly.

# Optimizing training performance with GPUs

- https://sebastianraschka.com/teaching/pytorch-1h/