# Introduction into PyTorch


PyTorch is a Python-based library designed for deep learning. It is distinguished by its dynamic computational graph, which enables researchers and developers to construct models with a high degree of flexibility. PyTorch has found extensive use in various scientific and engineering domains due to its ease of use and extensive research-friendly features.

This notebook is based on the [PyTorch 60-Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html).


### 1 Tensors
Tensors are a specialized data structure similar to arrays and matrices. They are comparable to NumPy's ndarrays, but they have the advantage that PyTorch tensors can run on GPUs.

In [4]:
import torch
import numpy as np

**1.1 Tensor Initialization**

There are many different ways of initialising tensors e.g. from data:

In [5]:
# Creating ndarray/tensor from data
n = np.array([[1, 2], [3, 4]])
t = torch.tensor([[1, 2], [3, 4]])

print(f"NumPy: \n {n} \n")
print(f"PyTorch: \n {t} \n")

NumPy: 
 [[1 2]
 [3 4]] 

PyTorch: 
 tensor([[1, 2],
        [3, 4]]) 



Or creating them from constant or random values:

In [7]:
# NumPy
n_zeros  = np.zeros((2, 3))
n_ones   = np.ones((2, 3))
n_random = np.random.random((2, 3))

# PyTorch, No parantheses around the shape needed!
t_zeros  = torch.zeros(2, 3)
t_ones   = torch.ones(2, 3)
t_random = torch.rand(2, 3)
t_full = torch.full((2, 3), 3.1415)

n_zeros, n_ones, n_random, t_zeros, t_ones, t_random, t_full

(array([[0., 0., 0.],
        [0., 0., 0.]]),
 array([[1., 1., 1.],
        [1., 1., 1.]]),
 array([[0.5720494 , 0.90804453, 0.00744775],
        [0.72319214, 0.32139422, 0.85191981]]),
 tensor([[0., 0., 0.],
         [0., 0., 0.]]),
 tensor([[1., 1., 1.],
         [1., 1., 1.]]),
 tensor([[0.5425, 0.8958, 0.4804],
         [0.9227, 0.7835, 0.3031]]),
 tensor([[3.1415, 3.1415, 3.1415],
         [3.1415, 3.1415, 3.1415]]))

**1.2 Conversion between NumPy and PyTorch**

The conversion of an ndarray into a torch tensor is simple. Both will share the same memory.

In [8]:
# NumPy -> PyTorch
n = np.ones((3))
t = torch.from_numpy(n)

print(f"NumPy: {n}")
print(f"PyTorch: {t}")

# Same memory, so operations will affect both!
t += 1

print(f"NumPy: {n}")
print(f"PyTorch: {t}")

NumPy: [1. 1. 1.]
PyTorch: tensor([1., 1., 1.], dtype=torch.float64)
NumPy: [2. 2. 2.]
PyTorch: tensor([2., 2., 2.], dtype=torch.float64)


In [9]:
# PyTorch -> NumPy
t = torch.ones(3)
n = t.numpy()

print(f"NumPy: {n}")
print(f"PyTorch: {t}")

NumPy: [1. 1. 1.]
PyTorch: tensor([1., 1., 1.])


**1.3 Tensor Attributes**

We can print information such as the tensor shape, the tensor datatype, and the device on which they are stored.

In [10]:
n = np.ones((3, 4))
t = torch.ones(3, 4)

print(f"Shape: {t.shape}")
print(f"Datatype: {t.dtype}")
print(f"Device tensor is stored on: {t.device}") # PyTorch tensors can be stored on the GPU!

Shape: torch.Size([3, 4])
Datatype: torch.float32
Device tensor is stored on: cpu


We can change the shape of the tensors via ``reshape``. Be aware that reshape may return a copy of the original tensor. 

In [11]:
print(t.shape)

# Change shape
t = t.reshape(4, 3)
print(t.shape)

torch.Size([3, 4])
torch.Size([4, 3])


Each tensor has a specific data type. A list of dtypes can be found here: https://pytorch.org/docs/stable/tensors.html

In [12]:
print(t.dtype)

# Conversion of dtypes
t = t.type(torch.bool)
print(t.dtype)

# Shorter way
t = t.int()
print(t.dtype)


torch.float32
torch.bool
torch.int32


Tensors are normally created on the CPU. We have to move them to the GPU.

In [13]:
# We move our tensor to the GPU if available
if torch.cuda.is_available():
  t = t.to("cuda")
  print(f"Device tensor is stored on: {t.device}")

  t += 1
  print(t)

**1.4 Tensor Operations**

Many operations that you find in NumPy are also available in PyTorch. Some, however, under a different name.

In [18]:
# Indexing and Slicing
t = torch.tensor([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(f"First row: {t[0]}")
print(f"First column: {t[:, 0]}")
print(f"Last column: {t[:, -1]}")
t[:, 1] = 0
print(t)
print(t.shape)

First row: tensor([1, 2, 3])
First column: tensor([1, 4, 7])
Last column: tensor([3, 6, 9])
tensor([[1, 0, 3],
        [4, 0, 6],
        [7, 0, 9]])
torch.Size([3, 3])


In [19]:
n0 = np.array([1, 2])
n1 = np.array([3, 4])

t0 = torch.tensor([1, 2])
t1 = torch.tensor([3, 4])

# Concatenating
print(f"NumPy: {np.concatenate([n0, n1])}")
print(f"PyTorch: {torch.cat([t0, t1])}\n")

# Stacking (along a new dimension)
print(f"NumPy: \n{np.stack([n0, n1])}")
print(f"PyTorch: \n{torch.stack([t0, t1])}\n")

# Check for shape!
t2 = torch.tensor([1, 2, 3])
#print(torch.stack([t0, t2]).shape)


NumPy: [1 2 3 4]
PyTorch: tensor([1, 2, 3, 4])

NumPy: 
[[1 2]
 [3 4]]
PyTorch: 
tensor([[1, 2],
        [3, 4]])



In [20]:
# multiplying tensors
t0 = torch.full((3, 2), 2.0)
t1 = torch.full((3, 2), 4.0)

# element-wise product
print(f"tensor.mul(tensor) \n {t0.mul(t1)} \n")
# Alternative syntax:
print(f"tensor * tensor \n {t0 * t1}")

# matrix multiplication
print(f"tensor.matmul(tensor.T) \n {t0.matmul(t1.T)} \n")
# Alternative syntax:
print(f"tensor @ tensor.T \n {t0 @ t1.T}")

tensor.mul(tensor) 
 tensor([[8., 8.],
        [8., 8.],
        [8., 8.]]) 

tensor * tensor 
 tensor([[8., 8.],
        [8., 8.],
        [8., 8.]])
tensor.matmul(tensor.T) 
 tensor([[16., 16., 16.],
        [16., 16., 16.],
        [16., 16., 16.]]) 

tensor @ tensor.T 
 tensor([[16., 16., 16.],
        [16., 16., 16.],
        [16., 16., 16.]])


In [21]:
# inplace operations have a trailing underscore
print(f"Tensor before inplace addition \n {t} \n")
t.add_(5)
print(f"Tensor after inplace addition \n {t} \n")

Tensor before inplace addition 
 tensor([[1, 0, 3],
        [4, 0, 6],
        [7, 0, 9]]) 

Tensor after inplace addition 
 tensor([[ 6,  5,  8],
        [ 9,  5, 11],
        [12,  5, 14]]) 



For a full list of available tensor operations check out the corresponding [PyTorch documentation](https://pytorch.org/docs/stable/torch.html).

In [22]:
t = torch.full((2, 3), 0.5)

# Further operation examples
print(t.mean())
print(t.sum())
print(t.log())
print(t.sin())
print(t + 2*t ** 3 - 10)

tensor(0.5000)
tensor(3.)
tensor([[-0.6931, -0.6931, -0.6931],
        [-0.6931, -0.6931, -0.6931]])
tensor([[0.4794, 0.4794, 0.4794],
        [0.4794, 0.4794, 0.4794]])
tensor([[-9.2500, -9.2500, -9.2500],
        [-9.2500, -9.2500, -9.2500]])


### 2 Introduction to torch.autograd

``torch.autograd`` is PyTorch's engine for automatic differentiation. It is essential for the training of neural networks.

**2.1 Differentiation in Autograd**

The argument ``required_grad=True`` signals to ``autograd`` that every operation on those tensors should be tracked. This allows ``autograd`` to collect gradients.

In [23]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)
Q = 3*a**3 - b**2

If ``a`` and ``b`` are parameters of a neural networks the error function ``Q`` could looks like this:

\begin{align}Q = 3a^3 - b^2\end{align}

For the training of the neural network we need to calculate the gradients with respect to the parameters: 

\begin{align}\frac{\partial Q}{\partial a} = 9a^2\end{align}

\begin{align}\frac{\partial Q}{\partial b} = -2b\end{align}

With ``autograd`` you can calculate those gradients by calling ``.backward()`` on Q. The ``gradient`` argument is used here to specify how much the tensors ``a`` and ``b`` should influence the gradient calculation of ``Q``. By providing ``external_grad`` as the gradient argument, you ensure that both ``a`` and ``b`` are treated as if they contribute equally to the gradient of ``Q`` (both having a weight of 1.0).

In [24]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


**2.2 Computational Graph**

``autograd`` functions by maintaining a record of both data (tensors) and executed operations in a directed acyclic graph (DAG) composed of Function objects. Within this graph structure, the input tensors serve as the starting point (leaves), and the output tensors act as the endpoints (roots). By traversing this graph in a reverse manner, one can automatically compute gradients using the chain rule.

- **Forward pass**: ``autograd`` performs operations to compute the resulting tensor and maintains the operation's gradient function in the DAG.

- **Backward pass**: ``autograd`` (triggered by calling ``.backward()`` on the DAG root) computes the gradients from each ``.grad_fn``, accumulates them in the corresponding tensor's ``.grad`` attribute, and applies the chain rule to propagate gradients to the leaf tensors.

``autograd`` tracks operations for tensors with ``requires_grad=True``, while setting ``requires_grad=False`` excludes them; if any input tensor has ``requires_grad=True``, the output tensor will also require gradients.

Note: In PyTorch, DAGs (Directed Acyclic Graphs) are dynamic, and it's important to know that a new graph is built from scratch after each ``.backward()`` call. This flexibility enables the use of control flow statements and the ability to modify the model's shape, size, and operations in each iteration as required.


In [25]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients?: {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients?: False
Does `b` require gradients?: True


### 3 Neural Networks with PyTorch

The ``torch.nn`` package can be used to construct neural networks. ``nn.Module``s contain layers, and a method ``forward(input)`` that returns the ``output``.

Typical training procedure for a neural network:
1. Define neural network with some learnable parameters (weights)
2. Iterate over dataset of inputs
3. Process input through network
4. Compute the loss (how wrong is the output)
5. Backpropagate to calculate the gradient for each of the network's weights
6. Update the weights of the network

**torch.nn.Module** holds both the **weights** of the network and their corresponding **gradient** tensors. The weights and the gradients can be examined via ```model.parameters()```. 


**3.1 Defining a neural network**

In general, a neural network in PyTorch can be described by the following structure:

In [26]:
import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        # SETUP

    def forward(self, x):
        # FORWARD PASS
        return x

net = NeuralNetwork()
print(net)


NeuralNetwork()


If we define the ``forward`` function, the ``backward`` function is automatically defined, which can be used for training of the network.

There are a variety of layers, activation functions etc. to choose from. For an overview, see: https://pytorch.org/docs/stable/nn.html
The most important ones for our tasks are the ``Linear`` layer for fully connected layers and the ``Conv2d`` layer for convolutions.

In [27]:
# Linear layer
in_features = 3
out_features = 5
linear_layer = nn.Linear(in_features, out_features)

# First dimension is the BATCH dimension, the second the one the number of in_features
test_input = torch.rand(32, 3)
test_output = linear_layer(test_input)
print(f"Output shape: {test_output.shape}, gradient information available? {test_output.requires_grad}")


# Convolutional layer, no information about the height and width of the input required!
in_channels = 3
out_channels = 5
kernel_size = (3, 3)

cnn_layer = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=1, padding_mode='zeros')

# First dimension is the BATCH dimension, the second the number of in_channels
# The third and forth dimensions are the height and width, respectively
test_input = torch.rand(32, 3, 32, 32)
test_output = cnn_layer(test_input)
print(f"Output shape: {test_output.shape}, gradient information available? {test_output.requires_grad}")


Output shape: torch.Size([32, 5]), gradient information available? True
Output shape: torch.Size([32, 5, 32, 32]), gradient information available? True


Here is a complete example of a neural network using a combination of convolutions and fully connected layers:

In [28]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, (5,5))
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5*5 from image dimension, 32 -> (32-4)/2 -> (14-4)/2 -> 5
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square, you can specify with a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)

        # BCHW -> B,N

        x = torch.flatten(x, 1) # flatten all dimensions except the batch dimension
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
print(net)

# learnable parameters of the model
params = list(net.parameters())
n_params = np.sum([torch.numel(p) for p in params])
print(f"Number of learnable parameters: {n_params} \n")
print(params[0].shape)  # conv1's .weight

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)
Number of learnable parameters: 61706 

torch.Size([6, 1, 5, 5])


There is also a much shorter version that uses nn.Sequential. It forwards the data through all layers in the given order. However, it is not as flexible as the definition of the network as a class:

In [29]:
net2 = nn.Sequential(
    nn.Conv2d(1, 6, (5, 5)),
    nn.ReLU(),
    nn.MaxPool2d((2, 2)),
    nn.Conv2d(6, 16, (5, 5)),
    nn.ReLU(),
    nn.MaxPool2d((2, 2)),
    nn.Flatten(start_dim=1, end_dim=-1),
    nn.Linear(16 * 5 * 5, 120),
    nn.ReLU(),
    nn.Linear(120, 84),
    nn.ReLU(),
    nn.Linear(84, 10),
)

In [30]:
# Random input
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.0672, -0.0951,  0.0648, -0.0270,  0.0832, -0.0916, -0.0135,  0.0494,
         -0.0704, -0.0135]], grad_fn=<AddmmBackward0>)


In [31]:
# Zero gradient buffers of all parameters and backprops with random gradients
net.zero_grad()
out.backward(torch.randn(1, 10))

**3.2 Loss Function**

The loss function computes a value that estimates how far away the output is from the target. For the full list of available loss functions check out the [PyTorch documentation](https://pytorch.org/docs/stable/nn.html#loss-functions).

Example: ``MSELoss`` (mean squared error)


In [32]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(1.2268, grad_fn=<MseLossBackward0>)


**3.3 Backprop**

To initiate error backpropagation, use ``loss.backward()``, but make sure to clear existing gradients; otherwise, the new gradients will accumulate onto the existing ones. This step is crucial for accurate gradient calculations.


In [33]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print("conv1.bias.grad before backward")
print(net.conv1.bias.grad)

loss.backward()

print("conv1.bias.grad after backward")
print(net.conv1.bias.grad)

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([ 0.0143, -0.0089,  0.0054, -0.0013, -0.0050, -0.0194])


**3.4 Update the weights**

A simple update rule is the Stochastic Gradient Descent (SGD):

*weight = weight - learning_rate * gradient*

The ``torch.optim`` package implements different update rules such as SGB, Nesterov-SGD, Adam, RMSProp, etc.

In [34]:
import torch.optim as optim

# Create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

Usually, you will find the following sequence of operations to train a PyTorch network for one iteration. It consists of zeroing the gradients, performing a forward pass + loss calculation. Calculate the gradients with ``backward()`` and update the parameters of the network using an optimizer:

In [35]:
# In your training loop: (IDIOMATIC)
optimizer.zero_grad()               # zero the gradient buffers
output = net(input)                 # make prediction
loss = criterion(output, target)    # calculate loss between prediction and ground truth
loss.backward()                     # backpropogate the loss (gradient calculation)
optimizer.step()                    # update the weights