# Deep Learning with Pytorch

We'll cover:
1. the Pytorch library specifics and implementation details
2. Higher-level libraries on top of Pytorch, with the aim of simplifying common DL problems
3. the Pytorch Ignite library

## 1. Tensors

`Tensor` is just a multi-dimensional array
1. tensors in DL are only partially related to tensors used in `tensor calculus` or `tensor algebra`

## 2. The creation of tensors

1. Apart from dimensions, a tensor is characterized by the type of its elements. There are $13$ types supported by Pytorch:
    1. $4$ float types:
        - 16-bit, with $2$ variants:
            - `float16` has more bits for precision
            - `bfloat16` has larger exponent part
        - 32-bit
        - 64-bit
    2. $3$ complex types:
        - 32-bit
        - 64-bit
        - 128-bit
    3. $5$ integer types:
        - 8-bit signed
        - 8-bit unsigned
        - 16-bit signed
        - 32-bit signed
        - 64-bit signed
    4. Boolean type

2. There are also $4$ `quantized number` types, but they are using the preceding types, just with different bit representation and interpretation

3. Tensors of different types are represented by different classes, with the most commonly used being
    1. `torch.FloatTensor` (corresponding to a 32-bit float)
    2. `torch.ByteTensor` (an 8-bit unsigned integer)
    3. `torch.LongTensor` (a 64-bit signed integer)
    4. Refer to documentation for other tensor types' names

4. $3$ ways to create a tensor in Pytorch:
    1. calling a constructor of the required type
        - or we can also provide a Python iterable (e.g. list or tuple) to the constructor
    2. asking Pytorch to create a tensor with specific data
    3. converting Numpy array or a Python list to a tensor using `torch.tensor()`
        - by default, the converted tensor has type of 64-bit float (or double) `torch.float64`, i.e `DoubleTensor` type
        - usually in DL, double precision is not required and it adds an extra memory and performance overhead
        - common practice: use 32-bit float type, or even 16-bit float type

5. There are $2$ types of operation for tensors:
    1. `inplace` operations have an underscore appended to their name and operate on the tensor's content. After this, the object itself is returned
        - more efficient from a performance/memory point of view, but modification of an existing tensor might lead to hidden bugs
    2. `functional` operations create a copy of the tensor with the performed modification, leaving the original tensor untouched

In [1]:
# 1
import torch
import numpy as np
a = torch.FloatTensor(3, 2)
a # Pytorch initialize with zero

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

As we can see, Pytorch initializes memory with zeros, which is a different behavior from previous versions:
- before, it just allocated memory and kept it uninitialized, which is slightly faster but less safe (as it might introduce tricky bugs and security issues)
- but we shouldn't rely on this behavior, as it might change again (or behave differently on different hardware backend) and always initialize the contents of the tensor
- to do so, we can either use one the the tensor construct operators:

In [None]:
# functional operation
torch.zeros(3, 4)

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

In [3]:
# inplace operation
a.zero_()

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [4]:
# provide a Python iterable
torch.FloatTensor([[1,2,3],[3,2,1]])

tensor([[1., 2., 3.],
        [3., 2., 1.]])

In [None]:
# convert from Numpy array
n = np.zeros(shape=(3,2))
print("Numpy array:\n", n)

b = torch.tensor(n)
b

Numpy array:
 [[0. 0.]
 [0. 0.]
 [0. 0.]]


tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], dtype=torch.float64)

In [None]:
# convert Numpy array to 32-bit float type tensor
n = np.zeros(shape=(3,2), dtype=np.float32)
torch.tensor(n)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [None]:
# convert Numpy array to 32-bit float type tensor
n = np.zeros(shape=(3,2))
torch.tensor(n, dtype=torch.float32)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

### 2.1 Scalar tensors

Zero-dimensional tensors can be created by the `torch.tensor()` function. To access the actual Python value of scalar tensor, we can use the special `item()` method:

In [14]:
torch.tensor(1)

tensor(1)

In [18]:
a = torch.tensor([1,2,3]).sum()
print("Sum: ", a)
a.item()

Sum:  tensor(6)


6

### 2.2 Tensor operations

1. Refer to Pytorch documentation. There are $2$ places to look for operations:
    1. the `torch` package: the function usually accepts the tensor as an argument
    2. the `tensor` class: the function operates on the called tensor

2. Most of the time, tensor operations in Pytorch are trying to correspond to their Numpy equivalent
    - examples: `torch.stack()`, `torch.transpose()`, `torch.cat()`

### 2.3 GPU tensors (run in Google Colab to access to GPU)

1. Pytorch supports CUDA GPUs, i.e. all operations have $2$ versions - CPU and GPU
    - every tensor type we mentioned so far is for CPU and has its GPU equivalent
        - GPU tensors reside in the `torch.cuda` package (e.g. `torch.cuda.FloatTensor`)
        - CPU tensors reside in the `torch` package (e.g. `torch.FloatTensor`)

2. In fact, under the hood, Pytorch supports not just CPU and CUDA; it has a notion of `backend`, which is an abstract computation device with memory
    - tensors could be allocated in the backend's memory and computations could be performed on them
    - example, on Apple hardware, Pytorch support `Metal Performance Shaders (MPS)` as backend called `mps`

3. the tensor method `to(device)` creates a copy of the tensor to a specified device
    - the device type can be specified in different ways:
        1. pass string name of the device:
            - `"cpu"` for CPU memory
            - `"cuda"` for GPU
            - `"cuda:n"` (`n` specifies the GPU index - zero-based)
        2. use `torch.device` class in the `to()` method, which accepts the device name and optional index
    - the old methods in previous version still works: `cpu()`, `cuda()`
    


In [31]:
a = torch.FloatTensor([2,3])

ca = a.to("cuda")

print("CPU tensor:\n", a)
print("GPU tensor:\n", ca)
print("CPU tensor computation:\n", a+1)
print("GPU tensor computation:\n", ca+1)
print("Get tensor's device:", ca.device)

AssertionError: Torch not compiled with CUDA enabled

## 3. Gradients

1. The `automatic computation gradients` feature/functionality was originally implemented in the Caffe toolkit and then became the de facto standard in DL libraries

2. The direction of data and gradients flow during the optimization process: ![Data and gradients flowing through the NN](../images/figure_3-2.png)
    - $2$ approaches on how gradients are calculated:
        1. `Static graph`: need to define calculations in advance and it won't be possible to change them later
            - the graph will be processed and optimized by the DL library before any computation is made
            - was implemented in `TensorFlow < 2.0`, `Theano`, and many other DL toolkits
            - strength:
                1. faster, as all computations can be moved to GPY, minimizing data transfer overhead
                2. the library has more freedom in optimizing the order that the computations are performed in or even removing parts of the graph
        2. `Dynamic graph`: don't need to define graph in advance exactly as it will be executed; just need to execute operations that we want to use for data transformation on actual data
            - during this, the library will record the order of the operations performed, and when being asked to calculate gradients, it will unroll history of operations, accumulating the gradients of network parameters
            - this method is also called `notebook gradients`
            - is implemented in `Pytorch`, `Chainer`, etc.
            - strength:
                1. although has higher computation overhead, it gives developers more freedom
                2. (very appealing) allows us to express transformation more naturally and in a more Pythonic way 

3. `Pytorch 2.x` introduced `torch.compile` function, which speeds up Pytorch code by `JIT-compiling` the code into optimized kernels
    - this is an evolution of the `TorchScript` and `FX Tracing` compiling methods from earlier versions
    - from a historical perspective, this is highly amusing how originally radically different approaches of `TensorFlow (static graph)` and `Pytorch (dynamic graph)` are fusing into each other over time
    - nowadays, Pytorch supports `compile()` and Tensorflow has `eager execution mode`

### 3.1 Tensors and gradients

1. Pytorch tensors have a built-in gradient calculation and tracking machinery, so all we need to do is convert data into tensors and perform computations using the tensor methods and functions provided by `torch`

2. several attributes related to gradients that every tensor has:
    - `grad`: hold a tensor of the same shape containing computed gradients
    - `is_leaf`: 
        1. `True` if constructed by the user
        2. `False` if it was the result of function transformation (i.e. it has a parent in the computation graph)
    - `requires_grad`:
        1. `True` if requires gradients to be calculated
            - this property is inherited from leaf tensors, which get this value from the tensor construction step (`torch.zeros()` or `torch.tensor()`, etc.)
            - if one of the variables involved in computations is `True`, all subsequent nodes also have it
        2. `False` default by the constructor

3. for memory efficient, gradients are stored only for lead nodes with `requires_grade=True`
    - call `retain_grad()` method to keep the gradients for non-lead nodes

In [35]:
v1 = torch.tensor([1.0, 1.0], requires_grad=True)
v2 = torch.tensor([2.0, 2.0])

v_sum = v1 +v2
print(f"Sum: {v_sum}")
v_res = (v_sum*2).sum()
print(f"Result: {v_res}")
print(f"v1 is leaf: {v1.is_leaf} \nv2 is leaf: {v2.is_leaf}")
print(f"v_sum is leaf: {v_sum.is_leaf} \nv_res is leaf: {v_res.is_leaf}")
print(f"v1 requires grad: {v1.requires_grad} \nv2 requires grad: {v2.requires_grad}")
print(f"v_sum requires grad: {v_sum.requires_grad} \nv_res requires grad: {v_res.requires_grad}")

Sum: tensor([3., 3.], grad_fn=<AddBackward0>)
Result: 12.0
v1 is leaf: True 
v2 is leaf: True
v_sum is leaf: False 
v_res is leaf: False
v1 requires grad: True 
v2 requires grad: False
v_sum requires grad: True 
v_res requires grad: True


![Graph representation of the expression](../images/figure_3-3.png)

Let's compute the gradients of our graph:

In [36]:
v_res.backward()

print(f"Gradients of v1: {v1.grad}")
print(f"Gradients of v2: {v2.grad}")

Gradients of v1: tensor([2., 2.])
Gradients of v2: None


## 4. NN building blocks

1. `torch.nn` package has tons of predefined classes providing the basic functionality blocks
    - all of them are designed with practice in mind (e.g. support mini-batches, have same default values, weights are properly initialized)
    - all modules follow the convention of `callable`: instance of any class can act as a function when applied to its arguments

2. all classes in the `torch.nn` package inherit from the `nn.Module` base class, which can be used to implement higher-level NN blocks. Some useful methods that all `nn.Module` children provide:
    - `parameters()`: return an iterator of all variables that require gradient computation (i.e. module weight)
    - `zero_grad()`: initialize all gradients of all parameters to zero
    - `to(device)`: moves all module parameters to a given device
    - `state_dict()`: return the dictionary with all module parameters and is useful for model serialization
    - `load_state_dict()`: initialize the module with the state dictionary

3. `Sequential` class combines other layers into the pipe:

In [37]:
from torch import nn

s = nn.Sequential(
    nn.Linear(2, 5),
    nn.ReLU(),
    nn.Linear(5, 20),
    nn.ReLU(),
    nn.Linear(20, 10),
    nn.Dropout(p=0.3),
    nn.Softmax(dim=1)
)
s

Sequential(
  (0): Linear(in_features=2, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=10, bias=True)
  (5): Dropout(p=0.3, inplace=False)
  (6): Softmax(dim=1)
)

In [38]:
s(torch.FloatTensor([[1, 2]]))

tensor([[0.0856, 0.0722, 0.0618, 0.1016, 0.1192, 0.1016, 0.1534, 0.0887, 0.1278,
         0.0880]], grad_fn=<SoftmaxBackward0>)

## 5. Custom layers

1. By subclassing `nn.Module` class, we can create our own building blocks, which can be stacked together, reused later, and integrated into the Pytorch framework flawlessly

2. At its core, `nn.Module` provides rich functionality to its children:
    1. track all submodules that the current module includes
        - to keep track of(register) the submodule, we just need to assign it to the class's field
    2. provide functions to deal with all parameters of the registered submodules:
        - `parameters()` method: get full list of the module's parameters
        - `zero_grads()` method: zero its gradients
        - `to(device)` method
        - `state_dict()` method: serialize the module
        - `load_state_dict()` method: deserialize the module
        - `apply()` method: perform generic transformation using our own callable
    3. establish the convention of `Module` application to data
        - every module needs to perform its data transformation in the `forward()` method by overriding it
    4. advance use cases:
        - functions with the ability to regiester a hook function to tweak module transformation or gradients flow

3. to create a custom module, we usually have to do only $2$ things:
    1. register submodules
    2. implement the `forward()` method

Refer to [modules script](01_modules.py):

In [39]:
import torch
import torch.nn as nn


class OurModule(nn.Module):
    def __init__(self, num_inputs, num_classes, dropout_prob=0.3):
        super(OurModule, self).__init__()
        self.pipe = nn.Sequential(
            nn.Linear(num_inputs, 5),
            nn.ReLU(),
            nn.Linear(5, 20),
            nn.ReLU(),
            nn.Linear(20, num_classes),
            nn.Dropout(p=dropout_prob),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        return self.pipe(x)

1. **Line 6**: pass the parameters

2. **Line 7**: call the parent's constructor to let it initialize itself

3. **Line 8-16**: create `nn.Sequential` object/instance and assign it to our class/object's field named `pipe`
    - by doing this, we'll automatically register this module (`nn.Sequential` inherits from `nn.Module`, as does everything in the `nn` package)
    - after the constructor finishes, all those fields will be registered automatically
    - `add_module()` function in `nn.Module` register submodules (useful if our module can have variable number of layers and they need to be created programmatically)

4. **Line 18-19**: override `forward()` function with our implementation of data transformation
    - **Line 19** askes `self.pipe` to transform the data
    - note: to apply a module to the data, call the module as a callable (i.e. pretend that the module instance is a function and call it with the argument) and not use the `forward()` function of the `nn.Module` class
        - this is because `nn.Module` overrides the `__call__()` method, which is being used when we treat an instance as callable
        - this method does some `nn.Module` magic and calls our `forward()` method
        - if we call `forward()` directly, we'll intervene with the `nn.Module` duty, which can give wron results

In [40]:
if __name__ == "__main__":
    net = OurModule(num_inputs=2, num_classes=3)
    print(net)
    v = torch.FloatTensor([[2, 3]])
    out = net(v)
    print(out)
    print("Cuda's availability is %s" % torch.cuda.is_available())
    if torch.cuda.is_available():
        print("Data from cuda: %s" % out.to('cuda'))

OurModule(
  (pipe): Sequential(
    (0): Linear(in_features=2, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
    (5): Dropout(p=0.3, inplace=False)
    (6): Softmax(dim=1)
  )
)
tensor([[0.3673, 0.3164, 0.3164]], grad_fn=<SoftmaxBackward0>)
Cuda's availability is False


1. **Line 2-5**: create our module with the desired params, and ask it to transform the created data as callable

2. **Line 6**: print network's structure (`nn.Module` overrides `__str__()` and `__repr__()`) to represent the inner structure in a nice way

## 6. Loss functions and optimizers

1. We also need to define our learning objective: the `loss function`
    - it accepts $2$ arguments: the network's output and the desired output
    - it returns a single number: the `loss value` that measures how close the network's prediction is from the desired result
    - we use the `loss value` to calculate gradients of network parameters and adjust them to decrease the loss value, which pushes our model to better results in the future

### 6.1 Loss functions

1. `loss functions` reside in the `nn` package and are implemented as an `nn.Module` subclass. The most commonly used standard loss functions:
    1. `nn.MSELoss`: standard loss for regression problems
    2. `nn.BCELoss` and `nn.BCEWithLogits`: for binary classification problems
        - `nn.BCELoss` expects a single probability value (output of the Sigmoid layer)
        - `nn.BCEWithLogits` assumes raw scores as input and applies Sigmoid itself - more numerically stable and efficient
    3. `nn.CrossEntropyLoss` and `nn.NLLLoss`: maximum likelihood for multi-class classification problems
        - `nn.CrossEntropyLoss` expects raw scores for each class and applies `LogSoftmax` internally
        - `nn.NLLLoss` expects log probabilities as input
    4. we are free to write our own `Module` subclass to compare output and target


### 6.2 Optimizers

1. Responsibility: take the gradients of the model params and change these params in order to decrease the loss value. Pytorch provides lots of popular optimizer implementations in the `torch.optim` package. Most widely known:
    1. `SGD`: vanilla stochastic gradient descent algorithm with an optional momentum extension
    2. `RMSprop`: proposed by Geoffrey Hinton
    3. `Adagrad`: adaptive gradients optimizer
    4. `Adam`: a combination of both `RMSprop` and `Adagrad`

2. All optimizers expose the unified interface --> easy to experiment with different optimization methods
    - on construction, we need to pass an iterable of tensors, which will be modifed during the optimization process
    - usual practice: pass the result of `params()` call of the upper-level `nn.Module` instance, which will return an iterable of all lead tensors with gradients

3. Let's discuss the common blueprint of a trainig loop:
```python
for batch_x, batch_y in iterate_batches(data, batch_size=N):
    batch_x_t = torch.tensor(batch_x)
    batch_y_t = torch.tensor(batch_y)
    out_t = net(batch_x_t)
    loss_t = loss_function(out_t, batch_y_t)
    loss_t.backward()
    optimizer.step()
    optimizer.zero_grad()
```
- **Line 1**: iterate over our data over and over again
    - `epoch`: one interation over a full set of examples
    - data is splitted into batches of equal size , since it is usually too large to fit into CPU or GPU memory at once
- **Line 2-3**: every batch includes data samples and target labels, and both of them have to be tensor
- **Line 4**: pass data samples to our network
    - all transformations of our network are nothing more than a graph of our operations with intermediate tensor instances
- **Line 5**: feed the network's output and target labels to the loss function
    - all transformations inside the loss fuction are also a graph (can think of these $2$ graphs as one single combined computation graph)
    - result: a tensor of one single loss value 
- **Line 6**: every tensor in this computation graph remembers its parent, so to calculate gradients for the whole network, call the `backward()` function on a loss function result
    - the result of this call will be the unrolling of the graph of the performed computations and the calculating of gradients for every leaf tensor with `require_grad=True`
    - usually, such tensors are our model's params: weights and biases of FFNs, and covolution filters
    - every time a gradient is calculated, it is accumulated in the `tensor.grad` field, so one tensor can participate in a transformation multiple times and its gradients will be properly summed together
    - example: one single RNN cell could be applied to multiple input items
- **Line 7**: after `loss.backward()` call is finished, we have the gradients accumulated, and now it's time for the optimizer to do its job
    - it takes all gradients from the params we have passed to it on construction and applied them
    - all this is done with the method `step()`
- **Line 8**: call `zero_grad()` to zero gradients of parameters
    - can be placed at the beginning of the training loop

## 7. Monitoring with `TensorBoard`

1. A list of things that we should observe during training:
    1. `loss value`: normally consists of several components like base loss and regularization losses
        - should monitor both the total loss and the individual components over time
    2. results of `validation` on training and test datasets.
    3. statistics about gradients and weights.
    4. values produced by the network:
        - in a classification problem: measure the entropy of predicted class probabilities
        - in a regression problem: raw predicted values can give tons of data about the training
    5. `learning rates` and other hyperparameters, if they are adjusted over time

2. the list could also include:
    1. domain-specific metrics:
        - word embedding projections
        - audio samples
        - images generated by GANs
    2. monitor values related to training speed:
        - how long and epoch takes (to see the effect of our optimizations or problems with hardware)

### 7.1 `TensorBoard` 101

1. `TensorBoard`: originally developed as part of `TensorFlow` (later moved to a separate project, but still maintained by Google) to observe and analyze various NN characteristics during and after training
    - a Python web service that we can start on our computer, passing it the directory where our training process will save values to be analyzed. Then we point our browser to `TensorBoard`'s board `6006`, and it shows us an interactive web interface with values updated in real time
    - `TensorBoard` still uses the `TensorFlow` data format, so we'll need to write this data from our Pytorch program
    - Pytorch support this data format with the `torch.utils.tensorboard` package

### 7.2 Plotting metrics

Refer to [tensorboard script](02_tensorboard.py):

In [41]:
import math
from torch.utils.tensorboard.writer import SummaryWriter


if __name__ == "__main__":
    writer = SummaryWriter()

    funcs = {"sin": math.sin, "cos": math.cos, "tan": math.tan}

    for angle in range(-360, 360):
        angle_rad = angle * math.pi / 180
        for name, fun in funcs.items():
            val = fun(angle_rad)
            writer.add_scalar(name, val, angle)

    writer.close()

1. **Line 6**: by default, `SummaryWriter()` will create a unique directory in the `/runs` directory for every launch, to be able to compare different rounds of training
    - the name of the new directory includes:
        1. current date and time
        2. host name
    - to override this, pass the `log_dir` argument to `SummaryWriter()`
    - can also add a suffix to the name of the directory by adding a `comment` argument
        - e.g. to capture different experiments' semantics, such as `dropout=0.3` or `strong_regularisation`

2. **Line 14**: every value is added to the writer using the `add_scalar()` function

3. **Line 16**: close the writer
    - note: the writer does a periodical flush (by default, every $2$ mins), so even in the case of a lengthy optimization process, we'll see our values
    - use `flush()` method to flush `SummaryWriter()` data expiclitly

4. The result of running this will be zero output on the console
    - but we'll see a new directory created inside the `/runs` directory with a single file
    - start `TensorBoard` and look at the result by:
        1. open terminal
        2. type `tensorboard --logdir runs` inside `/Chapter03` folder
        3. open `http://localhost:6006/`
    - to run on a remote server and make it accessible from other machines:
        - `tensorboard --logdir runs -bind_all`

## 8. GAN on Atari images

1. To show the power of DL, as an example, let's train a GAN to generate screenshots of various Atari games
    - practical usage of GAN: image quality improvement, realistic image generation, and feature learning
    - no practical usage in this example, but it'll be a good showcase about everything we learned about Pytorch so far
    - refer to [atari gan script](03_atari_gan.py):
    

In [None]:
#!/usr/bin/env python
import cv2
import time
import random
import argparse
import typing as tt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard.writer import SummaryWriter

import torchvision.utils as vutils

import gymnasium as gym
from gymnasium import spaces

import numpy as np

# Register Atari environments
import ale_py
gym.register_envs(ale_py)

# Configure gymnasium logger - set minimum level directly
gym.logger.min_level = gym.logger.WARN  # Use gym.logger.ERROR for less verbose output

LATENT_VECTOR_SIZE = 100
DISCR_FILTERS = 64
GENER_FILTERS = 64
BATCH_SIZE = 16

# dimension input image will be rescaled
IMAGE_SIZE = 64

LEARNING_RATE = 0.0001
REPORT_EVERY_ITER = 100
SAVE_IMAGE_EVERY_ITER = 1000


class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
    1. resize image into predefined size
    2. move color channel axis to a first place
    """
    def __init__(self, *args):
        super(InputWrapper, self).__init__(*args)
        old_space = self.observation_space
        assert isinstance(old_space, spaces.Box)
        self.observation_space = spaces.Box(
            self.observation(old_space.low), self.observation(old_space.high),
            dtype=np.float32
        )

    def observation(self, observation: gym.core.ObsType) -> gym.core.ObsType:
        # resize image
        new_obs = cv2.resize(
            observation, (IMAGE_SIZE, IMAGE_SIZE))
        # transform (w, h, c) -> (c, w, h)
        new_obs = np.moveaxis(new_obs, 2, 0)
        return new_obs.astype(np.float32)


class Discriminator(nn.Module):
    def __init__(self, input_shape):
        super(Discriminator, self).__init__()
        # this pipe converges image into the single number
        self.conv_pipe = nn.Sequential(
            nn.Conv2d(in_channels=input_shape[0], out_channels=DISCR_FILTERS,
                      kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS, out_channels=DISCR_FILTERS*2,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*2),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 2, out_channels=DISCR_FILTERS * 4,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 4),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 4, out_channels=DISCR_FILTERS * 8,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 8),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 8, out_channels=1,
                      kernel_size=4, stride=1, padding=0),
            nn.Sigmoid()
        )

    def forward(self, x):
        conv_out = self.conv_pipe(x)
        return conv_out.view(-1, 1).squeeze(dim=1)


class Generator(nn.Module):
    def __init__(self, output_shape):
        super(Generator, self).__init__()
        # pipe deconvolves input vector into (3, 64, 64) image
        self.pipe = nn.Sequential(
            nn.ConvTranspose2d(in_channels=LATENT_VECTOR_SIZE, out_channels=GENER_FILTERS * 8,
                               kernel_size=4, stride=1, padding=0),
            nn.BatchNorm2d(GENER_FILTERS * 8),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 8, out_channels=GENER_FILTERS * 4,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 4),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 4, out_channels=GENER_FILTERS * 2,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 2),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 2, out_channels=GENER_FILTERS,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS, out_channels=output_shape[0],
                               kernel_size=4, stride=2, padding=1),
            nn.Tanh()
        )

    def forward(self, x):
        return self.pipe(x)


def iterate_batches(envs: tt.List[gym.Env],
                    batch_size: int = BATCH_SIZE) -> tt.Generator[torch.Tensor, None, None]:
    batch = [e.reset()[0] for e in envs]
    env_gen = iter(lambda: random.choice(envs), None)

    while True:
        e = next(env_gen)
        action = e.action_space.sample()
        obs, reward, is_done, is_trunc, _ = e.step(action)
        if np.mean(obs) > 0.01:
            batch.append(obs)
        if len(batch) == batch_size:
            batch_np = np.array(batch, dtype=np.float32)
            # Normalising input to [-1..1] and convert to tensor
            yield torch.tensor(batch_np * 2.0 / 255.0 - 1.0)
            batch.clear()
        if is_done or is_trunc:
            e.reset()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dev", default="cpu", help="Device name, default=cpu")
    args = parser.parse_args()

    device = torch.device(args.dev)
    envs = [
        InputWrapper(gym.make(name))
        for name in ('Breakout-v4', 'AirRaid-v4', 'Pong-v4')
    ]
    shape = envs[0].observation_space.shape

    net_discr = Discriminator(input_shape=shape).to(device)
    net_gener = Generator(output_shape=shape).to(device)

    objective = nn.BCELoss()
    gen_optimizer = optim.Adam(params=net_gener.parameters(), lr=LEARNING_RATE,
                               betas=(0.5, 0.999))
    dis_optimizer = optim.Adam(params=net_discr.parameters(), lr=LEARNING_RATE,
                               betas=(0.5, 0.999))
    writer = SummaryWriter()

    gen_losses = []
    dis_losses = []
    iter_no = 0

    true_labels_v = torch.ones(BATCH_SIZE, device=device)
    fake_labels_v = torch.zeros(BATCH_SIZE, device=device)
    ts_start = time.time()

    for batch_v in iterate_batches(envs):
        # fake samples, input is 4D: batch, filters, x, y
        gen_input_v = torch.FloatTensor(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1)
        gen_input_v.normal_(0, 1)
        gen_input_v = gen_input_v.to(device)
        batch_v = batch_v.to(device)
        gen_output_v = net_gener(gen_input_v)

        # train discriminator
        dis_optimizer.zero_grad()
        dis_output_true_v = net_discr(batch_v)
        dis_output_fake_v = net_discr(gen_output_v.detach())
        dis_loss = objective(dis_output_true_v, true_labels_v) + \
                   objective(dis_output_fake_v, fake_labels_v)
        dis_loss.backward()
        dis_optimizer.step()
        dis_losses.append(dis_loss.item())

        # train generator
        gen_optimizer.zero_grad()
        dis_output_v = net_discr(gen_output_v)
        gen_loss_v = objective(dis_output_v, true_labels_v)
        gen_loss_v.backward()
        gen_optimizer.step()
        gen_losses.append(gen_loss_v.item())

        iter_no += 1
        if iter_no % REPORT_EVERY_ITER == 0:
            dt = time.time() - ts_start
            print(f"Iter {iter_no} in {dt:.2f}s: gen_loss={np.mean(gen_losses):.3e}, dis_loss={np.mean(dis_losses):.3e}")
            ts_start = time.time()
            writer.add_scalar("gen_loss", np.mean(gen_losses), iter_no)
            writer.add_scalar("dis_loss", np.mean(dis_losses), iter_no)
            gen_losses = []
            dis_losses = []
        if iter_no % SAVE_IMAGE_EVERY_ITER == 0:
            img = vutils.make_grid(gen_output_v.data[:64], normalize=True)
            writer.add_image("fake", img, iter_no)
            img = vutils.make_grid(batch_v.data[:64], normalize=True)
            writer.add_image("real", img, iter_no)

1. **Line 40-61**: the `InputWrapper` is a wrapper around a Gym game, it includes several transformations:
    1. resize the input image from $210\times 160$ (standard Atari resolution) to $64\times 64$
    2. move color plane of the image from the last position to the first, to meet Pytorch convention of convolution layers (input tensor with shape `[channels, height, width]`)
    3. cast the image from `bytes` to `float`

2. **Line 64-121**: define $2$ `nn.Module` classes:
    1. **Line 60-87**: `Disciminator` takes our scaled color image as input and, by applying $5$ layers of convolutions, converts it into a single number passed through a Sigmoid nonlinearity
        - Sigmoid output interpretation: the probability that `Discriminator` thinks our input image is from the real dataset
    2. **Line 90-117**: `Generator` takes as input a vector of random numbers (latent vector) and, by using the `transposed convolution` (aka `deconvolution`), converts this vector into a color image of the originial resolution
    3. as input, we'll use screenshot from several Atari games played simultaneously by a random agent ![Sample screenshots from three Atari games](../images/figure_3-6.png)

3. **Line 124-141**: images are combined in batches that are generated by `iterate_batches()` function:
    - `iterate_batches()` infinitely samples the environment from the provided list, issues random actions, and remembers observations in the batch list
    - when the batch becomes of the required size, we normalize the image, convert it to a tensor, and yield from the generator
    - the check for non-zero mean of the observation is required due to a bug in one of the games to prevent flickering of an image

4. **Line 144-213**: our `main` function, which prepares models and runs the training loops:
    1. **Line 144-154**: process the command-line arguments (which could be only one optional argument, `-dev`, which specifies the device to use for computations) and create our environment pool with a wrapper applied
        - this environment array will be passed to the `iterate_batches()` function later to generate training data
    2. **Line 156-164**: create our classes - a summary writer, both networks, a loss function, $2$ optimizers. Why $2$?
        - to train the discriminator, we need to show it both real and fake data samples with appropriate labels ($1$ for real and $0$ for fake)
            - during this pass, we update only the discriminator's params
        - after that, we pass both real and fake samples through the dicriminator again, but this time, the labels are $1$s for all samples and we update only the generator's weights
            - the second pass teaches the generator how to fool the discriminator and confuse real samples with the generated ones
    3. **Line 166-172**: define arrays to accumulate losses, iterator counters, and variables with the true and fake labels, also store the current timestamp to report the time elapsed after $100$ interations of training
    4. **Line 174-180**: at the begining of the training loop, generate a random vector and pass it to the `Generator` network
    5. **Line 182-190**: then train the discriminator by applying it twice:
        1. once to the true data samples in our batch
        2. once to the generated ones
        - we need to call `detach()` function on the generator's output to prevent gradients of this training pass from flowing into the generator
            - `detach()` is a tensor method, which makes a copy of it without connection to the parent's operation, i.e. detaching the tensor from the parent's graph
    6. **Line 192-198**: now, it's the generator's training time:
        - we pass the generator's output to the discriminator, but now we don't stop the gradients
        - instead, we apply the `objective()` funtion with `True` labels
        - it'll push our generator in the direction where the samples that it generates make the discriminator confuse them with the real data
    7. **Line 200-213**: report losses and feed the image samples to TensorBoard

At the begining, the generated images are completely random noise
    - but after 10k-20k iterations, the generator becomes more and more proficient at its job and the generated images become more and more similar to the real game screenshots
    - after 40k-50k iterations ![Sample images produces by the generator network](../images/figure_3-7.png)


Let's take a look of the result visualization after running the code:
1. Generator Loss: ![Genarator Loss](../images/ch03_gen_loss.png)
2. Discriminator Loss: ![Discriminator Loss](../images/ch03_disc_loss.png)
3. Example of real image at 197k iterations: ![Real image examples](../images/ch03_197k_iter_real_img.png)
4. Generated images after 32k iterations: ![32k generated images](../images/ch03_32k_iter_fake%20gen.png)
5. Generated images after 57k iterations: ![57k generated images](../images/ch03_32k_iter_fake%20gen.png)
6. Generated images after 197k iterations: ![197k generated images](../images/ch03_32k_iter_fake%20gen.png)


We'll simplify our code by using the add-on Pytorch library, Ignite

## 9. Pytorch Ignite

1. Pytorch's flexibility comes with  price: too much code to be written to solve our problems. Sometimes, when we work on routine tasks, we don't need this flexibility
    - a non-exhaustive list of topics that are an essential part of any DL training procedure, but require some code to be written:
        1. data preparation and transformation, and the generation of batches
        2. calculation of training metrics (e.g. loss values, accuracy, F1-scores, etc.)
        3. periodical testing of the model being trained on the test and validation datasets
        4. model checkpointing after some number of iterations or when a new best metrics is achieved
        5. sending metrics into a monitoring tool like TensorBoard
        6. hypreparameters change over time, like a learning rate decrease/increase schedule
        7. writing training progress messages on the console
    - as these tasks occur in any DL project, it quickly becomes cumbersome to write the same code over and over again
    - the normal approach to solving the issue is to write the functionality once, wrap it into a library (and open sourced), and reuse it later
    - several libraries for Pytorch that simplify the solving of common tasks:
        1. `ptlearn`
        2. `fastai`
        3. `ignite`
        4. for RL, [`PTAN`](https://github.com/Shmuma/ptan) written by the author
    - [Pytorch ecosystem projects](https://landscape.pytorch.org/) 

### 9.1 Ignite concepts

1. At a high level, [`Ignite`](https://pytorch-ignite.ai/) simplifies the writing of the training loop in Pytorch DL
    - `Engine` clas: the central piece, loops over the data source, applying the processing function to the data batch
    - `Events`: functions called at specific conditions of the training loop, could be at:
        1. begining/end of the whole training process
        2. begining/end of a training epoch (iteration over the data)
        3. begining/end of a single batch processing
    - additionally, custom events allow us to specify our functions to be called every $N$ events

2. Example:
```python
from ignite.engine import Engine, Events

def training(engine, batch):
    optimizer.zero_grad()
    x, y = prepare_batch()
    y_out = model(x)
    loss = loss_fn(y_out, y)
    loss.backward()
    optimizer.step()
    return loss.item()

engine = Engine(training)
engine.run(data)
```

### 9.2 GAN training on Atari using Ignite

Refer to [atari gan ignite script](04_atari_gan_ignite.py):

In [None]:
#!/usr/bin/env python
import random
import argparse
import cv2

import torch
import torch.nn as nn
import torch.optim as optim
from ignite.engine import Engine, Events
from ignite.handlers import Timer
from ignite.metrics import RunningAverage
from ignite.contrib.handlers import tensorboard_logger as tb_logger

import torchvision.utils as vutils
import gymnasium as gym
from gymnasium import spaces

import numpy as np

# Register Atari environments
import ale_py
gym.register_envs(ale_py)

# Configure gymnasium logger - set minimum level directly
gym.logger.min_level = gym.logger.WARN  # Use gym.logger.ERROR for less verbose output

LATENT_VECTOR_SIZE = 100
DISCR_FILTERS = 64
GENER_FILTERS = 64
BATCH_SIZE = 16

# dimension input image will be rescaled
IMAGE_SIZE = 64

LEARNING_RATE = 0.0001
REPORT_EVERY_ITER = 100
SAVE_IMAGE_EVERY_ITER = 1000


class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
    1. resize image into predefined size
    2. move color channel axis to a first place
    """
    def __init__(self, *args):
        super(InputWrapper, self).__init__(*args)
        old_space = self.observation_space
        assert isinstance(old_space, spaces.Box)
        self.observation_space = gym.spaces.Box(self.observation(old_space.low), self.observation(old_space.high),
                                                dtype=np.float32)

    def observation(self, observation):
        # resize image
        new_obs = cv2.resize(observation, (IMAGE_SIZE, IMAGE_SIZE))
        # transform (w, h, c) -> (c, w, h)
        new_obs = np.moveaxis(new_obs, 2, 0)
        return new_obs.astype(np.float32)


class Discriminator(nn.Module):
    def __init__(self, input_shape):
        super(Discriminator, self).__init__()
        # this pipe converges image into the single number
        self.conv_pipe = nn.Sequential(
            nn.Conv2d(in_channels=input_shape[0], out_channels=DISCR_FILTERS,
                      kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS, out_channels=DISCR_FILTERS*2,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*2),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 2, out_channels=DISCR_FILTERS * 4,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 4),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 4, out_channels=DISCR_FILTERS * 8,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 8),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 8, out_channels=1,
                      kernel_size=4, stride=1, padding=0),
            nn.Sigmoid()
        )

    def forward(self, x):
        conv_out = self.conv_pipe(x)
        return conv_out.view(-1, 1).squeeze(dim=1)


class Generator(nn.Module):
    def __init__(self, output_shape):
        super(Generator, self).__init__()
        # pipe deconvolves input vector into (3, 64, 64) image
        self.pipe = nn.Sequential(
            nn.ConvTranspose2d(in_channels=LATENT_VECTOR_SIZE, out_channels=GENER_FILTERS * 8,
                               kernel_size=4, stride=1, padding=0),
            nn.BatchNorm2d(GENER_FILTERS * 8),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 8, out_channels=GENER_FILTERS * 4,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 4),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 4, out_channels=GENER_FILTERS * 2,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 2),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 2, out_channels=GENER_FILTERS,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS, out_channels=output_shape[0],
                               kernel_size=4, stride=2, padding=1),
            nn.Tanh()
        )

    def forward(self, x):
        return self.pipe(x)


def iterate_batches(envs, batch_size=BATCH_SIZE):
    batch = [e.reset()[0] for e in envs]
    env_gen = iter(lambda: random.choice(envs), None)

    while True:
        e = next(env_gen)
        obs, reward, is_done, is_trunc, _ = e.step(e.action_space.sample())
        if np.mean(obs) > 0.01:
            batch.append(obs)
        if len(batch) == batch_size:
            # Normalising input between -1 to 1
            batch_np = np.array(batch, dtype=np.float32)
            yield torch.tensor(batch_np * 2.0 / 255.0 - 1.0)
            batch.clear()
        if is_done or is_trunc:
            e.reset()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dev", default="cpu",
                        help="Device name, default=cpu")
    args = parser.parse_args()

    device = torch.device(args.dev)
    envs = [InputWrapper(gym.make(name)) for name in ('Breakout-v4', 'AirRaid-v4', 'Pong-v4')]
    input_shape = envs[0].observation_space.shape

    net_discr = Discriminator(input_shape=input_shape).to(device)
    net_gener = Generator(output_shape=input_shape).to(device)

    objective = nn.BCELoss()
    gen_optimizer = optim.Adam(params=net_gener.parameters(), lr=LEARNING_RATE, betas=(0.5, 0.999))
    dis_optimizer = optim.Adam(params=net_discr.parameters(), lr=LEARNING_RATE, betas=(0.5, 0.999))

    true_labels_v = torch.ones(BATCH_SIZE, device=device)
    fake_labels_v = torch.zeros(BATCH_SIZE, device=device)

    def process_batch(trainer, batch):
        gen_input_v = torch.FloatTensor(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1)
        gen_input_v.normal_(0, 1)
        gen_input_v = gen_input_v.to(device)
        batch_v = batch.to(device)
        gen_output_v = net_gener(gen_input_v)

        # train discriminator
        dis_optimizer.zero_grad()
        dis_output_true_v = net_discr(batch_v)
        dis_output_fake_v = net_discr(gen_output_v.detach())
        dis_loss = objective(dis_output_true_v, true_labels_v) + \
                   objective(dis_output_fake_v, fake_labels_v)
        dis_loss.backward()
        dis_optimizer.step()

        # train generator
        gen_optimizer.zero_grad()
        dis_output_v = net_discr(gen_output_v)
        gen_loss = objective(dis_output_v, true_labels_v)
        gen_loss.backward()
        gen_optimizer.step()

        if trainer.state.iteration % SAVE_IMAGE_EVERY_ITER == 0:
            fake_img = vutils.make_grid(gen_output_v.data[:64], normalize=True)
            trainer.tb.writer.add_image("fake", fake_img, trainer.state.iteration)
            real_img = vutils.make_grid(batch_v.data[:64], normalize=True)
            trainer.tb.writer.add_image("real", real_img, trainer.state.iteration)
            trainer.tb.writer.flush()
        return dis_loss.item(), gen_loss.item()

    engine = Engine(process_batch)
    tb = tb_logger.TensorboardLogger(log_dir=None)
    engine.tb = tb
    RunningAverage(output_transform=lambda out: out[1]).\
        attach(engine, "avg_loss_gen")
    RunningAverage(output_transform=lambda out: out[0]).\
        attach(engine, "avg_loss_dis")

    handler = tb_logger.OutputHandler(tag="train", metric_names=['avg_loss_gen', 'avg_loss_dis'])
    tb.attach(engine, log_handler=handler, event_name=Events.ITERATION_COMPLETED)

    timer = Timer()
    timer.attach(engine)

    @engine.on(Events.ITERATION_COMPLETED)
    def log_losses(trainer):
        if trainer.state.iteration % REPORT_EVERY_ITER == 0:
            print(f"Iter {trainer.state.iteration} in {timer.value():.2f}s: "
                  f"gen_loss={trainer.state.metrics['avg_loss_gen']:.3e}, "
                  f"dis_loss={trainer.state.metrics['avg_loss_dis']:.3e}")
            timer.reset()

    engine.run(data=iterate_batches(envs))

1. **Line 9=12**: import `Ignite` classes:
    1. `ignite.metrics` contains classes related to working with the performance metrics of the training process
        - in this case, we use `RunningAverage` to smooth time series values
        - previously we used `np.mean()` on an array of losses
    2. `tensorboard_logger` from the `ignite.contrib` package (the functionality contributed by others)
    3. `Timer` handler provides a simple way to calculate time elapsed between certain events

2. **Line 159-188**: `process_batch()` function takes the data batch and does an update of both the discriminator and generator models on this batch
    - can return any data to be tracked during the training process (in this case, $2$ loss values for both models)
    - also save images to be displayed in TensorBoard

3. **Line 190-202**: create our engine, passing our `process_batch()` function and attaching $2$ `RunningAverage` transformations for our $2$ loss values
    - being attached, every `RunningAverage` produces a so-called `metric` - a derived value kept around during the training process
    - names of our smoothed metrics:
        1. `avg_loss_gen` for smoothed loss from the generator
        2. `avg_loss_dis` for smoothed loss from the discriminator
    - these $2$ values will be written in TensorBoard after every iteration
    - also attach the timer, which being created without any constructor arguments, acts as a simple manually-controlled timer (we call its `reset()` method manually), but can work in a more flexible way with different configuration options

4. **Line 204-212**: attach another event handler, which will be our function, and is called by `Engine` on every iteration completion:
    - it'll write a log line with an iteration index, time taken and values of smoothed metrics
    - **Line 212** starts our engine, passing the already defined function as the data source
        - the `iterate_batches()` function is a generator, returning the normal iterator over batches, so it'll be perfectly fine to pass its output as a data argument