# Deep Learning with PyTorch

## Tensors
*Tensors* represent
* *multilinear maps* between vector spaces (mathematics)
* generic *n-dimensional arrays* (computer science)

In [1]:
import numpy as np
import torch

# Create an uninitialized 3x2 tensor of 32-bit floats
a = torch.FloatTensor(3, 2)
a

tensor([[1.3556e-19, 3.0097e+29],
        [7.1853e+22, 4.5145e+27],
        [1.8040e+28, 1.5769e-19]])

In [2]:
# Initialize the tensor (in-place) with zeros
a.zero_()

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

There are two types of methods in the PyTorch API:
* Functional ones that return transformed copies have standard names like `some_function()`
* In-place mutating operations will have a trailing underscore in their name, e.g. `some_function_()`

In [3]:
# Create a tensor from a standard collection
torch.FloatTensor([[1, 2, 3], [3, 2, 1]])

tensor([[1., 2., 3.],
        [3., 2., 1.]])

In [4]:
n = np.zeros(shape=(3, 2))
n

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [5]:
# Create a tensor from a numpy ndarray
b = torch.tensor(n)
b

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], dtype=torch.float64)

In [6]:
# Change the numpy array to 64-bit float
#  - This translates to the tensor
n = np.zeros(shape=(3, 2), dtype=np.float32)
torch.tensor(n)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [7]:
# Alternatively specify a PyTorch dtype
n = np.zeros(shape=(3, 2))
torch.tensor(n, dtype=torch.float32)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [8]:
a = torch.tensor([1, 2, 3])
a

tensor([1, 2, 3])

In [9]:
# Scalar tensors can be results of some aggregations
s = a.sum()
s

tensor(6)

In [10]:
# There's a convenient method to access the value of a scalar tensor
s.item()

6

In [11]:
torch.tensor(42)

tensor(42)

### Tensor Operations
Each tensor has associated `device` where the computation takes place. The options are
* `cpu` - computation takes place on the CPU
* `cuda` or `cuda:<index>` - computation takes place on the GPU (with a device id `<index>`)

In [12]:
# Determine computation device based on CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"

a = torch.FloatTensor([2, 3])

# Move the tensor to GPU (if there's CUDA available)
a = a.to(device)
a

tensor([2., 3.])

In [13]:
a.device

device(type='cpu')

### Tensors and Gradients
Each tensor has following info related to automatic gradient computation:
* `grad` is a property holding computed gradients (tensor of the same shape)
* `is_leaf` is true if the tensor was constructed by a user and false if it's a result of a computation
* `requires_grad` is true if the tensor requires gradients to be computed

In [14]:
# Define some tensors
#  - The first one requires gradients to be computed
v1 = torch.tensor([1.0, 1.0], requires_grad=True)
v2 = torch.tensor([2.0, 2.0])

# Define a computational graph on these tensors
#  - Notice: Result contains a function coputing the gradient.
v_sum = v1 + v2
v_res = (v_sum * 2).sum()
v_res

tensor(12., grad_fn=<SumBackward0>)

In [15]:
v1.is_leaf, v2.is_leaf

(True, True)

In [16]:
v_sum.is_leaf, v_res.is_leaf

(False, False)

In [17]:
v1.requires_grad, v2.requires_grad

(True, False)

In [18]:
v_sum.requires_grad, v_res.requires_grad

(True, True)

In [19]:
# Calculate the gradients of our graph
v_res.backward()

# Show backpropagated gradients in v1
v1.grad

tensor([2., 2.])

In [20]:
# v2 does not require any gradients so there's nothing
v2.grad

## Neural Network Building Blocks

In [21]:
import torch.nn as nn  # noqa

# Construct a 2-to-5 dense layer with an implicit bias
#  - Note: Weights of this layer are randomly initialized.
dense = nn.Linear(2, 5)

# Each PyTorch NN module acts as a callable
inputs = torch.FloatTensor([1, 2])
dense(inputs)

tensor([-0.2939, -1.2834,  0.0977,  1.6714,  0.0031], grad_fn=<AddBackward0>)

Some important methods from PyTorch API:
* `parameters()` returns an iterable of all trainable variables (those that require gradients)
* `zero_grad()` initializes all gradients to zero
* `to(device)` moves the computation to a device
* `state_dict()` exports all weights to a dictionary for model serialization
* `load_state_dict()` oppisite of the previous which imports weights

In [22]:
# Build a sequential model with
#  - Three dense layers with ReLU activations
#  - A dropout layer
#  - And a softmax output over the feature dimension
model = nn.Sequential(
    nn.Linear(2, 5),
    nn.ReLU(),
    nn.Linear(5, 20),
    nn.ReLU(),
    nn.Linear(20, 10),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Softmax(dim=1),
)

model

Sequential(
  (0): Linear(in_features=2, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=10, bias=True)
  (5): ReLU()
  (6): Dropout(p=0.3, inplace=False)
  (7): Softmax(dim=1)
)

In [23]:
# Feed an input tensor through our sequential model
#  - There's single instance in the input batch
model(torch.FloatTensor([[1, 2]]))

tensor([[0.1137, 0.0886, 0.1400, 0.1144, 0.0886, 0.0886, 0.0886, 0.0886, 0.0886,
         0.1005]], grad_fn=<SoftmaxBackward>)

### Custom Layers
Creating custom modules (layers) is as easy as inheriting from `nn.Module` class and implementing the `forward()` method. Every other instance of a module assigned to a field is automatically registered under this module.

Note that the convention is to use the module as a callable - this is because the `Module` class does some extra work in the `__call__` method.

In [24]:
from typing import TypeVar  # noqa

T = TypeVar("T", bound=torch.Tensor)


class MyModule(nn.Module):
    """Custom PyTorch module"""

    def __init__(
        self,
        n_inputs: int,
        n_outputs: int,
        dropout_prob: float = 0.3,
    ) -> None:
        super().__init__()
        # Build a sequential model
        #  - Every field that is a Module is auto-discovered
        self.pipe = nn.Sequential(
            nn.Linear(n_inputs, 5),
            nn.ReLU(),
            nn.Linear(5, 20),
            nn.ReLU(),
            nn.Linear(20, n_outputs),
            nn.Dropout(p=dropout_prob),
            nn.Softmax(dim=1),
        )

    def forward(self, x: T) -> T:
        # We must treat the sub-module as a callable!
        return self.pipe(x)


# Build an instance of this model and show its structure
net = MyModule(n_inputs=2, n_outputs=3)
net

MyModule(
  (pipe): Sequential(
    (0): Linear(in_features=2, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
    (5): Dropout(p=0.3, inplace=False)
    (6): Softmax(dim=1)
  )
)

In [25]:
# Feed an input batch to the model
net(torch.FloatTensor([[2, 3]]))

tensor([[0.3075, 0.3657, 0.3268]], grad_fn=<SoftmaxBackward>)

### Loss Functions and Optimizers

PyTorch includes standard set of loss functions and of course allows simple implementation of custom ones. Here's a short list of some loss function classes:
* `nn.MSELoss` is the *mean squared error* typically used for regression problems
* `nn.BCELoss` and `BCEWithLogits` are *binary cross-entropy* losses for binary classification problems - the former expects single probability value while the latter raw scores (usually preferable)
* `nn.CrossEntropyLoss` and `nn.NLLLoss` for multi-class classification problems

Similarly there is buch of traditional optimizers such as vanilla `SGD`, `RMSprop`, `Adagrad` or the popular `Adam`. Finally, here's a typical training loop in PyTorch.
```python
# Define model and loss function
model = ...
loss_fn = ...

# Register all trainable parameters in an optimizer
optimizer = optim.Adam(params=model.parameters(), ...)

# Iterate over mini-batches of training data
for X_train, y_trian in iterate_batches(data, batch_size=32)
    
    # Wrap examples and labels into tensors
    X_train = torch.tensor(X_train)
    y_trian = torch.tensor(y_trian)
    
    # Make predictions using the model
    y_pred = model(X_train)
    
    # Compute model's prediction loss
    loss = loss_fn(y_pred, y_train)
    
    # Compute gradients of the loss function w.r.t. all the weights
    #  - The loss is just a computation graph over tensors in the model
    #  - And because model's weights "require gradients" this calculates dL/dw
    loss.backward()
    
    # Perform one gradient descent step using computed gradients
    #  - Note: The optimizer has access to `grad` for registered `params`
    optimizer.step()
    
    # Clear gradients for this step
    #  - Alternatively this can be done as the beginning of a step
    #  - Note: This is a convenience method for calling it on the model
    optimizer.zero_grad()
```

## Monitoring with TensorBoard(X)
Following example shows how to log arbitrary metrics and investigate them with *TensorBoard*. Because TensorBoard expects data in *TensorFlow* format we use `tensorboardX` for easy integration (also because it's a dependecy of PyTorch Ignite that we'll use later).

Example below computes values of few trigonometric functions for varying angles and stores the output in `runs/` director. For later view one can run TensorBoard with
```bash
tensorboard --log-dir runs
```

In [27]:
import math  # noqa

from tensorboardX import SummaryWriter  # noqa

# Define some functions representing our metrics
funcs = {"sin": math.sin, "cos": math.cos, "tan": math.tan}

# Create tesorboardX writer
#  - Note: The default output directory is './runs'
#  - Note 2: By default each call creates new "run"
with SummaryWriter() as writer:

    # Register one metric per function
    for name, f in funcs.items():

        # Evaluate and record f on interval [-360, 360)
        for angle in range(-360, 360):
            val = f(angle * math.pi / 180)
            writer.add_scalar(name, val, angle)

In [28]:
!tree runs

[01;34mruns[00m
├── [01;34mMar10_17-31-44_mpc-xps[00m
│   └── events.out.tfevents.1615393904.mpc-xps
└── [01;34mMar10_17-31-47_mpc-xps[00m
    └── events.out.tfevents.1615393907.mpc-xps

2 directories, 2 files


## GAN on Atari Images
Let's train a *Generative Adversarial Network (GAN)* on randomly sampled screenshots from three Atari games and collect training metrics for TensorBoard.

In [34]:
!rm -rf runs

In [52]:
import random  # noqa
from typing import Any, Iterable, Sequence, Tuple  # noqa

import cv2  # noqa
import gym  # noqa
import torch.optim as optim  # noqa
import torchvision.utils as vutils  # noqa

# setup logging
log = gym.logger
log.set_level(gym.logger.INFO)

# Hyperparameters
IMAGE_SIZE = 64
LATENT_VECTOR_SIZE = 100
DISCR_FILTERS = 64
GENER_FILTERS = 64
BATCH_SIZE = 16
MAX_ITERS = 2_000
LEARNING_RATE = 0.0001
REPORT_PERIOD = 100
SAVE_IMAGE_PERIOD = 1000


class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
    1. resize image into predefined size
    2. move color channel axis to a first place
    """

    def __init__(self, *args: Any) -> None:
        super().__init__(*args)
        assert isinstance(self.observation_space, gym.spaces.Box)
        box = self.observation_space
        self.observation_space = gym.spaces.Box(
            self.observation(box.low),
            self.observation(box.high),
            dtype=np.float32,
        )

    def observation(self, observation: np.ndarray) -> np.ndarray:
        # Resize the image
        new_obs = cv2.resize(observation, (IMAGE_SIZE, IMAGE_SIZE))
        # Move the color dimension to the fromt
        #  - PyTorch's conv. layers expect shape [channels, width, height]
        new_obs = np.moveaxis(new_obs, 2, 0)
        return new_obs.astype(np.float32)

In [53]:
class Discriminator(nn.Module):
    """
    Converts an image into single number
    representing the probability of positive class
    (the image being real).
    """

    def __init__(self, input_shape: Tuple[int, ...]) -> None:
        super().__init__()
        self.conv_pipe = nn.Sequential(
            nn.Conv2d(
                in_channels=input_shape[0],
                out_channels=DISCR_FILTERS,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=DISCR_FILTERS,
                out_channels=DISCR_FILTERS * 2,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.BatchNorm2d(DISCR_FILTERS * 2),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=DISCR_FILTERS * 2,
                out_channels=DISCR_FILTERS * 4,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.BatchNorm2d(DISCR_FILTERS * 4),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=DISCR_FILTERS * 4,
                out_channels=DISCR_FILTERS * 8,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.BatchNorm2d(DISCR_FILTERS * 8),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=DISCR_FILTERS * 8,
                out_channels=1,
                kernel_size=4,
                stride=1,
                padding=0,
            ),
            nn.Sigmoid(),
        )

    def forward(self, x: T) -> T:
        conv_out = self.conv_pipe(x)
        return conv_out.view(-1, 1).squeeze(dim=1)

In [54]:
class Generator(nn.Module):
    """Deconvolves input vector into 3x64x64 image"""

    def __init__(self, output_shape: Tuple[int, ...]) -> None:
        super().__init__()
        self.pipe = nn.Sequential(
            nn.ConvTranspose2d(
                in_channels=LATENT_VECTOR_SIZE,
                out_channels=GENER_FILTERS * 8,
                kernel_size=4,
                stride=1,
                padding=0,
            ),
            nn.BatchNorm2d(GENER_FILTERS * 8),
            nn.ReLU(),
            nn.ConvTranspose2d(
                in_channels=GENER_FILTERS * 8,
                out_channels=GENER_FILTERS * 4,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.BatchNorm2d(GENER_FILTERS * 4),
            nn.ReLU(),
            nn.ConvTranspose2d(
                in_channels=GENER_FILTERS * 4,
                out_channels=GENER_FILTERS * 2,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.BatchNorm2d(GENER_FILTERS * 2),
            nn.ReLU(),
            nn.ConvTranspose2d(
                in_channels=GENER_FILTERS * 2,
                out_channels=GENER_FILTERS,
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.BatchNorm2d(GENER_FILTERS),
            nn.ReLU(),
            nn.ConvTranspose2d(
                in_channels=GENER_FILTERS,
                out_channels=output_shape[0],
                kernel_size=4,
                stride=2,
                padding=1,
            ),
            nn.Tanh(),
        )

    def forward(self, x: T) -> T:
        return self.pipe(x)

In [55]:
def normalize(batch: np.ndarray) -> np.ndarray:
    """Normalize to interval [-1, 1]"""
    return np.array(batch, dtype=np.float32) * 2.0 / 255.0 - 1.0


def iterate_batches(
    envs: Sequence[gym.Env],
    batch_size: int = BATCH_SIZE,
) -> Iterable[T]:
    # Initialize a buffer for current batch
    batch = [e.reset() for e in envs]

    # Create a sampler over the environments
    env_sampler = iter(lambda: random.choice(envs), None)

    while True:

        # Sample an action from random environment
        env = next(env_sampler)
        action = env.action_space.sample()

        # Get new observation for the action
        obs, _, done, _ = env.step(action)

        # Just a hack to fix one of the envs
        if np.mean(obs) > 0.01:
            batch.append(obs)

        # Generate new batch
        #  - Normalize values to [-1, 1]
        #  - Yield new tensor and clean the buffer
        if len(batch) == batch_size:
            yield torch.tensor(normalize(batch))
            batch.clear()

        # Restart environment when episode ends
        if done:
            env.reset()

In [57]:
# Create new gym environments for 3 Atari games
games = ("Breakout-v0", "AirRaid-v0", "Pong-v0")
envs = [InputWrapper(gym.make(name)) for name in games]

input_shape = envs[0].observation_space.shape

# Create GAN components
discriminator = Discriminator(input_shape=input_shape).to(device)
generator = Generator(output_shape=input_shape).to(device)

# The objective is binary cross-entropy (binary classification)
loss = nn.BCELoss()

# Create two optimizers - one for each component
gen_optimizer = optim.Adam(
    params=generator.parameters(),
    lr=LEARNING_RATE,
    betas=(0.5, 0.999),
)

dis_optimizer = optim.Adam(
    params=discriminator.parameters(),
    lr=LEARNING_RATE,
    betas=(0.5, 0.999),
)

# Create labels for true and fake images
true_labels = torch.ones(BATCH_SIZE, device=device)
fake_labels = torch.zeros(BATCH_SIZE, device=device)

# Start recording of the training metrics for TensorBoard
with SummaryWriter() as writer:

    gen_losses = []
    dis_losses = []

    batch_iter = iter(iterate_batches(envs))

    # Main training loop
    for i in range(MAX_ITERS):
        real_inputs = next(batch_iter)

        # Move the input batch to selected device
        real_inputs = real_inputs.to(device)

        # Generate an equal-sized batch of codings for fake images
        #  - Latent vectors are drawn from N(0, 1)
        #  - And we move them to selected device too
        codings = torch.FloatTensor(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1)
        codings.normal_(0, 1)
        codings = codings.to(device)

        # Generate fake images using the generator
        fake_inputs = generator(codings)

        # Train the discriminator
        #  - Note: We must `detach` generator inputs to not alter its weights
        dis_optimizer.zero_grad()
        pred_true = discriminator(real_inputs)
        pred_fake = discriminator(fake_inputs.detach())
        dis_loss = loss(pred_true, true_labels) + loss(pred_fake, fake_labels)
        dis_loss.backward()
        dis_optimizer.step()
        dis_losses.append(dis_loss.item())

        # Train the generator
        gen_optimizer.zero_grad()
        dis_pred = discriminator(fake_inputs)
        gen_loss = loss(dis_pred, true_labels)
        gen_loss.backward()
        gen_optimizer.step()
        gen_losses.append(gen_loss.item())

        # Collect training metrics
        if i % REPORT_PERIOD == 0:

            # Compute mean losses over buffered window
            mean_gen_loss = np.mean(gen_losses)
            mean_dis_loss = np.mean(dis_losses)

            log.info(
                "Iter: %.4d/%d Gen. Loss: %.3f Dis. Loss: %.3f",
                i,
                MAX_ITERS,
                mean_gen_loss,
                mean_dis_loss,
            )

            writer.add_scalar("loss_gen", mean_gen_loss, i)
            writer.add_scalar("loss_dis", mean_dis_loss, i)

            # Reset the loss buffers
            gen_losses = []
            dis_losses = []

        # Collect real and fake images
        if i % SAVE_IMAGE_PERIOD == 0:
            real_img = vutils.make_grid(real_inputs.data[:64], normalize=True)
            fake_img = vutils.make_grid(fake_inputs.data[:64], normalize=True)
            writer.add_image("img_real", real_img, i)
            writer.add_image("img_fake", fake_img, i)

INFO: Making new env: Breakout-v0
INFO: Making new env: AirRaid-v0
INFO: Making new env: Pong-v0
INFO: Iter: 0000/2000 Gen. Loss: 1.480 Dis. Loss: 1.415
INFO: Iter: 0100/2000 Gen. Loss: 5.438 Dis. Loss: 0.041
INFO: Iter: 0200/2000 Gen. Loss: 7.007 Dis. Loss: 0.005
INFO: Iter: 0300/2000 Gen. Loss: 7.449 Dis. Loss: 0.002
INFO: Iter: 0400/2000 Gen. Loss: 8.134 Dis. Loss: 0.084
INFO: Iter: 0500/2000 Gen. Loss: 7.594 Dis. Loss: 0.014
INFO: Iter: 0600/2000 Gen. Loss: 6.825 Dis. Loss: 0.006
INFO: Iter: 0700/2000 Gen. Loss: 8.798 Dis. Loss: 0.005
INFO: Iter: 0800/2000 Gen. Loss: 9.548 Dis. Loss: 0.120
INFO: Iter: 0900/2000 Gen. Loss: 5.932 Dis. Loss: 0.011
INFO: Iter: 1000/2000 Gen. Loss: 6.427 Dis. Loss: 0.011
INFO: Iter: 1100/2000 Gen. Loss: 6.506 Dis. Loss: 0.006
INFO: Iter: 1200/2000 Gen. Loss: 6.858 Dis. Loss: 0.004
INFO: Iter: 1300/2000 Gen. Loss: 7.343 Dis. Loss: 0.002
INFO: Iter: 1400/2000 Gen. Loss: 7.197 Dis. Loss: 0.003
INFO: Iter: 1500/2000 Gen. Loss: 6.564 Dis. Loss: 0.366
INFO: I

Not great but we've trained the GAN for just few iterations (2k) and it suffices as an example. Now, let's re-implement this using a higher-level library called *PyTorch Ignite*.