<a href="https://colab.research.google.com/github/pavanraja753/PyTorch_Learning/blob/main/PyTorch_modules_and_batch_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4.3 PyTorch modules and batch processing

- Elements from `torch.nn.functional` are autograd-compliant functions which compute  result from provided rguments alone

- Subclasses of `torch.nn.Module` are losses and network components.

- Parameters of type `torch.nn.Parameter` which is a `Tensor` with `requires_grad = True` and known to be a model parameter by many utility functions, in particular `torch.nn.Module.parameters()`



- Functions and modules from `nn` process batches of inputs stored in a 􏰂 tensor whose first dimension indexes them, and produce a corresponding
tensor with the same additional dimension.

- E.g. a fully connected layer $R^C - R^D $ expects as input a tensor of size $N × C$ and computes a tensor of size $N × D$, where $N$ is the number of samples and can vary from a call to another. We come back to this in a second.

For instance, given that a sample from the MNIST data-set is a `28 × 28` grayscale image, a minibatch of `64` samples would be stored in a tensor of size `64×1×28×28` and this is the type of tensors that a LeNet5 expects as input

## The autograd-compliant function

In [10]:
import torch.nn.functional as F
import torch
import torch.nn as nn

In [6]:
x = torch.randn(4,2)
x

tensor([[-0.7706, -0.9017],
        [ 1.1591, -1.0446],
        [ 0.0979, -0.8242],
        [-0.6524, -0.4160]])

In [8]:
F.relu(x,inplace=True)

tensor([[0.0000, 0.0000],
        [1.1591, 0.0000],
        [0.0979, 0.0000],
        [0.0000, 0.0000]])

In [9]:
x

tensor([[0.0000, 0.0000],
        [1.1591, 0.0000],
        [0.0979, 0.0000],
        [0.0000, 0.0000]])

`inplace` indicates if the operation should modify the argument itself. This may be desirable to reduce the memory footprint of the processing.

## nn.Linear

The module

```shell
nn.Linear(in_features, out_features, bias=True)
```

implements a $R^C → R^D$ fully-connected layer. It takes as input a tensor of size $N × C$ and produce a tensor of size $N × D$.

In [11]:
f = nn.Linear(in_features=10, out_features=4)
for n,p in f.named_parameters():
    print(n,p)

weight Parameter containing:
tensor([[ 0.2760, -0.2507,  0.1026,  0.1083,  0.1244,  0.0929, -0.0557,  0.0760,
          0.0609,  0.2063],
        [ 0.2788,  0.2793, -0.2651,  0.1555,  0.3141,  0.2843,  0.1416, -0.1517,
         -0.2271,  0.3015],
        [ 0.0617, -0.2596,  0.2458,  0.0574, -0.0459, -0.0854,  0.2779,  0.2764,
          0.0516, -0.2394],
        [ 0.0197, -0.3107,  0.2328, -0.2008,  0.2526, -0.1454, -0.2113, -0.0471,
         -0.2837, -0.2571]], requires_grad=True)
bias Parameter containing:
tensor([-0.2396, -0.0426, -0.0525,  0.1604], requires_grad=True)


## The Module

```shell
nn.MSELoss()
```

implements the Mean Square Error loss: the sum of the component-wise squared difference, divided by the total number of components in the tensors.

In [14]:
x = torch.tensor([[3.0]])
y = torch.tensor([[9.0]])
loss = nn.MSELoss(x,y)
print(loss)

MSELoss()




- The first parameter of a loss is traditionally called the `input` and the second the `target.` These two quantities may be of different dimensions or even types for some losses (e.g. for classification).

Criteria do not accept a target with `requires_grad` to `True.`

# Batch Processing

- Functions and modules from `nn` process samples by batches. This is motivated by the computational speed-up it induces

- Memory transfers are slower than computation. Batch processing cuts down to one copy of the parameters to the cache per batch.

- It also cuts down the use of Python loops, which are awfully slow.

Let $f_1 , . . . , f_d$ be some modules of a model, and $x_1,...,x_N$ besamples.
Evaluating
$f_D(f_{D−1}(...f_2(f_1(x_n)))), n = 1,...,N.$ by batches is motivated by

1. an $f (x)$ can be computed only if the parameters of $f$ and $x$ are in the cache memory

2. moving data into the cache memory is slower than the computation per se,

3. the [parameters of the] $f_d$ s cannot all fit in the cache memory at the same time,

4. several $x_n$ can [usually] fit in the cache memory at the sametime.
Peak performance is achieved when there is no delay due to copying data to the memories. Optimizing the way copies are made to the memory is the bottleneck.
Note that this is a rough explanation of the issues related to cache memory and batch processing. There are actually multiple levels of cache, sharing between computational cores etc. Properly using highly parallel processing devics such as GPUs is a very difficult task which explains in particular the complexity of drivers and libraries such as NVIDIA’s cudnn.

## The autograd complaint function

```shell
F.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)
```

Implements a 2d convolution, where `weight` is of dimension `D × C × h × w` and
contains the kernels, bias is of dimension `D`, `input` is of dimension `N×C×H×W`
and the result is of dimension
`N × D × (H − h + 1) × (W − w + 1)`

In [15]:
weight = torch.empty(5,4,2,3).normal_()
bias = torch.empty(5).normal_()
input = torch.empty(117,4,10,3).normal_()

output = F.conv2d(input,weight,bias)

In [16]:
output.shape

torch.Size([117, 5, 9, 1])

In [19]:
import torchvision

mnist_data = torchvision.datasets.FashionMNIST("./",download=True)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting ./FashionMNIST/raw/train-images-idx3-ubyte.gz to ./FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting ./FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting ./FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting ./FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./FashionMNIST/raw



In [24]:
x = mnist_data.data[10].float().view(1,1,28,28)

In [25]:
x.shape

torch.Size([1, 1, 28, 28])

In [27]:
weight = torch.empty(5,1,3,3)
weight[0,0] = torch.tensor([[0.0,0.,0.],
                            [0.,1.,0.],
                            [0.,0.,0.]])

weight[1,0] = torch.tensor([[0.0,0.,0.],
                            [0.,1.,0.],
                            [0.,0.,0.]])

weight[2,0] = torch.tensor([[0.0,0.,0.],
                            [0.,1.,0.],
                            [0.,0.,0.]])

weight[3,0] = torch.tensor([[0.0,0.,0.],
                            [0.,1.,0.],
                            [0.,0.,0.]])

weight[4,0] = torch.tensor([[0.0,0.,0.],
                            [0.,1.,0.],
                            [0.,0.,0.]])

In [28]:
y = F.conv2d(x,weight)

In [29]:
y.shape

torch.Size([1, 5, 26, 26])

```shell
class torch.nn.Conv2d(in_channels, out_channels,
                    kernel_size, stride=1, padding=0, dilation=1,
                    groups=1, bias=True)
```
Wraps the `convolution` into a `Module`, with the `kernels` and `biases`as `Parameter`
properly randomized at creation.

In [33]:
f = nn.Conv2d(in_channels=4, out_channels=5, kernel_size=(2,3))

for n, p in f.named_parameters():
    print(n,p.size())

weight torch.Size([5, 4, 2, 3])
bias torch.Size([5])


In [35]:
x = torch.empty(100,4,28,28)
output = f(x)

output.size()

torch.Size([100, 5, 27, 26])

## Padding, stride, and dilation


Convolutions have three additional parameters:

- The `padding` specifies the size of a zeroed frame added around the input,
- the `stride` specifies a step size when moving the kernel across the signal,
- the `dilation` modulates the expansion of the filter without adding weights.

```shell
F.max_pool2d(input, kernel_size,
            stride=None, padding=0, dilation=1,
            ceil_mode=False, return_indices=False)
```

takes as input a $N ×C ×H ×W$ tensor, and a kernel size $(h,w) or k$ interpreted as $(k,k)$, applies the max-pooling on each channel of each sample separately, and produces (if the padding is 0) a $N × C × ⌊H/h⌋ × ⌊W/w⌋$ output.

In [None]:
x = torch.empty(1, 2, 2, 6).random_(3)
print(x)
print(x.size())
F.max_pool2d(x, (1, 2))

- As for convolution, pooling operations can be modulated through their stride and padding.
- While for convolution the default stride is 1, for pooling it is equal to the kernel size, but this not obligatory. Default padding is zero. **bold text**

```shell
class torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1,
                        return_indices=False, ceil_mode=False) 
```

Wraps the max-pooling operation into a Module.
As for convolutions, the kernel size is either a pair (h,w) or a single value k interpreted as (k,k).