The convolutional neural networks from the backbone of the most accurate image classifiers around today. We build now a CNN to solve the problem in the previous chapter, to show that is quicker to train and more accurate, less prone to overfit and so on.

For this new architecture, we use ```nn.Sequential()``` which allows us to create a chain of layers, when we use one of these chains in ```forward()```, the input goes through each element of the array of layers in succession, this can be used to break the model into more logical arrangements, in this network, there will be 2 chains: the **features block** and the **classifier**.

**Convolutions**

The ```Conv2d()``` is a 2D convolution. If we have grayscale image, it consists of an array of ${x}$ pixels wide and ${y}$ pixels high, with each entra having a value that indicates whether it's black or white or somewhere in between for a 8-bit image (0,255).

e.g., $4x4$ Matrix

$$\begin{bmatrix} 10 & 11 & 9 & 3 \\ 2 & 123 & 4 & 0 \\ 45 & 237 & 23 & 99 \\ 20 & 67 & 22 & 255 \end{bmatrix}$$

Next we introduce a *Convolutional Kernel*. This is another matrix, most likely smaller which we drag across the image, here's a $2x2$ filter:

$$\begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix}$$

To produce an output, we take the smaller filter and pass it to over the original input, starting top left.

$$\begin{bmatrix} 10 & 11 \\ 2 & 123\end{bmatrix}\begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix}$$

Then multiply each element in the matrix by its corresponding member in the other matrix and sum:

$(10\cdot1) + (11\cdot0) + (2\cdot1) + (123\cdot0) = 12$

Having done that we move the filter across and beging again. But how much we should move the filter? In this case we move by 2 meaning that:

$$\begin{bmatrix} 9 & 3 \\ 4 & 0\end{bmatrix}\begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix}$$

This gives an out put of 13. We now move down and back to the left and repeat, finally we get a *feature map*:

$$\begin{bmatrix} 12 & 13 \\ 65 & 45\end{bmatrix}$$

A convolutional layer will have many of this filters, the values of which are filled by the training of the network, and all the filters in the layer share the same bias values:

```nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)```

The ```in_channels``` is the number of input channels we will receive in the layer. At the beginning of the network, we're taking in the RGB image as input, so the number of input channels is three. ```out_channels``` is the number of output channels, which corresponds to the number of filters in our conv layer. ```kernel_size```, which describes the height and width of the filter. This can be a single scalar specifying a square (e.g, in the first conv layer, we're setting up a $11x11$ filter), or you can use a tuple (such as (3,5) for a $3x5$ filter. The next two parameter seem harmless enough, but they have a big effect on the downstream layers, and even what that particular layer ends up working at. ```stride``` indicates how many steps across the input we move when we adjust the filter to a new position. In the previous example we used a stride of 2, which has the effect of making a feature map is half of the size of the input, but we could also moved 1, which would give a feature map output of $4x4$, the same size of the input. We can also pass a tuple ```(a,b)``` that would allow to move $a$ across $b$ down on each step. So if we dragged a stride of 1 we eventually get to the point that:

$$\begin{bmatrix} 3 & ? \\ 0 & ?\end{bmatrix}$$

We don't have enough elements in our input to do a full convolution. This is where ```padding``` parameter comes in, if we give a vlaue of 1 our input looks like this:

$$\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 10 & 11 & 9 & 3 & 0\\ 0 & 2 & 123 & 4 & 0 & 0 \\ 0& 45 & 237 & 23 & 99 & 0  \\0 & 20 & 67 & 22 & 255 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\end{bmatrix}$$

Now when we get to the edge, our values covered by the filter are:

$$\begin{bmatrix} 3 & 0 \\ 0 & 0\end{bmatrix}$$

If we don't set padding, any edge cases that PyTorch encounters in the last columns of the input are simply thrown away. Just as ```kernel_size``` and ```stride``` you can pass a tuple for height and width padding instead of a single number that pads the same in both directions.

**Pooling**

In conjunction with the conv layers, *pooling* layers reduce (flatten) the resolution of the network from previous input layer, which gives us fewer parameters in lower layers. This compression results in faster computation for a start and it helps preventing overfitting.

Using a ```kernel_size``` of 3 and a ```stride``` of 2 for a $5x3$ input:

$$\begin{bmatrix} 1 & 2 & 1 & 4 &1\\ 5 & 6& 1& 2& 5\\ 5 & 0& 0& 9& 6\end{bmatrix}$$

Using the kernel of $3x3$ and stride of $2$, we get 2 $3x2$ tensors from pooling:

$$\begin{bmatrix} 1 & 2 & 1\\ 5 & 6 & 1\\ 5 & 0 & 0\end{bmatrix}$$

$$\begin{bmatrix} 1 & 4 & 1\\ 1 & 2 & 5\\ 0 & 9 & 6\end{bmatrix}$$

In ```MaxPool``` we take the maximun value from each of these tensors, giving us an output tensor of 
$[6, 9]$. Just as in the convulutional layers, there's a padding option to ```MaxPool``` that creates a border of zero values around the tensor, in case the stride goes outside the tensor view.

We can pool with other functions aside from taking the max value from a kernel. A popular alternative is to take the average of tensor values, which allows all of the tensor data to contribute to the pool instead of just one value in the ```max``` case (and if you think about an image, you can imagine that you might want to consider the nearest neighbors of a pixel). Also, PyTorch provides ```AdaptiveMaxPool``` and ```AdaptiveAvgPool``` layers, which work independently of the incoming input tensor's dimentions. Recommended -> ```AdaptiveMaxPool```, ```MaxPool```, ```AvgPool```.

**Dropout**

One recurring issue with NN is their tendency to overfit to training data, ```Dropout``` layer is a devilishly simple way of doing this that has the benefit of being easy to understand and effective: each training cycle we will ignore certain neurons randomly. By default ```Dropout``` layers are initialized with $0.5$ meaning that 50% of the input tensor is randomly zeroed out.

**Note: ```Dropout()``` should take place only during training. If it was happening during inference time, you'd lose a chunk of your network's reasoning power, which is not what we want. PyTorch implementation of ```Dropout``` works out which mode you're running in and passes all data through the ```Dropout``` layer at inference time.**

**TODO** Write a function which output the dimensions given the input_dim, kernel sizes and padding


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data
import torch.nn.functional as F
import torchvision
from torchvision import transforms
from PIL import Image

In [2]:
class CNNNet(nn.Module):
    
    def __init__(self, num_classes=2):
        super(CNNNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256,256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2)
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6,6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Hardswish(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(),
            nn.Dropout(),
            nn.Hardswish(),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Linear(4096, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

In [3]:
cnnet = CNNNet()

In [4]:
def check_image(path):
    try:
        im = Image.open(path)
        return True
    except:
        return False

In [5]:
img_transforms = transforms.Compose([
    transforms.Resize((64,64)),
    transforms.RandomCrop(64),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])


In [6]:
train_data_path = 'C02Dataset/train/'
train_data = torchvision.datasets.ImageFolder(root=train_data_path,
                                              transform=img_transforms,
                                              is_valid_file=check_image)

In [7]:
val_data_path = 'C02Dataset/val/'
val_data = torchvision.datasets.ImageFolder(root=val_data_path,
                                            transform=img_transforms,
                                            is_valid_file=check_image)

In [8]:
test_data_path = 'C02Dataset/test/'
test_data = torchvision.datasets.ImageFolder(root=test_data_path,
                                             transform=img_transforms,
                                             is_valid_file=check_image)

In [9]:
batch_size = 256
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_data_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_data_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [10]:
def train(model, optimizer, loss_fn, train_loader, val_loader, epochs=20, device="cpu"):
    for epoch in range(1, epochs+1):
        training_loss = 0.0
        valid_loss = 0.0
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            inputs, targets = batch
            inputs = inputs.to(device)
            targets = targets.to(device)
            output = model(inputs)
            loss = loss_fn(output, targets)
            loss.backward()
            optimizer.step()
            training_loss += loss.data.item() * inputs.size(0)
        training_loss /= len(train_loader.dataset)
        
        model.eval()
        num_correct = 0 
        num_examples = 0
        for batch in val_loader:
            inputs, targets = batch
            inputs = inputs.to(device)
            output = model(inputs)
            targets = targets.to(device)
            loss = loss_fn(output,targets) 
            valid_loss += loss.data.item() * inputs.size(0)
            correct = torch.eq(torch.max(F.softmax(output, dim=1), dim=1)[1],
                               targets)
            num_correct += torch.sum(correct).item()
            num_examples += correct.shape[0]
        valid_loss /= len(val_loader.dataset)

        print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}, Accuracy = {:.2f}'.format(epoch, training_loss,
        valid_loss, num_correct / num_examples))

In [11]:
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

In [12]:
cnnet.to(device)

CNNNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU()
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU()
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Hardswish()
    (2): Linear(in_features=9216, out_features=4096, bias=True)
    (3): ReLU()
    (4): Dropout(p=0.5, i

In [13]:
optimizer = optim.Adam(cnnet.parameters(), lr=0.001) #  Good default settings for the tested machine learning problems are Î± = 0.001,

In [14]:
train(cnnet, optimizer, torch.nn.CrossEntropyLoss(),
     train_data_loader, val_data_loader,
     epochs=10, device=device)

Epoch: 1, Training Loss: 1.89, Validation Loss: 0.65, Accuracy = 0.95
Epoch: 2, Training Loss: 0.66, Validation Loss: 0.49, Accuracy = 0.95
Epoch: 3, Training Loss: 0.58, Validation Loss: 0.35, Accuracy = 0.95
Epoch: 4, Training Loss: 0.54, Validation Loss: 0.37, Accuracy = 0.95
Epoch: 5, Training Loss: 0.51, Validation Loss: 0.31, Accuracy = 0.95
Epoch: 6, Training Loss: 0.52, Validation Loss: 0.48, Accuracy = 0.88
Epoch: 7, Training Loss: 0.49, Validation Loss: 0.34, Accuracy = 0.95
Epoch: 8, Training Loss: 0.47, Validation Loss: 0.25, Accuracy = 0.95
Epoch: 9, Training Loss: 0.41, Validation Loss: 0.30, Accuracy = 0.87
Epoch: 10, Training Loss: 0.39, Validation Loss: 0.29, Accuracy = 0.92


In [15]:
labels = ['cat','fish']

from os import listdir
from os.path import isfile, join
cats = [f for f in listdir('C02Dataset/val/cat') if isfile(join('C02Dataset/val/cat', f))]
fishes = [f for f in listdir('C02Dataset/val/fish') if isfile(join('C02Dataset/val/fish', f))]
cats_pred = []
fishes_pred = []
for cat in cats:
    img = Image.open("C02Dataset/val/cat/"+cat) 
    img = img_transforms(img).to(device)
    img = torch.unsqueeze(img, 0)
    cnnet.eval()
    prediction = F.softmax(cnnet(img), dim=1)
    prediction = prediction.argmax()
    cats_pred.append(labels[prediction])
for fish in fishes:
    img = Image.open("C02Dataset/val/fish/"+fish) 
    img = img_transforms(img).to(device)
    img = torch.unsqueeze(img, 0)
    cnnet.eval()
    prediction = F.softmax(cnnet(img), dim=1)
    prediction = prediction.argmax()
    fishes_pred.append(labels[prediction])

In [16]:
total_samples = len(cats_pred) + len(fishes_pred)
true_positives = sum(1 for i in cats_pred if i == 'cat') 
false_negative = sum(1 for i in cats_pred if i == 'fish')
false_positives = sum(1 for i in fishes_pred if i == 'cat')
true_negative = sum(1 for i in fishes_pred if i == 'fish')
classification_accuracy = true_positives+true_negative/total_samples*100
prevelence = len(cats_pred)/total_samples
PPV = true_positives/true_positives+true_negative
FDR = false_positives/true_positives+true_negative
FOR = false_negative/false_negative
error_rate = (1 - (true_positives/total_samples))*100
x = torch.tensor([[true_positives, false_positives], [false_negative, true_negative]])

In [17]:
cats_expected = torch.ones(len(cats), dtype=torch.int8).tolist()
cats_predicted = [1 if x == 'cat' else 0 for x in cats_pred]
fishes_expected = torch.zeros(len(fishes), dtype=torch.int8).tolist()
fishes_predicted = [1 if x == 'cat' else 0 for x in fishes_pred]
expected = cats_expected + fishes_expected
predicted = cats_predicted + fishes_predicted

In [18]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(expected, predicted))

[[  26   44]
 [  84 1392]]


In [19]:
x

tensor([[1392,   44],
        [  84,   26]])

In [20]:
print('Accuracy:{:.2f}\n Prevalence: {:.2f}\n PPV: {:.2f}\n FDR: {:.2f}\n FOR: {:.2f}\n Error rate: {:.2f}'
      .format(classification_accuracy, prevelence, PPV, FDR, FOR, error_rate))

Accuracy:1393.68
 Prevalence: 0.95
 PPV: 27.00
 FDR: 26.03
 FOR: 1.00
 Error rate: 9.96


In [21]:
#From https://github.com/geohot/tinygrad/blob/c83cebccdaf1f962c554237eaa597abcfa023c9d/tinygrad/utils.py#L7
def fetch(url):
  import requests, os, hashlib, tempfile
  fp = os.path.join(tempfile.gettempdir(), hashlib.md5(url.encode('utf-8')).hexdigest())
  if os.path.isfile(fp) and os.stat(fp).st_size > 0:
    with open(fp, "rb") as f:
      dat = f.read()
  else:
    print("fetching %s" % url)
    dat = requests.get(url).content
    with open(fp+".tmp", "wb") as f:
      f.write(dat)
    os.rename(fp+".tmp", fp)
  return dat

In [37]:
import io
url = input('URL')
img = Image.open(io.BytesIO(fetch(url)))
img = img_transforms(img).to(device)
img = torch.unsqueeze(img, 0)
cnnet.eval()
prediction = F.softmax(cnnet(img), dim=1)
prediction = prediction.argmax()
print(labels[prediction])

URLhttps://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP._N8SIT9lNaOftkl77G7zgAHaE7%26pid%3DApi&f=1
fetching https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP._N8SIT9lNaOftkl77G7zgAHaE7%26pid%3DApi&f=1
cat


In [23]:
torch.save(cnnet, "cnnet") 
cnnet = torch.load("cnnet")
torch.save(cnnet.state_dict(), "cnnet")    

**Using pretrained models**

In [24]:
# import torchvision.models as models
# resnet = models.resnet18(num_classes=2)

We can use ```models.alexnet(pretrained=True)``` to download a pre-trained set of weights for *AlexNet*, allowing to use it immediately for classification with no extra training but is recommended to do some additional training in order to improve the accuracy for the particular dataset use on it.

In [25]:
# print(resnet)

From every everything seem the same with the exception of **Batch Normalization Layer** which is a simple layer short for *Batch Normalization* which using two learned parameters to try to ensure that each minibatch that goes through the network has a mean centered around zero with a variance of 1. For small nets like the Chapter-2 is meaningless having it because has no effect, but with large models, say 20 layers down, can be vast because of repeated matrices multiplications may end up with either a *vanish or exploding gradient*, both are fatal for the training process, so even if we use ResNet-152, the multiplications inside the network don't get out of hand. It is recommended to use ```print(model)``` to see which layers they use and in what order operations happen.

It is good option for use cases and competitions use PyTorch Hub, that provides an additional route to get models.

In [26]:
# import torch
# import torch.optim
# transfer_model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)

Next we need to freeze the layers. The way we do this is simple: we stop them from accumulating gradients by using ```requires_grad()```. We need to do this for every parameter in the network, but helpfully, PyTorch provides a ```parameters()``` method that makes this rather easy:

In [27]:
# for name, param in transfer_model.named_parameters():
#     if('bn' not in name): #We do this to prevent BatchNorm to not being updated
#         param.requires_grad = False

Then we need to replace the final classification block with a new one that we will train for detecting cats or fish. In this example we replace it with a couple of ```Linear()``` layers, a ```ReLU()``` and ```Dropout()```, but we could add extra CNN layers here too. Happily, the definition of PyTorch's implementation of ResNet stores the final classifier block as an instance variable, fc, so all we need to do is replace that with our new structure.

In [28]:
# transfer_model.fc = nn.Sequential(nn.Linear(transfer_model.fc.in_features,500),
#                                  nn.ReLU(),
#                                  nn.Dropout(), nn.Linear(500,2))

In the preceding code, we take advantage of the ```in_features``` variable that allows us to grab the number of activations coming into a layer (2048 in this case). You can also use ```out_features``` to discover the activations coming out. These are handy functions for when you're snapping together networks like building bricks

**Finding the learning rate**

A learning rate that has empirically been observed to work with the ADAM optimizer is $1x3^{-4}$. This is known as Karpathy's constant. By the other hand there's a way to calculate a optimal learning rate during the minibatch with a function made by fast.ai library which over the course of an epoch, start out with a small learning rate and increase to a higher learning rate over each minibatch, resulting in a high rate at the end of the epoch. Calculate the loss for each rate and then, looking at the plot, pick the learning rate that gives the greates decline, here's a simplified version of the function.

In [29]:
# import math

# def find_lr(model, loss_fn, optimizer, init_value=1e-8,
#            final_value=10.0):
#     number_in_epoch = len(train_loader) - 1
#     update_step = (final_value / init_value) ** (1 / number_in_epoch)
#     lr = init_value
#     optimizer.param_groups[0]["lr"] = lr
#     best_loss = 0.0
#     batch_num = 0
#     losses = []
#     logs_lrs = []
#     for data in train_loader:
#         batch_num += 1
#         inputs, labels = data
#         inputs, labels = inputs, labels
#         optimizer.zero_grad()
#         outputs = model(inputs)
#         loss = loss_fn(outputs, labels)
#         #Crash out if loss explodes (gradient explosion)
#         if batch_num > 1 and loss > 4 * best_loss:
#             return logs_lrs[10:-5], losses[10:-5]
#         #Record the best loss
#         if loss < best_loss or batch_num == 1:
#             best_loss = loss
#         #Store values
#         losses.append(loss)
#         log_lrs.append(math.log10(lr))
#         #Do backward pass and optimize
#         loss.backward()
#         optimizer.step()
#         #Update the lr for the next step and store
#         lr *= update_step
#         optimizer.param_groups[0]["lr"] = lr
#     return logs_lrs[10:-5], losses[10:-5]
        

**Differential Learning Rates**

In [30]:
# optimizer = torch.optim.Adam([
#     {'params':transfer_model.layer4.parameters(), 'lr':
#     found_lr/3},
#     {'params':transfer_model.layer3.parameters(), 'lr':
#     found_lr/9}
# ]) # found_lr after training loop

In [31]:
# unfreeze_layers = [transfer_model.layer3,
#                   transfer_model.layer4]
# for layer in unfreeze_layers:
#     for param in layer.parameters():
#         param.requires_grad = True

**Data Augmentation**



This can be done by swapping, flipping, crop, streching the image in this case so we prevent that the NN fits to the input data and not be able to extend to other data (overfit). From torchvision transforms there are a large collection of transforms that can be used for data augmentation, plus, two ways of constructing new transformations. In this section, we look at the most useful ones.

```
torchvision.transforms.ColorJitter(brightness=0, contrast=0, saturation=0, hue=0)
```
This will randomly change the brightness, contrast, saturation and hue of a image, for these values a tuple of floats can be passed.

```
torchvision.transforms.RandomHorizontalFlip(p=0.5)
torchvision.transforms.RandomVerticalFlip(p=0.5)
```
This will flip the image Horizontally or Vertically with a 50% of chance to occur during training.

```
torchvision.transforms.RandomCrop(size, padding=None,
pad_if_needed=False, fill=0, padding_mode='constant')
torchvision.transforms.RandomResizeCrop(size, scale=
(0.08, 1.0),
ratio=(0.75, 1.333333), interpolation=2)
```
For these we have to be careful, because if the crops are too small, we run the risk of cutting out important features of the image and make the model train the wrong thing. ```RandomResizeCrop``` will resize the crop to fill the given size, ```RandomCrop``` may take a crop close to the edge and into the darkness beyond the image.

```RandomResizeCrop``` is using Bilinear interpolation, but we can select nearest neighbor or bicubic interpolation, by changing the ```interpolation``` parameter.

It is proven that HSV has more accuracy than RGB, when combined with esembles, you could easily create a series of models that combines the results of training RGB, HSV, YUV and LAB color spaces. One problem is that PyTorch doesn't offer a transform that can do this. But it does provides a couple of tools that we can use to randonly change images from standardRGB into HSV or any other color space. First, looking at [PIL Documentation](https://pillow.readthedocs.io/en/stable/) we can use ```Image.convert()``` to translate a PIL image from one to another color space. We can write a custom transform class to carry out this conversion, but PyTorch adds a ```transforms.Lambda``` class so that we can easily wrap any function and make it available to the transform pipeline.

In [32]:
# def _random_colour_space(x):
#     output = x.convert('HSV')
#     return output

This is then wrapped in a ```transforms.Lambda``` class and can be used in any standard transformation pipeline:

In [33]:
# colour_transform = transforms.Lambda(lambda x: _random_colour_space(x))

This is fine is we want to convert *every* image into HSV but we don't really want that. We'd like it to randomly change images in each batch, so it's probable that the image will be presented in different color spaces in different epochs. We could update the original function to generate a random number and use that to generate a random probability of changing the image, but insteadwe're even lazier and use **RandomApply**

In [34]:
# random_color_transform = torchvision.transforms.RandomApply([colour_transform])

**Custom Transform Classes** 

Sometimes a simple lambda isn't enough because maybe we have some initialization or state that we want to track, for example, in these cases we can create a custom transform that operates on either PIL images data or a tensor. Such class has to implement two dunder methods: ```__call__```, which the transform pipeline will invoke durin the transformation process; and ```__repr__```, which should return a string representation of the transform, along with any state that may be useful for diagnostic purposes.

In the following code, we implement a transform class that adds a random Gaussian noise to a tensor. When the class is initialized, we pass in the mean and standard distribution of the noise we require, and during the ```__call__``` method, we sample from this distribution and add it to the incoming tensor:

In [35]:
# class Noise():
#     """Adds gaussian noise to  a tensor.
#     transforms.Compose([
#     transforms.ToTensor(),
#     Noise(0.1, 0.05)
#     ])"""
#     def __init__(self, mean, stddev):
#         self.mean = mean
#         self.stddev = stddev
        
#     def __call__(self, tensor):
#         noise = torch.zeros_like(tensor).normal(self.mean, self.stddev)
#         return tensor.add_(noise)
#     def __repr__(self):
#         repr = f'{self.__class__.__name__}(mean={self.mean}, stddev={self.stddev})'
#         return repr

Because transforms don't have any restrictions and just inherit from the base Python object class, you can do anything. Aside from transformations, there are a few more ways of squeezing as much performance from a model as possible. **Start small, get bigger** A tip for CNNs is for example we are training for $256x256$  images, create a few other datasets which the images have been scaled to $64x64$ and $128x128$, Create the model with the $64x64$ dataset, fine-tune as normal, then train the *exact same model* with the $128x128$ dataset, not from scratch, but using the parameters that have been trained. Once it looks we squeezed the most of the $128x128$ dataset, move into the $256x256$ dataset, we'll probably find a percentage point or two improvement in accuracy.

If we don't want to have multiple copies of a dataset hanging around in storage, you can use ```torchvision``` transforms to do this on the fly using the ```Resize``` function.

```
resize = transforms.Compose([transforms.Resize(64),
....other augmentation transforms...
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
```
The trade off is that training takes more time, since, PyTorch has to apply resize every time. Consider also start with small architectures then go bigger and work way up to reusing small models (because they reduce prediction time when experimenting) reuse those by adding them to an ensemble mode.

**Ensembles**

*Ensembling* is a technique that is farily common in more traditional machine learning methods, and it works rather well too in deep learning models. The idea is to obtain a prediction from a series of models, and combine those predictions to produce a final answer. Because different models will have different strengths in different areas.

Assuming you have a list of models in models, and input is your input tensor

In [36]:
# predictions = [m[i].fit(input) for i in models]
# avg_prediction = torch.stack(b).mean(0).argmax()

The ```stack``` method concatenates the array of tensors together, so if we were working on the cat/fish problem and had four models in our esemble, we'd end up with a $4x2$ tensor constructed from the four $1x2$ tensors. And ```mean``` does that we expect, taking the average, although we have to pass in a dimension of 0 to ensure that it takes the average across the first dimension instead of simply adding up all the tensor elements and producing a scalar output. Finally ```argmax``` picks out the tensor index with the highest element.