# Tutorial 2.3. How to handle a more realistic dataset

Author: [Maren Gröne](mailto:maren.groene@s2016.tu-chemnitz.de)

In this tutorial, we introduce a new and more realistic dataset: [Imagenette](https://github.com/fastai/imagenette).

The previous MNIST examples are great for an introduction but are generally easy to solve with an accuracy of up to 99% in classification tasks. Imagenette is a subset of the much larger dataset [ImageNet](https://image-net.org/index.php) which was used in an image classification competition until the AlexNet crushed it in 2012. Nowadays, it is often used to train image classification models from scratch due to its massive image volume and number of classes.

For now, this dataset is unnecessarily big for our stage. That is why we use its smaller derivative Imagenette with only 10 destinct classes and 10,000 images. Nonetheless, they are bigger, in RGB colors and real-world images which make them way more difficult to handle.

Before diving in, let us check and set the device.

In [None]:
import torch

is_cuda = torch.cuda.is_available()
if is_cuda:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print('Device is ',device)

From this notebook onwards, we will use modules defined in the *Utils* folder. You can import them like a Python package by adding the root directory to system path first.

In [None]:
# add root directory to system path
import os, sys
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")

# load packages from Python files    
from Utils.dataloaders import prepare_imagenette

### Image preprocessing 

Now we implement necessary transformations and load the data as a Dataloader object. For training, it is necessary to convert all images to the same size to pass it to aan always fixed input layer of a neural network. Since $224 \times 224$ is a common size (for example for the Vision Transformer later) we used it here.

We also use two data augmentation techniques: cropping and flipping. Data augmentation is generally used to enrich the dataset with more diverse images which in turn improves the machine learning model. For further transformations, read [transforming and augmenting images - PyTorch Website](https://pytorch.org/vision/main/transforms.html).

The functions we used are:
- `v2.RandomResizedCrop` chooses a random part of the image and resizes it to our wanted image dimensions
- `v2.RandomHorizontalFlip` decides with a chance of `p=0.5`, so 50/50, whether the image is flipped or not.

In [None]:
import torchvision.transforms.v2 as v2

## prepare data
transform = v2.Compose([
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToTensor(),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

batch_size = 4

trainloader, testloader, classes = prepare_imagenette(train_compose=transform,test_compose=transform,save_path="../Dataset/",batch_size=batch_size)

Have a look, what images we are working with, here only in grey-scale.

In [None]:
import matplotlib.pyplot as plt

def imshow(imges):
    plt.figure()
    for i in range(4):
        img = imges[i,0]
        img = img / 2 + 0.5 #unnormalize
        npimg = img.numpy()
        plt.subplot(1,4,i+1)
        plt.imshow(npimg, cmap='gray')
        plt.axis('off')
    plt.show

dataiter = iter(trainloader)
images, labels = next(dataiter)
imshow(images)

print('Classes are: ')
print('| '.join(f'{classes[labels[j]]:5s} ' for j in range(4)))

## Create the convolutional neural network

### The network class

As before, we create a class object which inherits from the `torch.nn.Module`class to build our neural network. This includes the layer objects and the feedforward pass.

#### Define the network structure

Let's adjust the network model of the previous tutorial to our new dataset.

Remember, that a Conv2D layer has the parameters `(input channels,feature maps,kernel size)`. Since the Imagenette images are in RGB, the input now has 3 color channels. Therefore, the number of input channels into the first convolutional layer is $3$. We, again, want $6$ feature maps and a kernel size of $5 \times 5$. In contrast to defining fully-connected layers, the pixel dimensions of the image is not relevant here.

The first concolutional layer is:

`self.conv1 = nn.Conv2d(3,6,5)`

The second convolutional layer and the max pooling operation stay unchanged.

`self.conv1 = nn.Conv2d(6,16,5)`

`self.pool = nn.MaxPool2d(2,2)`

Now, with the transition to the fully connected layer, we have to do math again. Remember, the size of an image after a convolutional layer is:
`(W-K+2P)/S+1` with input dimensions `W`, kernel size `K`, padding `P` and stride `S`. With max pooling here, it is halved again. 

After the first convolution, the dimension is $((224-5+2*0)/1+1)/2=110$ and after the second (here abbreviated) $(110-5+1)/2=53$. The last convolutional layer has $16$ feature maps. Therefore, the number of outputs equals $16 \times 53 \times 53$ and represent the number of inputs of the fully-connected layer. 

`self.fc1 = nn.Linear(16*53*53,120)`

The rest and the forward pass stay the same as before.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        ## define the network structure with the layers
        self.conv1 = nn.Conv2d(3,6,5) # in_channels, out_channels, kernel_size 
        self.conv2 = nn.Conv2d(6,16,5) # in_channels,out_channels, kernel_size
        self.pool  = nn.MaxPool2d(2,2) # kernel_size, stride
        self.fc1   = nn.Linear(16*53*53, 120) # in_channels, out_channels
        self.fc2   = nn.Linear(120,84) # in_channels, out_channels
        self.fc3   = nn.Linear(84,10) # in_channels, out_channels

        
    def forward(self, x):
        ## define the functionality of each layer/between the layers
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x,1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
import torch.optim as optim

net = Net().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

## Training


For training the network, we also use a function from the *Utils* folder. Therefore, we have a modified visualization of the training progress. 

We use the same batch size and number of epochs as before with a neural network being overall the same.

What do you think how well will it perform?

In [None]:
from Utils.functions import train_model

num_epochs = 2
history = train_model(net, trainloader,testloader,criterion,optimizer,scheduler=None,device=device,num_epochs=num_epochs)

The accuracy is not even reaching 50% but we can assume that it just underfits. It has not reached its full potential. Therefore, we increase the number of epochs, for example to 50. In the train_model-function, early stopping via patience is implemented. This stops the training process when the validation accuracy (accuracy on data not used during training) begins to decrease.

After you have watched the training process for more epochs, you can see, that the accuracy still stops at around 60%.

So how can we improve the model? Play around with number of feature maps in conv-layers, number of conv-layers, number of neurons in the fully connected layers and number of fully-connected layers. If you change something in the conv-layers, remember to also change the number of input channels of the first fully connected layer! You can also play around with parameters outside the model itself, e.g. batch size, optimizer or loss function. 

This process is called **hyperparameter tuning**.

If you cannot find a good solution (above 70%), help yourself and google other CNN architectures and rebuild them. 

Little reminder: Reload all the necessary cells to refresh their content; otherwise, you will use the old network. If the performance always stays the same, reloading the kernel and clearing all outputs might help.

Generally, it is good to automate the process of hyperparameter tuning. If you want to dabble with it, check out [Hyperparameter tuning with Ray Tune](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html).

## Exercise: Go deeper

Specifically test how the accuracy changes if you add many conv layers and fully connected layers. You should notice a severe degradation in accuracy or at least no significant increase. That is due to the vanishing gradient problem. At some point, the error becomes so small in earlier layers during backpropagation, so that they cannot be trained anymore.

The next tutorial introduces Residual Networks which are designed to circumvent this issue.