In [None]:
DEBUG = True
MODEL_DEBUG = True

### References

<img src="https://www.testdriller.com/pictures/blog/57043786ab6fa09.jpg" width=700/>

## CIFAR100 Classification

> Can we develop a model that performs well on the benchmark dataset CIFAR100?

### Context

The CIFAR-100 dataset (Canadian Institute for Advanced Research) is a subset of the Tiny Images dataset and consists of <strong>60000</strong> <code>32x32</code> colour images in <strong>100</strong> classes, with <strong>600</strong> images per class. The <strong>100</strong> classes in the CIFAR-100 are <strong>grouped into 20 superclasses</strong>. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are <strong>50000 training images</strong> and <strong>10000 test images.</strong>

Credit: <a href="https://www.kaggle.com/datasets/fedesoriano/cifar100?select=meta">Kaggle Link</a>

### Objectives

<ol>
	<li>To explore and understand the CIFAR100 dataset</li>
	<li>Understand the effects of different data augmentation techniques on the performanceo f the model</li>
	<li>Discover new techniques and approaches as to tackle the <strong>3 color-channels (RGB) nature</strong> of the dataset.</li>
	<li>Develop and experiment with models in order to rival state-of-the-art (SOTA) benchmark scores.</li>
</ol>

## Importing Libraries
We import the necessary libraries for the notebook to run below.

In [158]:
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
import copy
import math

import random
import torch
import torch.nn as nn
import os

We see the seed such that the notebook results in reproducible results when run.   
We also set the device to CUDA to enable torch to use our GPU.

In [159]:
seed = 1234
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
# When running on the CuDNN backend, two further options must be set
torch.backends.cudnn.deterministic = True
# Set a fixed value for the hash seed
os.environ['PYTHONHASHSEED'] = str(seed)
    
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device available now:', device)
if device != torch.device('cuda'):
    print('using cpu, exiting')
    assert False

Device available now: cuda


## Utility Functions
Below we define some utility functions that will ease and help us with our analysis.

In [160]:
def loc_data(data, loc):
	datacopy = copy.deepcopy(data)
	arr = np.array(datacopy.loc[loc].drop('label'))
	label = datacopy.loc[loc]['label']
	root = int(len(arr) ** 0.5)
	arr.resize((root, root))
	return label, arr

def imshow(arr: list, label: list = None, grayscale=True, figsize=None):
	if label == None:
		label = [''] * len(arr)

	height = int(len(arr) ** 0.5)
	width = math.ceil(len(arr) / height)

	if figsize == None:
		fig = plt.figure()
	else:
		fig = plt.figure(figsize=figsize)
	for i in range(height):
		for j in range(width):
			ax = fig.add_subplot(height, width, i * height + j + 1)
			ax.grid(False)
			ax.set_xticks([])
			ax.set_yticks([])
			ax.imshow(arr[i * height + j], cmap='gray' if grayscale else '')
			ax.set_title(label[i * height + j])

def df_to_tensor(df, shape = (28, 28)):
	return torch.tensor(df.values.reshape((-1, *shape)), dtype=torch.float32)

def preprocess(df):
	return df.applymap(lambda x: x / 255)

def mse(t1, t2, shape=(28, 28)):
	loss = nn.MSELoss(reduction='none')
	loss_result = torch.sum(loss(t1, t2), dim=2)
	loss_result = torch.sum(loss_result, dim=2)
	loss_result = loss_result / np.prod([*shape])
	return loss_result

## Dataset
Let's take a look at the dataset. This dataset was retrieved from the original Fashion MNIST dataset found at <a href="https://arxiv.org/pdf/1708.07747v2.pdf">Paper Link</a>. 

<table>
	<tr>
		<th>
			Column Name
		</th>
		<th>
			Description
		</th>
	</tr>
	<tr>
		<td>
			label
		</td>
		<td>
			The true class of the image, represented as an integer ranging from 1 to 100<strong>*</strong>
		</td>
	</tr>
	<tr>
		<td>
			pixel 1<br/>...<br/>pixel 3072
		</td>
		<td>
			Pixels representing the image, each pixel ranging from 0 to 255. Each image has a dimension of <code>32x32x3</code>.
		</td>
	</tr>
</table>

<strong>\*</strong>Each number represents a certain dress item
```
1-5 -> beaver, dolphin, otter, seal, whale
6-10 -> aquarium fish, flatfish, ray, shark, trout
11-15 -> orchids, poppies, roses, sunflowers, tulips
16-20 -> bottles, bowls, cans, cups, plates
21-25 -> apples, mushrooms, oranges, pears, sweet peppers
26-30 -> clock, computer keyboard, lamp, telephone, television
31-35 -> bed, chair, couch, table, wardrobe
36-40 -> bee, beetle, butterfly, caterpillar, cockroach
41-45 -> bear, leopard, lion, tiger, wolf
46-50 -> bridge, castle, house, road, skyscraper
51-55 -> cloud, forest, mountain, plain, sea
56-60 -> camel, cattle, chimpanzee, elephant, kangaroo
61-65 -> fox, porcupine, possum, raccoon, skunk
66-70 -> crab, lobster, snail, spider, worm
71-75 -> baby, boy, girl, man, woman
76-80 -> crocodile, dinosaur, lizard, snake, turtle
81-85 -> hamster, mouse, rabbit, shrew, squirrel
86-90 -> maple, oak, palm, pine, willow
91-95 -> bicycle, bus, motorcycle, pickup truck, train
96-100 -> lawn-mower, rocket, streetcar, tank, tractor
```

In [161]:
def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

metadata_path = 'data/meta' # change this path
metadata = unpickle(metadata_path)
superclass_dict = dict(list(enumerate(metadata[b'coarse_label_names'])))

data_pre_path = 'data/' # change this path
# File paths
data_train_path = data_pre_path + 'train'
data_test_path = data_pre_path + 'test'
# Read dictionary
data_train_dict = unpickle(data_train_path)
data_test_dict = unpickle(data_test_path)
# Get data (change the coarse_labels if you want to use the 100 classes)
X_train = data_train_dict[b'data']
y_train = np.array(data_train_dict[b'fine_labels'])
X_test = data_test_dict[b'data']
y_test = np.array(data_test_dict[b'fine_labels'])

classes = np.array(list(map(lambda x: x.decode('utf-8'), metadata[b'fine_label_names'])))

Let's take a look at the training dataset

In [162]:
X_train.shape

(50000, 3072)

We observe that there are a total of <code>50000</code> rows and <code>3072</code> columns

### Testing for missing values and invalid ata
Let's try to identify if there are any missing values

In [163]:
print("Feature missing values:",pd.DataFrame(X_train).isnull().sum().sum())
print("Feature missing values:",pd.DataFrame(y_train).isnull().sum().sum())

Feature missing values: 0
Feature missing values: 0


In [164]:
temp = np.transpose(data_test.reshape((-1, 3, 32, 32)), axes=[0,2,3,1])
# plt.imshow(temp[0])
temp.shape

(10000, 32, 32, 3)

## Exploratory Data Analysis