<a href="https://colab.research.google.com/github/krzs13/JellyfishNet/blob/master/JellyfishNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents


0. Dependencies
1. Introduction
2. Data Preparation
    * 2.1 Load Data
    * 2.2 Data Normalization
    * 2.3 Train/Validation Set Split
    * 2.4 Data Augmentation
3. JellyfishNet
    * 3.1 Main Assumption
    * 3.2 Architectural Highlights
    * 3.3 Model
4. Hyperparameters Tuning
    * 4.1 Keras Tuner
    * 4.2 Regularization Layers
    * 4.3 Activation Layers
    * 4.4 Dense Units
    * 4.5 Data Augmentation Parameters
    * 4.6 Learning Rate and Learning Rate Reduction
5. Model Evaluation
    * 5.1 TensorBoard
6. Prediction and Model Ensemble









# 0. Dependencies

In [1]:
import datetime
import os
import random
from typing import Tuple

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from IPython import display
from sklearn.model_selection import train_test_split

# 1. Introduction

**99,75% of test accuracy and only 234 330 trainable parameters!**

The moment I learned about attention in machine learning, I wanted to create my model that used this concept. I found attention helpful for the model to focus on the essentials, which would allow for a simple solution. All of that was the reason why JellyfishNet was created - to make as simple model as possible that would use concept of attention and have small amount of trainable parameters.



# 2. Data Preperation

## 2.1 Load Data

For this task I am using original MNIST dataset, so training set contains 60 000 images and 10 000 images are for test set.

In [None]:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

## 2.2 Data Normalization

MNIST dataset contains images in grayscale with pixel values from 0 to 255 stored as 2D NumPy arrays. TensorFlow operates on tensors which are 3D arrays so I am adding another dimension called 'channel' by expand_dims method from NumPy package. Convolutional Neural Networks converge faster on smaller values from 0-1 range then 0-255 range so pixel values are divided by 255 and stored as float32 data type. Train labels which are integer numbers are encoded to one hot array so label with category of 2 is [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]. I also make sure that labels are same data type int8.

In [3]:
train_images = (np.expand_dims(train_images, axis=-1) / 255).astype(np.float32)
train_labels = tf.keras.utils.to_categorical(
    train_labels, num_classes=10, dtype=np.int8)

test_images = (np.expand_dims(test_images, axis=-1) / 255).astype(np.float32)
test_labels = (test_labels).astype(np.int8)

## 2.3 Train/Validation Set Split

Training set is split into training and validation sets by train_test_split method from scikit-learn package. Validation set contains 10% images of original training set. Stratify parameter in train_test_split method means that train and vaidation sets have same proportions of class labels which is important for proper learning of model. Random state is set to value 42 (popular random seed) to make sure that split is always the same over different runs so results are more reliable and models could be compared.

In [4]:
train_images, val_images, train_labels, val_labels = train_test_split(
    train_images, 
    train_labels, 
    test_size=0.1, 
    random_state=42, 
    stratify=train_labels,
)

## 2.4 Data Augmentation

Data augmentation is used to prevent overfitting. The idea is to alter the training data with small transformations so neural network 'sees' more images while learning. 
* rotation_range - randomly applies rotation to images. Small amount is chosen to avoid confiusion between 6 and 9
* zoom_range - randomly applies zoom to images
* width_shift_range - randomly shift images horizontally
* height_shift_range - randomly shift images vertically

ImageDataGenerator contains other methods of image augmentation, but I didn't find them usefull for MNIST dataset because they don't make model learn better.



In [5]:
datagen = tf.keras.preprocessing.image.ImageDataGenerator(  
    rotation_range=10,  
    zoom_range=0.2, 
    width_shift_range=0.2, 
    height_shift_range=0.2,
)

datagen.fit(train_images)

# 3. JellyfishNet

## 3.1 Main Assumption

The goal of JellyfishNet was to create simple model with attention that has low amount of trainable parameters and don't need much computational power. I assumed that using just convolution with softmax activation that is multiplied with convolution with non linear activation in first layers of network will be good enough for as simple task as digit recognition. Gaussian Noise was chosen as regularization for head layers and Gaussian Dropout was chosen for spine and fully connected layers. The idea was to use regularization after each block with small rate amount instead regularization after several bocks with bigger rate amount as is usually seen.

## 3.2 Architectural Highlights

JellyfishNet is made of two heads with simple attention modules that are connected to spine made of convolutional blocks. Last block of network is fully connected layer.

* Head contains two blocks. Each block contains two strided convolutions with kernel size equal to 3 and 7. Convolution with smaller kernel size has ReLU activation layer and one with bigger kernel size has Softmax activation. Each block has own Gaussian Noise and Batch Normalization layers. At last outputs of ReLU and Softmax are multiplied. 

* Heads are concatenated. Before concatenation head with lower stride amount is passed through convolution to make it having same amount and size of filters. This convolution layer has same regularization and activation layers as spine.

* Spine contains three blocks with convolutions that lower amount of filters from 128 to 16. Convolutions have stride amount of 1 so filters size is constant from the beginning to the end of spine. Each block contains own Gaussian Dropout and Batch Normalization. Swish activation was chosen in hyperparameters tuning process.

* Fully connected part of JellyfishNet begins from Flatten layer. Resized output is passed through just one Dense layer to have smaller amount of trainable parameters. Dense block has own Guassian Dropout and Batch Normalization. LeakyReLU activation was chosen in hyperparameters tuning process. Last layer is standard classification with Dense layer and Softmax activation.

* Model plot (my lovely girlfriend told me it looks like jellyfish and that's why it's called like that 😉): ![JellyfishNet](https://drive.google.com/uc?export=view&id=1oBEaipwtevLA5fivWHiYMV-nPhdWorBx)


## 3.3 Model

### Head Layer

In [6]:
class HeadLayer(tf.keras.layers.Layer):
    def __init__(
        self, 
        filters: int, 
        strides: Tuple[int, int], 
        gaussian_noise_ratio: float,
    ):
        super().__init__()
        self._conv = tf.keras.layers.Conv2D(
            filters,
            3,
            strides=strides,
            padding='same',
        )
        self._conv_noise = tf.keras.layers.GaussianNoise(gaussian_noise_ratio)
        self._conv_activation = tf.keras.layers.Activation(tf.nn.relu)
        self._conv_batchnorm = tf.keras.layers.BatchNormalization()
        self._soft = tf.keras.layers.Conv2D(
            filters,
            7,
            strides=strides,
            padding='same',
        )
        self._soft_noise = tf.keras.layers.GaussianNoise(gaussian_noise_ratio)
        self._soft_activation = tf.keras.layers.Softmax()
        self._soft_batchnorm = tf.keras.layers.BatchNormalization()
        self._multiply = tf.keras.layers.Multiply()

    def call(self, inputs):
        conv = self._conv(inputs)
        conv = self._conv_noise(conv)
        conv = self._conv_activation(conv)
        conv = self._conv_batchnorm(conv)

        soft = self._soft(inputs)
        soft = self._soft_noise(soft)
        soft = self._soft_activation(soft)
        soft = self._soft_batchnorm(soft)

        multiply = self._multiply([conv, soft])

        return multiply

In [7]:
class HeadConvolutionLayer(tf.keras.layers.Layer):
    def __init__(
        self, 
        filters: int,
        strides: Tuple[int, int],  
        gaussian_dropout_ratio: float,
    ):
        super().__init__()
        self._conv = tf.keras.layers.Conv2D(
            filters,
            3,
            strides=strides,
            padding='same',
        )
        self._dropout = tf.keras.layers.GaussianDropout(gaussian_dropout_ratio)
        self._activation = tf.keras.layers.Activation(tf.nn.swish)
        self._batchnorm = tf.keras.layers.BatchNormalization()

    def call(self, inputs):
        x = self._conv(inputs)
        x = self._dropout(x)
        x = self._activation(x)
        x = self._batchnorm(x)

        return x

### Spine Layer

In [8]:
class SpineLayer(tf.keras.layers.Layer):
    def __init__(
        self, 
        filters: int, 
        gaussian_dropout_ratio: float,
    ):
        super().__init__()
        self._conv = tf.keras.layers.Conv2D(
            filters,
            3,
            strides=(1, 1),
            padding='same',
        )
        self._dropout = tf.keras.layers.GaussianDropout(gaussian_dropout_ratio)
        self._activation = tf.keras.layers.Activation(tf.nn.swish)
        self._batchnorm = tf.keras.layers.BatchNormalization()

    def call(self, inputs):
        x = self._conv(inputs)
        x = self._dropout(x)
        x = self._activation(x)
        x = self._batchnorm(x)

        return x

### Fully Connected Layer

In [9]:
class FullyConnectedLayer(tf.keras.layers.Layer):
    def __init__(
        self, 
        units: int, 
        gaussian_dropout_ratio: float,
    ):
        super().__init__()
        self._flatten = tf.keras.layers.Flatten()
        self._dense = tf.keras.layers.Dense(units)
        self._dropout = tf.keras.layers.GaussianDropout(gaussian_dropout_ratio)
        self._activation = tf.keras.layers.LeakyReLU(alpha=0.2)
        self._batchnorm = tf.keras.layers.BatchNormalization()

    def call(self, inputs):
        x = self._flatten(inputs)
        x = self._dense(x)
        x = self._dropout(x)
        x = self._activation(x)
        x = self._batchnorm(x)

        return x

### JellyfishNet

In [10]:
class JellyfishNet(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self._head_1 = HeadLayer(
            filters=64, strides=(4, 4), gaussian_noise_ratio=0.02)
        self._head_2 = HeadLayer(
            filters=128, strides=(2, 2), gaussian_noise_ratio=0.02)
        self._head_2_conv = HeadConvolutionLayer(
            filters=64, strides=(2, 2), gaussian_dropout_ratio=0.07)
        self._concatenate = tf.keras.layers.Concatenate()
        self._spine_1 = SpineLayer(filters=64, gaussian_dropout_ratio=0.07)
        self._spine_2 = SpineLayer(filters=32, gaussian_dropout_ratio=0.07)
        self._spine_3 = SpineLayer(filters=16, gaussian_dropout_ratio=0.07)
        self._fully_connected = FullyConnectedLayer(
            units=64, gaussian_dropout_ratio=0.07)
        self._classification = tf.keras.layers.Dense(10, activation='softmax')

    def call(self, inputs):
        head_1 = self._head_1(inputs)
        head_2 = self._head_2(inputs)
        head_2 = self._head_2_conv(head_2)
        concatenate = self._concatenate([head_1, head_2])

        spine = self._spine_1(concatenate)
        spine = self._spine_2(spine)
        spine = self._spine_3(spine)

        fully_connected = self._fully_connected(spine)

        classification = self._classification(fully_connected)

        return classification

# 4. Hyperparameters tuning

## 4.1 Keras Tuner

Keras Tuner is amazing package that makes hyperparameter tuning pretty easy. For JellyfishNet I used Hyperband method which is optimized Random Search. Hyperband works simmilar to genetic alghoritms so at the start it makes many networks with different parameters and the best half of them  get to the next round. The algorithm is repeated until there is one best network left. This section doesn't contain code because whole process of hyperparameters tuning took about 9 hours at single V100 GPU instance.

## 4.2 Regularization Layers

Type of regularization layers were chosen by hand (Guassian Noise for heads and Gaussian Dropout for rest of the JellyfishNet) but their rates were chosen by Keras Tuner. Range given to Keras Tuner was from 0.01 to 0.10.

## 4.3 Activation Layers

The only activation layers chosen by hand where Softmax at head and clasification layer. Keras Tuner did his job for rest of the network. It could choose between ReLU, LeakyReLU and Swish activations. Because of using activations from tf.nn package alpha amount at LeakyReLU is 0.2 which is default alpha amount for tf.nn.leaky_relu instead of default amount for tf.keras.layers.LeakyReLU which is 0.3. 

## 4.4 Dense Units

Amount of units in Dense layer from fully connected block was chosen by Keras Tuner. Choices were from range from 32 to 192 with step of 16.

## 4.5 Data Augmentation Parameters

Data augmentation parameters were chosen by hand. I found out that bigger amount of data augmentation rates results with worse single model performance but better model ensemble result. I decided not to use Keras Tuner here because it choose best parameters only for single model.

## 4.6 Learning Rate and Learning Rate Reduction

These parameters were also chosen by hand and by looking at accuracy/loss model performance plots. I choosed ReduceLROnPlateu method from tf.keras.callbacks package. This method reduces learning rate by chosen factor when network stops performing better. Patience parameter was chosen according to my observations at JellyfishNet habits. There is also early stopping callback to stop learning process after 7 epochs without progress.

# 5. Model Evaluation

This model learns pretty well because the validation accuracy is greater than the training accuracy every time during training session. That means that JellyfishNet doesn't overfit the training set.

![Accuracy](https://drive.google.com/uc?export=view&id=1fjI01d5PuydANg1kABxxfa93KMKN07ty)

## 5.1 TensorBoard

In [None]:
!mkdir -p logs
!rm -rf ./logs/ 

%load_ext tensorboard
%tensorboard --logdir logs

# 6. Prediction and Model Ensemble

Model learns well with batch size of 50 and initial learning rate of 0.001. Max amout of epochs doesn't matter because of early stopping after 7 epochs while network doesn't make any progress. There is also model check point callback that saves model according to best validation accuracy. Final submission is made of 15 models ensemble. Average of all predictions is calculated to produce more reliable result.

In [12]:
!mkdir -p models

BATCH_SIZE = 50
MAX_EPOCHS = 200
AMOUNT_OF_MODELS_TO_EVALUATE = 15

def train_model(batch_size: int, max_epochs: int, amount_of_models_to_evaluate):
    ensemble_models = []

    for i in range(1, amount_of_models_to_evaluate + 1):
        print(f'STEP: {i}/{amount_of_models_to_evaluate}\n')

        model = JellyfishNet()

        early_stopping = tf.keras.callbacks.EarlyStopping(
            monitor='val_accuracy', patience=7)
        learning_rate_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
            monitor='val_accuracy', 
            patience=2, 
            verbose=1, 
            factor=0.5, 
            min_lr=1e-10,
        )
        model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
            f'models/jellyfishnet_step_{str(i).zfill(3)}',
            monitor='val_accuracy',
            verbose=1,
            save_best_only=True, 
            save_weights_only=True,
        )
        log_dir = f'logs/{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}'
        tensorboard_callback = tf.keras.callbacks.TensorBoard(
            log_dir=log_dir, histogram_freq=1, write_graph=False)
        callbacks = [
            early_stopping, 
            learning_rate_scheduler,
            model_checkpoint,
            tensorboard_callback,
        ]

        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
            loss='categorical_crossentropy',
            metrics=['accuracy'],
        )

        model.fit(
            datagen.flow(train_images, train_labels, batch_size=batch_size),
            steps_per_epoch=len(train_images) / batch_size,
            epochs=max_epochs, 
            validation_data=(val_images, val_labels), 
            callbacks=callbacks,
        )

        display.clear_output(wait=True)

    for i in range(1, amount_of_models_to_evaluate + 1):
        model = JellyfishNet()
        model.load_weights(f'models/jellyfishnet_step_{str(i).zfill(3)}')
        ensemble_models.append(model)

    predictions = [model.predict(test_images) for model in ensemble_models]
    predictions = np.array(predictions)
    predictions = np.mean(predictions, axis=0)
    predictions = np.argmax(predictions, axis=1)

    accuracy = tf.keras.metrics.Accuracy()
    accuracy.update_state(test_labels, predictions)
    accuracy = accuracy.result().numpy()

    return f'Ensemble accuracy: {str(accuracy * 100)[:7]}%'

train_model(BATCH_SIZE, MAX_EPOCHS, AMOUNT_OF_MODELS_TO_EVALUATE)

'Ensemble accuracy: 99.7500%'