<a href="https://colab.research.google.com/github/gkdivya/EVA/blob/master/13.ResNet/ResNet18_CIFAR10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Super Convergence using ResNet18
In this notebook, will modify the code in the link [CIFAR-10 - 92% Acc within 10 Epochs](https://mc.ai/tutorial-1-cifar10-with-google-colabs-free-gpu%E2%80%8A-%E2%80%8A92-5/) as below and target 90% accuracy or more

*   Use ResNet18 model instead
*   Model must look like Conv->B1->B2->B3->B4 and not individually called Convs.
*   Batch Size 128
*   Use Normalization values of: (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)
*   Random Crop of 32 with padding of 4px
*   Horizontal Flip (0.5)
*   Optimizer: SGD, Weight-Decay: 5e-4
*   NOT-OneCycleLR
*   Train for 300 Epochs




In [4]:
import numpy as np
import time, math
from tqdm import tqdm_notebook as tqdm

import tensorflow as tf
import tensorflow.contrib.eager as tfe

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## Eager Execution
Uses Eager execution runtime. More details on Eager Execution:

https://github.com/gkdivya/EVA/blob/master/12.SuperConvergence/EagerExecution.md




In [0]:
tf.enable_eager_execution()

In [0]:
BATCH_SIZE = 128 #@param {type:"integer"}
MOMENTUM = 0.9 #@param {type:"number"}
LEARNING_RATE = 0.4 #@param {type:"number"}
WEIGHT_DECAY = 5e-4 #@param {type:"number"}
EPOCHS = 300 #@param {type:"integer"}

## Initializing weights

**Initialization of Convolution layer**

For initializing weights in Convolution layer, Keras uses `Xavier Glorot init` weight initialization algorithm and PyTorch uses a different version of `Kaiming He init`. Since we are trying to replicate DavidNet's PyTorch version code in Keras, we are initializing the weights same as what is been implemented in PyTorch.

PyTorch takes the inverse square root of the layer’s fan-in as a bound, and then generates a random initial weight in the range [-bound, bound]

In [0]:
def init_pytorch(shape, dtype=tf.float32, partition_info=None):
  fan = np.prod(shape[:-1])
  bound = 1 / math.sqrt(fan)
  return tf.random.uniform(shape, minval=-bound, maxval=bound, dtype=dtype)

**Initialization of Batch Normalization layer**

Batch Normalization layer in PyTorch, weights are initialized randomly, whereas in Keras, weights by defaults are fixed to 1’s. In DavidNet batch normalization layer, weights are fixed to 1's, so we don't need to modify the way we initialize.
Apart from that momentum of 0.9 and epsilon parameters are modified.

momentum: Momentum for the moving mean and the moving variance.
epsilon: Small float added to variance to avoid dividing by zero.

## Creating blocks

**Convolution Block:** Inputs are passed convolution layer -> dropouts -> batch Normalization -> Relu

In [0]:
class ConvBN(tf.keras.Model):
  def __init__(self, c_out):
    super().__init__()
    self.conv = tf.keras.layers.Conv2D(filters=c_out, kernel_size=3, padding="SAME", kernel_initializer=init_pytorch, use_bias=False)
    self.bn = tf.keras.layers.BatchNormalization(momentum=0.9, epsilon=1e-5)
    self.drop = tf.keras.layers.Dropout(0.05)

  def call(self, inputs):
    #Function chaining    
    return tf.nn.relu(self.bn(self.drop(self.conv(inputs))))

**Residule Block:** Conv-BN-Relu block, followed by a 2×2 max pool, and then (optionally) two Conv-BN-Relu blocks with a residual connection

In [0]:
class ResBlk(tf.keras.Model):
  def __init__(self, c_out, pool, res = False):
    super().__init__()
    self.conv_bn = ConvBN(c_out)
    self.pool = pool
    self.res = res
    if self.res:
      self.res1 = ConvBN(c_out)
      self.res2 = ConvBN(c_out)

  def call(self, inputs):
    h = self.pool(self.conv_bn(inputs))
    if self.res:
      h = h + self.res2(self.res1(h))
    return h

In [0]:

def resnet_v1(input_shape, depth, num_classes=10):
    """ResNet Version 1 Model builder [a]

    Stacks of 2 x (3 x 3) Conv2D-BN-ReLU
    Last ReLU is after the shortcut connection.
    At the beginning of each stage, the feature map size is halved (downsampled)
    by a convolutional layer with strides=2, while the number of filters is
    doubled. Within each stage, the layers have the same number filters and the
    same number of filters.
    Features maps sizes:
    stage 0: 32x32, 16
    stage 1: 16x16, 32
    stage 2:  8x8,  64
    The Number of parameters is approx the same as Table 6 of [a]:
    ResNet20 0.27M
    ResNet32 0.46M
    ResNet44 0.66M
    ResNet56 0.85M
    ResNet110 1.7M

    # Arguments
        input_shape (tensor): shape of input image tensor
        depth (int): number of core convolutional layers
        num_classes (int): number of classes (CIFAR10 has 10)

    # Returns
        model (Model): Keras model instance
    """
    if (depth - 2) % 6 != 0:
        raise ValueError('depth should be 6n+2 (eg 20, 32, 44 in [a])')
    # Start model definition.
    num_filters = 16
    num_res_blocks = int((depth - 2) / 6)

    inputs = Input(shape=input_shape)
    x = resnet_layer(inputs=inputs)
    # Instantiate the stack of residual units
    for stack in range(3):
        for res_block in range(num_res_blocks):
            strides = 1
            if stack > 0 and res_block == 0:  # first layer but not first stack
                strides = 2  # downsample
            y = resnet_layer(inputs=x,
                             num_filters=num_filters,
                             strides=strides)
            y = resnet_layer(inputs=y,
                             num_filters=num_filters,
                             activation=None)
            if stack > 0 and res_block == 0:  # first layer but not first stack
                # linear projection residual shortcut connection to match
                # changed dims
                x = resnet_layer(inputs=x,
                                 num_filters=num_filters,
                                 kernel_size=1,
                                 strides=strides,
                                 activation=None,
                                 batch_normalization=False)
            x = keras.layers.add([x, y])
            x = Activation('relu')(x)
        num_filters *= 2

    # Add classifier on top.
    # v1 does not use BN after last shortcut connection-ReLU
    x = AveragePooling2D(pool_size=8)(x)
    y = Flatten()(x)
    outputs = Dense(num_classes,
                    activation='softmax',
                    kernel_initializer='he_normal')(y)

    # Instantiate model.
    model = Model(inputs=inputs, outputs=outputs)
    return model

In [13]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
len_train, len_test = len(x_train), len(x_test)
y_train = y_train.astype('int64').reshape(len_train)
y_test = y_test.astype('int64').reshape(len_test)

X_train_mean = np.array([0.4914, 0.4822, 0.4465])
X_train_std = np.array([0.2023, 0.1994, 0.2010])

X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std


pad4 = lambda x: np.pad(x, [(0, 0), (4, 4), (4, 4), (0, 0)], mode='reflect')

x_train = normalize(pad4(x_train))
x_test = normalize(x_test)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [11]:
model = resnet_v1(input_shape=input_shape, depth=depth)

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=lr_schedule(0)),
              metrics=['accuracy'])
model.summary()

NameError: ignored

## Building DavidNet

DAWNBench winning solution for Cifar-10 has only 8 convolution layers and 1 fully-connected layer. 

1.   A Conv-BN-Relu block
2.   3 ResBlocks (two with residual components and one without)
3.   A Gobal max-pool layer
4.   A Fully connected layer that output logits & a “Multiply by 0.125” operation

which outputs two values: cross-entropy loss and accuracy, in terms of the number of correct predictions in a batch. 

![DavidNet](https://github.com/gkdivya/EVA/blob/master/12.SuperConvergence/assets/DavidNet.png?raw=true)


In [0]:
class DavidNet(tf.keras.Model):
  def __init__(self, c=64, weight=0.125):
    super().__init__()
    pool = tf.keras.layers.MaxPooling2D()
    self.init_conv_bn = ConvBN(c)
    self.blk1 = ResBlk(c*2, pool, res = True)
    self.blk2 = ResBlk(c*4, pool)
    self.blk3 = ResBlk(c*8, pool, res = True)
    self.pool = tf.keras.layers.GlobalMaxPool2D()
    self.linear = tf.keras.layers.Dense(10, kernel_initializer=init_pytorch, use_bias=False)
    self.weight = weight

  def call(self, x, y):
    h = self.pool(self.blk3(self.blk2(self.blk1(self.init_conv_bn(x)))))
    #Fully-connected classifier layer, multiplies the logits by 0.125. This scaling factor 0.125 is hand-tuned in DavidNet
    h = self.linear(h) * self.weight
    ce = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=h, labels=y)
    loss = tf.reduce_sum(ce)
    correct = tf.reduce_sum(tf.cast(tf.math.equal(tf.argmax(h, axis = 1), y), tf.float32))
    return loss, correct

**Model creation:** Global step initialized so that the lr_schedule can be configured on the global step. 

**Optimizer:**  Define the optimiser with scheduler rate

**Data Augmentation:** Random crop and flip is used

In [14]:
model = DavidNet()
model.summary()
batches_per_epoch = len_train//BATCH_SIZE + 1

lr_schedule = lambda t: np.interp([t], [0, (EPOCHS+1)//5, EPOCHS], [0, LEARNING_RATE, 0])[0]
global_step = tf.train.get_or_create_global_step()

#Learning rate is scaled down by batch size to counter the loss which is scaled up by batch size
lr_func = lambda: lr_schedule(global_step/batches_per_epoch)/BATCH_SIZE


opt = tf.train.MomentumOptimizer(lr_func, momentum=MOMENTUM, use_nesterov=True)
data_aug = lambda x, y: (tf.image.random_flip_left_right(tf.random_crop(x, [32, 32, 3])), y)

ValueError: ignored

## Hyper-parameters
**Loss:** Loss is essentially scaled up by a factor of the batch size, consequently, each gradient is now Batch-Size bigger meaning that the learning rate should be scaled down by a factor of 1/Batch-size.

**Weight decay**: Before we decay the weights, we first multiply the weight decay hyperparameter by the learning rate. So to cancel out the downscaling of learning rate, we must also upscale weight decay by Batch size.

**Momentum:**
DavidNet trains the model with Stochastic Gradient Descent with Nesterov momentum, with a slanted triangular learning rate schedule.


In [0]:

t = time.time()
test_set = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BATCH_SIZE)

for epoch in range(EPOCHS):
  train_loss = test_loss = train_acc = test_acc = 0.0
  train_set = tf.data.Dataset.from_tensor_slices((x_train, y_train)).map(data_aug).shuffle(len_train).batch(BATCH_SIZE).prefetch(1)

  tf.keras.backend.set_learning_phase(1)
  for (x, y) in tqdm(train_set):
    with tf.GradientTape() as tape:
      loss, correct = model(x, y)

    var = model.trainable_variables
    #With Eager execution, we can use context manager to save all the executions happened so far 
    #to compute gradients
    
    grads = tape.gradient(loss, var)
    for g, v in zip(grads, var):
      #
      g += v * WEIGHT_DECAY * BATCH_SIZE
    opt.apply_gradients(zip(grads, var), global_step=global_step)

    #Loss is summed for each batch
    train_loss += loss.numpy()
    train_acc += correct.numpy()

  tf.keras.backend.set_learning_phase(0)
  for (x, y) in test_set:
    loss, correct = model(x, y)
    test_loss += loss.numpy()
    test_acc += correct.numpy()
    
  print('epoch:', epoch+1, 'lr:', lr_schedule(epoch+1), 'train loss:', train_loss / len_train, 'train acc:', train_acc / len_train, 'val loss:', test_loss / len_test, 'val acc:', test_acc / len_test, 'time:', time.time() - t)




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 1 lr: 0.08 train loss: 1.6336391064453124 train acc: 0.40734 val loss: 1.0610911560058593 val acc: 0.6206 time: 33.05153250694275


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 2 lr: 0.16 train loss: 0.8933230285644531 train acc: 0.68136 val loss: 1.0772883575439454 val acc: 0.6555 time: 58.68683624267578


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 3 lr: 0.24 train loss: 0.6655279177856446 train acc: 0.7685 val loss: 0.7744013854980468 val acc: 0.7541 time: 84.04165172576904


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 4 lr: 0.32 train loss: 0.5791167834472656 train acc: 0.80038 val loss: 0.9863092926025391 val acc: 0.7125 time: 109.61860418319702


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 5 lr: 0.4 train loss: 0.5016365876770019 train acc: 0.82744 val loss: 0.5565072830200195 val acc: 0.8116 time: 135.28209280967712


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 6 lr: 0.37894736842105264 train loss: 0.4197472592163086 train acc: 0.85578 val loss: 0.5377095321655273 val acc: 0.823 time: 160.83615565299988


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 7 lr: 0.35789473684210527 train loss: 0.3454934747314453 train acc: 0.88002 val loss: 0.4544006896972656 val acc: 0.8505 time: 186.4078643321991


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 8 lr: 0.33684210526315794 train loss: 0.2990135446166992 train acc: 0.89546 val loss: 0.48860083312988284 val acc: 0.8506 time: 211.74214005470276


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 9 lr: 0.31578947368421056 train loss: 0.2645912223815918 train acc: 0.90888 val loss: 0.38752456665039064 val acc: 0.8733 time: 237.22713947296143


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 10 lr: 0.2947368421052632 train loss: 0.22748498809814452 train acc: 0.92282 val loss: 0.35482265548706055 val acc: 0.8854 time: 262.75570917129517


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 11 lr: 0.2736842105263158 train loss: 0.20997696502685548 train acc: 0.92768 val loss: 0.36620839767456054 val acc: 0.8809 time: 288.3285961151123


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 12 lr: 0.25263157894736843 train loss: 0.18034590217590332 train acc: 0.93668 val loss: 0.33830790481567385 val acc: 0.8914 time: 313.81680274009705


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 13 lr: 0.23157894736842108 train loss: 0.15810416015625 train acc: 0.94548 val loss: 0.2983163108825684 val acc: 0.9023 time: 339.2895791530609


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 14 lr: 0.2105263157894737 train loss: 0.138950482711792 train acc: 0.95154 val loss: 0.3551513786315918 val acc: 0.8889 time: 364.98002529144287


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 15 lr: 0.18947368421052635 train loss: 0.12186662895202637 train acc: 0.95782 val loss: 0.4001967620849609 val acc: 0.8805 time: 390.6096396446228


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 16 lr: 0.16842105263157897 train loss: 0.10866021209716797 train acc: 0.9629 val loss: 0.29612521057128904 val acc: 0.9102 time: 416.12506556510925


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 17 lr: 0.1473684210526316 train loss: 0.09134000160217286 train acc: 0.96972 val loss: 0.28178701400756834 val acc: 0.9114 time: 441.57280921936035


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 18 lr: 0.12631578947368421 train loss: 0.08025247726440429 train acc: 0.97324 val loss: 0.27596197395324706 val acc: 0.9179 time: 466.97564482688904


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 19 lr: 0.10526315789473689 train loss: 0.07022912216186523 train acc: 0.97722 val loss: 0.28936398010253905 val acc: 0.9147 time: 492.38785886764526


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 20 lr: 0.08421052631578951 train loss: 0.05916305568695068 train acc: 0.98182 val loss: 0.27614187316894534 val acc: 0.9203 time: 517.8384346961975


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 21 lr: 0.06315789473684214 train loss: 0.05028404010772705 train acc: 0.98496 val loss: 0.25275745582580567 val acc: 0.9251 time: 543.2611935138702


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 22 lr: 0.04210526315789476 train loss: 0.042335688152313235 train acc: 0.9869 val loss: 0.25204232559204104 val acc: 0.9257 time: 568.6265816688538


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 23 lr: 0.02105263157894738 train loss: 0.03891898441314697 train acc: 0.98862 val loss: 0.25043766174316406 val acc: 0.9262 time: 594.0733613967896


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


epoch: 24 lr: 0.0 train loss: 0.03417562875747681 train acc: 0.99074 val loss: 0.24939545059204102 val acc: 0.9289 time: 619.4773788452148
