In [1]:
from __future__ import print_function

import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.activations import relu
from tensorflow.keras.datasets import cifar10, mnist
from tensorflow.keras.regularizers import l2, l1, l1_l2
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.layers import Conv3D
from tensorflow.keras import backend as K

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print("Packages Loaded")

Packages Loaded


---

ResNets are 'Residual Networks' that operate by allowing direct training and back propagation to 'residual blocks' in the network. This lets the input, x, pass through the network from start to finish without being trained on. This is a major advantage over classical deep models because they have to understand the data more upfront before they can adjust the weights on each layer. 

![resnet-e1548261477164.png](attachment:resnet-e1548261477164.png)

One thing to note is that normally the inputs are passed through a batch norm before they enter into the ReLU activation function. so the residual blocks will look more like this.

![1_hx7aqbQmlspkrA7wLinFpg.png](attachment:1_hx7aqbQmlspkrA7wLinFpg.png)

This is how a typical ResNet would look like in the simplest terms possible. You have the head of the network to the left with a convolutional and max pooling layer to help the network get the base level understanding from the data. From there every ResNet block in the network will focus on another aspect of the data that other parts of the network haven't considered. You can think of this as gaining knowledge and wisdom about the data over time. So the more residual blocks you have, the deeper of an understanding you can get. 

This idea also has two big benefits other than improved accuracy. It eliminates the possibility of vanishing gradients and it also reduces training time and complexity. How can that be? Well our model has the ability to pass weights and inputs completely through the network without them being changed or affected. This means that even if you had 1,001 layers your model can choose to only use 2 of those layers and turn off the rest. This also means that your network doesn't have to do gradient descent for the whole network at the same time. During the initial training steps it will only have to weight and bias updates for the layers it is currently training. This allows it to always effectively be a network that is 2-3 layers long while training, and then during test time it will be a 1,001 layered network that is much more accurate and stable than a 3 layered network. 

Because it does direct back prop to individual layers while training that means it doesn't follow the chain rule that much either so the possibility of vanishing and exploding gradients is basically gone. All of this together makes ResNets and its variants much more capable than standard deep learning models.

![The-structure-of-ResNet-12.png](attachment:The-structure-of-ResNet-12.png)

A powerful variation of ResNets are called DenseNets. These allow for direct propagation for all layers at all timesteps no matter how trained the models are. They are the idea of ResNets taken to the extreme. They are a bit harder to create and tune but they work better than ResNets in pretty much every task. We will show a simple version of one layer but without the bottleneck variant that makes it more powerful.

![1_QWtz0S27vYsDzb5luRfKSA.png](attachment:1_QWtz0S27vYsDzb5luRfKSA.png)

----

In [2]:
# Now lets look at a simple model and then a more complex model that both use residual blocks. 
# We will look at a CNN and at a fully connected Residual network.

# This is all just standard loading of data and preping for training. We will be working with MNIST like in lab 3.
batch_size = 128
epochs = 10

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

'''
Here is where the model begins. I will seperate each block by blank lines. 
This input also includes the first layer that is built into the model that 
is needed before you pass the weights onto the residual network.
'''
inputs = tf.keras.Input(shape=(784,), name='img')
x = Dense(128, activation='relu')(inputs)
block_1_output = Dense(128, activation='relu')(x)

'''
This is the first Res Block. It has two layers and so is called a 'double skip' connection,
as the network could choose to skip both layers and move on.
'''
x = Dense(128)(block_1_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_2_output = tf.keras.layers.add([x, block_1_output])
'''
This is the end of the first Res Block. Notice that there is an 'add' layer here that adds the last layer in the Res Block (x)
and the output of the input layer 'block_1_output'. This will be how we add the original input each time.
If we were to skip all the blocks the 'block output' from each layer will just be the original input layer's output (block_1_output).
'''

# We will repeat above for as many times as we think our computer can handle. For now lets make just three Residual Blocks.
x = Dense(128)(block_2_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_3_output = tf.keras.layers.add([x, block_2_output])

x = Dense(128)(block_3_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_4_output = tf.keras.layers.add([x, block_3_output])


# Now we cast the final Residual output into a dense layer to be able to classify the output easier. 
x = Dense(128, activation='relu')(block_4_output)
x = Dropout(0.5)(x)
outputs = Dense(10, activation='softmax')(x)
# The rest is just standard keras code. 
model = tf.keras.Model(inputs, outputs, name='resnet')

model.compile(Adam(amsgrad=True), 'binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

'\nHere is where the model begins. I will seperate each block by blank lines. \nThis input also includes the first layer that is built into the model that \nis needed before you pass the weights onto the residual network.\n'

"\nThis is the first Res Block. It has two layers and so is called a 'double skip' connection,\nas the network could choose to skip both layers and move on.\n"

"\nThis is the end of the first Res Block. Notice that there is an 'add' layer here that adds the last layer in the Res Block (x)\nand the output of the input layer 'block_1_output'. This will be how we add the original input each time.\nIf we were to skip all the blocks the 'block output' from each layer will just be the original input layer's output (block_1_output).\n"

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1cf63f4a9e8>

In [3]:
# Now lets train on mnist again but with a deep CNN ResNet. 
# We will change the dense layers to Convolutional layers but everything else is the same. 

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape((x_train.shape[0], 28, 28, 1))
x_test = x_test.reshape((x_test.shape[0], 28, 28, 1))

x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

inputs = tf.keras.Input(shape=(28, 28, 1), name='img')
x = Conv2D(32, 3, activation='relu')(inputs)
x = Conv2D(64, 3, activation='relu')(x)
block_1_output = MaxPooling2D(3)(x)

x = Conv2D(64, 3, activation='relu', padding='same')(block_1_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_2_output = tf.keras.layers.add([x, block_1_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_2_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_3_output = tf.keras.layers.add([x, block_2_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_3_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_4_output = tf.keras.layers.add([x, block_3_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_4_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_5_output = tf.keras.layers.add([x, block_4_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_5_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_6_output = tf.keras.layers.add([x, block_5_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_6_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_7_output = tf.keras.layers.add([x, block_6_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_7_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_8_output = tf.keras.layers.add([x, block_7_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_8_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_9_output = tf.keras.layers.add([x, block_8_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_9_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_10_output = tf.keras.layers.add([x, block_9_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_10_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_11_output = tf.keras.layers.add([x, block_10_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_11_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_12_output = tf.keras.layers.add([x, block_11_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_12_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_13_output = tf.keras.layers.add([x, block_12_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_13_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_14_output = tf.keras.layers.add([x, block_13_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_14_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_15_output = tf.keras.layers.add([x, block_14_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_15_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_16_output = tf.keras.layers.add([x, block_15_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_16_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_17_output = tf.keras.layers.add([x, block_16_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_17_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_18_output = tf.keras.layers.add([x, block_17_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_18_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_19_output = tf.keras.layers.add([x, block_18_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_19_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_20_output = tf.keras.layers.add([x, block_19_output])

x = Conv2D(64, 3, activation='relu')(block_20_output)
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
outputs = Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs, name='resnet')


model.compile(Adam(amsgrad=True), 'binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=256,
          epochs=10,
          validation_split=0.2)

Train on 48000 samples, validate on 12000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1cdada87898>

As we can see CNNs are still better on image data and that deeper ResNets are better in general than shallow ResNets. Lets now test this on a harder dataset to classify Cifar-10. We will want to try and do both a fully connected ResNet and a CNN ResNet and see how they compare with each other and MNIST.

---

In [4]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

x_train = x_train.reshape(50000, 3072)
x_test = x_test.reshape(10000, 3072)

x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

batch_size = 128
epochs = 10

inputs = tf.keras.Input(shape=(3072,), name='img')
x = Dense(128, activation='relu')(inputs)
block_1_output = Dense(128, activation='relu')(x)

x = Dense(128)(block_1_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_2_output = tf.keras.layers.add([x, block_1_output])

x = Dense(128)(block_2_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_3_output = tf.keras.layers.add([x, block_2_output])

x = Dense(128)(block_3_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_4_output = tf.keras.layers.add([x, block_3_output])

x = Dense(128, activation='relu')(block_4_output)
x = Dropout(0.5)(x)
outputs = Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs, name='resnet')

model.compile(Adam(amsgrad=True), 'binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

Train on 50000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1cdcb72e630>

We see that is worse than MNIST which is to be expected since it is much more complex data. Lets see how the CNN ResNet does and how much of an approvement we get on a shallow CNN ResNet versus a deep CNN ResNet.

In [5]:
# First the shallow
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

inputs = tf.keras.Input(shape=(32, 32, 3), name='img')
x = Conv2D(32, 3, activation='relu')(inputs)
x = Conv2D(64, 3, activation='relu')(x)
block_1_output = MaxPooling2D(3)(x)

x = Conv2D(64, 3, padding='same')(block_1_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(64, 3, padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
block_2_output = tf.keras.layers.add([x, block_1_output])

x = Conv2D(64, 3, padding='same')(block_2_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(64, 3, padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
block_3_output = tf.keras.layers.add([x, block_2_output])

x = Conv2D(64, 3, activation='relu')(block_3_output)
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
outputs = Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs, name='resnet')

model.compile(Adam(amsgrad=True), 'binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=64,
          epochs=10,
          validation_split=0.2)

Train on 40000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1cf83f275c0>

We see that it trains in roughly the same speed but with a much more stable accuracy improvement over time and with a MUCH higher final accuracy. 

In [6]:
# Now the deep.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

inputs = tf.keras.Input(shape=(32, 32, 3), name='img')
x = Conv2D(32, 3, activation='relu')(inputs)
x = Conv2D(64, 3, activation='relu')(x)
block_1_output = MaxPooling2D(3)(x)

x = Conv2D(64, 3, activation='relu', padding='same')(block_1_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_2_output = tf.keras.layers.add([x, block_1_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_2_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_3_output = tf.keras.layers.add([x, block_2_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_3_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_4_output = tf.keras.layers.add([x, block_3_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_4_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_5_output = tf.keras.layers.add([x, block_4_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_5_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_6_output = tf.keras.layers.add([x, block_5_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_6_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_7_output = tf.keras.layers.add([x, block_6_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_7_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_8_output = tf.keras.layers.add([x, block_7_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_8_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_9_output = tf.keras.layers.add([x, block_8_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_9_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_10_output = tf.keras.layers.add([x, block_9_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_10_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_11_output = tf.keras.layers.add([x, block_10_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_11_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_12_output = tf.keras.layers.add([x, block_11_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_12_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_13_output = tf.keras.layers.add([x, block_12_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_13_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_14_output = tf.keras.layers.add([x, block_13_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_14_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_15_output = tf.keras.layers.add([x, block_14_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_15_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_16_output = tf.keras.layers.add([x, block_15_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_16_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_17_output = tf.keras.layers.add([x, block_16_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_17_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_18_output = tf.keras.layers.add([x, block_17_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_18_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_19_output = tf.keras.layers.add([x, block_18_output])

x = Conv2D(64, 3, activation='relu', padding='same')(block_19_output)
x = BatchNormalization()(x)
x = Conv2D(64, 3, activation='relu', padding='same')(x)
block_20_output = tf.keras.layers.add([x, block_19_output])

x = Conv2D(64, 3, activation='relu')(block_20_output)
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
outputs = Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs, name='resnet')


model.compile(Adam(amsgrad=True), 'binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=256,
          epochs=10,
          validation_split=0.2)

Train on 40000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1cfadbe2780>

So we see that training is roughly doubled but it is still faster than a standard CNN for this data. We also see a small boost in validation accuracy over the shallow net. This difference will only increase as we do more epochs. This is because the deeper networks require more epochs to train than shallow networks. This isn't outright a bad thing though. Normal models will hit a wall in training at certain accuracies called 'local minima' that stop the model from training past it. Deep networks almost always find the 'global minima' which will give the best results and the results will be more stable on new unseen data. 

Now I want to show off how a DenseNet might be different without the bottleneck feature that makes them so much better. The information on that can be found here: https://www.youtube.com/watch?v=-W6y8xnd--U

---

In [7]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

batch_size = 128
epochs = 10

inputs = tf.keras.Input(shape=(784,), name='img')
x = Dense(128, activation='relu')(inputs)
block_1_output = Dense(128, activation='relu')(x)

x = Dense(128)(block_1_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
block_2_output = tf.keras.layers.add([x, block_1_output])

# Everything before here will be the same. The only change is in the output layers of every block after the first one.
x = Dense(128)(block_2_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
# Here we will reuse the 'block_1_output' in this layer and everyone afterward to let the model use this unchanged weight the entire time.
block_3_output = tf.keras.layers.add([x, block_1_output, block_2_output])

x = Dense(128)(block_3_output)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
'''
We will do that again here with both block 1 and block 2 outputs. 
This lets our model constantly do back prop to any layer even after we start training a new one. 
This lets the model speed up training time and make small adjustments to a layer wihtout that adjustment affecting any other layer.
'''
block_4_output = tf.keras.layers.add([x, block_1_output, block_2_output, block_3_output])


x = Dense(128, activation='relu')(block_4_output)
x = Dropout(0.5)(x)
outputs = Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs, name='resnet')

model.compile(Adam(amsgrad=True), 'binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

'\nWe will do that again here with both block 1 and block 2 outputs. \nThis lets our model constantly do back prop to any layer even after we start training a new one. \nThis lets the model speed up training time and make small adjustments to a layer wihtout that adjustment affecting any other layer.\n'

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1cf8b333b70>

We see that simply changing the way the blocks are added gives the model the same performance as the shallow model but it trains in less time. If we were to implement the bottleneck layer we should get better accuracy and still train faster than standard ResNets. 

## Papers

ResNets:
https://arxiv.org/pdf/1512.03385.pdf

Highway Nets:
https://arxiv.org/pdf/1505.00387.pdf

Variations:
https://arxiv.org/pdf/1611.05431.pdf, https://arxiv.org/pdf/1603.09382.pdf

DenseNets:
https://arxiv.org/pdf/1608.06993v3.pdf

## Code

All of these and more have been prebuilt by Tensorflow and can be used without coding from scratch if you don't need a custom use of these models!
https://www.tensorflow.org/api_docs/python/tf/keras/applications

ResNets:
https://www.tensorflow.org/guide/keras/functional#a_toy_resnet_model,
https://towardsdatascience.com/implementing-a-resnet-model-from-scratch-971be7193718,
https://datascience-enthusiast.com/DL/Residual_Networks_v2.html


DenseNets:
https://github.com/titu1994/DenseNet/blob/master/densenet.py,
https://github.com/flyyufelix/DenseNet-Keras,
https://github.com/YixuanLi/densenet-tensorflow/blob/master/cifar10-densenet.py