# 1.0 Introduction

There are discrete architectural elements from milestone models that you can use to design your convolutional neural networks. Specifically, models that have achieved state-of-the-art results for tasks like image classification use discrete architecture elements repeated multiple times, such as the VGG block in the [VGG](https://arxiv.org/abs/1409.1556) models, the inception module in the [GoogLeNet](https://arxiv.org/abs/1409.4842), and the residual module in the [ResNet](https://arxiv.org/abs/1512.03385). Once you can implement parameterized versions of these architectural elements, you can use them to design your models for computer
vision and other applications. This lesson will discover how to implement the critical architecture elements from milestone convolutional neural network models from scratch. After completing this lesson, you will know:

- How to implement a VGG module used in the VGG-16 and VGG-19 convolutional neural network models.
- How to implement the naive and optimized inception module used in the GoogLeNet model.

# 2.0 VGGnet

In our previous lesson, we discussed LeNet and AlexNet, two seminal Convolutional Neural Networks in the deep learning and computer vision literature. Simonyan and Zisserman first introduced VGGnet, (sometimes referred to as simply VGG) in their 2015 paper, [Very Deep Learning Convolutional Neural Networks for Large-Scale Image Recognition]((https://arxiv.org/abs/1409.1556)). 

> The primary contribution of their work was demonstrating that architecture with tiny (3x3) filters can be trained to increasingly higher depths (16-19 layers) and obtain state-of-the-art classification on the challenging ImageNet classification challenge.

This network is **characterized by its simplicity**, using only 3×3 convolutional layers stacked on top of each other in increasing depth. **Reducing the spatial dimensions of volumes is accomplished through the usage of max pooling**. Two fully connected layers, each with 4,096 nodes (and dropout in between), are followed by a softmax classifier.

VGG is often used today for <font color="red">transfer learning</font> (we will describe this technique later in this course) **as the network demonstrates an above-average ability to generalize to datasets it was not trained on** (compared to other network types such as GoogLeNet and ResNet). More times than not, if you are reading a publication or lab journal that applies transfer learning, it likely uses VGG as the base model.

Unfortunately, training VGG from scratch is a pain, to say the least. The network is brutally slow to train, and the network architecture weights themselves are quite large (over 500MB). Due to the depth of the network along with the fully-connected layers, the backpropagation phase is excruciatingly slow.

> References from practitioners such as [Adrian Rosebrock](https://www.pyimagesearch.com/), training VGG on eight GPUs took $\approx$ 10 days – with any less than four GPUs, training VGG from scratch will likely take prohibitively long (unless you can be very patient).

That said, it’s important as a deep learning practitioner to understand the history of deep learning, especially the concept of <font color="red">pre-training</font> and how **we later learned to avoid this expensive operation by optimizing our initialization weight functions**.

## 2.1 Implementing VGGNet

When implementing VGG, Simonyan and Zisserman tried variants of VGG that increased in depth. The figure below was extracted from their publication is and highlights their experiments. In particular, we are most interested in configurations A, B, D, and E. Looking at these architectures, you will notice two patterns:

1. The first is that the **network uses only 3×3 filters**. 
2. The second is as the depth of the network increases, the number of filters learned increases as well – to be exact, **the number of filters doubles each time max pooling is applied** to reduce volume size. 

> The notion of doubling the number of filters each time you decrease spatial dimensions is of historical importance in the deep learning literature and even a pattern you will see today.

<img width="600" src="https://drive.google.com/uc?export=view&id=19cSs4edO6rQ4ZodRzgU54Inv8P3m566K"/>

We perform this doubling of filters to ensure no single layer block is more biased than the others. Layers earlier in the network architecture have fewer filters, but their spatial volumes are also much larger, implying there is **more (spatial) data** to learn from.

However, we know that applying a max-pooling operation will reduce our spatial input volumes. 

> If we reduce the spatial volumes without increasing the number of filters, our layers become unbalanced and potentially biased, implying that layers earlier in the network may influence our output classification more than layers deeper. 

To combat this imbalance, we keep in mind the ratio of volume size to the number of filters. If we reduce the input volume size by 50-75%, we double the number of filters in the next set of CONV layers to maintain the balance.

The issue with training such deep architectures is that Simonyan and Zisserman found training VGG16 and VGG19 to be extremely challenging due to their depth. **If these architectures were randomly initialized and trained from scratch, they would often struggle to learn and gain any initial traction – the networks were too deep for basic random initialization**. Therefore, to train deeper variants of VGG, Simonyan and Zisserman came up with a clever concept called <font color="red">pre-training</font>.

> Pre-training is the practice of training smaller versions of your network architecture with fewer weight layers first and then using these converged network weights as the initializations for larger,
deeper networks.

In the case of VGG, the authors first trained configuration A, VGG11. VGG11 was able to converge to the level of reasonably low loss but not state-of-the-art accuracy worthy.

The weights from VGG11 were then used as initializations to configuration B, VGG13. The conv3-64 and conv3-128 layers (highlighted in bold in above figure) in VGG13 were randomly initialized while the remainder of the layers were copied over from the pre-trained VGG11 network. Using the initializations, Simonyan and Zisserman were able to train VGG13 successfully–but still not obtain state-of-the-art accuracy.

This pre-training pattern continued to configuration D, which we commonly know as VGG16. This time three new layers were randomly initialized while the other layers were copied over from VGG13. The network was then trained using these **warmed pre-trained up** layers, thereby allowing the randomly initialized layers to converge and learn discriminating patterns. Ultimately, VGG16 was able to perform very well on the ImageNet classification challenge.

> As a final experiment, Simonyan and Zisserman once again applied pre-training to configuration E, VGG19. This very deep architecture copied the weights from the pre-trained VGG16 architecture and then added another additional three convolutional layers. After training, it was found that VGG19 obtained the highest classification accuracy from their experiments; however, the size of the model (574MB) and the amount of time it took to train and evaluate the network, all for meager gains, made it less appealing to deep learning practitioners.

**If pre-training sounds like a painful, tedious process, that is because it is**. Training smaller variations of your network architecture and then using the converged weights as initializations to your deeper versions of the network is a clever trick; however, it requires training and tuning the
hyperparameters to N separate networks, where N is your final network architecture along with the number of previous (smaller) networks required to obtain the end model. Performing this process is
extremely time-consuming, especially for deeper networks with many fully-connected layers such as VGG.

<font color="red">The good news is that we no longer perform pre-training</font> when training very deep Convolutional Neural Networks – instead, we rely on a good initialization function. Instead of pure random weight initializations we now use [Xavier/Glorot](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) or MSRA (also known as He et al. initialization).
Through the work of both Mishkin and Mtas in their 2015 paper, [All you need is a good init](https://arxiv.org/abs/1511.06422) and He et al. in [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852), we found that we can skip the pre-training phase entirely and jump directly to the deeper variations of the network architectures.

> After these papers were published, Simonyan and Zisserman re-evaluated their experiments and found that these **smarter initialization** schemes and activation functions could replicate their previous performance without the usage of tedious pre-training.

Additionally, **we recommend using batch normalization after the activation functions in the network**. 

> Apply batch normalization was not discussed in the original Simonyan and Zisserman paper, but as other lessons have discussed, batch normalization can stabilize your training and reduce the total number of epochs required to obtain a reasonably performing model.

## 2.2 The VGG Family of Networks

Two key components can characterize the VGG family of Convolutional Neural Networks:

1. All CONV layers in the network using only 3x3 filters.
2. Stacking multiple CONV => RELU layer sets (where the number of consecutive CONV => RELU layers typically increases the deeper we go) before applying a POOL operation.

In this section, we will discuss a variant of the VGGNet architecture, which we call “MiniVGGNet,” because the network is substantially more shallow than its big brother.

### 2.2.1 The (Mini) VGGNet Architecture

In LeNet-5, we have applied a series of CONV => RELU => POOL layers. However, in VGGNet, we stack multiple CONV => RELU layers before applying a single POOL layer. This allows the network to learn more rich features from the CONV layers before downsampling the spatial input size via the POOL operation.

Overall, MiniVGGNet consists of two sets of CONV => RELU => CONV => RELU => POOL layers, followed by a set of FC => RELU => FC => SOFTMAX layers. The first two CONV layers will learn 32 filters, each of size 3x3. The second two CONV layers will learn 64 filters, again, each of size 3x3. Our POOL layers will perform max pooling over a 2x2 window with a 2x2 stride. We will also be inserting batch normalization layers after the activations and dropout layers (DO) after the POOL and FC layers.

The network architecture itself is detailed in Figure below, where the initial input image size is assumed to be 32x32x3 as we will be training MiniVGGNet on [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) later in this section.

<img width="450" src="https://drive.google.com/uc?export=view&id=1hcjInhUrh0AuZXV73mu49RoksoIkzayU"/>


Again, notice how the batch normalization and dropout layers are included in the network architecture based on **Best Practices** described in Lesson 5. Applying batch normalization will help reduce the effects of overfitting and increase our classification accuracy on CIFAR-10.

### 2.2.2 CIFAR-10

Just like MNIST, CIFAR-10 is considered another standard benchmark dataset for image classification
in the computer vision and machine learning literature. CIFAR-10 consists of 60,000
32x32x3 (RGB) images resulting in a feature vector dimensionality of 3072.

As the name suggests, CIFAR-10 consists of 10 classes, including: airplanes, automobiles,
birds, cats, deer, dogs, frogs, horses, ships, and trucks.

While it’s quite easy to train a model that obtains > 97% classification accuracy on MNIST,
it’s substantially harder to obtain such a model for CIFAR-10 (and it’s bigger brother, CIFAR-100).

The challenge comes from the dramatic variance in how objects appear. For example, we can
no longer assume that an image containing a green pixel at a given (x;y)-coordinate is a frog. This
pixel could be part of the background of a forest that contains a deer. Or, the pixel could simply be
the color of a green truck.

These assumptions are a stark contrast to the MNIST dataset where the network can learn assumptions regarding the spatial distribution of pixel intensities. For example, the spatial distribution of foreground pixels of the number 1 is substantially different than a 0 or 5.

> While being a small dataset, CIFAR-10 is still regularly used to benchmark new CNN architectures.

## 2.3 Implementing MiniVGGNet

In [None]:
# import the necessary packages
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras import backend as K

class MiniVGGNet:
	@staticmethod
	def build(width, height, depth, classes):
		# initialize the model along with the input shape to be
		# "channels last" and the channels dimension itself
		model = Sequential()
		inputShape = (height, width, depth)
		chanDim = -1

		# if we are using "channels first", update the input shape
		# and channels dimension
		if K.image_data_format() == "channels_first":
			inputShape = (depth, height, width)
			chanDim = 1

		# first CONV => RELU => CONV => RELU => POOL layer set
		model.add(Conv2D(32, (3, 3), padding="same",input_shape=inputShape))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(32, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(2, 2)))
		model.add(Dropout(0.25))

		# second CONV => RELU => CONV => RELU => POOL layer set
		model.add(Conv2D(64, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(64, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(2, 2)))
		model.add(Dropout(0.25))

		# first (and only) set of FC => RELU layers
		model.add(Flatten())
		model.add(Dense(512))
		model.add(Activation("relu"))
		model.add(BatchNormalization())
		model.add(Dropout(0.5))

		# softmax classifier
		model.add(Dense(classes))
		model.add(Activation("softmax"))

		# return the constructed network architecture
		return model

In [None]:
model = MiniVGGNet.build(width=32, height=32, depth=3, classes=10)
model.summary()

## 2.4 MiniVGGNet on CIFAR-10

In [None]:
%%capture
!pip install wandb

In [None]:
!wandb login

In [None]:
import matplotlib

# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import cifar10
import matplotlib.pyplot as plt
import numpy as np
import wandb
from wandb.keras import WandbCallback

# load the training and testing data, then scale it into the
# range [0, 1]
print("[INFO] loading CIFAR-10 data...")
(train_x, train_y), (test_x, test_y) = cifar10.load_data()
train_x = train_x.astype("float") / 255.0
test_x = test_x.astype("float") / 255.0

# convert the labels from integers to vectors
lb = LabelBinarizer()
train_y = lb.fit_transform(train_y)
test_y = lb.transform(test_y)

# initialize the label names for the CIFAR-10 dataset
labelNames = ["airplane", "automobile", "bird", "cat", "deer",
              "dog", "frog", "horse", "ship", "truck"]

In [None]:
# Set an experiment name to group training and evaluation
experiment_name = wandb.util.generate_id()

# setup wandb
wandb.init(project="lesson07_VGG", 
           group=experiment_name,
           config={
               "epoch": 40,
               "batch_size": 64,
           })
config = wandb.config

In [None]:
%%wandb

# initialize the optimizer and model
print("[INFO] compiling model...")

# An decay parameter was used. This argument is used to slowly reduce the learning rate over time.
# Rate Schedulers, decaying the learning rate is helpful in reducing overfitting
# and obtaining higher classification accuracy – the smaller the learning rate is, 
# the smaller the weight updates will be. A common setting for decay is to divide
# the initial learning rate by the total number of epochs – in this case, 
# we’ll be training our network for a total of 40 epochs with an initial learning rate of 0.01,
# therefore decay = 0.01 / 40.

opt = SGD(lr=0.01, decay=0.01/40, momentum=0.9, nesterov=True)
model = MiniVGGNet.build(width=32, height=32, depth=3, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt,metrics=["accuracy"])

# train the network
print("[INFO] training network...")
H = model.fit(train_x, train_y, validation_data=(test_x, test_y),
              batch_size=config.batch_size, 
              epochs=config.epoch,
              verbose=1,
              callbacks=[WandbCallback()])
wandb.finish()

**Log Analysis**

Next, log an analysis run, using the same experiment name as the group parameter so that this run and the previous run are grouped together in W&B.

In [None]:
%%capture
# Install dependencies
!pip install scikit-plot -qqq

In [None]:
import numpy as np
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import tensorflow as tf
from scikitplot.metrics import plot_confusion_matrix, plot_roc, plot_precision_recall

wandb.init(project="lesson07_VGG", group=experiment_name)

# Class proportions
train_y_labels = [labelNames[i] for i in np.argmax(train_y, axis=1)]
test_y_labels = [labelNames[i] for i in np.argmax(test_y, axis=1)]
wandb.log({'Class Proportions': wandb.sklearn.plot_class_proportions(train_y_labels,
                                                                     test_y_labels,
                                                                     labelNames)},
           commit=False) # Hold on, more incoming!

# Log F1 Score
test_y_pred = np.asarray(model.predict(test_x))
test_y_pred_class = np.argmax(test_y_pred, axis=1)
f1 = f1_score(np.argmax(test_y, axis=1), test_y_pred_class, average='micro')
wandb.log({"f1": f1}, commit=False)

#test_y_labels = [labelNames[i] for i in np.argmax(test_y, axis=1)]
test_y_pred_labels = [labelNames[i] for i in test_y_pred_class]


# Log Confusion Matrix
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(test_y_labels, test_y_pred_labels, ax=ax)
wandb.log({"confusion_matrix": wandb.Image(fig)}, commit=False)

# Log ROC Curve
fig, ax = plt.subplots(figsize=(16, 12))
plot_roc(test_y_labels, test_y_pred, ax=ax)
wandb.log({"plot_roc": wandb.Image(fig)},commit=False)  # Now we've logged everything for this step

# Precision vs Recall
fig, ax = plt.subplots(figsize=(16, 12))
plot_precision_recall(test_y_labels, test_y_pred, ax=ax)
wandb.log({"plot_precision_recall": wandb.Image(fig)},commit=False)  # Now we've logged everything for this step

# Class Scores
class_score_data = []
for test, pred in zip(test_y_labels, test_y_pred):
    class_score_data.append([test, pred])

wandb.log({"class_scores": wandb.Table(data=class_score_data,
                                           columns=["test", "pred"])}, commit=False)

# 
# Visualize Predictions
# 
# visualize 18 images
def show_image(train_image, label, index):
    plt.subplot(3, 6, index+1)
    plt.imshow(tf.squeeze(train_image), cmap=plt.cm.gray)
    plt.title(label)
    plt.grid(b=False)

# predictions
predictions = model.predict(test_x)
results = np.argmax(predictions, axis = 1)

# visualize the first 18 test results
plt.figure(figsize=(12, 8))
for index in range(18):
    label = results[index]
    image_pixels = test_x[index,:,:,:]
    show_image(image_pixels, labelNames[label], index)
plt.tight_layout()

wandb.log({"Predictions": plt}, commit=True)

wandb.finish()

In [None]:
# evaluate the network
print("[INFO] evaluating network...")
predictions = model.predict(test_x, batch_size=64)
print(classification_report(test_y.argmax(axis=1),
                            predictions.argmax(axis=1), target_names=labelNames))

## 2.5 Extensions

When evaluating MinIVGGNet you can performed some experiments:
1. Disable GPU and use the only CPU to investigate the ratio of benefit to using more powerful hardware.
2. Experiment the model without batch normalization.
3. Let’s go ahead and take a look at these results to compare how network performance increases
when applying batch normalization.
4. Consider using Data Augmentation in a combination with 2 and 3. Tell us what you find.
5. Try to combat the overfiting problem using techniques presented in previous lessons.

# 3.0 GoogLeNet

This section will study the **GoogLeNet architecture**, introduced by Szegedy et al. in their 2014 paper, [Going Deeper With Convolutions](https://arxiv.org/pdf/1409.4842.pdf). This paper is essential for two reasons. First, the model architecture is tiny compared to AlexNet and VGGNet ( 28MB for the weights themselves). The authors can obtain such a dramatic drop in network architecture size (while still increasing the depth of the overall network) by removing fully connected layers and instead of using global average pooling. Most of the weights in a CNN can be found in the dense FC layers – if these layers can be removed, the memory savings are massive.

Secondly, the Szegedy et al. paper makes usage of a network in-network or micro-architecture when constructing the overall macro-architecture. Up to this point, we have seen only sequential neural networks where the output of one network feeds directly into the next. We are now going to see micro-architectures, small building blocks used inside the rest of the architecture, where the output from one layer can split into various paths and be rejoined later.

Specifically, Szegedy et al. contributed the Inception module to the deep learning community, a building block that fits into a Convolutional Neural Network, enabling it to learn CONV layers with **multiple filter sizes**, turning the module into a multi-level feature extractor.

Micro-architectures such as Inception have inspired other significant variants, including the Residual module in [ResNet](https://arxiv.org/abs/1512.03385) and the Fire module in [SqueezeNet](https://arxiv.org/abs/1602.07360). We will be discussing the Inception module (and its variants) later in this section. Once we have examined the Inception module and ensure we know how it works, we will then implement a smaller version of GoogLeNet called "MiniGoogLeNet" – we will train this architecture on the CIFAR-10 dataset and obtain higher accuracy than in any of our previous section using VGG.

From there, we will move on to the more difficult [cs231n Tiny ImageNet Challenge](http://cs231n.stanford.edu/project.html). This challenge is offered to students enrolled in [Stanford’s cs231n Convolutional Neural Networks for Visual Recognition class](http://cs231n.stanford.edu/) as part of their final project. It means to give them a taste of the challenges associated with large-scale deep learning on modern architectures without being as time-consuming or taxing to work with as the entire ImageNet dataset.

By training GoogLeNet from scratch on Tiny ImageNet, we will demonstrate how to obtain a top ranking position on the Tiny ImageNet leaderboard. Furthermore, in our next section, we will utilize ResNet to claim the top position from models trained from scratch. Let us go ahead and get this section started by discussing the Inception module.

## 3.1 The Inception Module (and its Variants)

Modern state-of-the-art Convolutional Neural Networks utilize **micro-architectures**, also called **network-in-network modules**, initially proposed by [Lin et al](https://arxiv.org/abs/1312.4400). I prefer the term micro-architecture better describes these modules as building blocks in the context of the overall macro-architecture (i.e., what you build and train).

Micro-architectures are tiny building blocks designed by deep learning practitioners to enable networks to learn (1) faster and (2) more efficiently, all while increasing network depth. 
> These micro-architecture building blocks are stacked, along with conventional layer types such as CONV, POOL, etc., to form the overall macro-architecture.

In 2014, Szegedy et al. introduced the Inception module. The general idea behind the Inception module is two-fold:

1. It can be hard to decide the size of the filter you need to learn at given CONV layers. Should they be 5x5 filters? What about 3x3 filters? Should we learn local features using 1x1 filters? Instead, why not learn them all and let the model decide? Inside the Inception module, we learn all three 5x5, 3x3, and 1x1 filters (computing them in parallel), concatenating the resulting feature maps along the channel dimension. The next layer in
the GoogLeNet architecture (which could be another Inception module) receives these concatenated, mixed filters and performs the same process. Taken as a whole, this process enables GoogLeNet to learn both local features via smaller convolutions and abstracted
features with larger convolutions – we do not have to sacrifice our level of abstraction at the expense of smaller features.
2. By learning multiple filter sizes, we can turn the module into a multi-level feature extractor. The 5x5 filters have a larger receptive size and can learn more abstract features. The 1x1 filters are, by definition, local. The 3x3 filters sit as a balance in between.

### 3.1.1 Inception

Now that we’ve discussed the motivation behind the Inception module, let’s look at the actual
module itself in Figure below (the original Inception module used in GoogLeNet).

<img width="450" src="https://drive.google.com/uc?export=view&id=1ja1dLUSBtZSxBwsbimHUbho0cEvno0Cb"/>

Specifically, take note of how the Inception module branches into four distinct paths from the input layer. The first branch in the Inception module learns a series of 1x1 local features from the input.

The second batch first applies 1x1 convolutions, not only as a form of learning local features but instead as dimensionality reduction. Larger convolutions (i.e., 3x3 and 5x5) by definition take more computation to perform. Therefore, if we can reduce the dimensionality of the inputs
to these larger filters by applying 1x1 convolutions, we can reduce the computation required by our network. Therefore, the number of filters learned in the 1x1 CONV in the second branch will always be smaller than the number of 3x3 filters learned directly afterward.

The third branch applies the same logic as the second branch, only this time to learn 5x5 filters. We once again reduce dimensionality via 1x1 convolutions, then feed the output into the 5x5 filters.

The fourth and final branch of the Inception module performs 3x3 max pooling with a stride of 1x1 – this branch is commonly referred to as the pool projection branch. Historically, models that perform pooling have demonstrated an ability to obtain higher accuracy, although we now
know through the work of Springenberg et al. in their 2014 paper, [Striving for Simplicity: The All Convolutional Net](https://arxiv.org/abs/1412.6806) that this is not necessarily true and that POOL layers can be replaced with CONV layers for reducing volume size.

In the case of Szegedy et al., this POOL layer was added simply because it was thought that they were needed for CNNs to perform reasonably. The output of the POOL is then fed into another series of 1x1 convolutions to learn local features.

Finally, all four-interception modules converge where they are concatenated together along the channel dimension. Special care is taken during the implementation (via zero paddings) to ensure the output of each branch has the same volume size, thereby allowing the outputs to be concatenated. The output of the Inception module is then fed into the next layer in the network. In practice, we often stack multiple Inception modules on top of each other before performing a pooling operation to reduce volume size.

### 3.1.2 Miniception

Of course, the original Inception module was designed for GoogLeNet. It could be trained on the ImageNet dataset (where each input image is assumed to be 224x22x43) and obtain state-of-the-art accuracy. We can simplify the Inception module for smaller datasets (with smaller image spatial dimensions) where fewer network parameters are required.

A tiny version of Inception was developed from Zhang et al. in a 2017
publication, [Understanding Deep Learning Requires Re-Thinking Generalization](https://arxiv.org/pdf/1611.03530.pdf). The top row of the figure below describes three modules used in their MiniGoogLeNet implementation.

<img width="750" src="https://drive.google.com/uc?export=view&id=1TlTRgCzY_Af-laTm_x29VT-Ud1V5lgpA"/>

- **Left**: A convolution module responsible for performing convolution, batch normalization, and activation.
- **Middle**: The Miniception module, which performs two sets of convolutions, one for 1x1 filters and the other for 3x3 filters, then concatenates the results. No dimensionality reduction is performed before the 3x3 filter as (1) the input volumes will be smaller already (since we’ll be using the CIFAR-10 dataset) and (2) to reduce the number of parameters in
the network.
- **Right**: A downsample module which applies both convolution and max-pooling to reduce dimensionality, then concatenates across the filter dimension.


These building blocks are then used to build the MiniGoogLeNet architecture on the bottom row. You’ll notice here that the authors placed the batch normalization before the activation (presumably because this is what Szegedy et al. did as well), in contrast to what is now recommended when implementing CNNs.


## 3.2 MiniGoogLeNet on CIFAR-10

In this section, we are going to implement the MiniGoogLeNet architecture using the Miniception module. We’ll then train MiniGoogLeNet on the CIFAR-10 dataset. As our results demonstrate, this architecture will obtain > 90% accuracy on CIFAR-10, far better than our previous attempts using miniVGG.

### 3.2.1 Implementing MiniGoogLeNet

In [None]:
# import the necessary packages
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import concatenate
from tensorflow.keras import backend as K

class MiniGoogLeNet:
	@staticmethod
	def conv_module(x, K, kX, kY, stride, chanDim, padding="same"):
		# define a CONV => BN => RELU pattern
		x = Conv2D(K, (kX, kY), strides=stride, padding=padding)(x)
		x = BatchNormalization(axis=chanDim)(x)
		x = Activation("relu")(x)

		# return the block
		return x

	@staticmethod
	def inception_module(x, numK1x1, numK3x3, chanDim):
		# define two CONV modules, then concatenate across the
		# channel dimension
		conv_1x1 = MiniGoogLeNet.conv_module(x, numK1x1, 1, 1,(1, 1), chanDim)
		conv_3x3 = MiniGoogLeNet.conv_module(x, numK3x3, 3, 3,(1, 1), chanDim)
		x = concatenate([conv_1x1, conv_3x3], axis=chanDim)

		# return the block
		return x

	@staticmethod
	def downsample_module(x, K, chanDim):
		# define the CONV module and POOL, then concatenate
		# across the channel dimensions
		conv_3x3 = MiniGoogLeNet.conv_module(x, K, 3, 3, (2, 2),chanDim, padding="valid")
		pool = MaxPooling2D((3, 3), strides=(2, 2))(x)
		x = concatenate([conv_3x3, pool], axis=chanDim)

		# return the block
		return x

	@staticmethod
	def build(width, height, depth, classes):
		# initialize the input shape to be "channels last" and the
		# channels dimension itself
		inputShape = (height, width, depth)
		chanDim = -1

		# if we are using "channels first", update the input shape
		# and channels dimension
		if K.image_data_format() == "channels_first":
			inputShape = (depth, height, width)
			chanDim = 1

		# define the model input and first CONV module
		inputs = Input(shape=inputShape)
		x = MiniGoogLeNet.conv_module(inputs, 96, 3, 3, (1, 1),chanDim)

		# two Inception modules followed by a downsample module
		x = MiniGoogLeNet.inception_module(x, 32, 32, chanDim)
		x = MiniGoogLeNet.inception_module(x, 32, 48, chanDim)
		x = MiniGoogLeNet.downsample_module(x, 80, chanDim)

		# four Inception modules followed by a downsample module
		x = MiniGoogLeNet.inception_module(x, 112, 48, chanDim)
		x = MiniGoogLeNet.inception_module(x, 96, 64, chanDim)
		x = MiniGoogLeNet.inception_module(x, 80, 80, chanDim)
		x = MiniGoogLeNet.inception_module(x, 48, 96, chanDim)
		x = MiniGoogLeNet.downsample_module(x, 96, chanDim)

		# two Inception modules followed by global POOL and dropout
		x = MiniGoogLeNet.inception_module(x, 176, 160, chanDim)
		x = MiniGoogLeNet.inception_module(x, 176, 160, chanDim)
		x = AveragePooling2D((7, 7))(x)
		x = Dropout(0.5)(x)

		# softmax classifier
		x = Flatten()(x)
		x = Dense(classes)(x)
		x = Activation("softmax")(x)

		# create the model
		model = Model(inputs, x, name="googlenet")

		# return the constructed network architecture
		return model

### 3.2.2 Training and Evaluating MiniGoogLeNet on CIFAR-10

In this training we are importing the **LearningRateScheduler class**, which
implies that we’ll be defining a specific learning rate for our optimizer to follow when training the
network. Specifically, we’ll be defining a polynomial decay learning rate schedule. A polynomial learning rate scheduler will follow the equation:

$$
\displaystyle \alpha = \alpha_0 \times (1 - \frac{e}{e_{max}})^p
$$

Where $\alpha_0$ is the **initial learning rate**, $e$ is the **current epoch number**, $e_{max}$ is the **maximum number of epochs** we are going to perform, and $p$ is the **power of the polynomial**. Applying this equation yields the learning rate $\alpha$ for the current epoch.

In [None]:
# definine the total number of epochs to train for along with the
# initial learning rate
NUM_EPOCHS = 70
INIT_LR = 5e-3
power_base = 1.0

def poly_decay(epoch):
	# initialize the maximum number of epochs, base learning rate,
	# and power of the polynomial
	maxEpochs = NUM_EPOCHS
	baseLR = INIT_LR
	power = power_base

	# compute the new learning rate based on polynomial decay
	alpha = baseLR * (1 - (epoch / float(maxEpochs))) ** power

	# return the new learning rate
	return alpha


Given the **maximum number of epochs**, the learning rate will decay to zero. This learning rate scheduler can also be made linear by setting the power to 1.0 – which is often done – and, in fact, what we are going to do in this example. We have included a number of example polynomial learning rate schedules using a maximum of **70 epochs**, an initial learning rate of $5e-3$, and varying powers in Figure below. Notice how as the power increases, the faster the learning rate drops. Using a power of 1.0 turns the curve into a linear decay.

In [None]:
result = dict()
for p in [1.0, 1.5, 2.0, 3.0]:
  power_base = p
  result[power_base] = []
  for i in range(70):
    result[power_base].append(poly_decay(i))

In [None]:
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
fig, ax = plt.subplots(1,1, figsize=(8,6))
for i in range(4):
  ax.plot(range(70),list(result.values())[i])

ax.set_title("Examples of Polynomical Learning Rate Decay")
ax.set_ylabel("Learning Rate")
ax.set_xlabel("Epoch")
ax.legend(["p = 1.0", "p = 1.5", "p = 2.0", "p = 3.0"])
plt.show()

In [None]:
%%capture
!pip install wandb

In [None]:
!wandb login

In [None]:
import wandb

# Set an experiment name to group training and evaluation
experiment_name = wandb.util.generate_id()

# setup wandb
wandb.init(project="lesson07_MiniGoogLeNet", 
           group=experiment_name,
           config={
               "epoch": 70,
               "init_lr": 5e-3
           })
config = wandb.config

In [None]:
%%wandb

# set the matplotlib backend so figures can be saved in the background
import matplotlib

# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import cifar10
import numpy as np
from wandb.keras import WandbCallback


# definine the total number of epochs to train for along with the
# initial learning rate
NUM_EPOCHS = config.epoch
INIT_LR = config.init_lr
power_base = 1.0

# load the training and testing data, converting the images from
# integers to floats
print("[INFO] loading CIFAR-10 data...")
((train_x, train_y), (test_x, test_y)) = cifar10.load_data()
train_x = train_x.astype("float")
test_x = test_x.astype("float")

# apply mean subtraction to the data
mean = np.mean(train_x, axis=0)
train_x -= mean
test_x -= mean

# convert the labels from integers to vectors
lb = LabelBinarizer()
train_y = lb.fit_transform(train_y)
test_y = lb.transform(test_y)

# construct the image generator for data augmentation
aug = ImageDataGenerator(width_shift_range=0.1,
                         height_shift_range=0.1,
                         horizontal_flip=True,
                         fill_mode="nearest")

# construct the set of callbacks
callbacks = [LearningRateScheduler(poly_decay),
             WandbCallback()]

# initialize the optimizer and model
print("[INFO] compiling model...")
opt = SGD(lr=INIT_LR, momentum=0.9)
model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt,metrics=["accuracy"])

# train the network
print("[INFO] training network...")
model.fit(aug.flow(train_x, train_y, batch_size=64),
                    validation_data=(test_x, test_y),
                    steps_per_epoch=len(train_x) // 64,
                    epochs=NUM_EPOCHS, 
                    callbacks=callbacks, verbose=1)

wandb.finish()

In [None]:
model.summary()

**Log Analysis**

Next, log an analysis run, using the same experiment name as the group parameter so that this run and the previous run are grouped together in W&B.

In [None]:
%%capture
# Install dependencies
!pip install scikit-plot -qqq

In [None]:
import numpy as np
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import tensorflow as tf
from scikitplot.metrics import plot_confusion_matrix, plot_roc, plot_precision_recall

wandb.init(project="lesson07_MiniGoogLeNet", group=experiment_name)

# initialize the label names for the CIFAR-10 dataset
labelNames = ["airplane", "automobile", "bird", "cat", "deer",
              "dog", "frog", "horse", "ship", "truck"]

# Class proportions
train_y_labels = [labelNames[i] for i in np.argmax(train_y, axis=1)]
test_y_labels = [labelNames[i] for i in np.argmax(test_y, axis=1)]
wandb.log({'Class Proportions': wandb.sklearn.plot_class_proportions(train_y_labels,
                                                                     test_y_labels,
                                                                     labelNames)},
           commit=False) # Hold on, more incoming!

# Log F1 Score
test_y_pred = np.asarray(model.predict(test_x))
test_y_pred_class = np.argmax(test_y_pred, axis=1)
f1 = f1_score(np.argmax(test_y, axis=1), test_y_pred_class, average='micro')
wandb.log({"f1": f1}, commit=False)

#test_y_labels = [labelNames[i] for i in np.argmax(test_y, axis=1)]
test_y_pred_labels = [labelNames[i] for i in test_y_pred_class]

# Log Confusion Matrix
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(test_y_labels, test_y_pred_labels, ax=ax)
wandb.log({"confusion_matrix": wandb.Image(fig)}, commit=False)

# Log ROC Curve
fig, ax = plt.subplots(figsize=(16, 12))
plot_roc(test_y_labels, test_y_pred, ax=ax)
wandb.log({"plot_roc": wandb.Image(fig)},commit=False)  # Now we've logged everything for this step

# Precision vs Recall
fig, ax = plt.subplots(figsize=(16, 12))
plot_precision_recall(test_y_labels, test_y_pred, ax=ax)
wandb.log({"plot_precision_recall": wandb.Image(fig)},commit=False)  # Now we've logged everything for this step

# Class Scores
class_score_data = []
for test, pred in zip(test_y_labels, test_y_pred):
    class_score_data.append([test, pred])

wandb.log({"class_scores": wandb.Table(data=class_score_data,
                                           columns=["test", "pred"])}, commit=False)

# 
# Visualize Predictions
# 
# visualize 18 images
def show_image(train_image, label, index):
    plt.subplot(3, 6, index+1)
    plt.imshow(tf.squeeze(train_image), cmap=plt.cm.gray)
    plt.title(label)
    plt.grid(b=False)

# predictions
predictions = model.predict(test_x)
results = np.argmax(predictions, axis = 1)

# visualize the first 18 test results
plt.figure(figsize=(12, 8))
for index in range(18):
    label = results[index]
    image_pixels = test_x[index,:,:,:]/255
    show_image(image_pixels, labelNames[label], index)
plt.tight_layout()

wandb.log({"Predictions": plt}, commit=True)

wandb.finish()

In [None]:
from sklearn.metrics import classification_report

# evaluate the network
print("[INFO] evaluating network...")
predictions = model.predict(test_x, batch_size=64)
print(classification_report(test_y.argmax(axis=1),
                            predictions.argmax(axis=1), target_names=labelNames))

## 3.3 The Tiny ImageNet Challenge

The Tiny ImageNet Visual Recognition Challenge (a sample of which can be seen in Figure below) is part of the [cs231n Stanford course on Convolutional Neural Networks](http://cs231n.stanford.edu/) for Visual Recognition. As part of their final project, students can compete in the classification by either training a CNN from scratch or performing transfer learning via fine-tuning (this topic will be presented later in our course).

<img width="850" src="https://drive.google.com/uc?export=view&id=1IVTouM03vjm30eOb7M6p9pM0yHLHLcpm"/>

The [Tiny ImageNet dataset](https://drive.google.com/file/d/1ZZcGmX3s5bOb9A_El5RAHeJyau_zFgn_/view?usp=sharing) is actually a subset of the entire [ImageNet dataset](https://image-net.org/), consisting of 200 diverse classes, including everything from Egyptian cats to volleyballs to lemons. Given that there are 200 classes, guessing at random, we would expect to be correct 1/200 = 0.5%
of the time; therefore, our CNN needs to obtain at least 0.5% to demonstrate it has learned underlying discriminative patterns in the respective classes.

> Each class includes 450 training images, 50 validation images, and 50 testing images. 

Groundtruth labels are only provided for the training and validation images. Since we do not have access to the Tiny ImageNet evaluation server, we will use part of the training set to form our testing set to evaluate the performance of our classification algorithms.

The images in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have varying widths and heights. Therefore, whenever we work with ILSVRC, we first need to resize all images in the dataset to a fixed width and height before we can train our network. To help students focus strictly on the deep learning and image classification component (and not get caught up in image processing details), all images in the Tiny ImageNet dataset have been resized to 64x64 pixels and center cropped.

In some ways, having the images resized makes Tiny ImageNet a bit more challenging than its bigger brother, ILSVRC. In ILSVRC, we are free to apply any type of resizing, cropping, etc., operations that we see fit. However, with Tiny ImageNet, much of the image has already been discarded for us. As we’ll find out, obtaining a reasonable rank-1 and rank-5 accuracy on Tiny ImageNet isn’t as easy as one might think, making it a great, insightful dataset for budding deep learning practitioners to learn and practice on.

In the next few sections, you will learn how to obtain the Tiny ImageNet dataset, understand its structure, and create HDF5 files for the training, validation, and testing images.


In [None]:
# --- tiny-imagenet-200
#   |--- test
#       | --- images
#             | --- test_00.JPEG 
#   |--- train
#       | --- nxxyyzzuu 
#             | --- images
#                   | --- nxxyyzzuu_00.JPEG 
#             | --- nxxyyzzuu_boxes.txt
#   |--- val
#       | --- images
#             | --- val_00.JPEG 
#             | --- val_annotations.txt
#   |--- wnids.txt
#   |--- words.txt

# download Tiny ImageNet dataset (tiny-imagenet-200.zip)
!gdown https://drive.google.com/uc?id=1ZZcGmX3s5bOb9A_El5RAHeJyau_zFgn_

In [None]:
!unzip tiny-imagenet-200.zip

Inside the test directory are the testing images – we will be ignoring these images since we do not have access to the cs231n evaluation server (the labels are purposely left out from the download to ensure no one can “cheat” in the challenge).

We then have the train directory, which contains subdirectories with strange names starting with the letter <font color="red">n</font> followed by a series of numbers. These subdirectories are the [WordNet](https://wordnet.princeton.edu/) IDs called “synonym set” or “synsets” for short. Each WordNet ID maps to a specific word/object.
Every image inside a given WordNet subdirectory contains examples of that object.

We can lookup the human-readable label for a WordNet ID by parsing the **words.txt** file, which is simply a tab-separated file with the WordNet ID in the first column and the human-readable word/object in the second column. The **wnids.txt** file lists out the 200 WordNet IDs (one per line) in the ImageNet dataset.

Finally, the val directory stores our validation set. Inside the val directory, you will find an images subdirectory and a file named **val_annotations.txt**. The **val_annotations.txt** provides the WordNet IDs for every image in the val directory.

Therefore, before we can even get started training GoogLeNet on Tiny ImageNet, we first need to write a script to parse them and put them into the HDF5 format. Keep in mind that being a deep learning practitioner is not about implementing Convolutional Neural Networks and training them from scratch. Being a deep learning practitioner involves using your programming skills to build simple scripts that can parse data.

The more general-purpose programming skills you have, the better deep learning practitioner you can become – while other deep learning researchers are struggling to organize files on disk or understand how a dataset is structured, you’ll have already converted your entire dataset to a format suitable for training a CNN.

In the next section, we’ll teach you how to define your project configuration file and create a single, simple Python script that will convert the Tiny ImageNet dataset into an HDF5 representation.

### 3.3.1 Building the Tiny ImageNet Dataset

Let’s go ahead and define the project structure for Tiny ImageNet + GoogLeNet:

In [None]:
# import the necessary packages
from os import path

# define the paths to the training and validation directories
TRAIN_IMAGES = "tiny-imagenet-200/train"
VAL_IMAGES = "tiny-imagenet-200/val/images"

# define the path to the file that maps validation filenames to
# their corresponding class labels
VAL_MAPPINGS = "tiny-imagenet-200/val/val_annotations.txt"

# define the paths to the WordNet hierarchy files which are used
# to generate our class labels
WORDNET_IDS = "tiny-imagenet-200/wnids.txt"
WORD_LABELS = "tiny-imagenet-200/words.txt"

# since we do not have access to the testing data we need to
# take a number of images from the training data and use it instead
NUM_CLASSES = 200
NUM_TEST_IMAGES = 50 * NUM_CLASSES

# define the path to the output training, validation, and testing
# HDF5 files
TRAIN_HDF5 = "tiny-imagenet-200/hdf5/train.hdf5"
VAL_HDF5 = "tiny-imagenet-200/hdf5/val.hdf5"
TEST_HDF5 = "tiny-imagenet-200/hdf5/test.hdf5"

# define the path to the dataset mean
DATASET_MEAN = "tiny-imagenet-200/output/tiny-image-net-200-mean.json"

# define the path to the output directory used for storing plots,
# classification reports, etc.
OUTPUT_PATH = "tiny-imagenet-200/output"
MODEL_PATH = path.sep.join([OUTPUT_PATH,"epoch_60.hdf5"])
FIG_PATH = path.sep.join([OUTPUT_PATH,"deepergooglenet_tinyimagenet.png"])
JSON_PATH = path.sep.join([OUTPUT_PATH,"deepergooglenet_tinyimagenet.json"])

In [None]:
!mkdir tiny-imagenet-200/hdf5
!mkdir tiny-imagenet-200/output

As you can see, this configuration file is fairly straightforward. We are mainly just defining paths to input directories of images/label mappings along with output files. However, taking the time to create this configuration file makes our life much easier when actually building Tiny ImageNet and converting it to HDF5.

In [None]:
# import the necessary packages
import h5py
import os

class HDF5DatasetWriter:
	def __init__(self, dims, outputPath, dataKey="images",bufSize=1000):
		# check to see if the output path exists, and if so, raise
		# an exception
		if os.path.exists(outputPath):
			raise ValueError("The supplied `outputPath` already "
				"exists and cannot be overwritten. Manually delete "
				"the file before continuing.", outputPath)

		# open the HDF5 database for writing and create two datasets:
		# one to store the images/features and another to store the
		# class labels
		self.db = h5py.File(outputPath, "w")
		self.data = self.db.create_dataset(dataKey, dims,dtype="float")
		self.labels = self.db.create_dataset("labels", (dims[0],),dtype="int")

		# store the buffer size, then initialize the buffer itself
		# along with the index into the datasets
		self.bufSize = bufSize
		self.buffer = {"data": [], "labels": []}
		self.idx = 0

	def add(self, rows, labels):
		# add the rows and labels to the buffer
		self.buffer["data"].extend(rows)
		self.buffer["labels"].extend(labels)

		# check to see if the buffer needs to be flushed to disk
		if len(self.buffer["data"]) >= self.bufSize:
			self.flush()

	def flush(self):
		# write the buffers to disk then reset the buffer
		i = self.idx + len(self.buffer["data"])
		self.data[self.idx:i] = self.buffer["data"]
		self.labels[self.idx:i] = self.buffer["labels"]
		self.idx = i
		self.buffer = {"data": [], "labels": []}

	def storeClassLabels(self, classLabels):
		# create a dataset to store the actual class label names,
		# then store the class labels
		dt = h5py.special_dtype(vlen=str) # `vlen=unicode` for Py2.7
		labelSet = self.db.create_dataset("label_names",(len(classLabels),), dtype=dt)
		labelSet[:] = classLabels

	def close(self):
		# check to see if there are any other entries in the buffer
		# that need to be flushed to disk
		if len(self.buffer["data"]) > 0:
			self.flush()

		# close the dataset
		self.db.close()

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imutils import paths
import numpy as np
import progressbar
import json
import cv2
import os

# grab the paths to the training images, then extract the training
# class labels and encode them
trainPaths = list(paths.list_images(TRAIN_IMAGES))
trainLabels = [p.split(os.path.sep)[-3] for p in trainPaths]
le = LabelEncoder()
trainLabels = le.fit_transform(trainLabels)

# perform stratified sampling from the training set to construct a
# a testing set
split = train_test_split(trainPaths, trainLabels,test_size=NUM_TEST_IMAGES, 
                         stratify=trainLabels,random_state=42)
(trainPaths, testPaths, trainLabels, testLabels) = split

# load the validation filename => class from file and then use these
# mappings to build the validation paths and label lists
M = open(VAL_MAPPINGS).read().strip().split("\n")
M = [r.split("\t")[:2] for r in M]
valPaths = [os.path.sep.join([VAL_IMAGES, m[0]]) for m in M]
valLabels = le.transform([m[1] for m in M])

# construct a list pairing the training, validation, and testing
# image paths along with their corresponding labels and output HDF5
# files
datasets = [
	("train", trainPaths, trainLabels, TRAIN_HDF5),
	("val", valPaths, valLabels, VAL_HDF5),
	("test", testPaths, testLabels, TEST_HDF5)]

# initialize the lists of RGB channel averages
(R, G, B) = ([], [], [])

# loop over the dataset tuples
for (dType, paths, labels, outputPath) in datasets:
	# create HDF5 writer
	print("[INFO] building {}...".format(outputPath))
	writer = HDF5DatasetWriter((len(paths), 64, 64, 3), outputPath)

	# initialize the progress bar
	widgets = ["Building Dataset: ", progressbar.Percentage(), " ",
            progressbar.Bar(), " ", progressbar.ETA()]
	pbar = progressbar.ProgressBar(maxval=len(paths),widgets=widgets).start()

	# loop over the image paths
	for (i, (path, label)) in enumerate(zip(paths, labels)):
		# load the image from disk
		image = cv2.imread(path)

		# if we are building the training dataset, then compute the
		# mean of each channel in the image, then update the
		# respective lists
		if dType == "train":
			(b, g, r) = cv2.mean(image)[:3]
			R.append(r)
			G.append(g)
			B.append(b)

		# add the image and label to the HDF5 dataset
		writer.add([image], [label])
		pbar.update(i)

	# close the HDF5 writer
	pbar.finish()
	writer.close()

# construct a dictionary of averages, then serialize the means to a
# JSON file
print("[INFO] serializing means...")
D = {"R": np.mean(R), "G": np.mean(G), "B": np.mean(B)}
f = open(DATASET_MEAN, "w")
f.write(json.dumps(D))
f.close()

In [None]:
# evaluate the generated dataset
import h5py

filenames = [TRAIN_HDF5, VAL_HDF5, TEST_HDF5]
for filename in filenames:
  db = h5py.File(filename, "r")
  print(db["images"].shape)
  db.close()

In [None]:
# copy hdf5 files to your google drive
!cp -r tiny-imagenet-200/hdf5 /content/drive/MyDrive/Atividades/Ensino/Disciplinas/POS-GRADUAÇÃO/Deep\ Learning/Lessons/Lesson\ #07/dataset
!cp -r tiny-imagenet-200/output /content/drive/MyDrive/Atividades/Ensino/Disciplinas/POS-GRADUAÇÃO/Deep\ Learning/Lessons/Lesson\ #07/

<font color="red"> Only execute the cell below if you already have hdf5 files stored in your google drive </font>

In [None]:
# 
# only in case loading data from drive
#
!mkdir tiny-imagenet-200
!cp -r /content/drive/MyDrive/Atividades/Ensino/Disciplinas/POS-GRADUAÇÃO/Deep\ Learning/Lessons/Lesson\ #07/dataset/hdf5 tiny-imagenet-200
!cp -r /content/drive/MyDrive/Atividades/Ensino/Disciplinas/POS-GRADUAÇÃO/Deep\ Learning/Lessons/Lesson\ #07/output tiny-imagenet-200

## 3.4 DeeperGoogLeNet on Tiny ImageNet

Now that we have our HDF5 representation of the **Tiny ImageNet dataset**, we are ready to train GoogLeNet on it – but instead of using **MiniGoogLeNet** as in the previous section, we are going to use a deeper variant which more closely models the Szegedy et al. implementation. This deeper variation will use the original Inception module as which will help you understand the original architecture and implement it on your own in the future. To get started, we’ll first learn how to implement this deeper network architecture. 

We’ll then train DeeperGoogLeNet on the Tiny ImageNet dataset and evaluate the results in terms of **rank-1** and **rank-5** accuracy.



## 3.5 Training DeeperGoogLeNet on Tiny ImageNet


We have provided a figure (replicated and modified from Szegedy et al.) detailing our Deeper-GoogLeNet architecture in Figure below. There are only two primary differences between our implementation and the **full GoogLeNet architecture** used by Szegedy et al. when training the network on the complete ImageNet dataset:

1. Instead of using 7x7 filters with a stride of 2x2 in the first CONV layer, we use 5x5 filters with a 1x1 stride. We use these due to the fact that our implementation of GoogLeNet is only able to accept 64x64x3 input images while the original implementation was constructed to accept 224x224x3 images. If we applied 7x7 filters with a 2x2 stride, we would reduce our input dimensions too quickly.
2. Our implementation is slightly shallower with two fewer Inception modules – in the original Szegedy et al. paper, two more Inception modules were added prior to the average pooling operation. This implementation of GoogLeNet will be more than enough for us to perform well on Tiny ImageNet and claim a spot on the cs231n Tiny ImageNet leaderboard. 

| type           | patch size/stride | output size | depth | #1x1 | #3x3 reduce | #3x3 | #5x5 reduce | #5x5 | pool proj |
|----------------|-------------------|-------------|-------|------|-------------|------|-------------|------|-----------|
| convolution    | 5x5/1             | 64, 64, 64  | 1     |      |             |      |             |      |           |
| max pool       | 3x3/2             | 32, 32, 64  | 0     |      |             |      |             |      |           |
| convolution    | 3x3/1             | 32, 32, 192 | 2     |      | 64          | 192  |             |      |           |
| max pool       | 3x3/2             | 16, 16, 192 | 0     |      |             |      |             |      |           |
| inception (3a) |                   | 16, 16, 256 | 2     | 64   | 96          | 128  | 16          | 32   | 32        |
| inception (3b) |                   | 16, 16, 480 | 2     | 128  | 128         | 192  | 32          | 96   | 64        |
| max pool       | 3x3/2             | 8, 8, 480   | 0     |      |             |      |             |      |           |
| inception (4a) |                   | 8, 8, 512   | 2     | 192  | 96          | 208  | 16          | 48   | 64        |
| inception (4b) |                   | 8, 8, 512   | 2     | 160  | 112         | 224  | 24          | 64   | 64        |
| inception (4c) |                   | 8, 8, 512   | 2     | 128  | 128         | 256  | 24          | 64   | 64        |
| inception (4d) |                   | 8, 8, 528   | 2     | 112  | 144         | 288  | 32          | 64   | 64        |
| inception (4e) |                   | 8, 8, 832   | 2     | 256  | 160         | 320  | 32          | 128  | 128       |
| max pool       | 3x3/2             | 4, 4, 832   | 0     |      |             |      |             |      |           |
| avg pool       | 4x4/1             | 1, 1, 832   | 0     |      |             |      |             |      |           |
| dropout (40%)  |                   | 1, 1, 832   | 0     |      |             |      |             |      |           |
| linear         |                   | 1, 1, 200   | 1     |      |             |      |             |      |           |
| softmax        |                   | 1, 1, 200   | 0     |      |             |      |             |      |           |


In [None]:
# import the necessary packages
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import concatenate
from tensorflow.keras.regularizers import l2
from tensorflow.keras import backend as K

class DeeperGoogLeNet:
	@staticmethod
	def conv_module(x, K, kX, kY, stride, chanDim,padding="same", reg=0.0005, name=None):
		# initialize the CONV, BN, and RELU layer names
		(convName, bnName, actName) = (None, None, None)

		# if a layer name was supplied, prepend it
		if name is not None:
			convName = name + "_conv"
			bnName = name + "_bn"
			actName = name + "_act"

		# define a CONV => BN => RELU pattern
		x = Conv2D(K, (kX, kY), strides=stride, padding=padding, kernel_regularizer=l2(reg), name=convName)(x)
		x = BatchNormalization(axis=chanDim, name=bnName)(x)
		x = Activation("relu", name=actName)(x)

		# return the block
		return x

	@staticmethod
	def inception_module(x, num1x1, num3x3Reduce, num3x3,num5x5Reduce, num5x5, num1x1Proj, chanDim, stage, reg=0.0005):
		# define the first branch of the Inception module which
		# consists of 1x1 convolutions
		first = DeeperGoogLeNet.conv_module(x, num1x1, 1, 1,
			(1, 1), chanDim, reg=reg, name=stage + "_first")

		# define the second branch of the Inception module which
		# consists of 1x1 and 3x3 convolutions
		second = DeeperGoogLeNet.conv_module(x, num3x3Reduce, 1, 1, (1, 1), chanDim, reg=reg, name=stage + "_second1")
		second = DeeperGoogLeNet.conv_module(second, num3x3, 3, 3,  (1, 1), chanDim, reg=reg, name=stage + "_second2")

		# define the third branch of the Inception module which
		# are our 1x1 and 5x5 convolutions
		third = DeeperGoogLeNet.conv_module(x, num5x5Reduce, 1, 1, (1, 1), chanDim, reg=reg, name=stage + "_third1")
		third = DeeperGoogLeNet.conv_module(third, num5x5, 5, 5, (1, 1), chanDim, reg=reg, name=stage + "_third2")

		# define the fourth branch of the Inception module which
		# is the POOL projection
		fourth = MaxPooling2D((3, 3), strides=(1, 1), padding="same", name=stage + "_pool")(x)
		fourth = DeeperGoogLeNet.conv_module(fourth, num1x1Proj, 1, 1, (1, 1), chanDim, reg=reg, name=stage + "_fourth")

		# concatenate across the channel dimension
		x = concatenate([first, second, third, fourth], axis=chanDim,
			name=stage + "_mixed")

		# return the block
		return x

	@staticmethod
	def build(width, height, depth, classes, reg=0.0005):
		# initialize the input shape to be "channels last" and the
		# channels dimension itself
		inputShape = (height, width, depth)
		chanDim = -1

		# if we are using "channels first", update the input shape
		# and channels dimension
		if K.image_data_format() == "channels_first":
			inputShape = (depth, height, width)
			chanDim = 1

		# define the model input, followed by a sequence of CONV =>
		# POOL => (CONV * 2) => POOL layers
		inputs = Input(shape=inputShape)
		x = DeeperGoogLeNet.conv_module(inputs, 64, 5, 5, (1, 1), chanDim, reg=reg, name="block1")
		x = MaxPooling2D((3, 3), strides=(2, 2), padding="same", name="pool1")(x)
		x = DeeperGoogLeNet.conv_module(x, 64, 1, 1, (1, 1), chanDim, reg=reg, name="block2")
		x = DeeperGoogLeNet.conv_module(x, 192, 3, 3, (1, 1), chanDim, reg=reg, name="block3")
		x = MaxPooling2D((3, 3), strides=(2, 2), padding="same", name="pool2")(x)

		# apply two Inception modules followed by a POOL
		x = DeeperGoogLeNet.inception_module(x, 64, 96, 128, 16, 32, 32, chanDim, "3a", reg=reg)
		x = DeeperGoogLeNet.inception_module(x, 128, 128, 192, 32, 96, 64, chanDim, "3b", reg=reg)
		x = MaxPooling2D((3, 3), strides=(2, 2), padding="same", name="pool3")(x)

		# apply five Inception modules followed by POOL
		x = DeeperGoogLeNet.inception_module(x, 192, 96, 208, 16, 48, 64, chanDim, "4a", reg=reg)
		x = DeeperGoogLeNet.inception_module(x, 160, 112, 224, 24, 64, 64, chanDim, "4b", reg=reg)
		x = DeeperGoogLeNet.inception_module(x, 128, 128, 256, 24, 64, 64, chanDim, "4c", reg=reg)
		x = DeeperGoogLeNet.inception_module(x, 112, 144, 288, 32, 64, 64, chanDim, "4d", reg=reg)
		x = DeeperGoogLeNet.inception_module(x, 256, 160, 320, 32, 128, 128, chanDim, "4e", reg=reg)
		x = MaxPooling2D((3, 3), strides=(2, 2), padding="same", name="pool4")(x)

		# apply a POOL layer (average) followed by dropout
		x = AveragePooling2D((4, 4), name="pool5")(x)
		x = Dropout(0.4, name="do")(x)

		# softmax classifier
		x = Flatten(name="flatten")(x)
		x = Dense(classes, kernel_regularizer=l2(reg), name="labels")(x)
		x = Activation("softmax", name="softmax")(x)

		# create the model
		model = Model(inputs, x, name="googlenet")

		# return the constructed network architecture
		return model

In [None]:
model = DeeperGoogLeNet.build(64,64,3,200)
model.summary()

### 3.5.1 Creating the training pre-processing

#### 3.5.1.1 Image Preprocessors

In [None]:
# import the necessary packages
from tensorflow.keras.preprocessing.image import img_to_array

class ImageToArrayPreprocessor:
	def __init__(self, dataFormat=None):
		# store the image data format
		self.dataFormat = dataFormat

	def preprocess(self, image):
		# apply the Keras utility function that correctly rearranges
		# the dimensions of the image
		return img_to_array(image, data_format=self.dataFormat)

#### 3.5.1.2 Mean preprocessor

In [None]:
# import the necessary packages
import cv2

class MeanPreprocessor:
	def __init__(self, rMean, gMean, bMean):
		# store the Red, Green, and Blue channel averages across a
		# training set
		self.rMean = rMean
		self.gMean = gMean
		self.bMean = bMean

	def preprocess(self, image):
		# split the image into its respective Red, Green, and Blue
		# channels
		(B, G, R) = cv2.split(image.astype("float32"))

		# subtract the means for each channel
		R -= self.rMean
		G -= self.gMean
		B -= self.bMean

    # Keep in mind that OpenCV represents images in BGR order
		# merge the channels back together and return the image
		return cv2.merge([B, G, R])

#### 3.5.1.3 HDF5 dataset generators

In [None]:
# import the necessary packages
from tensorflow.keras.utils import to_categorical
import numpy as np
import h5py

class HDF5DatasetGenerator:
	def __init__(self, dbPath, batchSize, preprocessors=None, aug=None, binarize=True, classes=2):
		# store the batch size, preprocessors, and data augmentor,
		# whether or not the labels should be binarized, along with
		# the total number of classes
		self.batchSize = batchSize
		self.preprocessors = preprocessors
		self.aug = aug
		self.binarize = binarize
		self.classes = classes

		# open the HDF5 database for reading and determine the total
		# number of entries in the database
		self.db = h5py.File(dbPath, "r")
		self.numImages = self.db["labels"].shape[0]

	def generator(self, passes=np.inf):
		# initialize the epoch count
		epochs = 0

		# keep looping infinitely -- the model will stop once we have
		# reach the desired number of epochs
		while epochs < passes:
			# loop over the HDF5 dataset
			for i in np.arange(0, self.numImages, self.batchSize):
				# extract the images and labels from the HDF dataset
				images = self.db["images"][i: i + self.batchSize]
				labels = self.db["labels"][i: i + self.batchSize]

				# check to see if the labels should be binarized
				if self.binarize:
					labels = to_categorical(labels,
						self.classes)

				# check to see if our preprocessors are not None
				if self.preprocessors is not None:
					# initialize the list of processed images
					procImages = []

					# loop over the images
					for image in images:
						# loop over the preprocessors and apply each
						# to the image
						for p in self.preprocessors:
							image = p.preprocess(image)

						# update the list of processed images
						procImages.append(image)

					# update the images array to be the processed
					# images
					images = np.array(procImages)

				# if the data augmenator exists, apply it
				if self.aug is not None:
					(images, labels) = next(self.aug.flow(images,
						labels, batch_size=self.batchSize))

				# yield a tuple of images and labels
				yield (images, labels)

			# increment the total number of epochs
			epochs += 1

	def close(self):
		# close the database
		self.db.close()

#### 3.5.1.4 Simple preprocessor

In [None]:
# import the necessary packages
import cv2

class SimplePreprocessor:
	def __init__(self, width, height, inter=cv2.INTER_AREA):
		# store the target image width, height, and interpolation
		# method used when resizing
		self.width = width
		self.height = height
		self.inter = inter

	def preprocess(self, image):
		# resize the image to a fixed size, ignoring the aspect
		# ratio
		return cv2.resize(image, (self.width, self.height),
			interpolation=self.inter)

#### 3.5.1.5 Training monitor

In [None]:
# import the necessary packages
from tensorflow.keras.callbacks import BaseLogger
import matplotlib.pyplot as plt
import numpy as np
import json
import os

class TrainingMonitor(BaseLogger):
	def __init__(self, figPath, jsonPath=None, startAt=0):
		# store the output path for the figure, the path to the JSON
		# serialized file, and the starting epoch
		super(TrainingMonitor, self).__init__()
		self.figPath = figPath
		self.jsonPath = jsonPath
		self.startAt = startAt

	def on_train_begin(self, logs={}):
		# initialize the history dictionary
		self.H = {}

		# if the JSON history path exists, load the training history
		if self.jsonPath is not None:
			if os.path.exists(self.jsonPath):
				self.H = json.loads(open(self.jsonPath).read())

				# check to see if a starting epoch was supplied
				if self.startAt > 0:
					# loop over the entries in the history log and
					# trim any entries that are past the starting
					# epoch
					for k in self.H.keys():
						self.H[k] = self.H[k][:self.startAt]

	def on_epoch_end(self, epoch, logs={}):
		# loop over the logs and update the loss, accuracy, etc.
		# for the entire training process
		for (k, v) in logs.items():
			l = self.H.get(k, [])
			l.append(float(v))
			self.H[k] = l

		# check to see if the training history should be serialized
		# to file
		if self.jsonPath is not None:
			f = open(self.jsonPath, "w")
			f.write(json.dumps(self.H))
			f.close()

		# ensure at least two epochs have passed before plotting
		# (epoch starts at zero)
		if len(self.H["loss"]) > 1:
			# plot the training loss and accuracy
			N = np.arange(0, len(self.H["loss"]))
			plt.style.use("ggplot")
			plt.figure()
			plt.plot(N, self.H["loss"], label="train_loss")
			plt.plot(N, self.H["val_loss"], label="val_loss")
			plt.plot(N, self.H["accuracy"], label="train_acc")
			plt.plot(N, self.H["val_accuracy"], label="val_acc")
			plt.title("Training Loss and Accuracy [Epoch {}]".format(
				len(self.H["loss"])))
			plt.xlabel("Epoch #")
			plt.ylabel("Loss/Accuracy")
			plt.legend()

			# save the figure
			plt.savefig(self.figPath)
			plt.close()

#### 3.5.1.6 Epoch CheckPoint

In [None]:
# import the necessary packages
from tensorflow.keras.callbacks import Callback
import os

class EpochCheckpoint(Callback):
	def __init__(self, outputPath, every=5, startAt=0):
		# call the parent constructor
		super(Callback, self).__init__()

		# store the base output path for the model, the number of
		# epochs that must pass before the model is serialized to
		# disk and the current epoch value
		self.outputPath = outputPath
		self.every = every
		self.intEpoch = startAt

	def on_epoch_end(self, epoch, logs={}):
		# check to see if the model should be serialized to disk
		if (self.intEpoch + 1) % self.every == 0:
			p = os.path.sep.join([self.outputPath,
				"epoch_{}.hdf5".format(self.intEpoch + 1)])
			self.model.save(p, overwrite=True)

		# increment the internal epoch counter
		self.intEpoch += 1

#### 3.5.1.7 Rank accuracy


In [None]:
# import the necessary packages
import numpy as np

def rank5_accuracy(preds, labels):
	# initialize the rank-1 and rank-5 accuracies
	rank1 = 0
	rank5 = 0

	# loop over the predictions and ground-truth labels
	for (p, gt) in zip(preds, labels):
		# sort the probabilities by their index in descending
		# order so that the more confident guesses are at the
		# front of the list
		p = np.argsort(p)[::-1]

		# check if the ground-truth label is in the top-5
		# predictions
		if gt in p[:5]:
			rank5 += 1

		# check to see if the ground-truth is the #1 prediction
		if gt == p[0]:
			rank1 += 1

	# compute the final rank-1 and rank-5 accuracies
	rank1 /= float(len(preds))
	rank5 /= float(len(preds))

	# return a tuple of the rank-1 and rank-5 accuracies
	return (rank1, rank5)

#### 3.5.1.8 SimpleDatasetLoader

In [None]:
# import the necessary packages
import numpy as np
import cv2
import os

# helper to load images
class SimpleDatasetLoader:
	def __init__(self, preprocessors=None):
		# store the image preprocessor
		self.preprocessors = preprocessors

		# if the preprocessors are None, initialize them as an
		# empty list
		if self.preprocessors is None:
			self.preprocessors = []

	def load(self, imagePaths, verbose=-1):
		# initialize the list of features and labels
		data = []
		labels = []

		# loop over the input images
		for (i, imagePath) in enumerate(imagePaths):
			# load the image and extract the class label assuming
			# that our path has the following format:
			# /path/to/dataset/{class}/{image}.jpg
			image = cv2.imread(imagePath)
			label = imagePath.split(os.path.sep)[-2]

			# check to see if our preprocessors are not None
			if self.preprocessors is not None:
				# loop over the preprocessors and apply each to
				# the image
				for p in self.preprocessors:
					image = p.preprocess(image)

			# treat our processed image as a "feature vector"
			# by updating the data list followed by the labels
			data.append(image)
			labels.append(label)

			# show an update every `verbose` images
			if verbose > 0 and i > 0 and (i + 1) % verbose == 0:
				print("[INFO] processed {}/{}".format(i + 1,
					len(imagePaths)))

		# return a tuple of the data and labels
		return (np.array(data), np.array(labels))

### 3.5.2 Creating the Training 

In [None]:
# import the necessary packages
from os import path

# define the paths to the training and validation directories
TRAIN_IMAGES = "tiny-imagenet-200/train"
VAL_IMAGES = "tiny-imagenet-200/val/images"

# define the path to the file that maps validation filenames to
# their corresponding class labels
VAL_MAPPINGS = "tiny-imagenet-200/val/val_annotations.txt"

# define the paths to the WordNet hierarchy files which are used
# to generate our class labels
WORDNET_IDS = "tiny-imagenet-200/wnids.txt"
WORD_LABELS = "tiny-imagenet-200/words.txt"

# since we do not have access to the testing data we need to
# take a number of images from the training data and use it instead
NUM_CLASSES = 200
NUM_TEST_IMAGES = 50 * NUM_CLASSES

# define the path to the output training, validation, and testing
# HDF5 files
TRAIN_HDF5 = "tiny-imagenet-200/hdf5/train.hdf5"
VAL_HDF5 = "tiny-imagenet-200/hdf5/val.hdf5"
TEST_HDF5 = "tiny-imagenet-200/hdf5/test.hdf5"

# define the path to the dataset mean
DATASET_MEAN = "tiny-imagenet-200/output/tiny-image-net-200-mean.json"

# define the path to the output directory used for storing plots,
# classification reports, etc.
OUTPUT_PATH = "tiny-imagenet-200/output"
MODEL_PATH = path.sep.join([OUTPUT_PATH,"epoch_40.hdf5"])
FIG_PATH = path.sep.join([OUTPUT_PATH,"deepergooglenet_tinyimagenet.png"])
JSON_PATH = path.sep.join([OUTPUT_PATH,"deepergooglenet_tinyimagenet.json"])

In [None]:
# Global variables used in the train

RESUME_MODEL = "tiny-imagenet-200/output/epoch_40.hdf5"
RESUME = False
START_EPOCH = 0

In [None]:
# set the matplotlib backend so figures can be saved in the background
import matplotlib
matplotlib.use("Agg")

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam,SGD
from tensorflow.keras.models import load_model
import tensorflow.keras.backend as K
import argparse
import json

# construct the training image generator for data augmentation
aug = ImageDataGenerator(rotation_range=18, zoom_range=0.15, 
                         width_shift_range=0.2, height_shift_range=0.2, shear_range=0.15,
                         horizontal_flip=True, fill_mode="nearest")

# load the RGB means for the training set
means = json.loads(open(DATASET_MEAN).read())

# initialize the image preprocessors
sp = SimplePreprocessor(64, 64)
mp = MeanPreprocessor(means["R"], means["G"], means["B"])
iap = ImageToArrayPreprocessor()

# initialize the training and validation dataset generators
trainGen = HDF5DatasetGenerator(TRAIN_HDF5, 64, aug=aug,
                                preprocessors=[sp, mp, iap], classes=NUM_CLASSES)
valGen = HDF5DatasetGenerator(VAL_HDF5, 64,
                              preprocessors=[sp, mp, iap], classes=NUM_CLASSES)

In [None]:
# if there is no specific model checkpoint supplied, then initialize the network and compile the model
if not RESUME:
  print("[INFO] compiling model...")
  model = DeeperGoogLeNet.build(width=64, height=64, depth=3, classes=NUM_CLASSES, reg=0.0002)
  opt = SGD(1e-3,momentum=0.9)
  model.compile(loss="categorical_crossentropy", optimizer=opt,metrics=["accuracy"])
# otherwise, load the checkpoint from disk
else:
  print("[INFO] loading {}...".format(RESUME_MODEL))
  model = load_model(RESUME_MODEL)

  # update the learning rate
  print("[INFO] old learning rate: {}".format(K.get_value(model.optimizer.lr)))
  K.set_value(model.optimizer.lr, 1e-5)
  print("[INFO] new learning rate: {}".format(K.get_value(model.optimizer.lr)))

In [None]:
# construct the set of callbacks
callbacks = [EpochCheckpoint("tiny-imagenet-200/output",every=5,startAt=START_EPOCH),
             TrainingMonitor(FIG_PATH, 
                             jsonPath=JSON_PATH,
                             startAt=START_EPOCH)]

In [None]:
# train the network
model.fit(trainGen.generator(),
          steps_per_epoch=trainGen.numImages // 64,
          validation_data=valGen.generator(),
          validation_steps=valGen.numImages // 64,
          epochs=40,
          max_queue_size=10,
          callbacks=callbacks, verbose=1,initial_epoch=START_EPOCH)

# close the databases
trainGen.close()
valGen.close()

In [None]:
# save a backup of results
!cp tiny-imagenet-200/output/deepergooglenet_tinyimagenet.png /content/drive/MyDrive/Atividades/Ensino/Disciplinas/POS-GRADUAÇÃO/Deep\ Learning/Lessons/Lesson\ #07/output/models_origin/deepergooglenet_tinyimagenet.png 
!cp tiny-imagenet-200/output/epoch_40.hdf5 /content/drive/MyDrive/Atividades/Ensino/Disciplinas/POS-GRADUAÇÃO/Deep\ Learning/Lessons/Lesson\ #07/output/models_origin/epoch_40.hdf5 

### 3.5.3 Creating the Evaluation Script

Once we are satisfied with our model performance on the training and validation set, we can move on to evaluating the network on the testing set.

In [None]:
from tensorflow.keras.models import load_model
import json

# load the RGB means for the training set
means = json.loads(open(DATASET_MEAN).read())

# initialize the image preprocessors
sp = SimplePreprocessor(64, 64)
mp = MeanPreprocessor(means["R"], means["G"], means["B"])
iap = ImageToArrayPreprocessor()

# initialize the testing dataset generator
testGen = HDF5DatasetGenerator(TEST_HDF5, 64, 
                               preprocessors=[sp, mp, iap],
                               classes=NUM_CLASSES)

# load the pre-trained network
print("[INFO] loading model...")
model = load_model(MODEL_PATH)

In [None]:
# make predictions on the testing data
print("[INFO] predicting on test data...")
predictions = model.predict(testGen.generator(),
                            steps=testGen.numImages // 64,
                            max_queue_size=10)

# compute the rank-1 and rank-5 accuracies
(rank1, rank5) = rank5_accuracy(predictions, testGen.db["labels"])
print("[INFO] rank-1: {:.2f}%".format(rank1 * 100))
print("[INFO] rank-5: {:.2f}%".format(rank5 * 100))

# close the database
testGen.close()

### 3.5.4 DeeperGoogLeNet Experiments

In the following sections we have included the results of three separate experiments we ran when training DeeperGoogLeNet on Tiny ImageNet. After each experiment we evaluated the results and then made an educated decision on how the hyperparameters and network architecture should be updated to increase accuracy.

Case studies like these are especially helpful to you as a budding deep learning practitioner. Not only do they demonstrate that deep learning is an iterative process requiring many experiments, but they also show which parameters you should be paying attention to and how to update them.

Finally, it’s worth noting that some of these experiments required changes to the code. 

#### 3.5.4.1 Experiment 01

Given that this was my first time training a network on the Tiny ImageNet challenge, we wasn’t sure what the optimal depth should be for a given architecture on this dataset. While we knew Tiny ImageNet would be a challenging classification task, <font color="red">we didn’t think Inception modules 4a-4e were required</font>, so we **removed** them from our DeeperGoogLeNet implementation above, leading to a substantially more shallow network architecture.

1. We decided to train DeeperGoogLenet using SGD with an initial learning rate of $1e-2$ and momentum term of $0.9$ (no Nesterov acceleration was applied).
2. [<font color="red">best practice</font>] you should first try SGD to obtain a baseline, and then if need be, use more advanced optimization methods.
3. The learning rate schedule detailed in Table below can then be used. This table implies that after epoch 25 we stopped training, lowered the learning rate to $1e-3$, then resumed training for another 10 epochs.
4. After epoch 35 we can again stopped training, lowered the learning rate to $1e-4$, and then resumed training for thirty more epochs. Training for an extra thirty epochs can be excessive, to say the least; however, we wanted to get a feel for the level of overfitting to expect for a large number of epochs after the original learning rate had been dropped (as this was the first time we had worked with GoogLeNet + Tiny ImageNet).

| Epoch   | Learning Rate |
|---------|---------------|
| 1 - 25  | $1e-02$      |
| 26 - 35 | $1e-03$      |
| 36 - 65 | $1e-04$      |

#### 3.5.4.2 Experiment 02

In our second experiment with DeeperGoogLeNet + Tiny ImageNet, I decided to switch out the SGD optimizer for Adam. This decision was made strictly because I wasn’t convinced that the network architecture needed to be deeper (yet). The Adam optimizer was used with the default initial learning rate of $1e-3$. I then used the learning rate schedule in Table below.

| Epoch   | Learning Rate |
|---------|---------------|
| 1 - 20  | 1e-3          |
| 21 - 30 | 1e-4          |
| 31 - 40 | 1e-5          |

#### 3.5.4.3 Experiment 03

Supposing a learning stagnation, we can postulate that the network was not deep enough to model the underlying patterns in the Tiny ImageNet dataset. Therefore, I decided to enable the Inception modules 4a-4e again, creating a much deeper network architecture capable of learning deeper, more discriminative features. The Adam optimizer with an initial learning rate of $1e-3$ was used to train the network. I left the L2 weight decay term at $0.0002$. DeeperGoogLeNet was then trained according to Table below.

| Epoch   | Learning Rate |
|---------|---------------|
| 1 - 40  | 1e-3          |
| 41 - 60 | 1e-4          |
| 61 - 70 | 1e-5          |

#### 3.5.4.4 Tips

For readers interested in trying to boost the accuracy of DeeperGoogLeNet further, I would suggest the following experiments:

1. Change the **conv_module** to use CONV => RELU => BN instead of the original CONV => BN => RELU ordering.
2. Attempt using **ELUs** instead of **ReLUs**.