In [None]:
"""
(*) Advanced Architecture Patterns:


1. Residual Connections.
2. Batch Normalization
3. Depthwise Seperable Convolution
"""

In [None]:
"""
Batch Normalization

-- Normalization is a broad category of methods that seek to make different samples seen by a machine-learning model more similar
   to each other, which helps the model learn and generalize well to new data. The most common form of data normalization is one that
   centering the data on 0 by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data
   by its standard deviation. In effect, this makes the assumption that the data follows a normal (or Gaussian) distribution and
   makes sure this distribution is centered and scaled to unit variance.
      
-- Previously it normalized data before feeding it into models. But data normalization should be a concern after every transformation
   operated by the network: even if the data entering a Dense or Conv2D network has a 0 mean and unit variance, there’s no reason to
   expect a priori that this will be the case for the data coming out. Batch normalization is a type of layer (BatchNormalization in Keras)
   that can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining
   an exponential moving average of the batch-wise mean and variance of the data seen during training. The main effect of batch
   normalization is that it helps with gradient propagation — much like residual connections — and thus allows for deeper networks.
   Some very deep networks can only be trained if they include multiple BatchNormalization layers. For instance, BatchNormalization is
   used liberally in many of the advanced convnet architectures that come packaged with Keras, such as ResNet50, Inception V3, and Xception.

-- The BatchNormalization layer takes an axis argument, which specifies the feature axis that should be normalized. This argument
   defaults to -1, the last axis in the input tensor. This is the correct value when using Dense layers, Conv1D layers, RNN layers,
   and Conv2D layers with data_format set to "channels_last". But in the niche use case of Conv2D layers with data_format set to
   "channels_first", the features axis is axis 1; the axis argument in BatchNormalization should accordingly be set to 1.
"""

# The BatchNormalization layer is typically used after a convolutional or densely connected layer.
model.add(layers.Conv2D(32, 3, activation='relu'))
model.add(layers.BatchNormalization())

model.add(layers.Dense(32, activation='relu'))
model.add(layers.BatchNormalization())

In [None]:
"""
Depthwise Seperatable Convolution


-- A layer you can use as a drop-in replacement for Conv2D that will make your model lighter (fewer trainable weight parameters) and
   faster (fewer floating-point operations) and cause it to perform a few percentage points better on its task. That is precisely what
   the depthwise separable convolution layer does (SeparableConv2D in Keras).
   
-- This layer performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise
   convolution (a 1 × 1 convolution). This is equivalent to separating the learning of spatial features and the learning of channel-wise
   features, which makes a lot of sense if you assume that spatial locations in the input are highly correlated, but different channels are
   fairly independent. It requires significantly fewer parameters and involves fewer computations, thus resulting in smaller, speedier
   models. And because it’s a more representationally efficient way to perform convolution, it tends to learn better representations using
   less data, resulting in better-performing models.
"""
from keras.models import Sequential, Model
from keras import layers

height = 64
width = 64
channels = 3
num_classes = 10

model = Sequential()
model.add(layers.SeparableConv2D(32, 3,
activation='relu',
input_shape=(height, width, channels,)))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.GlobalAveragePooling2D())

model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(num_classes, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

In [None]:
"""
Hyperparameter Optimization

(*) You need to explore the space of possible decisions automatically, systematically, in a principled way. You need to search
    the architecture space and find the bestperforming ones empirically. That’s what the field of automatic hyperparameter optimization
    is about: it’s an entire field of research, and an important one.

(*) The process of optimizing hyperparameters typically looks like this:
    -- Choose a set of hyperparameters (automatically)
    -- Build the corresponding model
    -- Fit it to your training data, and measure the final performance on the validation data
    -- Choose the next set of hyperparameters to try (automatically)
    -- Repeat
    -- Eventually, measure performance on your test data
    
(*) The key to this process is the algorithm that uses this history of validation performance, given various sets of hyperparameters,
    to choose the next set of hyperparameters to evaluate. Many different techniques are possible: Bayesian optimization, genetic algorithms,
    simple random search, and so on.

(*) Training the weights of a model is relatively easy: you compute a loss function on a mini-batch of data and then use the Backpropagation
    algorithm to move the weights in the right direction. Updating hyperparameters, on the other hand, is extremely challenging.
    -- Computing the feedback signal (does this set of hyperparameters lead to a high-performing model on this task?) can be extremely
       expensive: it requires creating and training a new model from scratch on your dataset.
    -- The hyperparameter space is typically made of discrete decisions and thus isn’t continuous or differentiable. Hence, you typically
       can’t do gradient descent in hyperparameter space. Instead, you must rely on gradient-free optimization techniques, which naturally
       are far less efficient than gradient descent.

(*) One important issue to keep in mind when doing automatic hyperparameter optimization at scale is validation-set overfitting. Because
    you’re updating hyperparameters based on a signal that is computed using your validation data, you’re effectively training them
    on the validation data, and thus they will quickly overfit to the validation data.

(*) Overall, hyperparameter optimization is a powerful technique that is an absolute requirement to get to state-of-the-art models on any
    task or to win machine-learning competitions. Think about it: once upon a time, people handcrafted the features that went into shallow
    machine-learning models. That was very much suboptimal. Now, deep learning automates the task of hierarchical feature engineering —
    features are learned using a feedback signal, not hand-tuned, and that’s the way it should be. In the same way, you shouldn’t
    handcraft your model architectures; you should optimize them in a principled way.
"""

In [None]:
"""
Model Emsembling

1. Ensembling relies on the assumption that different good models trained independently are likely to be good for different reasons:
   each model looks at slightly different aspects of the data to make its predictions, getting part of the “truth” but not all of it.

2. The easiest way to pool the predictions of a set of classifiers (to ensemble the classifiers) is to average their predictions
   at inference time. A smarter way to ensemble classifiers is to do a weighted average, where the weights are learned on the validation
   data—typically, the better classifiers are given a higher weight, and the worse classifiers are given a lower weight. There are many
   possible variants: you can do an average of an exponential of the predictions, for instance. In general, a simple weighted average with
   weights optimized on the validation data provides a very strong baseline.

3. The key to making ensembling work is the diversity of the set of classifiers. Diversity is strength. Diversity is what makes ensembling
   work. In machine-learning terms, if all of your models are biased in the same way, then your ensemble will retain this same bias. If your
   models are biased in different ways, the biases will cancel each other out, and the ensemble will be more robust and more accurate.
   For this reason, you should ensemble models that are as good as possible while being as different as possible. This typically means using
   very different architectures or even different brands of machine-learning approaches.

4. One thing that is largely not worth doing is ensembling the same network trained several times independently, from different random
   initializations. If the only difference between your models is their random initialization and the order in which they were exposed
   to the training data, then your ensemble will be low-diversity and will provide only a tiny improvement over any single model.

5.  In recent times, one style of basic ensemble that has been very successful in practice is the wide and deep category of models, blending
    deep learning with shallow learning. Such models consist of jointly training a deep neural network with a large linear model. The joint
    training of a family of diverse models is yet another option to achieve model ensembling. 
"""