# Advanced Deep-Learning Best Practices

Using the Keras functional API, we can build graph-like models, share a layer across different inputs, and use Keras models just like Python functions. Keras callbacks and the TensorBoard browser-based visualization tool let us monitor models during training. We’ll also look at several other best practices including batch normalization, residual connections, hyperparameter optimization, and model ensembling.

### Going beyond the Sequential model: the Keras functional API

Until now, we have implemented Sequential model. The Sequential model makes the assumption that the network has exactly one input and exactly one output, and that it consists of a linear stack of layers.
![capture](https://user-images.githubusercontent.com/13174586/51603886-17859600-1f31-11e9-8cf2-38f472988022.JPG)

This is a commonly verified assumption. But this set of assumptions is too inflexible in a number of cases. Some networks require several independent inputs, others require multiple outputs, and some networks have internal branching between layers that makes them look like graphs of layers rather than linear stacks of layers.

Some tasks, for instance, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of
a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text description, and a picture of the item. If we had only the metadata available, we could one-hot encode it
and use a densely connected network to predict the price. If we had only the text description available, we could use an RNN or a 1D convnet. If we had only the picture, we could use a 2D convnet. But how can we use all three at the same time? A
naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate
model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.
![capture](https://user-images.githubusercontent.com/13174586/51604126-d3df5c00-1f31-11e9-9c8f-0327286eb9e8.JPG)

Similarly, some tasks need to predict multiple target attributes of input data. Given the text of a novel or short story, we might want to automatically classify it by genre (such as romance or thriller) but also predict the approximate date it was written. Of course, we could train two separate models: one for the genre and one for the date. But because these attributes aren’t statistically independent, we could build a better model by learning to jointly predict both genre and date at the same time. Such a joint model would then have two outputs, or heads. Due to correlations between genre and date, knowing the date of a novel would help the model learn rich, accurate representations of the space of novel genres, and vice versa.

![capture](https://user-images.githubusercontent.com/13174586/51604229-26207d00-1f32-11e9-9d10-6a1a9954bfb7.JPG)

### Introduction to The Functional API

In the functional API, we directly manipulate tensors, and we use layers as functions that take tensors and return tensors (hence, the name *functional* API):

In [3]:
from keras import Input, layers

input_tensor= Input(shape=(32,))  #A Tensor

dense= layers.Dense(32, activation='relu') #layer is a function
output_tensor= dense(input_tensor)  #A layer may be called on a tensor, and it returns a tensor

Let’s start with a minimal example that shows side by side a simple `Sequential` model and its equivalent in the functional API:

In [4]:
from keras.models import Sequential, Model
from keras import layers
from keras import Input

In [5]:
seq_model= Sequential() #Sequential model, which we already know about
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(32, activation='softmax'))

In [6]:
input_tensor= Input(shape=(64,))
x =layers.Dense(32, activation='relu')(input_tensor)   #Its functional
x= layers.Dense(32, activation= 'relu')(x)             #equivalent
output_tensor= layers.Dense(10, activation='softmax')(x)

In [8]:
model= Model(input_tensor, output_tensor) #The Model class turns an input tensor and output tensor into a model

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_7 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_8 (Dense)              (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


The only part that may seem a bit magical at this point is instantiating a `Model` object using only an input tensor and an output tensor. Behind the scenes, Keras retrieves every layer involved in going from `input_tensor` to `output_tensor`, bringing them together into a graph-like data structure—a Model. Of course, the reason it works is that `output_tensor` was obtained by repeatedly transforming `input_tensor`. If we tried to build a model from inputs and outputs that weren’t related, we’d get a 

`RuntimeError:`

`>>> unrelated_input = Input(shape=(32,))`

`>>> bad_model = model = Model(unrelated_input, output_tensor)`

This error tells us, in essence, that Keras couldn’t reach input_1 from the provided output tensor.

When it comes to compiling, training, or evaluating such an instance of Model, the API is the same as that of `Sequential`:

In [9]:
model.compile(optimizer='rmsprop', loss= 'categorical_crossentropy')

In [13]:
import numpy as np
x_train= np.random.random((10000,64))
y_train= np.random.random((10000,10))

In [14]:
model.fit(x_train, y_train, epochs=10, batch_size=256)

score= model.evaluate(x_train, y_train)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
print(score)

11.468967539978028


### Multi-input Models

The functional API can be used to build models that have multiple inputs. Typically, such models at some point merge their different input branches using a layer that can combine several tensors: by adding them, concatenating them, and so on. This is usually done via a Keras merge operation such as `keras.layers.add`, `keras.layers.concatenate`, and so on. Let’s look at a very simple example of a multi-input model: a **question-answering model**.

A typical question-answering model has two inputs: a natural-language question and a text snippet (such as a news article) providing information to be used for answering the question. The model must then produce an answer: in the simplest possible
setup, this is a one-word answer obtained via a softmax over some predefined vocabulary.

![capture](https://user-images.githubusercontent.com/13174586/51657265-22d8d000-1fca-11e9-9557-98efc7de866a.jpeg)

Following is an example of how we can build such a model with the functional API. We set up two independent branches, encoding the text input and the question input as representation vectors; then, concatenate these vectors; and finally, add a softmax
classifier on top of the concatenated representations.

### Functional API Implementation of a Two-Input Question-Answering Model

In [79]:
from keras.models import Model
from keras import layers
from keras import Input

In [80]:
text_vocabulary_size=10000
question_vocabulary_size=10000
answer_vocabulary_size=500

In [81]:
text_input= Input(shape=(None,), dtype='int32', name='text') #The text input is a variable length sequence of integers.
                                                            #We can optionally name the inputs.
print(text_input)

embedded_text= layers.Embedding(64, text_vocabulary_size)(text_input) #Embeds the inputs into a sequence of vectors of size 64

encoded_text= layers.LSTM(32)(embedded_text) #Encodes the vectors in a single vector via an LSTM

Tensor("text:0", shape=(?, ?), dtype=int32)


In [82]:
question_input= Input(shape=(None,), dtype='int32', name='question')

embedded_question= layers.Embedding(64, question_vocabulary_size)(question_input) #Same process (with different layer
                                                                                  #instances) for the question

encoded_question= layers.LSTM(16)(embedded_question)

In [83]:
concatenated= layers.concatenate([encoded_text, encoded_question], axis=-1) #Concatenates the encoded question and encoded text
print(concatenated)

Tensor("concatenate_1/concat:0", shape=(?, 48), dtype=float32)


In [84]:
answer= layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated) #Adds a softmax classifier on top

In [85]:
model= Model([text_input, question_input], answer) #At model instantiation, we specify the two inputs and the output

In [86]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

Now, how do we train this two-input model? There are two possible APIs: we can feed the model a list of Numpy arrays as inputs, or we can feed it a dictionary that maps input names to Numpy arrays. Naturally, the latter option is available only if we give
names to our inputs.

### Feed Data to a Multi-Input Model

In [87]:
import numpy as np

num_samples=1000
max_length= 100

text= np.random.randint(1, text_vocabulary_size, size= (num_samples, max_length)) #Generates dummy Numpy data
text.shape

(1000, 100)

In [88]:
question= np.random.randint(1, question_vocabulary_size, size= (num_samples, max_length)) #Generates dummy Numpy data
question.shape

(1000, 100)

In [89]:
answer= np.random.randint(0, 1, size= (num_samples, answer_vocabulary_size)) #Answers are one-hot encoded, not integers
answer.shape

(1000, 500)

In [90]:
model.fit([text, question], answer, epochs=10, batch_size=128) #Fitting using a list of inputs

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x296463dfcf8>

In [91]:
model.fit({'text':text, 'question':question}, answer, epochs=10, batch_size=128) #Fitting using a dictionary of
                                                                                 #inputs (only if inputs are named)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2964abb4fd0>

### Multi-output models
In the same way, we can use the functional API to build models with multiple outputs (or multiple heads). A simple example is a network that attempts to simultaneously predict different properties of the data, such as a network that takes as input a series
of social media posts from a single anonymous person and tries to predict attributes of that person, such as age, gender, and income level.

### Functional API Implementation of a Three-Output Model

In [1]:
from keras import layers
from keras.models import Model
from keras import Input

Using TensorFlow backend.


In [3]:
vocabulary_size=5000
num_income_groups=10

In [85]:
posts_input= Input(shape=(None,), dtype='int32', name='posts')
embedded_posts= layers.Embedding(256, vocabulary_size)(posts_input)

x= layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x= layers.MaxPooling1D(5)(x)
x= layers.Conv1D(256, 5, activation='relu')(x)
x= layers.Conv1D(256, 5, activation='relu')(x)
x= layers.MaxPooling1D(5)(x)
x= layers.Conv1D(256, 5, activation='relu')(x)
x= layers.Conv1D(256, 5, activation='relu')(x)
x= layers.GlobalMaxPooling1D()(x)
x= layers.Dense(128, activation='relu')(x)

age_prediction= layers.Dense(1, name='age')(x)
income_prediction= layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction= layers.Dense(1, activation='sigmoid', name='gender')(x)

In [86]:
model= Model(posts_input, [age_prediction, income_prediction, gender_prediction])

![capture](https://user-images.githubusercontent.com/13174586/51662626-718e6600-1fda-11e9-920b-f910d34c6095.JPG)

Importantly, training such a model requires the ability to specify different loss functions for different heads of the network: for instance, age prediction is a scalar regression task, but gender prediction is a binary classification task, requiring a different training procedure. But because gradient descent requires us to minimize a scalar, we must combine these losses into a single value in order to train the model. The simplest way to combine different losses is to sum them all. In Keras, we can use
either a list or a dictionary of losses in compile to specify different objects for different outputs; the resulting loss values are summed into a global loss, which is minimized during training.

### Compilation Options of a Multi-Output Model: Multiple Losses

In [87]:
model.compile(optimizer='rmsprop', loss=['mse', 'categorical_crossentropy','binary_crossentropy'])

In [88]:
model.compile(optimizer='rmsprop', loss={'age':'mse', 'income': 'categorical_crossentropy',
                                        'gender':'binary_crossentropy'}) #Equivalent (possible only if we give names
                                                                         #to the output layers)

Note that very imbalanced loss contributions will cause the model representations to be optimized preferentially for the task with the largest individual loss, at the expense of the other tasks. To remedy this, we can assign different levels of importance to the loss values in their contribution to the final loss. This is useful in particular if the losses’ values use different scales. For instance, the **mean squared error (MSE)** loss used for the age-regression task typically takes a value around 3–5, whereas the **crossentropy** loss used for the gender-classification task can be as low as 0.1. In such a situation,
to balance the contribution of the different losses, we can assign a weight of 10 to the crossentropy loss and a weight of 0.25 to the MSE loss.

### Compilation Options of a Multi-Output Model: Loss Weighting

In [89]:
model.compile(optimizer='rmsprop', loss=['mse', 'categorical_crossentropy','binary_crossentropy'], 
              loss_weights=[0.25,1.0,10.0])

In [90]:
model.compile(optimizer='rmsprop', loss={'age':'mse', 'income': 'categorical_crossentropy',
                                        'gender':'binary_crossentropy'},
             loss_weights={'age':0.25, 'income': 1.0,
                                        'gender':10.0}) #Equivalent (possible only if we give names
                                                                         #to the output layers)

Much as in the case of multi-input models, we can pass Numpy data to the model for training either via a list of arrays or via a dictionary of arrays.

### Feed Data to a Multi-Output Model

In [91]:
import numpy as np
num_samples=1000
max_length= 100

posts= np.random.randint(1, vocabulary_size, size= (num_samples, 500))
age_targets= np.random.randint(0, 1, size= (num_samples, 1))
income_targets= np.random.randint(0, 11, size= (num_samples, 10))
gender_targets= np.random.randint(0, 1, size= (num_samples, 1))

In [92]:
model.fit(posts, [age_targets, income_targets, gender_targets],
          epochs=10, batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x12ec58c9400>

In [94]:
op1,op2, op3=model.predict(posts)

In [100]:
print("age:", op1, "income:", op2, "gender:", op3)

age: [[-2.14726515e-02]
 [ 4.70847021e+03]
 [-1.51037150e+06]
 [-1.56608922e+05]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.17819828e+05]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-1.74845862e+06]
 [-6.40380438e+05]
 [-2.14726515e-02]
 [-9.90492109e+04]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-6.88065234e+04]
 [-2.14726515e-02]
 [ 2.81642344e+05]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.46825609e+05]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-1.59953906e+05]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.02547875e+05]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.55838906e+04]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-9.00847969e+04]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-2.14726515e-02]
 [-6.31065375e+05]
 [-1.64

### Directed acyclic graphs of layers
With the functional API, not only can we build models with multiple inputs and multiple outputs, but we can also implement networks with a complex internal topology. Neural networks in Keras are allowed to be arbitrary ***directed acyclic graphs*** of layers. The qualifier acyclic is important: these graphs can’t have cycles. It’s impossible for a tensor x to become the input of one of the layers that generated x. The only processing loops that are allowed (that is, recurrent connections) are those internal to recurrent layers. 

Several common neural-network components are implemented as graphs. Two notable ones are Inception modules and residual connections. To better understand how the functional API can be used to build graphs of layers, let’s take a look at how
we can implement both of them in Keras.

#### INCEPTION MODULES
Inception3 is a popular type of network architecture for convolutional neural networks. It consists of a stack of modules
that themselves look like small independent networks, split into several parallel branches. The most basic form of an Inception module has three to four branches starting with a 1 × 1 convolution, followed by a 3 × 3 convolution, and ending with the
concatenation of the resulting features. This setup helps the network separately learn spatial features and channel-wise features, which is more efficient than learning them jointly. More-complex versions of an Inception module are also possible, typically involving pooling operations, different spatial convolution sizes (for example, 5 × 5 instead of 3 × 3 on some branches), and branches without a spatial convolution (only a 1 × 1 convolution). An example of such a module, taken from Inception V3.

![capture](https://user-images.githubusercontent.com/13174586/51670595-026e3d00-1fed-11e9-9c03-4717b16408ce.JPG)

>>#### The purpose of 1 × 1 convolutions
We already know that convolutions extract spatial patches around every tile in an input tensor and apply the same transformation to each patch. An edge case is when the patches extracted consist of a single tile. The convolution operation then becomes equivalent to running each tile vector through a Dense layer: it will compute features that mix together information from the channels of the input tensor, but it won’t mix information across space (because it’s looking at one tile at a time). Such
1 × 1 convolutions (also called pointwise convolutions) are featured in Inception modules, where they contribute to factoring out channel-wise feature learning and spacewise feature learning—a reasonable thing to do if you assume that each channel is
highly autocorrelated across space, but different channels may not be highly correlated with each other.

In [None]:
from keras import layers

branch_a= layers.Conv2D(128, 1, activation='relu', strides=2)(x) #Every branch has the same stride value (2),
                                                                 #which is necessary to keep all branch outputs 
                                                                 #the same size so you can concatenate them.

branch_b= layers.Conv2D(128,1,activation='relu')(x)                   #In this branch, the striding occurs
branch_b= layers.Conv2D(128,3,activation='relu', strides=2)(branch_b) #in the spatial convolution layer.

branch_c= layers.AvgPooling2D(3, strides=2)(x)                        #In this branch, the striding occurs
branch_c= layers.Conv2D(3, activation='relu')(branch_c)               #in the average pooling layer.

branch_d= layers.Conv2D(1, activation='relu')(x)
branch_d= layers.Conv2D(3, activation='relu')(branch_d)
branch_d= layers.Conv2D(3, activation='relu', strides=2)(branch_d)

output= layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1) #Concatenates the branch outputs to
                                                                              #obtain the module output

Note that the full Inception V3 architecture is available in Keras as `keras.applications.inception_v3.InceptionV3`, including weights pretrained on the ImageNet dataset.

Another closely related model available as part of the Keras applications module is `Xception`. **Xception**, which stands for **extreme inception**, is a convnet architecture loosely inspired by Inception. It takes the idea of separating the learning of channel-wise and space-wise features to its logical extreme, and replaces Inception modules with depthwise separable convolutions consisting of a depthwise convolution (a spatial convolution where every input channel is handled separately) followed by a pointwise convolution (a 1 × 1 convolution)—effectively, an extreme form of an Inception module, where spatial features and channel-wise features are fully separated. Xception has roughly the same number of parameters as Inception V3, but it shows better runtime performance and higher accuracy on ImageNet as well as other large-scale datasets, due to a more efficient use of model parameters.

#### RESIDUAL CONNECTIONS
Residual connections are a common graph-like network component found in many post- 2015 network architectures, including Xception.They tackle two common problems that plague any large-scale deep-learning model: ***vanishing gradients*** and ***representational bottlenecks***. In general, adding residual connections to any model that has more than 10 layers is likely to be beneficial.

A residual connection consists of making the output of an earlier layer available as input to a later layer, effectively creating a shortcut in a sequential network. Rather than being concatenated to the later activation, the earlier output is summed with the later activation, which assumes that both activations are the same size. If they’re different sizes, we can use a linear transformation to reshape the earlier activation into the target shape (for example, a Dense layer without an activation or, for convolutional feature maps, a 1 × 1 convolution without an activation). 

Here’s how to implement a residual connection in Keras when the feature-map sizes are the same, using identity residual connections. This example assumes the existence of a 4D input tensor x:

In [None]:
from keras import layers
x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x) #Applies a transformation to x
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.add([y, x]) #Adds the original x back to the output features

And the following implements a residual connection when the feature-map sizes differ, using a linear residual connection (again, assuming the existence of a 4D input tensor x):

In [None]:
from keras import layers
x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)
residual = layers.Conv2D(128, 1, strides=2, padding='same')(x) #Uses a 1 × 1 convolution to linearly downsample the original
                                                               #x tensor to the same shape as y

y = layers.add([y, residual])#Adds the residual tensor back to the output features

>> #### Representational Bottlenecks in Deep Learning
In a Sequential model, each successive representation layer is built on top of the revious one, which means it only has access to information contained in the activation of the previous layer. If one layer is too small (for example, it has features that
are too low-dimensional), then the model will be constrained by how much information can be crammed into the activations of this layer. We can grasp this concept with a signal-processing analogy: if we have an audioprocessing pipeline that consists of a series of operations, each of which takes as input the output of the previous operation, then if one operation crops your signal to a low-frequency range (for example, 0–15 kHz), the operations downstream will never be able to recover the dropped frequencies. Any loss of information is permanent. Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep-learning models.

>> #### Vanishing Gradients in Deep Learning
Backpropagation, the master algorithm used to train deep neural networks, works by propagating a feedback signal from the output loss down to earlier layers. If this feedback signal has to be propagated through a deep stack of layers, the signal may
become tenuous or even be lost entirely, rendering the network untrainable. This issue is known as vanishing gradients. <br/>
This problem occurs both with deep networks and with recurrent networks over very long sequences—in both cases, a feedback signal must be propagated through a long series of operations. You’re already familiar with the solution that the LSTM layer
uses to address this problem in recurrent networks: it introduces a carry track that propagates information parallel to the main processing track. Residual connections work in a similar way in feedforward deep networks, but they’re even simpler: they
introduce a purely linear information carry track parallel to the main layer stack, thus helping to propagate gradients through arbitrarily deep stacks of layers.

### Layer Weight Sharing

One more important feature of the functional API is the ability to reuse a layer instance several times. When we call a layer instance twice, instead of instantiating a new layer for each call, we reuse the same weights with every call. This allows us to build models that have shared branches—several branches that all share the same knowledge and perform the same operations. That is, they share the same representations and learn these representations simultaneously for different sets of inputs.

For example, consider a model that attempts to assess the semantic similarity between two sentences. The model has two inputs (the two sentences to compare) and outputs a score between 0 and 1, where 0 means unrelated sentences and 1 means sentences that are either identical or reformulations of each other. Such a model could be useful in many applications, including deduplicating natural-language queries in a dialog system.

In this setup, the two input sentences are interchangeable, because semantic similarity is a symmetrical relationship: the similarity of A to B is identical to the similarity of B to A. For this reason, it wouldn’t make sense to learn two independent models for processing each input sentence. Rather, we want to process both with a single LSTM layer. The representations of this LSTM layer (its weights) are learned based on both inputs simultaneously. This is what we call a **``Siamese LSTM``** model or a **``shared LSTM``**.

Here’s how to implement such a model using layer sharing (layer reuse) in the
Keras functional API:

In [None]:
from keras.models import Model 
from keras import layers
from keras import Input

lstm= layers.LSTM(32) #Instantiates a single LSTM layer once

left_input= Input(shape=(None, 128)) #Building the left branch of the model: inputs are 
left_output= lstm(left_input)        #variable-length sequences of vectors of size 128.

right_input= Input(shape=(None, 128)) #Building the right branch of the model:
right_output= lstm(left_input)        #when we call an existing layer instance, we reuse its weights.

merged= layers.concatenate([left_output, right_output], axis=-1) #Builds the classifier on top
predictions= layer.Dense(1, activation='sigmoid')(merged)


model= Model([left_input, right_input], predictions)   #Instantiating and training the model: when we
model.fit([left_data, right_data], targets)            #train such a model, the weights of the LSTM layer
                                                       #are updated based on both inputs.

Naturally, a layer instance may be used more than once—it can be called arbitrarily many times, reusing the same set of weights every time.

### Models as layers
Importantly, in the functional API, models can be used as we’d use layers—effectively, we can think of a model as a “bigger layer.” This is true of both the Sequential and Model classes. This means we can call a model on an input tensor and retrieve an output tensor:

`y = model(x)`

If the model has multiple input tensors and multiple output tensors, it should be
called with a list of tensors:

`y1, y2 = model([x1, x2])`

When we call a model instance, we’re reusing the weights of the model—exactly like what happens when we call a layer instance. Calling an instance, whether it’s a layer instance or a model instance, will always reuse the existing learned representations of the instance—which is intuitive.

One simple practical example of what we can build by reusing a model instance is a vision model that uses a dual camera as its input: two parallel cameras, a few centimeters (one inch) apart. Such a model can perceive depth, which can be useful in
many applications. We shouldn’t need two independent models to extract visual features from the left camera and the right camera before merging the two feeds.  Such low-level processing can be shared across the two inputs: that is, done via layers
that use the same weights and thus share the same representations. Here’s how we’d implement a Siamese vision model (shared convolutional base) in Keras:

In [None]:
from keras import layers
from keras import applications
from keras import Input

xception_base= applications.Xception(weights=None, include_top=False) #The base image-processing model is the 
                                                                      #Xception network (convolutional base only)

left_input= Input(shape=(250,250, 3)) #The inputs are 250 × 250 RGB images
left_features= xcception_base(left_input) 

right_input= Input(shape=(250,250, 3)) #The inputs are 250 × 250 RGB images
right_features= xcception_base(right_input)

merged_features= layers.concatenate([lefty_features, right_features], axis=-1) #The merged features contain 
                                                                               #information from the right visual 
                                                                               #feed and the left visual feed.

### Inspect and Monitor Deep-Learning Models Using Keras Callbacks and TensorBoard

We’ll review ways to gain greater access to and control over what goes on inside our model during training. Launching a training run on a large dataset for tens of epochs using `model.fit()` or `model.fit_generator()` can be a bit like launching a paper airplane: past the initial impulse, we don’t have any control over its trajectory or its landing spot. If we want to avoid bad outcomes (and thus wasted paper airplanes), it’s smarter to use not a paper plane, but a drone that can sense its environment, send data back to its operator, and automatically make steering decisions based on its current state. The techniques we present here will transform the call to `model.fit()` from a paper airplane into a smart, autonomous drone that can self-introspect
and dynamically take action.

#### Use Callbacks to Act on a Model During Training
When we’re training a model, there are many things we can’t predict from the start. In particular, we can’t tell how many epochs will be needed to get to an optimal validation loss. The examples so far have adopted the strategy of training for enough
epochs that we begin overfitting, using the first run to figure out the proper number of epochs to train for, and then finally launching a new training run from scratch using this optimal number. Of course, this approach is wasteful.

A much better way to handle this is to stop training when we measure that the validation loss in no longer improving. This can be achieved using a `Keras callback`. A `callback` is an object (a class instance implementing specific methods) that is passed to the model in the call to fit and that is called by the model at various points during training. It has access to all the available data about the state of the model and its performance, and it can take action: interrupt training, save a model, load a different weight set, or otherwise alter the state of the model.

Here are some examples of ways we can use callbacks:
 - ***Model checkpointing—*** Saving the current weights of the model at different points during training.
 - ***Early stopping—*** Interrupting training when the validation loss is no longer improving (and of course, saving the best model obtained during training).
 - ***Dynamically adjusting the value of certain parameters during training—*** Such as the learning rate of the optimizer.
 - ***Logging training and validation metrics during training, or visualizing the representations learned by the model as they’re updated—*** The Keras progress bar that we’re familiar with is a callback!

The keras.callbacks module includes a number of built-in callbacks:
`keras.callbacks.ModelCheckpoint`<br/>
`keras.callbacks.EarlyStopping`<br/>
`keras.callbacks.LearningRateScheduler`<br/>
`keras.callbacks.ReduceLROnPlateau`<br/>
`keras.callbacks.CSVLogger`<br/>

Let’s review a few of them to have an idea of how to use them: `ModelCheckpoint`, `EarlyStopping`, and `ReduceLROnPlateau`.

#### THE MODELCHECKPOINT AND EARLYSTOPPING CALLBACKS

We can use the `EarlyStopping` callback to interrupt training once a target metric being monitored has stopped improving for a fixed number of epochs. For instance, this callback allows us to interrupt training as soon as we start overfitting, thus
avoiding having to retrain our model for a smaller number of epochs. This callback is typically used in combination with `ModelCheckpoint`, which lets us continually save the model during training (and, optionally, save only the current best model so far: the version of the model that achieved the best performance at the end of an epoch):

In [None]:
import keras

callbacks_list=[   #Callbacks are passed to the model via the callbacks argument in fit, which takes a list of callbacks. 
                   #We can pass any number of callbacks.
    keras.callbacks.EarlyStopping( #Interrupts training when improvement stops
    monitoring='acc',             #Monitors the model’s validation accuracy
    patience=1),                  #Interrupts training when accuracy has stopped improving 
                                  #for more than one epoch (that is, two epochs)
    
    keras,callbacks.ModelCheckpoint(  #Saves the current weights after every epoch
    filepath='my_model.h5',           #Path to the destination model file
    monitor='val_loss',                      #These two arguments mean we won’t overwrite the model file unless val_loss has
                                             #improved, which allows you to keep the best model seen during training
    save_best_only=True),   
]


model.compile= (optimizer= 'rmsprop', loss='binary_crossentropy', metrics= ['acc']) #We monitor accuracy, so it should
                                                                                    #be part of the model’s metrics

model.fit(x_train, y_train, epochs=100,batch_size=128               #Note that because the callback will monitor validation 
          callbacks= callback_list, validation_data=(x_val, y_val)) #loss and validation accuracy, we need to pass
                                                                    #validation_data to the call to fit 

#### THE REDUCELRONPLATEAU CALLBACK
We can use this callback to reduce the learning rate when the validation loss has stopped improving. Reducing or increasing the learning rate in case of a loss plateau is is an effective strategy to get out of local minima during training. The following example uses the `ReduceLROnPlateau` callback:

In [None]:
callback_list=[
    keras.callbacks.ReduceLROnPlateau(
        monitor= 'val_loss',          #Monitors the model’s validation loss
        factor=0.1,                   #Divides the learning rate by 10 when triggered
        patience=10)                  #The callback is triggered after the validation loss has stopped improving for 10 epochs
]

model.fit(x_train, y_train, epochs=100,batch_size=128               #Note that because the callback will monitor validation 
          callbacks= callback_list, validation_data=(x_val, y_val)) #loss and validation accuracy, we need to pass
                                                                    #validation_data to the call to fit 

#### WRITING OUR OWN CALLBACK
If we need to take a specific action during training that isn’t covered by one of the built-in callbacks, we can write our own callback. Callbacks are implemented by subclassing the class `keras.callbacks.Callback`. We can then implement any number
of the following transparently named methods, which are called at various points during training:

`on_epoch_begin`  <- Called at the start of every epoch<br/>
`on_epoch_end`    <- Called at the end of every epoch

`on_batch_begin`  <- Called right before processing each batch<br/>
`on_batch_end`    <- Called right after processing each batch

`on_train_begin`  <- Called at the start of training<br/>
`on_train_end`    <- Called at the end of training

These methods all are called with a `logs` argument, which is a dictionary containing information about the previous batch, epoch, or training run: training and validation metrics, and so on. Additionally, the callback has access to the following attributes:
 - `self.model`— The model instance from which the callback is being called
 - `self.validation_data`— The value of what was passed to *fit* as validation data

Here’s a simple example of a custom callback that saves to disk (as Numpy arrays) the activations of every layer of the model at the end of every epoch, computed on the first sample of the validation set:

In [None]:
import keras
import numpy as np

class ActivationLogger(keras.callbacks.Callback):
    
    def set_model(self, model):
        self.model= model   #Called by the parent model before training, to inform the callback of what model will be calling it          
        layer_outputs= [layer.output for layer in model.layers]
        self.activations_model= keras.models.Model(model.inputs, layer_outputs) #Model instance that returns the 
                                                                                #activations of every layer
        
    def on_epoch_end(self, epochs, logs=None):
        if self.validation_data=None:
            raise RuntimeError('Requires validation data')
            
        validation_sample= self.validation_data[0][0:1]     #Obtains the first input sample of the validation data
        activations= self.activations_model,predict(validation_sample) 
        f= open('activations_at_epoch_' + str(epoch) + '.npz', 'w')         #Saves arrays to disk
        np.savez(f, activations)
        f.close()
        
        

This is all we need to know about callbacks—the rest is technical details, which we can easily look up. Now we’re equipped to perform any sort of logging or preprogrammed intervention on a Keras model during training.

### Introduction to TensorBoard: The TensorFlow Visualization Framework

To do good research or develop good models, we need rich, frequent feedback about what’s going on inside our models during our experiments. That’s the point of running experiments: to get information about how well a model performs—as much information as possible. Making progress is an iterative process, or loop: we start with an idea and express it as an experiment, attempting to validate or invalidate our idea. We run this experiment and process the information it generates. This inspires our next idea. The more iterations of this loop we’re able to run, the more refined and powerful our ideas become. Keras helps us go from idea to experiment in the least possible time, and fast GPUs can help you get from experiment to result as quickly as possible. But what about processing the experiment results? That’s where TensorBoard comes in.

![capture](https://user-images.githubusercontent.com/13174586/51678525-a1516400-2002-11e9-8f86-734afb56dafa.JPG)

This section introduces TensorBoard, a browser-based visualization tool that comes packaged with TensorFlow. Note that it’s only available for Keras models when we’re using Keras with the TensorFlow backend.

The key purpose of TensorBoard is to help us visually monitor everything that goes on inside our model during training. If we’re monitoring more information than just the model’s final loss, we can develop a clearer vision of what the model does and doesn’t do, and we can make progress more quickly. TensorBoard gives us access to several neat features, all in our browser:

Visually monitoring metrics during training
 - Visualizing your model architecture
 - Visualizing histograms of activations and gradients
 - Exploring embeddings in 3D

Let’s demonstrate these features on a simple example. We’ll train a 1D convnet on the IMDB sentiment-analysis task.
You’ll consider only the top 2,000 words in the IMDB vocabulary, to make visualizing word embeddings more tractable.

#### Text-Classification Model to Use With TensorBoard

In [1]:
import keras
from keras import Input
from keras import layers
from keras.models import Model
from keras.datasets import imdb
from keras.preprocessing import sequence

Using TensorFlow backend.


In [2]:
max_features= 2000 #Number of words to consider as features
max_len=500         #Cuts off texts after this number of words (among max_features most common words)

In [3]:
(x_train, y_train), (x_test, y_test)= imdb.load_data(num_words= max_features)

x_train= sequence.pad_sequences(x_train, maxlen= max_len)
x_test= sequence.pad_sequences(x_test, maxlen= max_len)

x_train_input= Input(shape=(None,), dtype='int32', name='x_train')
#y_train_output= Input(shape=(None,), dtype='int32', name='y_train')

embedded_x_train= layers.Embedding(max_features, 128, input_length=max_len, name='embed')(x_train_input)

x= layers.Conv1D(32, 7, activation='relu')(embedded_x_train)
x= layers.MaxPooling1D(5)(x)
x= layers.Conv1D(32, 7, activation='relu')(x)
x= layers.GlobalMaxPooling1D()(x)
op= layers.Dense(1)(x)


In [4]:
model= Model(x_train_input, op)
model.summary()

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
x_train (InputLayer)         (None, None)              0         
_________________________________________________________________
embed (Embedding)            (None, 500, 128)          256000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 494, 32)           28704     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 98, 32)            0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 92, 32)            7200      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total para

Before we start using TensorBoard, we need to create a directory where we’ll store the log files it generates.

#### Create a Directory for TensorBoard Log Files

Let’s launch the training with a TensorBoard callback instance. This callback will write log events to disk at the specified location.

#### Train The model With a TensorBoard Callback

In [5]:
callbacks= [
    keras.callbacks.TensorBoard(
    log_dir='my_keras_log_dir',                                            #Log files will be written at this location
    histogram_freq=1,                                                      #Records activation histograms every 1 epoch
    embeddings_freq=0,
    embeddings_data = None
    )
]        #Records embedding data every 1 epoch,



In [5]:
history = model.fit(x_train, y_train, epochs=20, batch_size=128,
                    validation_data= (x_test, y_test), callbacks=callbacks)

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


![capture](https://user-images.githubusercontent.com/13174586/51729853-5d5e6d80-209b-11e9-86ce-2875631c0c3f.JPG)
![capture 2jpg](https://user-images.githubusercontent.com/13174586/51730231-afec5980-209c-11e9-8f71-50429b2849c5.JPG)
![capture3](https://user-images.githubusercontent.com/13174586/51729855-5d5e6d80-209b-11e9-8d88-71aa62e30a5a.JPG)
![capture4](https://user-images.githubusercontent.com/13174586/51729856-5df70400-209b-11e9-854f-f263b3dfa4e1.JPG)
![capture5](https://user-images.githubusercontent.com/13174586/51729857-5df70400-209b-11e9-899a-2622e28b6a79.JPG)

Note that Keras also provides another, cleaner way to plot models as graphs of layers rather than graphs of TensorFlow operations: the utility keras.utils.plot_model. Using it requires that we’ve installed the Python pydot and pydot-ng libraries as well as the graphviz library. Let’s take a quick look:

In [8]:
from keras.utils import plot_model
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

plot_model(model, to_file='model.png')

![model](https://user-images.githubusercontent.com/13174586/51730390-4f115100-209d-11e9-8f37-bb2a6f174c4d.png)

We also have the option of displaying shape information in the graph of layers. This example visualizes model topology using plot_model and the show_shapes option:

In [9]:
from keras.utils import plot_model
plot_model(model, show_shapes=True, to_file='model_shape.png')

![model_shape](https://user-images.githubusercontent.com/13174586/51730470-9d265480-209d-11e9-80c6-8e70973e9ba1.png)

### Get The Most Out of Our Models
Trying out architectures blindly works well enough if we just need something that works okay. In this section, we’ll go beyond “works okay” to “works great”.

#### Advanced Architecture Patterns
We covered one important design pattern in detail in the previous section: residual connections. There are two more design patterns we should know about: normalization and depthwise separable convolution. These patterns are especially relevant when we’re building high-performing deep convnets, but they’re commonly found in many other types of architectures as well.

#### BATCH NORMALIZATION

Normalization is a broad category of methods that seek to make different samples seen by a machine-learning model more similar to each other, which helps the model learn and generalize well to new data. The most common form of data normalization is one
we’ve seen already: centering the data on 0 by subtracting the mean from the data, and giving the data a unit standard deviation by dividing the data by its standard deviation. In effect, this makes the assumption that the data follows a normal (or Gaussian) distribution and makes sure this distribution is centered and scaled to unit variance:

`normalized_data = (data - np.mean(data, axis=...)) / np.std(data, axis=...)`

Previous examples normalized data before feeding it into models. But data normalization should be a concern after every transformation operated by the network: even if the data entering a Dense or Conv2D network has a 0 mean and unit variance, there’s no reason to expect a priori that this will be the case for the data coming out. Batch normalization is a type of layer (BatchNormalization in Keras). It can adaptively normalize data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training. The main effect of batch normalization is that it helps with gradient propagation— much like residual connections—and thus allows for deeper networks. Some very deep networks can only be trained if they include multiple BatchNormalization layers. For instance, BatchNormalization is used liberally in many of the advanced convnet architectures that come packaged with Keras, such as ResNet50, Inception V3, and Xception.

The BatchNormalization layer is typically used after a convolutional or densely connected layer:

`conv_model.add(layers.Conv2D(32, 3, activation='relu'))`<br/>
`conv_model.add(layers.BatchNormalization())`

`dense_model.add(layers.Dense(32, activation='relu'))`<br/>
`dense_model.add(layers.BatchNormalization())`

The BatchNormalization layer takes an axis argument, which specifies the feature axis that should be normalized. This argument defaults to -1, the last axis in the input tensor. This is the correct value when using Dense layers, Conv1D layers, RNN layers,
and Conv2D layers with data_format set to "channels_last". But in the niche use case of Conv2D layers with data_format set to "channels_first", the features axis is axis 1; the axis argument in BatchNormalization should accordingly be set to 1.

#### DEPTHWISE SEPARABLE CONVOLUTION
There’s a layer we can use as a drop-in replacement for Conv2D that will make our model lighter (fewer trainable weight parameters) and faster (fewer floating-point operations) and cause it to perform a few percentage points better on its task? That is precisely what the depthwise separable convolution layer does (`SeparableConv2D`). This layer performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a 1 × 1 convolution). This is equivalent to separating the learning of spatial features and the learning of channel-wise features, which makes a lot of sense if we assume that spatial locations in the input are highly correlated, but different channels are fairly independent. It requires significantly fewer parameters and involves fewer computations, thus resulting in smaller, speedier models. And because it’s a more representationally efficient way to perform convolution, it tends to learn better representations using less data, resulting in better-performing models.

![capture](https://user-images.githubusercontent.com/13174586/51733219-aa940c80-20a6-11e9-9b06-e35b57e0bf93.JPG)

These advantages become especially important when we’re training small models from scratch on limited data. For instance, here’s how we can build a lightweight, depthwise separable convnet for an image-classification task (softmax categorical classification) on a small dataset:

In [2]:
from keras.models import Sequential, Model
from keras import layers

height=64
width=64
channels=3
num_classes=10

model= Sequential()
model.add(layers.SeparableConv2D(32, 3, activation='relu', input_shape=(height, width, channels)))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.GlobalAveragePooling2D())

model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(num_classes, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

When it comes to larger-scale models, depthwise separable convolutions are the basis of the Xception architecture, a high-performing convnet that comes packaged with Keras. We can read more about the theoretical grounding for depthwise separable convolutions and Xception in paper “Xception: Deep Learning with Depthwise
Separable Convolutions” by Francois Chollet.

### Hyperparameter optimization

When building a deep-learning model, we have to make many seemingly arbitrary decisions: How many layers should we stack? How many units or filters should go in each layer? Should we use relu as activation, or a different function? Should we use `BatchNormalization` after a given layer? How much dropout should we use? And so on. These architecture-level parameters are called hyperparameters to distinguish them from the parameters of a model, which are trained via backpropagation.

In practice, experienced machine-learning engineers and researchers build intuition over time as to what works and what doesn’t when it comes to these choices— they develop hyperparameter-tuning skills. But there are no formal rules. If we want to get to the very limit of what can be achieved on a given task, we can’t be content with arbitrary choices made by a fallible human. Our initial decisions are almost always suboptimal, even if we have good intuition. We can refine your choices by tweaking them by hand and retraining the model repeatedly— that’s what machinelearning engineers and researchers spend most of their time doing. But it shouldn’t be our job as a human to fiddle with hyperparameters all day— that is better left to a machine.

Thus we need to explore the space of possible decisions automatically, systematically, in a principled way. We need to search the architecture space and find the best performing ones empirically. That’s what the field of automatic hyperparameter optimization is about: it’s an entire field of research, and an important one.

The process of optimizing hyperparameters typically looks like this:
 - Choose a set of hyperparameters (automatically)
 - Build the corresponding model
 - Fit it to our training data, and measure the final performance on the validation data
 - Choose the next set of hyperparameters to try (automatically)
 - Repeat
 - Eventually, measure performance on your test data

The key to this process is the algorithm that uses this history of validation performance, given various sets of hyperparameters, to choose the next set of hyperparameters to evaluate. Many different techniques are possible: Bayesian optimization, genetic algorithms, simple random search, and so on.

Training the weights of a model is relatively easy: we compute a loss function on a mini-batch of data and then use the Backpropagation algorithm to move the weights in the right direction. Updating hyperparameters, on the other hand, is extremely
challenging. We should consider the following:

 - Computing the feedback signal (does this set of hyperparameters lead to a high-performing model on this task?) can be extremely expensive: it requires creating and training a new model from scratch on our dataset.
 - The hyperparameter space is typically made of discrete decisions and thus isn’t continuous or differentiable. Hence, we typically can’t do gradient descent in hyperparameter space. Instead, we must rely on gradient-free optimization techniques, which naturally are far less efficient than gradient descent.
 
Because these challenges are difficult and the field is still young, we currently only have access to very limited tools to optimize models. Often, it turns out that random search (choosing hyperparameters to evaluate at random, repeatedly) is the best solution, despite being the most naive one. But one tool we have found reliably better than random search is Hyperopt (https://github.com/hyperopt/hyperopt), a Python library for hyperparameter optimization that internally uses trees of Parzen estimators to predict sets of hyperparameters that are likely to work well. Another library called Hyperas (https://github.com/maxpumperla/hyperas) integrates Hyperopt for use with Keras models.

>NOTE One important issue to keep in mind when doing automatic hyperparameter optimization at scale is validation-set overfitting. Because we’re updating hyperparameters based on a signal that is computed using our validation data, we’re effectively training them on the validation data, and thus they will quickly overfit to the validation data. Always keep this in mind.

Overall, hyperparameter optimization is a powerful technique that is an absolute requirement to get to state-of-the-art models on any task or to win machine-learning competitions. Think about it: once upon a time, people handcrafted the features that
went into shallow machine-learning models. That was very much suboptimal. Now, deep learning automates the task of hierarchical feature engineering—features are learned using a feedback signal, not hand-tuned, and that’s the way it should be. In the same way, we shouldn’t handcraft our model architectures; we should optimize them in a principled way. At the time of writing, the field of automatic hyperparameter optimization is very young and immature, as deep learning was some years ago, but we expect it to boom in the next few years.

### Model Ensembling

Another powerful technique for obtaining the best possible results on a task is model ensembling. Ensembling consists of pooling together the predictions of a set of different models, to produce better predictions. If we look at machine-learning competitions, in particular on Kaggle, we’ll see that the winners use very large ensembles of models that inevitably beat any single model, no matter how good.

Ensembling relies on the assumption that different good models trained independently are likely to be good for different reasons: each model looks at slightly different aspects of the data to make its predictions, getting part of the “truth” but not all of it. We may be familiar with the ancient parable of the blind men and the elephant: a group of blind men come across an elephant for the first time and try to understand what the elephant is by touching it. Each man touches a different part of the elephant’s body—just one part, such as the trunk or a leg. Then the men describe to each other what an elephant is: “It’s like a snake,” “Like a pillar or a tree,” and so on. The blind men are essentially machine-learning models trying to understand the manifold of the training data, each from its own perspective, using its own assumptions (provided by the unique architecture of the model and the unique random weight initialization). Each of them gets part of the truth of the data, but not the whole truth. By pooling their perspectives together, we can get a far more accurate description of the data. The elephant is a combination of parts: not any single blind man gets it quite right, but, interviewed together, they can tell a fairly accurate story.

Let’s use classification as an example. The easiest way to pool the predictions of a set of classifiers (to ensemble the classifiers) is to average their predictions at inference time:

`preds_a = model_a.predict(x_val)`<br/>
`preds_b = model_b.predict(x_val)`<br/>
`preds_c = model_c.predict(x_val)`<br/>
`preds_d = model_d.predict(x_val)`<br/>

This new prediction array should be more accurate than any of the initial ones:
`final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)` 

This will work only if the classifiers are more or less equally good. If one of them is significantly worse than the others, the final predictions may not be as good as the best classifier of the group.

A smarter way to ensemble classifiers is to do a weighted average, where the weights are learned on the validation data—typically, the better classifiers are given a higher weight, and the worse classifiers are given a lower weight. To search for a good set of ensembling weights, we can use random search or a simple optimization algorithm such as Nelder-Mead:

`preds_a = model_a.predict(x_val)`<br/>
`preds_b = model_b.predict(x_val)`<br/>
`preds_c = model_c.predict(x_val)`<br/>
`preds_d = model_d.predict(x_val)`<br/>

`final_preds = 0.5 * preds_a + 0.25 * preds_b + 0.1 * preds_c + 0.15 * preds_d`<br/>
These weights (0.5, 0.25, 0.1, 0.15) are assumed to be learned empirically

There are many possible variants: we can do an average of an exponential of the predictions, for instance. In general, a simple weighted average with weights optimized on the validation data provides a very strong baseline.

The key to making ensembling work is the diversity of the set of classifiers. Diversity is strength. If all the blind men only touched the elephant’s trunk, they would agree that elephants are like snakes, and they would forever stay ignorant of the truth of the elephant. Diversity is what makes ensembling work. In machine-learning terms, if all of our models are biased in the same way, then our ensemble will retain this same bias. If our models are biased in different ways, the biases will cancel each other out, and the ensemble will be more robust and more accurate.

For this reason, we should ensemble models that are as good as possible while being as different as possible. This typically means using very different architectures or even different brands of machine-learning approaches. One thing that is largely not worth doing is ensembling the same network trained several times independently, from different random initializations. If the only difference between your models is their random initialization and the order in which they were exposed to the training data, then our ensemble will be low-diversity and will provide only a tiny improvement over any single model.

One thing works well in practice—but that doesn’t generalize to every problem domain—is the use of an ensemble of tree-based methods (such as random forests or gradient-boosted trees) and deep neural networks. 

In recent times, one style of basic ensemble that has been very successful in practice is the wide and deep category of models, blending deep learning with shallow learning. Such models consist of jointly training a deep neural network with a large linear
model. The joint training of a family of diverse models is yet another option to achieve model ensembling.