<a href="https://colab.research.google.com/github/kilos11/Learn-Keras-for-Deep-Neural-Networks-by-JoJo-Moolayil/blob/main/Tuning_and%C2%A0Deploying_Deep_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**L1 Regularization**

In L1 regularization, the absolute weights are added to the loss function. To
make the model more generalized, the values of the weights are reduced
to 0, and therefore this method is strongly preferred when we are trying to
compress the model for faster computation.
In Keras, the L1 loss can be added to a layer by providing the ‘regularizer’
object to the ‘kernel regularizer’ parameter. The following code snippet
demonstrates adding an L1 regularizer to a dense layer in Keras.

In [None]:
from keras import regularizers
from keras import Sequential

model = Sequential()
model.add(Dense(256, input_dim=128,
kernel_regularizer=regularizers.l1(0.01)))

#**L2 Regularization**

In L2 regularization, the squared weights are added to the loss function. To
make the model more generalized, the values of the weights are reduced to
near 0 (but not actually 0), and hence this is also called the “weight decay”
method. In most cases, L2 is highly recommended over L1 for reducing
overfitting.
L2 Regularization
In L2 regularization, the squared weights are added to the loss function. To
make the model more generalized, the values of the weights are reduced to
near 0 (but not actually 0), and hence this is also called the “weight decay”
method. In most cases, L2 is highly recommended over L1 for reducing
overfitting.

In [None]:
model = Sequential()
model.add(Dense(256, input_dim=128,
kernel_regularizer=regularizers.l2(0.01)))
#The value of 0.01 is the hyperparameter value we set for λ.

#**Dropout Regularization**


*In addition to L1 and L2 regularization, there is another popular technique
in DL to reduce overfitting. This technique is to use a dropout mechanism.
In this method, the model arbitrarily drops or deactivates a few neurons
for a layer during each iteration. Therefore, in each iteration the model
looks at a slightly different structure of itself to optimize (as a couple
of neurons and the connections would be deactivated). Say we have
two successive layers, H1 and H2, with 15 and 20 neurons, respectively.
Applying the dropout technique between these two layers would result
in randomly dropping a few neurons (based on a defined percentage) for
H1, which therefore reduces the connections between H1 and H2. This
process repeats for each iteration with randomness, so if the model has
learned for a batch and updated the weights, the next batch might have a
fairly different set of weights and connections to train.
 The process is not
only efficient due to the reduced computation but also works intuitively in
reducing the overfitting and therefore improving the overall performance.
The idea of dropout can be visually understood using the following
figure. We can see that the regular network has all neurons and connections
between two successive layers intact. With dropout, each iteration induces a
certain defined degree of randomness by arbitrarily deactivating or dropping
a few neurons and their associated weight connections.*

In [None]:
from keras import Sequential
from keras.layers.core import Dropout, Dense

model = Sequential()
model.add(Dense(100, input_dim= 50, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(1,activation="linear"))

#The parameter value of 0.25 indicates the dropout rate
#(i.e., the percentage of the neurons to be dropped).

#**Hyperparameter Tuning**

Hyperparameters are the parameters that define a model’s
holistic structure and thus the learning process. We can also relate
hyperparameters as the metaparameter for a model. It differs from a
model’s actual parameters, which it learns during the training process (say,
the model weights). Unlike model parameters, hyperparameters cannot be
learned; we need to tune them with different approaches to get improved
performance.



###**Number of Neurons in a Layer**

For most classification and regression use cases using tabular cross￾sectional data, DNNs can be made robust by playing around with the
width of the network (i.e., the number of neurons in a layer). Generally,
a simple rule of thumb for selecting the number of neurons in the first
layer is to refer to the number of input dimensions. If the final number of
input dimensions in a given training dataset (this includes the one-hot encoded features also) is x, we should use at least the closest number to 2x
in the power of 2. Let’s say you have 100 input dimensions in your training
dataset: preferably start with 2 × 100 = 200, and take the closest power of 2,
so 256. It is good to have the number of neurons in the power of 2, as it
helps the computation of the network to be faster. Also, good choices for
the number of neurons would be 8, 16, 32, 64, 128, 256, 512, 1024, and so
on. Based on the number of input dimensions, take the number closest to
2 times the size. So, when you have 300 input dimensions, try using 512
neurons.



###**Number of Layers**

It is true that just adding a few more layers will generally increase the
performance, at least marginally. But the problem is that with an increased
number of layers, the training time and computation increase significantly.
Moreover, you would need a higher number of epochs to see promising
results. Not using deeper networks is not an always an option; in cases
when you have to, try using a few best practices.
In case you are using a very large network, say more than 20 layers,
try using a tapering size architecture (i.e., gradually reduce the number
of neurons in each layer as the depth increases). So, if you are using an
architecture of 30 layers with 512 neurons in each layer, try reducing the
number of neurons in the layers slowly. An improved architecture would
be with the first 8 layers having 512 neurons, the next 8 with 256, the next
8 with 128, and so on. For the last hidden layer (not the output layer), try keeping the number of neurons to at least around 30–40% of the input size.
Alternatively, if you are using wider networks (i.e., not reducing the
number of neurons in the lower layers), always use L2 regularization or
dropout layers with a drop rate of around 30%. The chances of overfitting
are highly reduced.



###**Number of Epochs**

Sometimes, just increasing the number of epochs for model training
delivers better results, although this comes at the cost of increased
computation and training time.



###**Weight Initialization**

Initializing the weights for your network also has a tremendous impact
on the overall performance. A good weight initialization technique not
only speeds up the training process but also circumvents deadlocks in
the model training process. By default, the Keras framework uses glorot
uniform initialization, also called Xavier uniform initialization, but this can be changed as per your needs. We can initialize the weights for a layer
using the kernel initializer parameter as well as bias using a bias initializer.
Other popular options to select are ‘He Normal’ and ‘He Uniform’
initialization and ‘lecun normal’ and ‘lecun uniform’ initialization.
There are quite a few other options available in Keras too, but the
aforementioned choices are the most popular.

The following code snippet showcases an example of initializing
weights in a layer of a DNN with random_uniform.

*from keras import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(64,activation="relu", input_dim = 32, kernel_
initializer = "random_uniform",bias_initializer = "zeros"))
model.add(Dense(1,activation="sigmoid"))*



###**Batch Size**

Using a moderate batch size always helps achieve a smoother learning
process for the model. A batch size of 32 or 64, irrespective of the dataset
size and the number of samples, will deliver a smooth learning curve in
most cases. Even in scenarios where your hardware environment has
large RAM memory to accommodate a bigger batch size, I would still
recommend staying with a batch size of 32 or 64.




###**Learning Rate**

Learning rate is defined in the context of the optimization algorithm. It
defines the length of each step or, in simple terms, how large the updates
to the weights in each iteration can be made. Throughout this book, we
have ignored setting or changing the learning rate, as we have used the
default values for the respective optimization algorithms, in our case
Adam. The default value is 0.001, and this is a great choice for most scenarios. However, in some special cases, you might cross paths with a
use case where it might be better to go with a lower learning rate or maybe
slightly higher.



###**Activation Function**

We have a generous choice of activation functions for the neurons. In most
cases, ReLU works perfectly. You could almost always go ahead with ReLU
as an activation for any use case and get favorable results. In cases where
ReLU might not be delivering great results, experimenting with PReLU is a
great option.



###**Optimization**

Similar to activation functions, we also have a fairly generous number of
choices available for the optimization algorithm of the network. While
the most recommended is Adam, in scenarios where Adam might not
be delivering the best results for your architecture, you could explore
Adamax as well as Nadam optimizers. Adamax has mostly been a better
choice for architectures that have sparsely updated parameters like word
embeddings, which are mostly used in natural language processing
techniques. We have not covered these advanced topics in the book, but it
is good to keep these points in mind while exploring various architectures.





#**Approaches for Hyperparameter Tuning**

So far, we have discussed various hyperparameters that are available
for our DL models and also studied the most recommended options for
generic situations. However, selecting the most appropriate value for
a hyperparameter based on the data and the type of problem is more
of an art. The art is also arduous and painfully slow. The process of
hyperparameter tuning in DL is almost always slow and resource intensive.
However, based on the style of selecting a value for hyperparameter andfurther tuning model performance, we can roughly divide the different
types of approaches into four broad categories:


*    Manual Search

*   Grid Search
*  Random Search

*   Bayesian Optimization


Out of the four aforementioned approaches, we will have a brief look
into the first three. Bayesian optimization is altogether a long and difficult
topic that is beyond the scope for our book. Let’s have a brief look at the
first three approaches.


>
###**Manual Search**

Manual search, as the name implies, is a completely manual way of
selecting the best candidate value for the desired hyperparameters
in a DL model. This approach requires phenomenal experience in
training networks to get the right set of candidate values for all desired
hyperparameters using the least number of experiments. Often this
approach is highly efficient, provided you have sound experience in using
them. The best approach to start with manual search is simply to leverage
all the recommended values for a given hyperparameter and then to start
training the network. The results may not be the best, but would definitely
not be the worst. It’s a good starting point for any newbie in the field to
experiment with a few lowest-risk hyperparameter candidates.



###**Grid Search**

In the grid search approach, you literally experiment with all possible
combinations for a defined set of values of a hyperparameter. The name “grid”
is actually derived from the gridlike combinations for the provided values of
each hyperparameter. The following is a sample view of how a logical grid
would look for three hyperparameters with three distinct values in each.he grid search method in sklearn to accomplish the results. The following
code snippet showcases the means to use grid search from the sklearn
package by using the Keras wrapper for a dummy model.











In [None]:
from keras import Sequential
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers import Dense
import numpy as np

#Generate dummy data for 3 features and 1000 samples
x_train = np.random.random((1000, 3))
#Generate dummy results for 1000 samples: 1 or 0
y_train = np.random.randint(2, size=(1000, 1))
#Create a python function that returns a compiled DNN model
def create_dnn_model():
 model = Sequential()
 model.add(Dense(12, input_dim=3, activation='relu'))
 model.add(Dense(1, activation='sigmoid'))
 model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
 return model
#Use Keras wrapper to package the model as an sklearn object
model = KerasClassifier(build_fn=create_dnn_model)
# define the grid search parameters
batch_size = [32,64,128]
epochs = [15, 30, 60]
#Create a list with the parameters
param_grid = {"batch_size":batch_size, "epochs":epochs}
#Invoke the grid search method with the list of hyperparameters
grid_model = GridSearchCV(estimator=model, param_grid=param_
grid, n_jobs=-1)
#Train the model
grid_model.fit(x_train, y_train)
#Extract the best model grid search
best_model = grid_model.best_estimator_

SyntaxError: invalid syntax. Perhaps you forgot a comma? (<ipython-input-2-594a02c8ea65>, line 27)

###**Random Search**


An improved alternative to grid search is random search. In a random
search, rather than selecting a value for the hyperparameter from a defined
list of numbers, like learning rate, we can instead choose randomly from a
distribution. This is, however, only possible for numeric hyperparameters.
So, instead of trying a learning rate of 0.1, 0.01, or 0.001, it can alternatively
pick up a random value for learning rate from a distribution we define
with some properties. The parameter now has a larger range of values
to experiment with and also much higher chances of getting better
performance. It overcomes the disadvantage of a human guessing the
best value for the hyperparameter confined within the defined range
by inducing randomness to bring the chance for better hyperparameter
selection. In reality, for most practical cases, random search mostly
outperforms grid search.

##**Model Deployment**

Now, we can finally discuss a few important pointers on model
deployment. We started with learning Keras and DL, experimented with
actual DNNs for regression and classification, and then discussed tuning
hyperparameters for improved model performance. We can now discuss a
few guidelines for deploying a DL model in a production environment.
I want to clarify that we won’t actually be learning the process of deploying
a model in production as a software engineer or discuss the DL software
pipeline and architecture for a large enterprise project. We will instead
focus on a couple of important aspects to be kept in mind while deploying
the actual model.

##**Saving Models to Memory**

Another useful point we didn’t discuss during the course of this chapter is
saving the model as a file into memory and reusing it at some other point
in time. The reason this becomes extremely important in DL is the time
consumed in training large models. You shouldn’t be surprised when
you encounter DL engineers who have been training models for weeks
at a stretch on a supercomputer. Modern DL models that encompass
image, audio, and unstructured text data consume a significant amount
of time for training. A handy practice in such scenarios would be to have
the ability to pause and resume training for a DL model and also save the
intermediate results so that the training performed up to a certain point
of time doesn’t go to waste. This can be achieved with a simple callback (a
procedure in Keras that can be applied to the model at different stages of
training) that would save the weights of the model to a file along with the
model structure after a defined milestone.
This saved model can later be
imported again whenever you want to resume the training. The process
continues just like you would want it to. All we need to do is take care of
saving the model structure as well as the weights after an epoch or when
we have the best model in place. Keras provides the ability to save models
after every epoch or save the best model during training for multiple
epochs.

An example of saving the best weights of a model during training for a
large number of epochs is shown in the following snippet.

In [None]:
from keras.callbacks import ModelCheckpoint

# Define the file path pattern for saving model weights
# {epoch:.2f} will be replaced with the current epoch number
# {val_acc:.2f} will be replaced with the validation accuracy at that epoch
filepath = "ModelWeights-{epoch:.2f}-{val_acc:.2f}.hdf5"

# Create a checkpoint callback to save the best model weights during training
checkpoint = ModelCheckpoint(filepath, save_best_only=True, monitor="val_acc")

# Train the model using the defined checkpoint callback
model.fit(x_train, y_train, callbacks=[checkpoint], epochs=100, batch_size=64)

*Alternatively, you can also save a model in its entire form after
finishing training using the save_model method and later load it into
memory (maybe the next day) using the load_model method. An example
is shown in the following code snippet.*

In [None]:
from keras.models import load_model

#Train a model for defined number of epochs
model.fit(x_train, y_train, epochs=100, batch_size=64)
# Saves the entire model into a file named as 'dnn_model.h5'
model.save('dnn_model.h5')
# Later, (maybe another day), you can load the trained model for prediction.
model = load_model('dnn_model.h5')

##**Retraining the Models with New Data**

When you deploy your model into production, the ecosystem will continue
to generate more data, which can be used for training your models again.
Say, for the credit card fraud use case, you trained your model with 100K
samples and got a performance of 93% accuracy. You feel the performance
is good enough to get started, so you deploy your model into production.
Over a period of one month, an additional 10K samples are available from
the new transactions made by customers. Now you would want your model
to leverage this newly available data and improve its performance even further. To achieve this, you don’t need to retrain the entire model again;
you could instead use the pause-and-resume approach. All you need to do
is use the weights of the model already trained and provide additional data
with a few epochs to pass and iterate over the new samples. The weights
it has already learned don’t need to be disposed; you can simply use the
pause-and-resume formula and continue with the incremental data

##**Online Models**

An immediate question you may ponder after understanding the process
of retraining the model is how frequently should you do this: is it a
good approach to retrain every day, every week, or every month? The
right answer is to retrain as frequently as you want. There is no harm
in incrementally training your models every time a new data point is
available as long as the computation required is not a bottleneck. A good
practice would be to iterate a training instance as soon as a new batch
of samples is available. So, if you have set a batch_size of 64, you could
automate the model training to ingest the newly available batch of data
and further improve performance on future predictions by automating
the software infrastructure to train the model for every new batch of data
samples. An extremely aggressive way to keep the model performance at
the best would be to incrementally train with every new data point and
add previous samples as the remainder of the batch. This approach is
extremely computation intensive and also less rewarding. This approach of
becoming ultra-real-time and incrementally training for every new sample
instead of a batch is usually not recommended.
Such models, which are always learning as and when a new batch of
data is available, are called online models. The most popular examples
of online models can be seen on your phone. Features like predictive text
and autocorrect improve dramatically over time. If you generally type in
a specific style, say combining two languages or shortening few words or
using slang and so on, you will notice that the mobile phone quite actively  An extremely aggressive way to keep the model performance at
the best would be to incrementally train with every new data point and
add previous samples as the remainder of the batch. This approach is
extremely computation intensive and also less rewarding. This approach of
becoming ultra-real-time and incrementally training for every new sample
instead of a batch is usually not recommended.
Such models, which are always learning as and when a new batch of
data is available, are called online models. The most popular examples
of online models can be seen on your phone. Features like predictive text
and autocorrect improve dramatically over time. If you generally type in
a specific style, say combining two languages or shortening few words or
using slang and so on, you will notice that the mobile phone quite actively

##**Delivering Your Model As an API**

The best practice today in delivering your model as a service to a larger
software stack is by delivering it as an API. This is extremely useful and
effective, as it completely gets rid of the tech-stack requirements. Your
model can easily collaborate between a diverse and complex set of
components in a software ecosystem where you can worry less about the
language or framework you used to develop the model. While Python and Keras are almost universal in today’s modern
tech stack, we can still expect a few exceptions where this choice might
not be an easy option to integrate. Therefore, we can always choose API
as the preferred mode of deployment for a DL model and define the
requirements for data and calling style of the API appropriately.
There are two extremely useful and easy-to-operate options for
deploying your service as an API. You could either use Flask (a lightweight
Python web framework) or Amazon Sagemaker (available on AWS). There
are other options too, and I encourage you to explore them. There is an
extremely well-written article on Keras Blogs on deploying your DL model
using Flask.