## <font color='darkblue'>Preface</font>
([article source](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)) <font size='3ptx'><b>Hyperparameter optimization is a big part of deep learning.</b> The reason is that neural networks are notoriously difficult to configure and there are a lot of parameters that need to be set. On top of that, individual models can be very slow to train.</font>

In this post you will discover how you can use the grid search capability from the scikit-learn python machine learning library to tune the hyperparameters of Keras deep learning models. After reading this post you will know:
* How to wrap Keras models for use in scikit-learn and how to use grid search.
* How to grid search common neural network parameters such as learning rate, dropout rate, epochs and number of neurons.
* How to define your own hyperparameter tuning experiments on your own projects.

<a id='sect0'></a>
### <font color='darkgreen'>Overview</font>
In this post, I want to show you both how you can use the scikit-learn grid search capability and give you a suite of examples that you can copy-and-paste into your own project as a starting point.

Below is a list of the topics we are going to cover:
1. <font size='3ptx'><b><a href='#sect1'>How to use Keras models in scikit-learn.</a></b></font>
2. <font size='3ptx'><b><a href='#sect2'>How to use grid search in scikit-learn.</a></b></font>
3. <font size='3ptx'><b><a href='#sect3'>How to tune batch size and training epochs.</a></b></font>
4. <font size='3ptx'><b><a href='#sect4'>How to tune optimization algorithms.</a></b></font>
5. <font size='3ptx'><b><a href='#sect5'>How to tune learning rate and momentum.</a></b></font>
6. <font size='3ptx'><b><a href='#sect6'>How to tune network weight initialization.</a></b></font>
7. <font size='3ptx'><b><a href='#sect7'>How to tune activation functions.</a></b></font>
8. <font size='3ptx'><b><a href='#sect8'>How to tune dropout regularization.</a></b></font>
9. <font size='3ptx'><b><a href='#sect9'>How to tune the number of neurons in the hidden layer.</a></b></font>
10. <font size='3ptx'><b><a href='#sect10'>Tips for Hyperparameter Optimization</a></b></font>

<a id='sect1'></a>
## <font color='darkblue'>How to Use Keras Models in scikit-learn</font>
<font size='3ptx'><b>Keras models can be used in scikit-learn by wrapping them with the <font color='blue'>KerasClassifier</font> or <font color='blue'>KerasRegressor</font> class from the module [SciKeras](https://pypi.org/project/scikeras/)</b>. You may need to run the command <font color='blue'>pip install scikeras</font> first to install the module.</font>

In [1]:
#!pip install scikeras

To use these wrappers you must define a function that creates and returns your Keras sequential model, then pass this function to the model argument when constructing the <b><font color='blue'>KerasClassifier</font></b> class. For example:
```python
def create_model():
	...
	return model

model = KerasClassifier(model=create_model)
```
<br/>

The constructor for the <b><font color='blue'>KerasClassifier</font></b> class can also take new arguments that can be passed to your custom <font color='blue'>create_model()</font> function. These new arguments must also be defined in the signature of your <font color='blue'>create_model()</font> function with default parameters:
```python
def create_model(dropout_rate=0.0):
	...
	return model

model = KerasClassifier(model=create_model, dropout_rate=0.2)
```
<br/>

You can learn more about these from the [SciKeras documentation](https://www.adriangb.com/scikeras/stable/index.html).

In [47]:
import os
import logging
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.constraints import MaxNorm
from tensorflow.keras.layers import Dropout
from scikeras.wrappers import KerasClassifier

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
np.seterr(all="ignore")
tf.get_logger().setLevel(logging.ERROR)
print(tf.__version__)

2.9.1


<a id='sect2'></a>
## <font color='darkblue'>How to Use Grid Search in scikit-learn</font>
<font size='3ptx'><b>Grid search is a model hyperparameter optimization technique.</b> In scikit-learn this technique is provided in the [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class.</font>

When constructing this class you must provide a dictionary of hyperparameters to evaluate in the <font color='violet'>param_grid</font> argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in the <font color='violet'>score</font> argument of the [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) constructor.

<b>By default, the grid search will only use one thread. By setting the <font color='violet'>n_jobs</font> argument in the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) constructor to -1, the process will use all cores on your machine</b>. However, sometimes this may interfere with the main neural network training process.

The [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) process will then construct and evaluate one model for each combination of parameters. <b>Cross validation is used to evaluate each individual model and the default of 3-fold cross validation is used</b>, although this can be overridden by specifying the <font color='violet'>c</font>v argument to the [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) constructor.

Below is an example of defining a simple grid search:
```python
param_grid = dict(epochs=[10,20,30])
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, Y)
```
<br/>

Once completed, you can access the outcome of the grid search in the result object returned from <font color='blue'>grid.fit()</font>. The `best_score_` member provides access to the best score observed during the optimization procedure and the `best_params_` describes the combination of parameters that achieved the best results.

You can learn more about the [GridSearchCV class in the scikit-learn API documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

## <font color='darkblue'>Problem Description</font> ([back](#sect0))
<font size='3ptx'><b>Now that we know how to use Keras models with scikit-learn and how to use grid search in scikit-learn, let’s look at a bunch of examples.</b> All examples will be demonstrated on a small standard machine learning dataset called the [Pima Indians onset of diabetes classification dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes). This is a small dataset with all numerical attributes that is easy to work with.</font> 

[Download the dataset](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv) and place it in path <font color='olive'>datas/kaggle_pima-indians-diabetes-database/diabetes.csv</font>.

<b>As we proceed through the examples in this post, we will aggregate the best parameters</b>. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.

In [3]:
# load dataset
df = pd.read_csv("../../datas/kaggle_pima-indians-diabetes-database/diabetes.csv")

In [4]:
df.sample(n=3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
178,5,143,78,0,0,45.0,0.19,47,0
26,7,147,76,0,0,39.4,0.257,43,1
134,2,96,68,13,49,21.1,0.647,26,0


<a id='sect3'></a>
## <font color='darkblue'>How to Tune Batch Size and Number of Epochs</font> ([back](#sect0))
<font size='3ptx'><b>In this first simple example, we look at tuning the batch size and number of epochs used when fitting the network.</b></font>

The **batch size** in [iterative gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Iterative_method) is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The **number of epochs** is the number of times that the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here we will evaluate a suite of different mini batch sizes from 10 to 100 in steps of 20.

The full code listing is provided below.

In [5]:
# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

In [6]:
# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [7]:
# split into input (X) and output (Y) variables
X = df.loc[:, df.columns != 'Outcome']
Y = df['Outcome']

# create model
model = KerasClassifier(model=create_model, verbose=0)

# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [8]:
%%time
grid_result = grid.fit(X, Y)

2022-07-09 16:48:54.267675: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-09 16:48:54.267728: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-09 16:48:54.267759: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ubuntu): /proc/driver/nvidia/version does not exist
2022-07-09 16:48:54.268119: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


CPU times: user 5.99 s, sys: 4.12 s, total: 10.1 s
Wall time: 1min 3s


In [9]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.691406 using {'batch_size': 20, 'epochs': 100}


In [10]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.610677 (0.023939) with: {'batch_size': 10, 'epochs': 10}
0.674479 (0.048824) with: {'batch_size': 10, 'epochs': 50}
0.688802 (0.020752) with: {'batch_size': 10, 'epochs': 100}
0.640625 (0.016877) with: {'batch_size': 20, 'epochs': 10}
0.640625 (0.011500) with: {'batch_size': 20, 'epochs': 50}
0.691406 (0.019918) with: {'batch_size': 20, 'epochs': 100}
0.626302 (0.034401) with: {'batch_size': 40, 'epochs': 10}
0.627604 (0.014382) with: {'batch_size': 40, 'epochs': 50}
0.658854 (0.032264) with: {'batch_size': 40, 'epochs': 100}
0.496094 (0.017758) with: {'batch_size': 60, 'epochs': 10}
0.656250 (0.044309) with: {'batch_size': 60, 'epochs': 50}
0.669271 (0.007366) with: {'batch_size': 60, 'epochs': 100}
0.537760 (0.084766) with: {'batch_size': 80, 'epochs': 10}
0.588542 (0.009207) with: {'batch_size': 80, 'epochs': 50}
0.679688 (0.014616) with: {'batch_size': 80, 'epochs': 100}
0.488281 (0.101262) with: {'batch_size': 100, 'epochs': 10}
0.566406 (0.044993) with: {'batch_size': 100, 'epo

We can see that the batch size of 20 and 100 epochs achieved the best result of about 70% accuracy.

<a id='sect4'></a>
## <font color='darkblue'>How to Tune the Training Optimization Algorithm</font> ([back](#sect0))
<font size='3ptx'><b>Keras offers a suite of different state-of-the-art optimization algorithms.</b> In this example, we tune the optimization algorithm used to train the network, each with default parameters.</font>

This is an odd example, because often you will choose one approach a priori and instead focus on tuning its parameters on your problem (<font color='brown'>e.g. see the next example</font>). Here we will evaluate the [suite of optimization algorithms supported by the Keras API](http://keras.io/optimizers/).

The full code listing is provided below.

In [12]:
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # return model without compile
    return model

In [13]:
# create model
model = KerasClassifier(
    model=create_model, loss="binary_crossentropy", epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [18]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 10.4 s, sys: 7.54 s, total: 17.9 s
Wall time: 1min 32s


In [16]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.709635 using {'optimizer': 'Adam'}


In [17]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.649740 (0.003683) with: {'optimizer': 'SGD'}
0.671875 (0.017758) with: {'optimizer': 'RMSprop'}
0.567708 (0.025582) with: {'optimizer': 'Adagrad'}
0.436198 (0.094847) with: {'optimizer': 'Adadelta'}
0.709635 (0.020256) with: {'optimizer': 'Adam'}
0.658854 (0.018688) with: {'optimizer': 'Adamax'}
0.695312 (0.006379) with: {'optimizer': 'Nadam'}


Note in the function <font color='blue'>create_model()</font> defined above do not return a compiled model like that one in the previous example. This is because <b>setting an optimizer for a Keras model is done in the <font color='blue'>compile()</font> function call</b>, hence it is better to leave it to the <b><font color='blue'>KerasClassifier</font></b> wrapper and the [**GridSearchCV model**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Also note that we specified <font color='blue'>loss="binary_crossentropy"</font> in the wrapper as it should also be set during the <font color='blue'>compile()</font> function call.

The <b><font color='blue'>KerasClassifier</font></b> wrapper will not compile your model again if the model is already compiled. Hence the other way to run [**GridSearchCV model**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is to set the `optimizer` as an argument to the <font color='blue'>create_model()</font> function which returns an appropriately compiled model, like the following:

In [20]:
def create_model(optimizer='adam'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

In [21]:
model = KerasClassifier(model=create_model, epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(model__optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [22]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 11.8 s, sys: 8.8 s, total: 20.6 s
Wall time: 1min 46s


In [23]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.700521 using {'model__optimizer': 'Adam'}




The results suggest that the ADAM optimization algorithm is the best with a score of about 70% accuracy.

Note that in the above, we have the prefix `model__` in the parameter dictionary `param_grid`. This is required for the <font color='blue'><b>KerasClassifier</b></font> in [**SciKeras**](https://pypi.org/project/scikeras/) module to make clear that the parameter need to route into the <font color='blue'>create_model()</font> function as arguments, rather than some parameter to set up in <font color='blue'>compile()</font> or <font color='blue'>fit()</font>. See also the [routed parameter section](https://www.adriangb.com/scikeras/stable/advanced.html#routed-parameters) of SciKeras documentation.

<a id='sect5'></a>
## <font color='darkblue'>How to Tune Learning Rate and Momentum</font> ([back](#sect0))
<font size='3ptx'><b>It is common to pre-select an optimization algorithm to train your network and tune its parameters.</b></font>

By far the most common optimization algorithm is plain old [**Stochastic Gradient Descent**](http://keras.io/optimizers/#sgd) (SGD) because it is so well understood. In this example, we will look at optimizing the SGD learning rate and momentum parameters.

<b><font color='darkblue'>Learning rate</font> controls how much to update the weight at the end of each batch and the <font color='darkblue'>momentum</font> controls how much to let the previous update influence the current weight update.</b>

We will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice). In Keras, the way to set the learning rate and momentum is [the following](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD):
```python
...
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.2)
```
<br/>

In [**SciKeras**](https://pypi.org/project/scikeras/) wrapper, we will route the parameters to the optimizer with the prefix `optimizer__`.

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (<font color='brown'>learning rate</font>), the number of updates per epoch (<font color='brown'>batch size</font>) and the number of epochs.

The full code listing is provided below:

In [24]:
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model

In [26]:
model = KerasClassifier(
    model=create_model, loss="binary_crossentropy", optimizer="SGD",
    epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
param_grid = dict(
    optimizer__learning_rate=learn_rate,
    optimizer__momentum=momentum)

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [27]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 17.8 s, sys: 13.7 s, total: 31.5 s
Wall time: 5min 52s


In [28]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.677083 using {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.2}


In [29]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.675781 (0.019918) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.0}
0.677083 (0.040637) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.2}
0.673177 (0.025780) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.4}
0.675781 (0.022097) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.6}
0.670573 (0.006639) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.8}
0.673177 (0.027126) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.9}
0.652344 (0.019137) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.0}
0.615885 (0.076966) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.2}
0.651042 (0.003683) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.4}
0.649740 (0.003683) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.6}
0.652344 (0.003189) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.8}
0.651042 (0.001

We can see that relatively SGD is not very good on this problem, nevertheless best results were achieved using a learning rate of 0.001 and a momentum of 0.2 with an accuracy of about 68%.

<a id='sect6'></a>
## <font color='darkblue'>How to Tune Network Weight Initialization</font> ([back](#sect0))
<font size='3ptx'><b>Neural network weight initialization used to be simple: use small random values.</b></font>

Now there is a suite of different techniques to choose from. [Keras provides a laundry list](https://keras.io/api/layers/initializers/).

<b>In this example, we will look at tuning the selection of network weight initialization by evaluating all of the available techniques.</b>

<b>We will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer.</b> In the example below we use rectifier for the hidden layer. We use sigmoid for the output layer because the predictions are binary. The weight initialization is now an argument to <font color='blue'>create_model()</font> function, which we need to use the `model__` prefix to ask <font color='blue'><b>KerasClassifier</b></font> to route the parameter to the model creation function.

The full code listing is provided below:

In [30]:
# Function to create model, required for KerasClassifier
def create_model(init_mode='uniform'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,), kernel_initializer=init_mode, activation='relu'))
    model.add(Dense(1, kernel_initializer=init_mode, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [31]:
# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
init_mode = [
    'uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal',
    'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(model__init_mode=init_mode)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [32]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 9.52 s, sys: 7.36 s, total: 16.9 s
Wall time: 1min 45s


In [33]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.727865 using {'model__init_mode': 'uniform'}


In [34]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.727865 (0.018136) with: {'model__init_mode': 'uniform'}
0.683594 (0.035084) with: {'model__init_mode': 'lecun_uniform'}
0.726562 (0.020915) with: {'model__init_mode': 'normal'}
0.651042 (0.001841) with: {'model__init_mode': 'zero'}
0.707031 (0.008438) with: {'model__init_mode': 'glorot_normal'}
0.688802 (0.024774) with: {'model__init_mode': 'glorot_uniform'}
0.687500 (0.019401) with: {'model__init_mode': 'he_normal'}
0.703125 (0.006379) with: {'model__init_mode': 'he_uniform'}


We can see that the best results were achieved with a uniform weight initialization scheme achieving a performance of about 72%.

<a id='sect7'></a>
## <font color='darkblue'>How to Tune the Neuron Activation Function</font> ([back](#sect0))
<font size='3ptx'><b>The activation function controls the non-linearity of individual neurons and when to fire.</b> Generally, the [rectifier activation function](https://keras.io/api/layers/activations/#relu-function) is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.</font>

<b>In this example, we will evaluate the [suite of different activation functions available in Keras](https://keras.io/api/layers/activations/)</b>. We will only use these functions in the hidden layer, as we require a sigmoid activation function in the output for the binary classification problem. Similar to the previous example, this is an argument to the <font color='blue'>create_model()</font> function and we will use the `model__` prefix for the <font color='blue'><b>GridSearchCV</b></font> parameter grid.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which we will not do in this case.

The full code listing is provided below:

In [35]:
# Function to create model, required for KerasClassifier
def create_model(activation='relu'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,), kernel_initializer='uniform', activation=activation))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [36]:
# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
param_grid = dict(model__activation=activation)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [37]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 11.4 s, sys: 9.29 s, total: 20.7 s
Wall time: 1min 58s


In [38]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.739583 using {'model__activation': 'softplus'}


In [39]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.656250 (0.005524) with: {'model__activation': 'softmax'}
0.739583 (0.028764) with: {'model__activation': 'softplus'}
0.680990 (0.008027) with: {'model__activation': 'softsign'}
0.692708 (0.038051) with: {'model__activation': 'relu'}
0.682292 (0.030314) with: {'model__activation': 'tanh'}
0.683594 (0.028705) with: {'model__activation': 'sigmoid'}
0.678385 (0.028940) with: {'model__activation': 'hard_sigmoid'}
0.705729 (0.007366) with: {'model__activation': 'linear'}




Surprisingly (to me at least), the ‘softplus’ activation function achieved the best results with an accuracy of about 73%.

<a id='sect8'></a>
## <font color='darkblue'>How to Tune Dropout Regularization</font> ([back](#sect0))
<font size='3ptx'><b>In this example, we will look at tuning the [dropout rate for regularization](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/) in an effort to limit overfitting and improve the model’s ability to generalize.</b></font>

To get good results, dropout is best combined with a weight constraint such as the max norm constraint. For more on using dropout in deep learning models with Keras see the post:
* [Dropout Regularization in Deep Learning Models With Keras](https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/)

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (<font color='darkbrown'>1.0 does not make sense</font>) and maxnorm weight constraint values between 0 and 5.

The full code listing is provided below.

In [40]:
# Function to create model, required for KerasClassifier
def create_model(dropout_rate, weight_constraint):
    # create model
    model = Sequential()
    model.add(Dense(12, input_shape=(8,),
                    kernel_initializer='uniform',
                    activation='linear',
                    kernel_constraint=MaxNorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [43]:
# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
weight_constraint = [1.0, 2.0, 3.0, 4.0, 5.0]
dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
param_grid = dict(
    model__dropout_rate=dropout_rate,
    model__weight_constraint=weight_constraint)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [48]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 11.1 s, sys: 7.24 s, total: 18.3 s
Wall time: 10min 13s


In [49]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.725260 using {'model__dropout_rate': 0.0, 'model__weight_constraint': 2.0}


In [50]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.723958 (0.013279) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 1.0}
0.725260 (0.012075) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 2.0}
0.713542 (0.012075) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 3.0}
0.695312 (0.005524) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 4.0}
0.717448 (0.009207) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 5.0}
0.703125 (0.006379) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 1.0}
0.705729 (0.014731) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 2.0}
0.694010 (0.013279) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 3.0}
0.716146 (0.009744) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 4.0}
0.722656 (0.019137) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 5.0}
0.709635 (0.014382) with: {'model__dropout_rate': 0.2, 'model__weight_constraint': 1.0}
0.716146 (0.012890) with: {'mode

We can see that the dropout rate of 0% and the MaxNorm weight constraint of 2 resulted in the best accuracy of about 72%. You may notice some of the result is nan. Probably it is due to the issue that the input is not normalized and you may run into a degenerated model by chance.

<a id='sect9'></a>
## <font color='darkblue'>How to Tune the Number of Neurons in the Hidden Layer</font> ([back](#sect0))
<font size='3ptx'><b>The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.</b></font>

Also, generally, a large enough single layer network can approximate any other neural network, [at least in theory](https://en.wikipedia.org/wiki/Universal_approximation_theorem).

<b>In this example, we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5</b>

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

The full code listing is provided below.

In [59]:
# Function to create model, required for KerasClassifier
def create_model(neurons):
    # create model
    model = Sequential()
    model.add(
        Dense(
            neurons, input_shape=(8,), kernel_initializer='uniform',
            activation='softplus', kernel_constraint=MaxNorm(2)))
    #model.add(Dropout(0.2))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [66]:
# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
neurons = range(15, 40, 5)
param_grid = dict(model__neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

In [67]:
%%time
grid_result = grid.fit(X, Y)

CPU times: user 10.7 s, sys: 7.79 s, total: 18.4 s
Wall time: 1min 16s


In [68]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.753906 using {'model__neurons': 35}


In [69]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.729167 (0.016367) with: {'model__neurons': 15}
0.722656 (0.027251) with: {'model__neurons': 20}
0.736979 (0.029635) with: {'model__neurons': 25}
0.740885 (0.032264) with: {'model__neurons': 30}
0.753906 (0.033754) with: {'model__neurons': 35}


We can see that the best results were achieved with a network with 35 neurons in the hidden layer with an accuracy of about 75%.

<a id='sect10'></a>
## <font color='darkblue'>Tips for Hyperparameter Optimization</font> ([back](#sect0))
This section lists some handy tips to consider when tuning hyperparameters of your neural network.
* <font size='3ptx'>**k-fold Cross Validation**</font>. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.
* <font size='3ptx'>**Review the Whole Grid**</font>. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.
* <font size='3ptx'>**Parallelize**</font>. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider spinning up a lot of [**AWS instances**](https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-amazon-web-services/).
* <font size='3ptx'>**Use a Sample of Your Dataset**</font>. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.
* <font size='3ptx'>**Start with Coarse Grids**</font>. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.
* <font size='3ptx'>**Do not Transfer Results**</font>. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.
* <font size='3ptx'>**Reproducibility is a Problem**</font>. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.

## <font color='darkblue'>Supplement</font>
* [Stackoverflow - Disable Tensorflow debugging information](https://stackoverflow.com/questions/35911252/disable-tensorflow-debugging-information)
* [Stackoverflow - Why can't I suppress numpy warnings](https://stackoverflow.com/questions/29347987/why-cant-i-suppress-numpy-warnings)