Running on Mac 1.6GHz i5

### 1. Tune Batch Size and Number of Epochs

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. 

It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs = is the number of times that the entire training dataset is shown to the network during training. 

Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here we will evaluate a suite of different mini batch sizes from 10 to 100 in steps of 20.

In [2]:
# Use scikit-learn to grid search the batch size and epochs
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, verbose=0)
# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.697917 using {'epochs': 100, 'batch_size': 60}
0.585938 (0.030425) with: {'epochs': 10, 'batch_size': 10}
0.447917 (0.164011) with: {'epochs': 50, 'batch_size': 10}
0.470052 (0.166534) with: {'epochs': 100, 'batch_size': 10}
0.466146 (0.086840) with: {'epochs': 10, 'batch_size': 20}
0.656250 (0.019137) with: {'epochs': 50, 'batch_size': 20}
0.552083 (0.153767) with: {'epochs': 100, 'batch_size': 20}
0.453125 (0.084565) with: {'epochs': 10, 'batch_size': 40}
0.664062 (0.014616) with: {'epochs': 50, 'batch_size': 40}
0.528646 (0.144982) with: {'epochs': 100, 'batch_size': 40}
0.555990 (0.121757) with: {'epochs': 10, 'batch_size': 60}
0.657552 (0.012890) with: {'epochs': 50, 'batch_size': 60}
0.697917 (0.007366) with: {'epochs': 100, 'batch_size': 60}
0.415365 (0.128900) with: {'epochs': 10, 'batch_size': 80}
0.600260 (0.054345) with: {'epochs': 50, 'batch_size': 80}
0.576823 (0.178874) with: {'epochs': 100, 'batch_size': 80}
0.572917 (0.064133) with: {'epochs': 10, 'batch_size': 

#### We can see that the batch size of 60 and 100 epochs achieved the best result of about 69% accuracy

### 2.Tune the training optimization algorithm

Keras offers a suite of different state-of-the-art optimization algorithms.

In this example, we tune the optimization algorithm used to train the network, each with default parameters.

In [1]:
# Use scikit-learn to grid search the batch size and epochs
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
# Function to create model, required for KerasClassifier
def create_model(optimizer='adam'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Best: 0.691406 using {'optimizer': 'Adam'}
0.651042 (0.024774) with: {'optimizer': 'SGD'}
0.664063 (0.016877) with: {'optimizer': 'RMSprop'}
0.661458 (0.018136) with: {'optimizer': 'Adagrad'}
0.606771 (0.088004) with: {'optimizer': 'Adadelta'}
0.691406 (0.009568) with: {'optimizer': 'Adam'}
0.690104 (0.012075) with: {'optimizer': 'Adamax'}
0.678385 (0.019225) with: {'optimizer': 'Nadam'}


#### The results suggest that the ADAM optimization algorithm is the best with a score of about 69% accuracy.

### 3. Tune Learning Rate and Momentum

common optimization algorithm is Stocastic Gradient Descent (SGD) because it is well understood.

Learning rate controls how much to update the weight at the end of each batch and the momentum controls how much to let the previous update influence the current weight update.

We will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice).

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size) and the number of epochs.

In [1]:
# Use scikit-learn to grid search the learning rate and momentum
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    optimizer = SGD(lr=learn_rate, momentum=momentum)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
param_grid = dict(learn_rate=learn_rate, momentum=momentum)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Best: 0.696615 using {'learn_rate': 0.001, 'momentum': 0.8}
0.651042 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.0}
0.653646 (0.027498) with: {'learn_rate': 0.001, 'momentum': 0.2}
0.648438 (0.022326) with: {'learn_rate': 0.001, 'momentum': 0.4}
0.670573 (0.011201) with: {'learn_rate': 0.001, 'momentum': 0.6}
0.696615 (0.015073) with: {'learn_rate': 0.001, 'momentum': 0.8}
0.533854 (0.149269) with: {'learn_rate': 0.001, 'momentum': 0.9}
0.533854 (0.149269) with: {'learn_rate': 0.01, 'momentum': 0.0}
0.533854 (0.149269) with: {'learn_rate': 0.01, 'momentum': 0.2}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.4}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.6}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.8}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.9}
0.348958 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.0}
0.572917 (0.134575) with: {'learn_rate': 0.1, 'momentum': 0.2}
0.533854 (0.149269) with: {'learn_rate':

#### We can see that relatively SGD is not very good on this problem, nevertheless best results were achieved using a learning rate of 0.001 and a momentum of 0.8 with an accuracy of about 69%.

### 4. Tune Network Weight Initialization

We will look at tuning the selection of network weight initialization by evaluating all of the available techniques.

We will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. 

In the example below we use rectifier for the hidden layer. We use sigmoid for the output layer because the predictions are binary.

In [1]:
# Use scikit-learn to grid search the weight initialization
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
# Function to create model, required for KerasClassifier
def create_model(init_mode='uniform'):
	# create model
	model = Sequential()
	model.add(Dense(12, input_dim=8, kernel_initializer=init_mode, activation='relu'))
	model.add(Dense(1, kernel_initializer=init_mode, activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(init_mode=init_mode)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Best: 0.727865 using {'init_mode': 'normal'}
0.714844 (0.019918) with: {'init_mode': 'uniform'}
0.692708 (0.015073) with: {'init_mode': 'lecun_uniform'}
0.727865 (0.009744) with: {'init_mode': 'normal'}
0.651042 (0.024774) with: {'init_mode': 'zero'}
0.710938 (0.005524) with: {'init_mode': 'glorot_normal'}
0.703125 (0.011049) with: {'init_mode': 'glorot_uniform'}
0.466146 (0.149269) with: {'init_mode': 'he_normal'}
0.660156 (0.012758) with: {'init_mode': 'he_uniform'}


#### We can see that the best results were achieved with a normal weight initialization scheme achieving a performance of about 72%.

### 5. How to Tune Dropout Regularization

we will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

To get good results, dropout is best combined with a weight constraint such as the max norm constraint.

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and maxnorm weight constraint values between 0 and 5.


In [1]:
# Use scikit-learn to grid search the dropout rate
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
# Function to create model, required for KerasClassifier
def create_model(dropout_rate=0.0, weight_constraint=0):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='linear', kernel_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
weight_constraint = [1, 2, 3, 4, 5]
dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
param_grid = dict(dropout_rate=dropout_rate, weight_constraint=weight_constraint)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Best: 0.726563 using {'dropout_rate': 0.3, 'weight_constraint': 3}
0.717448 (0.009744) with: {'dropout_rate': 0.0, 'weight_constraint': 1}
0.707031 (0.006379) with: {'dropout_rate': 0.0, 'weight_constraint': 2}
0.712240 (0.025780) with: {'dropout_rate': 0.0, 'weight_constraint': 3}
0.717448 (0.020505) with: {'dropout_rate': 0.0, 'weight_constraint': 4}
0.722656 (0.024080) with: {'dropout_rate': 0.0, 'weight_constraint': 5}
0.720052 (0.026748) with: {'dropout_rate': 0.1, 'weight_constraint': 1}
0.707031 (0.008438) with: {'dropout_rate': 0.1, 'weight_constraint': 2}
0.718750 (0.016877) with: {'dropout_rate': 0.1, 'weight_constraint': 3}
0.722656 (0.020915) with: {'dropout_rate': 0.1, 'weight_constraint': 4}
0.716146 (0.007366) with: {'dropout_rate': 0.1, 'weight_constraint': 5}
0.700521 (0.020505) with: {'dropout_rate': 0.2, 'weight_constraint': 1}
0.722656 (0.025315) with: {'dropout_rate': 0.2, 'weight_constraint': 2}
0.718750 (0.019401) with: {'dropout_rate': 0.2, 'weight_constraint': 

#### We can see that the dropout rate of 0.3% and the maxnorm weight constraint of 3 resulted in the best accuracy of about 72%

### 6. Tune the Number of Neurons in the Hidden Layer
 
The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.

 we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5.

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.


In [1]:
# Use scikit-learn to grid search the number of neurons
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
# Function to create model, required for KerasClassifier
def create_model(neurons=1):
	# create model
	model = Sequential()
	model.add(Dense(neurons, input_dim=8, kernel_initializer='uniform', activation='linear', kernel_constraint=maxnorm(4)))
	model.add(Dropout(0.2))
	model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
neurons = [1, 5, 10, 15, 20, 25, 30]
param_grid = dict(neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Best: 0.722656 using {'neurons': 25}
0.701823 (0.012890) with: {'neurons': 1}
0.712240 (0.009744) with: {'neurons': 5}
0.707031 (0.025315) with: {'neurons': 10}
0.712240 (0.021236) with: {'neurons': 15}
0.717448 (0.025780) with: {'neurons': 20}
0.722656 (0.022097) with: {'neurons': 25}
0.721354 (0.020505) with: {'neurons': 30}


#### We can see that the best results were achieved with a network with 25 neurons in the hidden layer with an accuracy of about 72%.

### 7. Tune the Neuron Activation Function

The activation function controls the non-linearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.

In this example, we will evaluate the suite of different activation functions available in Keras. We will only use these functions in the hidden layer, as we require a sigmoid activation function in the output for the binary classification problem.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which we will not do in this case.


In [1]:
# Use scikit-learn to grid search the activation function
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
# Function to create model, required for KerasClassifier
def create_model(activation='relu'):
	# create model
	model = Sequential()
	model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation=activation))
	model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
param_grid = dict(activation=activation)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Best: 0.718750 using {'activation': 'softplus'}
0.669271 (0.006639) with: {'activation': 'softmax'}
0.718750 (0.027251) with: {'activation': 'softplus'}
0.673177 (0.031304) with: {'activation': 'softsign'}
0.714844 (0.027621) with: {'activation': 'relu'}
0.667969 (0.015947) with: {'activation': 'tanh'}
0.691406 (0.014616) with: {'activation': 'sigmoid'}
0.671875 (0.027621) with: {'activation': 'hard_sigmoid'}
0.705729 (0.001841) with: {'activation': 'linear'}


#### Surprisingly (to me at least), the ‘softplus’ activation function achieved the best results with an accuracy of about 72%