# 1 Introduction

In this lesson, you will continue to discover techniques to **improve the optimization** problem of adapting neural network model weights to learn a training dataset. Furthermore, it will be presented techniques that you can **improve the learning** of your deep learning neural network models. After complete this study, you will know:

- Improve the learning process
  - How the **training process is sensitive to the scale of input** and target variables and how normalization and standardization processes can dramatically improve model convergence.
  - How the **vanishing gradient problem** can be addressed with the rectfied linear activation function dramatically improves the likelihood and speed of convergence.
  - How the **exploding gradient problem** can be addressed with gradient norm scaling and gradient value clipping.


# 2 Better Learning

## 2.1 Stabilize Learning with Data Scaling

Deep learning neural networks learn **how to map inputs to outputs from examples in a training dataset**. The weights of the model are initialized to small random values and updated via an optimization algorithm to estimate an error on the training dataset. 

Given the use of small weights in the model and the use of error between predictions and actual values, the scale of inputs and outputs used to train the model is essential. 

- **Unscaled input variables** can result in a slow or unstable learning process
- **Unscaled target variables** on regression problems can result in exploding gradients, causing the learning process to fail.

Data preparation involves using **normalization** and **standardization** techniques to rescale input and output variables before training a neural network model. This section will discover how to improve neural network stability and modeling performance by scaling data.

After completing this section, you will know:

- Data scaling is a recommended pre-processing step when working with deep learning neural networks.
- Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
- How to apply standardization and normalization to improve a Multilayer Perceptron model's performance on a regression predictive modeling problem.

### 2.1.1 Data Scaling

Deep learning neural network models learn a mapping from input variables to an output variable. As such, the scale and distribution of the data may be different for each variable. 

**Input variables** may have different units (e.g., feet, kilometers, and hours) that, in turn, may mean the variables have different scales. Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g., a spread of hundreds or thousands of units) **can result in a model that learns large weight values**. A model with large weight values is **often unstable**, meaning that it may be suffering from poor performance during learning and sensitivity to input values resulting in higher generalization error.

A **target variable** with a **large spread of values**, in turn, may result in **large error gradient** values causing **weight** values to **change dramatically**, making the learning process unstable. 

> Scaling input and output variables is a critical step in using neural network models.

### 2.1.2 Data Scaling - Case Study



In [None]:
# regression predictive modeling problem
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# generate regression dataset
# each input variable has a Gaussian distribution, as does the target variable. 
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

fig, ax = plt.subplots(3,1,figsize=(10,6))
# histograms of input variables

ax[0].hist(x[:, 0])
ax[0].set_ylabel("Feature 0")
ax[1].hist(x[:, 1])
ax[1].set_ylabel("Feature 1")

# histogram of target variable
ax[2].hist(y)
ax[2].set_ylabel("Output")
plt.show()

#### 2.1.2.1 Multilayer Perceptron With Unscaled Data

In [None]:
# mlp with unscaled data for the regression problem
from sklearn.datasets import make_regression
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate regression dataset
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', 
                kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))

# compile model
model.compile(loss='mean_squared_error', 
              optimizer=SGD(lr=0.01, momentum=0.9))

# fit model
history = model.fit(train_x, train_y, 
                    validation_data=(test_x, test_y), 
                    epochs=100, verbose=0)

# evaluate the model
train_mse = model.evaluate(train_x, train_y, verbose=0)
test_mse = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# plot loss during training
plt.title('Mean Squared Error')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

Running the example fits the model and calculates the mean squared error on the train and test sets. In this case, the model cannot learn the problem, resulting in predictions of **NaN values**. 

The model weights exploded during training given the very large errors and, in turn, error gradients calculated for weight updates.

This demonstrates that at the very least, **some data scaling is required for the target variable**. A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error.

#### 2.1.2.2 Multilayer Perceptron With Scaled Output Variables

The example can be updated to scale the target variable. Reducing the scale of the target variable will, in turn, reduce the size of the gradient used to update the weights and result in a more stable model and training process. Given the Gaussian distribution of the target variable, a natural method for rescaling the variable would be to **standardize** it. This
requires estimating the mean and standard deviation of the variable and using these estimates to perform the rescaling. The best practice is to estimate the training dataset's mean and standard deviation and use these variables to scale the train and test dataset. This is to avoid
any data leakage during the model evaluation process. The scikit-learn transformers expect input data to be matrices of rows and columns. Therefore the 1D arrays for the target variable will have to be reshaped into 2D arrays prior to the transforms.

Rescaling the target variable means that estimating the model's performance and plotting the learning curves will calculate an MSE in squared units of the scaled variable rather than squared units of the original scale. This can make interpreting the error within the context
of the domain challenges. In practice, it may be helpful to estimate the performance of the model by first inverting the transform on the test dataset target variable and on the model predictions and estimating model performance using the root mean squared error on the unscaled data. **This is left as an exercise to the student**.

In [None]:
# mlp with scaled outputs on the regression problem
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate regression dataset
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# reshape 1d arrays to 2d arrays
train_y = train_y.reshape(len(train_y), 1)
test_y = test_y.reshape(len(train_y), 1)

# created scaler
scaler = StandardScaler()

# fit scaler on training dataset
scaler.fit(train_y)

# transform training dataset
train_y = scaler.transform(train_y)

# transform test dataset
test_y = scaler.transform(test_y)

# define model
model = Sequential()
model.add(Dense(25, input_dim=20, 
                activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))

# compile model
model.compile(loss='mean_squared_error', 
              optimizer=SGD(lr=0.01, momentum=0.9))

# fit model
history = model.fit(train_x, train_y, 
                    validation_data=(test_x, test_y),
                    epochs=100, verbose=0)

# evaluate the model
train_mse = model.evaluate(train_x, train_y, verbose=0)
test_mse = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# plot loss during training
plt.title('Mean Squared Error Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

A line plot of the mean squared error on the train (blue) and test (orange) dataset over each training epoch is created. In this case, we can see that the model rapidly learns to map inputs to outputs for the regression problem effectively and achieves good performance on both
datasets throughout the run, neither overfitting nor underfitting the training dataset.

> It may be interesting to repeat this experiment and normalize the target variable instead
and compare results.

#### 2.1.2.3 Multilayer Perceptron With Scaled Input Variables

We have seen that data scaling can stabilize the training process when fitting a regression model with a widespread target variable. It is also possible to improve the stability and performance of the model by scaling the input variables. This section will design an experiment to compare the performance of different scaling methods for the input variables. The input variables also have a Gaussian data distribution, like the target variable. Therefore we would expect that standardizing the data would be the best approach. 

> This is just a heuristic, and it is always best to evaluate different scaling methods and discover what works best.

In [None]:
# compare scaling methods for mlp inputs on regression problem
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt
import numpy as np

# prepare dataset with input and output scalers, can be none
def get_dataset(input_scaler, output_scaler):
  # generate dataset
  x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

  # split into train and test
  n_train = 500
  train_x, test_x = x[:n_train, :], x[n_train:, :]
  train_y, test_y = y[:n_train], y[n_train:]

  # scale inputs
  if input_scaler is not None:
    # fit scaler
    input_scaler.fit(train_x)
    
    # transform training dataset
    train_x = input_scaler.transform(train_x)
    
    # transform test dataset
    test_x = input_scaler.transform(test_x)
  if output_scaler is not None:
    # reshape 1d arrays to 2d arrays
    train_y = train_y.reshape(len(train_y), 1)
    test_y = test_y.reshape(len(train_y), 1)
    
    # fit scaler on training dataset
    output_scaler.fit(train_y)

    # transform training dataset
    train_y = output_scaler.transform(train_y)
    
    # transform test dataset
    test_y = output_scaler.transform(test_y)
  return train_x, train_y, test_x, test_y

# fit and evaluate mse of model on test set
def evaluate_model(train_x, train_y, test_x, test_y):
	# define model
	model = Sequential()
	model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform'))
	model.add(Dense(1, activation='linear'))
	
  # compile model
	model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))
	
  # fit model
	model.fit(train_x, train_y, epochs=100, verbose=0)
	
  # evaluate the model
	test_mse = model.evaluate(test_x, test_y, verbose=0)
	return test_mse

# evaluate model multiple times with given input and output scalers
def repeated_evaluation(input_scaler, output_scaler, n_repeats=30):
	# get dataset
	train_x, train_y, test_x, test_y = get_dataset(input_scaler, output_scaler)
	
  # repeated evaluation of model
	results = list()
	for _ in range(n_repeats):
		test_mse = evaluate_model(train_x, train_y, test_x, test_y)
		print('>%.3f' % test_mse)
		results.append(test_mse)
	return results

In [None]:
# unscaled inputs
results_unscaled_inputs = repeated_evaluation(None, StandardScaler())

# normalized inputs
results_normalized_inputs = repeated_evaluation(MinMaxScaler(), StandardScaler())

# standardized inputs
results_standardized_inputs = repeated_evaluation(StandardScaler(), StandardScaler())

# summarize results
print('Unscaled: %.3f (%.3f)' % (np.mean(results_unscaled_inputs), np.std(results_unscaled_inputs)))
print('Normalized: %.3f (%.3f)' % (np.mean(results_normalized_inputs), np.std(results_normalized_inputs)))
print('Standardized: %.3f (%.3f)' % (np.mean(results_standardized_inputs), np.std(results_standardized_inputs)))

# plot results
results = [results_unscaled_inputs, results_normalized_inputs, results_standardized_inputs]
labels = ['unscaled input\nstandardized output', 'normalized input\nstandardized output', 'standardized input\nstandardized output']
plt.boxplot(results, labels=labels)
plt.show()

A figure with three boxes and whisker plots is created summarizing the spread of error scores for each configuration. The plots show that there was little difference between the distributions of error scores for the unscaled and standardized input variables and that the normalized input
variables result in better performance and more stable or a tighter distribution of error scores.

These results highlight that it is important to actually experiment and confirm the results of data scaling methods rather than assuming that a given data preparation scheme will work best based on the observed distribution of the data.

## 2.2 Fix Vanishing Gradients with ReLU

In a neural network, **the activation function** is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input. **The rectified linear activation function** is a piecewise linear function that will output the input directly if it is
positive; otherwise, it will output zero.  

>Relu has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

In this section, you will discover the rectified linear activation function for deep learning neural networks. After completing this section, you will know:

- The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.
- The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better.
- <font color="red">The rectified linear activation is the default activation when developing Multilayer Perceptrons and convolutional neural networks</font>.

### 2.2.1 Vanishing Gradients and ReLU

A neural network is comprised of layers of nodes and learns to map examples of inputs to outputs. For a given node, the inputs are multiplied by the weights in a node and summed together. This value is referred to as the summed activation of the node. The summed activation is then
transformed via an activation function and denotes the node's specific output or activation. The most straightforward activation function is referred to as the linear activation, where no transform is applied at all. A network comprised of only linear activation functions is very easy to train
but cannot learn complex mapping functions. Linear activation functions are still used in the output layer for networks that predict a quantity (e.g., regression problems).

> Nonlinear activation functions are preferred as they allow the nodes to learn more complex structures in the data. Traditionally, two widely used nonlinear activation functions are the **sigmoid** and **hyperbolic tangent** activation functions. 

The **sigmoid activation function**, also called the logistic function, is traditionally a very popular activation function for neural networks. The
input to the function is transformed into a value between 0.0 and 1.0. Inputs that are much larger than 1.0 are transformed to the value 1.0. Similarly, values much smaller than 0.0 are snapped to 0.0. The shape of the function for all possible inputs is an S-shape from zero up
through 0.5 to 1.0.  **For a long time, through the early 1990s, the default activation was used on neural networks**. 

The **hyperbolic tangent function**, or tanh for short, is a similar-shaped nonlinear activation function that outputs values between -1.0 and 1.0. In the later **1990s and through the 2000s, the tanh function was preferred over the sigmoid** activation function as models that used it were easier to train and often had a better predictive performance.


> A general problem with both the **sigmoid** and **tanh** functions is that they saturate.

This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid, respectively. 

Further, the functions are only susceptible to changes around the mid-point of their input, such as 0.5 for sigmoid and 0.0 for tanh. The limited sensitivity and saturation
of the function happen regardless of whether the summed activation from the node provided as input contains useful information or not. **Once saturated, it becomes challenging for the learning algorithm to continue to adapt the weights to improve the model's performance.**

Finally, as hardware's capability increased through GPUs, very deep neural networks using **sigmoid** and **tanh** activation functions could not easily be trained. Layers deep in large networks using these nonlinear activation functions fail to receive useful gradient information.
Error is backpropagated through the network and used to update the weights. 

> The amount of error decreases dramatically with each additional layer through which it is propagated, given the derivative of the chosen activation function. This is called the **vanishing gradient** problem and prevents deep (multilayered) networks from learning effectively.

###  2.2.2 Rectified Linear Activation Function

In order to use stochastic gradient descent with backpropagation of errors to train deep neural networks, an activation function is needed that looks and acts like a linear function but is a nonlinear function allowing complex relationships in the data to be learned. 

The function must also provide more sensitivity to the activation sum input and avoid easy saturation. The solution had been bouncing around in the eldest for some time, although it was not highlighted until papers in 2009 and 2011 shone a light on it. **The solution is to use the rectified linear activation function, or ReLU for short**. A node or unit that implements this activation function is referred to as a rectified linear activation unit or ReLU for short.

Often, networks that use the rectifier function for the hidden layers are referred to as rectified networks. Adoption of ReLU may easily be considered one of the few milestones in the deep learning
revolution, e.g., the techniques that now permit the routine development of very deep neural networks.

The **rectified linear activation** function has rapidly become the default activation function when developing most types of neural networks. As such, it is crucial to take a moment to review
some of the benefits of the approach first highlighted by **Xavier Glorot** et al. in their milestone 2012 paper on using ReLU titled <font color="red"> Deep Sparse Rectifier Neural Networks</font>.

**Computational Simplicity**

> The rectifier function is trivial to implement, requiring a max() function. This is unlike the **tanh** and **sigmoid** activation functions that require the use of an exponential calculation.

**Representational Sparsity**

An important benefit of the rectifier function is that it is capable of outputting a genuine zero value. This is unlike the tanh and sigmoid activation functions that learn to approximate a zero
output, e.g., a value close to zero but not a real zero value. This means that negative inputs can output true zero values allowing the activation of hidden layers in neural networks to contain true zero values. This is called a sparse representation and is a desirable property in representational learning as it can accelerate learning and simplify the model. 


**The ReLU does have some limitations.**

> Key among the limitations of ReLU is when large weight updates can mean that the summed input to the activation function is always negative, regardless of the input to the network. This means that a node with this problem will forever output an activation value of 0.0. This is referred to as a **dying ReLU**.

### 2.2.3 ReLU Case Study

This section will demonstrate how to use ReLU to counter the vanishing gradient problem with a MLP on a simple classification problem. This example provides a template for exploring ReLU with your neural network for classification and regression problems.

In [None]:
# scatter plot of the circles dataset with points colored by class
from sklearn.datasets import make_circles
import numpy as np
import matplotlib.pyplot as plt

# generate circles
x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)

# select indices of points with each class label
for i in range(2):
	samples_ix = np.where(y == i)
	plt.scatter(x[samples_ix, 0], x[samples_ix, 1], label=str(i))
plt.legend()
plt.show()

#### 2.2.3.1 Multilayer Perceptron Model

In [None]:
# mlp with tanh for the two circles classification problem
from sklearn.datasets import make_circles
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.initializers import RandomUniform
import matplotlib.pyplot as plt

# generate 2d classification dataset
x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)

# scale input data to [-1,1]
scaler = MinMaxScaler(feature_range=(-1, 1))
x = scaler.fit_transform(x)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()

# It was also good practice to initialize the network weights
# to small random values from a uniform distribution.
init = RandomUniform(minval=0, maxval=1)
model.add(Dense(5, input_dim=2, activation='tanh', kernel_initializer=init))
model.add(Dense(1, activation='sigmoid', kernel_initializer=init))

# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y),
                    epochs=500, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(train_x, train_y, verbose=0)
_, test_acc = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves
plt.subplot(211)
plt.title('Cross-Entropy Loss', pad=-40)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()

# plot accuracy learning curves
plt.subplot(212)
plt.title('Accuracy', pad=-40)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.tight_layout()
plt.show()

A line plot of model loss and accuracy on the train and test sets are created, showing the overall 500 training epochs' performance change. The plots suggest, for this run, that the the performance begins to slow around epoch 200 at about 80% accuracy for both the train and
test sets.

Now that we have seen how to develop a classical MLP using the tanh activation function for the two circles problem, we can modify the model to have many more hidden layers.

#### 2.2.3.2 Deeper MLP Model

Traditionally, developing deep Multilayer Perceptron models was challenging. Deep models using the hyperbolic tangent activation function do not train quickly, and much of this poor performance is blamed on the vanishing gradient problem. We can attempt to investigate this
using the MLP model developed in the previous section. The number of hidden layers can be increased from 1 to 5; for example:

In [None]:
# deeper mlp with tanh for the two circles classification problem
from sklearn.datasets import make_circles
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.initializers import RandomUniform
import matplotlib.pyplot as plt

# generate 2d classification dataset
x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
x = scaler.fit_transform(x)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
init = RandomUniform(minval=0, maxval=1)
model = Sequential()
model.add(Dense(5, input_dim=2, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(1, activation='sigmoid', kernel_initializer=init))

# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y), epochs=500, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(train_x, train_y, verbose=0)
_, test_acc = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves
plt.subplot(211)
plt.title('Cross-Entropy Loss', pad=-40)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()

# plot accuracy learning curves
plt.subplot(212)
plt.title('Accuracy', pad=-40)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.tight_layout()
plt.show()

#### 2.2.3.3 Deeper MLP Model with ReLU

The **rectified linear activation** function has supplanted the hyperbolic tangent activation function as the new preferred default when developing Multilayer Perceptron networks and other network types like CNNs. 

> This is because the activation function looks and acts like a linear
function, making it easier to train and less likely to saturate, but is, in fact, a nonlinear function, forcing negative inputs to the value 0. 

It is claimed as one possible approach to addressing the vanishing gradients problem when training deeper models. When using the rectified linear activation function (or ReLU for short), it is good practice to use the He weight initialization scheme. We can define the MLP with ve hidden layers using ReLU and He initialization, listed below.

In [None]:
# deeper mlp with relu for the two circles classification problem (5 hidden layers)
from sklearn.datasets import make_circles
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate 2d classification dataset
x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
x = scaler.fit_transform(x)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))

# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y),
                    epochs=500, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(train_x, train_y, verbose=0)
_, test_acc = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves
plt.subplot(211)
plt.title('Cross-Entropy Loss', pad=-40)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()

# plot accuracy learning curves
plt.subplot(212)
plt.title('Accuracy', pad=-40)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.tight_layout()
plt.show()

A line plot of model accuracy on the train and test sets over training epochs is also created. The plot shows quite different dynamics to what we have seen so far. The model appears to rapidly learn the problem, converging on a solution in about 200 epochs.

In [None]:
# deeper mlp with relu for the two circles classification problem (15 layers)
from sklearn.datasets import make_circles
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate 2d classification dataset
x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
x = scaler.fit_transform(x)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))

# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y), epochs=500, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(train_x, train_y, verbose=0)
_, test_acc = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves
plt.subplot(211)
plt.title('Cross-Entropy Loss', pad=-40)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()

# plot accuracy learning curves
plt.subplot(212)
plt.title('Accuracy', pad=-40)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.tight_layout()
plt.show()

The ReLU activation function has allowed us to fit a much more in-depth model for this simple problem, but this capability does not extend infinitely. For example, increasing the number of layers results in slower learning to a point at about 20 layers where the model is no
longer capable of learning the problem, at least with the chosen configuration. For example, below is a line plot of train and test accuracy of the same model with 15 hidden layers that shows that it is still capable of learning the problem (at least some of the time).

In [None]:
# deeper mlp with relu for the two circles classification problem (20 layers)
from sklearn.datasets import make_circles
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate 2d classification dataset
x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
x = scaler.fit_transform(x)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))

# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y),
                    epochs=500, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(train_x, train_y, verbose=0)
_, test_acc = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves
plt.subplot(211)
plt.title('Cross-Entropy Loss', pad=-40)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()

# plot accuracy learning curves
plt.subplot(212)
plt.title('Accuracy', pad=-40)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.tight_layout()
plt.show()

## 2.3 Fix Exploding Gradients with Gradient Clipping

> Training a neural network can become unstable given the error function, learning rate, or even the target variable scale. 

Large updates to weights during training can cause a numerical overflow or underflow, often referred to as exploding gradients. The problem of exploding gradients is more familiar with recurrent neural networks, such as LSTMs, given the accumulation of gradients unrolled over hundreds of input time steps. 

A common and relatively easy solution to the **exploding gradients problem** is to change the derivative of the error before propagating it backward through the network and using it to update the weights.

Two approaches include rescaling the gradients given a chosen vector norm and clipping gradient values that exceed a preferred range. Together, these methods are referred to as **gradient clipping**.

This section will discover the exploding gradient problem and how to improve neural network training stability using gradient clipping.

### 2.3.1 Exploding Gradients and Clipping

Neural networks are trained using the stochastic gradient descent optimization algorithm. This requires estimating the loss on one or more training examples, then calculating the derivative of the loss, which is propagated backward through the network to update the weights. Weights are updated using a fraction of the backpropagated error controlled by the learning rate. 

> It is possible for the updates to the weights to be so large that the weights either overflow or underflow their numerical precision.

In practice, the **weights can take** on the value of a NaN (not a number) or Inf (infinity) when they **overflow** or **underflow**, and for practical purposes, the network will be useless from that point forward, forever predicting NaN values as signals ow through the invalid weights.


> The difficulty that arises is that when the parameter gradient is very large, a gradient descent parameter update could throw the parameters very far into a region where the objective function is larger, undoing much of the work that had been done to reach the current solution.


The **underflow** or **overflow of weights** is generally referred to as an instability of the network training process and is known by the name <font color="red">exploding gradients</font> as the unstable training process causes the network to fail to train in such a way that the model is essentially useless. A given
neural network, such as a Convolutional Neural Network or Multilayer Perceptron, can happen due to a poor configuration choice. Some examples include:

- Poor choice of learning rate that results in large weight updates.
- Poor choice of data preparation, allowing large differences in the target variable.
- Poor choice of the loss function, allowing the calculation of large error values.

Exploding gradients is also a problem in recurrent neural networks such as the Long Short-Term Memory network, given the accumulation of error gradients in the unrolled recurrent structure. Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of the small learning rate, scaled target variables, and a standard loss function. Nevertheless, exploding gradients may still be an issue with recurrent networks with a large number of input time steps.


<font color="red">A common solution to exploding gradients</font> is to change the error derivative before propagating it backward through the network and using it to update the weights. By rescaling the error derivative, the updates to the weights will also be rescaled, dramatically decreasing the likelihood
of an overflow or underflow. There are two main methods for updating the error derivative; they are:

- Gradient Scaling.
- Gradient Clipping.

**Gradient scaling** involves normalizing the error gradient vector such that vector norm (magnitude) equals a defined value, such as 1.0.

On the other hand, **gradient clipping** involves forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeded an expected range.  **Together, these methods are often simply referred to as gradient clipping**.

> When the traditional gradient descent algorithm proposes to make a very large step, the gradient clipping heuristic intervenes to reduce the step size to be small enough that it is less likely to go outside the region where the gradient indicates the direction of approximately steepest descent.

It is a method that **only addresses the numerical stability** of training deep neural network models and **does not offer any general improvement in performance**. The value for the gradient vector norm or preferred range can be configured by trial and error, by using common values
used in the literature or by first observing common vector norms or ranges via experimentation and then choosing a sensible value.

> It is common to use the same gradient clipping configuration for all layers in the network. Nevertheless, there are examples where a larger range of error gradients are permitted in the output layer compared to hidden layers.


#### 2.3.1.1 Gradient Norm Scaling

Gradient norm scaling involves changing the loss function derivatives to have a given vector norm when the L2 vector norm (sum of the squared values) of the gradient vector exceeds a threshold value. 

> For example, we could specify a norm of 1.0, meaning that if the vector norm
for a gradient exceeds 1.0, then the vector's values will be rescaled so that the norm of the vector equals 1.0. This can be used in Keras by specifying the **clipnorm** argument on the optimizer; for example:

```python
# configure sgd with gradient norm clipping
opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)
```

#### 2.3.1.2 Gradient Value Clipping

Gradient value clipping involves clipping the loss function's derivatives to have a given value if a gradient value is less than a negative threshold or more than the positive threshold.

> For example, we could specify a norm of 0.5, meaning that if a gradient value was less than -0.5, it is set to -0.5 and if it is more than 0.5, it will be set to 0.5. This can be used in Keras by specifying the **clipvalue** argument on the optimizer, for example:

```python
# configure sgd with gradient value clipping
opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5)
```

### 2.3.2 Gradient Clipping Case Study

This section will demonstrate how to use gradient clipping to counter the exploding gradients problem with a MLP on a simple classification problem. This example provides a template for exploring gradient clipping with your neural network for regression problems.

In [None]:
# regression predictive modeling problem
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# generate regression dataset
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

# histogram of target variable
plt.subplot(121)
plt.hist(y)

# boxplot of target variable
plt.subplot(122)
plt.boxplot(y)
plt.show()

#### 2.3.2.1 MLP With Exploding Gradients

We can develop a Multilayer Perceptron (MLP) model for the regression problem. A model will be demonstrated on the raw data without any scaling of the input or output variables. This is an excellent example to demonstrate exploding gradients as a model trained to predict the unscaled target variable will result in error gradients with values in the hundreds or even thousands, depending on the batch size used during training. Such large gradient values are likely to lead to unstable learning or an overflow of the weight values.

In [None]:
# mlp with unscaled data for the regression problem
from sklearn.datasets import make_regression
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate regression dataset
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(25, input_dim=20, 
                activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))

# compile model
model.compile(loss='mean_squared_error', 
              optimizer=SGD(lr=0.01, momentum=0.9))

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y),
                    epochs=100, verbose=0)

# evaluate the model
train_mse = model.evaluate(train_x, train_y, verbose=0)
test_mse = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# plot loss during training
plt.title('Mean Squared Error')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

This demonstrates that some intervention is required concerning the target variable for the model to learn this problem. A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error. 

> A traditional solution would be to rescale the target variable using either standardization or normalization, and this approach is recommended for MLPs. Nevertheless, an alternative that we will investigate in this case, it will be the use of gradient clipping.

#### 2.3.2.2 MLP With Gradient Norm Scaling

In [None]:
# mlp with unscaled data for the regression problem with gradient norm scaling
from sklearn.datasets import make_regression
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate regression dataset
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(25, input_dim=20, 
                activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))

# compile model
opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)
model.compile(loss='mean_squared_error', optimizer=opt)

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y),
                    epochs=100, verbose=0)

# evaluate the model
train_mse = model.evaluate(train_x, train_y, verbose=0)
test_mse = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# plot loss during training
plt.title('Mean Squared Error')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

A line plot is also created showing the means squared error loss on the train and test datasets over training epochs. The plot shows how loss dropped from large values above 20,000 down to small values below 100 rapidly over 20 epochs.

There is nothing special about the vector norm value of 1.0, and other values could be
evaluated and the performance of the resulting model compared.

#### 2.3.2.3 MLP With Gradient Value Clipping

Another solution to the exploding gradient problem is to clip the gradient if it becomes too large or too small. We can update the training of the MLP to use gradient clipping by adding the clipvalue argument to the optimization algorithm configuration. For example, the code
below clips the gradient to the range $[-5\: to \:5]$.

In [None]:
# mlp with unscaled data for the regression problem with gradient clipping
from sklearn.datasets import make_regression
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import matplotlib.pyplot as plt

# generate regression dataset
x, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

# split into train and test
n_train = 500
train_x, test_x = x[:n_train, :], x[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(25, input_dim=20, 
                activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))

# compile model
opt = SGD(lr=0.01, momentum=0.9, clipvalue=5.0)
model.compile(loss='mean_squared_error', optimizer=opt)

# fit model
history = model.fit(train_x, train_y,
                    validation_data=(test_x, test_y),
                    epochs=100, verbose=0)

# evaluate the model
train_mse = model.evaluate(train_x, train_y, verbose=0)
test_mse = model.evaluate(test_x, test_y, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# plot loss during training
plt.title('Mean Squared Error')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

A line plot is also created showing the means squared error loss on the train and test datasets over training epochs. The plot shows that the model learns the problem fast, achieving a sub-100
MSE loss within just a few training epochs.

A clipped range of $[-5, 5]$ was chosen arbitrarily; you can experiment with different sized ranges and compare performance of the speed of learning and final model performance.