<a href="https://colab.research.google.com/github/nfleischmann/Gradient-Based-Methods-for-the-Training-of-Neural-Networks/blob/main/experiment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5.1 Emprical Analysis of Local Minima

In this experiment, we will empirically investigate the relationship between the local minima of the objective function and the number of hidden neurons of the neural network.

In [None]:
# Load libraries
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

## Training Set

We use a small training set generated by the function make\_circles of the library sci-kit learn throughout the experiment. It consists of 100 observations that have a real input $x \in \mathbb{R}^2$ and a binary output $y \in \{0,1\}$.

In [None]:
# Generate 10000 observations according to the function make_circles
X,y = make_circles(n_samples=10000, shuffle=False, random_state=1, factor=0, noise=0.35)

# Split the data in training (n = 100) and test set (n = 9900)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99, random_state=1)

# Compute the training set size, the input shape and the test set size
n = X_train.shape[0]
d_0 = X_train.shape[1]

## Neural Networks

We consider neural networks with two input neurons, $L-1 \in \{1,2,3\}$ hidden layers, and one output neuron. Thereby all hidden layers contain the same number of neurons $d_i \in \{$1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 75, 100$\}$. For the hidden neurons, we use the leaky ReLU, whereas the logistic function is employed as the activation function for the output neuron.

In [None]:
def build_model(n_hidden, width, input_shape=[2], initializer_range=100, seed=42):
  """ Builds a neural network model

    Args:
        n_hidden (int): Number of hidden layers
        width (int): Width of the hidden layers
        input_shape (list): Shape of the input layer
          (default is [2])
        initializer_range(int): Range of the Uniform distribution used to initialize
                                the parameters
          (default is 100)
        seed (int): Seed that makes the initialization of the model reproducible
          (default is 42)

    Returns:
        tf.keras.model: Initialized neural network model
  """
  # Define initializer
  initializer = tf.keras.initializers.RandomUniform(minval=-initializer_range, maxval=initializer_range) 

  # Initialize neural network model
  tf.random.set_seed(seed)
  model = keras.models.Sequential() 

  # Add input layer
  model.add(keras.layers.InputLayer(input_shape=input_shape))

  # Add hidden layers
  for layer in range(n_hidden): 
    model.add(keras.layers.Dense(width, 
                                 activation=keras.layers.LeakyReLU(alpha=0.1), 
                                 kernel_initializer=initializer, 
                                 bias_initializer=initializer))

  # Add output layer
  model.add(keras.layers.Dense(1,
                               activation="sigmoid", 
                               kernel_initializer=initializer, 
                               bias_initializer=initializer))

  return model

## Experimental Design

We search for local minima of the respective objective function for each neural network and compute their cost. We use gradient descent to determine the local minima. We let gradient descent make 40,000 steps with a fixed step size and then adaptively lower the step size (after 50 steps without descent, the step size is halved). The method is stopped after 10,000 steps in which the cost improved less than 1e-7 or at the latest after 120,000 steps. In order to find different local minima, we run gradient descent several times. For each run, all parameters of the neural networks are initialized according to a continuous uniform distribution. This way, gradient descent is started 50 times for the neural networks with one hidden layer and 15 times for those with 2 or 3 hidden layers.

### Neural Networks with one Hidden Layer

In [None]:
n_hidden = 1

# Define the widths for which we will execute this experiment
widths = [1,2,3,4,5,10,15,20,30,40,50,75,100]
# Define seeds used for initializing the neural networks
seeds = list(range(50))

# Define a callback that stops gradient descent after 10.000 iterations without
# significant progress
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', 
                                              min_delta=0.0000001, 
                                              patience=10000)

# Define a callback that adaptively halves the learning rate after 50 iterations 
# without progress
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor="loss",
                                                 factor=0.5,
                                                 patience=40,
                                                 cooldown=10,
                                                 min_lr=0.0000001)

# Initialize the dataframe in which we will store the results
results_1_hidden = pd.DataFrame(columns=['n_hidden', 'width', 'cost'])
idx = 0 

for seed in seeds:
  for width in widths:
    # Build neural network with 1 hidden layer and 'width' hidden neurons
    model = build_model(n_hidden, width, seed=seed)

    # Perform (at most) 40.000 gradient descent steps with a constant step size
    model.compile(loss="binary_crossentropy",
                  optimizer=keras.optimizers.SGD(learning_rate=1))
    model.fit(X_train, y_train, 
              batch_size=n, 
              epochs=40000,
              callbacks=[early_stop])

    # Perform (at most) 80.000 gradient descent steps with decreasing step sizes
    history = model.fit(X_train, y_train, 
                        batch_size=n, 
                        epochs=80000,
                        callbacks=[early_stop, reduce_lr])

    results_1_hidden.loc[idx, 'n_hidden'] = 1
    results_1_hidden.loc[idx, 'width'] = width
    results_1_hidden.loc[idx, 'cost'] = history.history['loss'][-1]
    idx = idx + 1

### Neural Networks with Two or Three Hidden Layers


In [None]:
hidden_layers = [2,3]

# Define the widths for which we will execute this experiment
widths = [1,2,3,4,5,10,15,20,30,40,50,75,100]

# Define seeds used for initializing the neural networks
seeds = list(range(15))

# Define a callback that stops gradient descent after 10.000 iterations without
# significant progress
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', 
                                              min_delta=0.0000001, 
                                              patience=10000)

# Define a callback that adaptively halves the learning rate after 50 iterations 
# without progress
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor="loss",
                                                 factor=0.5,
                                                 patience=40,
                                                 cooldown=10,
                                                 min_lr=0.0000001)

# Initialize the dataframe in which we will store the results
results_2_3_hidden = pd.DataFrame(columns=['n_hidden', 'width', 'cost'])
idx = 0 

for n_hidden in hidden_layers:
  # Neural Networks with more hidden layers are more sensitive to the step size
  # therefore we choose different step sizes for the different networks
  if n_hidden == 2:
    lr = 0.5
  if n_hidden == 3:
    lr = 0.1

  for seed in seeds:
    for width in widths:

      # Build neural network
      model = build_model(n_hidden, width, initializer_range=1, seed=seed)

      # Perform (at most) 40.000 gradient descent steps with constant a step size
      model.compile(loss="binary_crossentropy",
                    optimizer=keras.optimizers.SGD(learning_rate=lr))
      model.fit(X_train, y_train, 
                batch_size=n, 
                epochs=40000,
                callbacks=[early_stop])

      # Perform (at most) 80.000 gradient descent steps with decreasing step sizes
      history = model.fit(X_train, y_train, 
                          batch_size=n, 
                          epochs=80000,
                          callbacks=[early_stop, reduce_lr])

      # Fill results dataframe
      results_2_3_hidden.loc[idx, 'n_hidden'] = n_hidden
      results_2_3_hidden.loc[idx, 'width'] = width
      results_2_3_hidden.loc[idx, 'cost'] = history.history['loss'][-1]
      idx = idx + 1

## Plots

In [9]:
# Import libraries
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

### Figure 5.1

Depicts the training set and its underlying distribution.

In [None]:
X_plot,y_plot = make_circles(n_samples=2000000, shuffle=False, random_state=1, factor=0, noise = 0.35)
Xy_df = pd.DataFrame({'x1':X[:,0],'x2':X[:,1],'y': y})

# Seperate the data in the two classes
y_1 = Xy_df.loc[Xy_df['y'] == 1]
y_0 = Xy_df.loc[Xy_df['y'] == 0]

sns.set_style("white")

# Plot the estimated density of the class y = 1
fig, ax = plt.subplots(figsize=(12,12))
kde = sns.kdeplot(x=y_1.x1, y=y_1.x2, cmap="Reds", shade=True, bw_adjust=1)
kde.set_xlabel("$x_1$", fontsize=30)
kde.set_ylabel("$x_2$", fontsize=30)
kde.tick_params(labelsize=20)
ax.set(xlim=(-2,2),
       ylim=(-2,2))
plt.show()

# Plot the estimated density of the class y = 0
fig, ax = plt.subplots(figsize=(12,12))
kde = sns.kdeplot(x=y_0.x1, y=y_0.x2, cmap="Blues", shade=True, bw_adjust=1)
kde.set_xlabel("$x_1$", fontsize=30)
kde.set_ylabel("$x_2$", fontsize=30)
kde.tick_params(labelsize=20)
ax.set(xlim=(-2,2),
      ylim=(-2,2))
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.coolwarm, s=120, edgecolors='k')
ax.set_xlabel('$x_1$', fontsize=30)
ax.set_ylabel('$x_2$', fontsize=30)
plt.xlim((-2,2))
plt.ylim((-2,2))
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(*scatter.legend_elements(), title="y", fontsize=30, title_fontsize=30)

### Figure 5.2

Depicts prediction functions that correspond to some of the found local minima.

In [13]:
def plot_predictions(model, X_train, y_train):
  """ Plot the predictions of the model along with the training set

    Args:
        model : Keras model
        X_train (np.ndarray): Inputs of the training set
        y_train (np.ndarray): Outputs of the training set
  """
  # Define the grid
  x_1 = x_2 = np.arange(-2, 2.1, 0.025)
  x_len = np.size(x_1)
  X_1, X_2 = np.meshgrid(x_1, x_2)

  # Bring grid in the right format to make predictions
  x_1 =  np.reshape(X_1, -1)
  x_2 = np.reshape(X_2, -1)
  X_12 = np.zeros((x_len**2, 2))
  for i in range(x_len**2):
    X_12[i] = np.array([x_1[i], x_2[i]])
  
  # Make Predictions for every element of the grid
  z = model.predict(X_12)
  Z = np.reshape(z,(x_len, x_len))

  # Plot the results
  fig, ax = plt.subplots(figsize=(12,12))
  pred = ax.contourf(X_1, X_2, Z, cmap=plt.cm.RdBu_r, alpha=0.6,vmax=1, vmin=0)
  scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.coolwarm, s=120, edgecolors='k')
  ax.set_xlabel('$x_1$', fontsize=30)
  ax.set_ylabel('$x_2$', fontsize=30)
  plt.xticks(fontsize=20)
  plt.yticks(fontsize=20)
  plt.xlim((-2,2))
  plt.ylim((-2,2))

In [None]:
n_hidden = 1
width_seeds = [[1,4],[3,29],[5,44],[10,50],[20,42],[50,31]]

# Define a callback that stops gradient descent after 10.000 iterations without
# significant progress
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', 
                                              min_delta=0.0000001, 
                                              patience=10000)

# Define a callback that adaptively halves the learning rate after 50 iterations 
# without progress
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor="loss",
                                                 factor=0.5,
                                                 patience=40,
                                                 cooldown=10,
                                                 min_lr=0.0000001)

for width_seed in width_seeds:
  width = width_seed[0]
  seed = width_seed[1]
  
  # Initialize neural network according to the seed
  model = build_model(n_hidden, width, seed=seed)

  # Optimize the neural network
  model.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1))
  model.fit(X_train, y_train, 
            batch_size=n, 
            epochs=40000,
            callbacks=[early_stop])
  model.fit(X_train, y_train, 
            batch_size=n, 
            epochs=80000,
            callbacks=[early_stop, reduce_lr])

  # Plot the predictions
  plot_predictions(model, X_train, y_train)