# Course 4 - Project - Part 7: Dense network

<a name="top-7"></a>
This notebook is concerned with *Part 7: Dense network*.

**Contents:**
* [Step 0: Loading data](#step-7.0)
* [Step 1: 1-layer dense network](#step-7.1)
* [Step 2: 2-layer dense network](#step-7.2)

## Step 0: Loading data<a name="step-7.0"></a> ([top](#top-7))
---

In [1]:
# !conda install -y nomkl

We begin with some imports.

In [2]:
# Standard library.
import pathlib
import typing as T

# 3rd party.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Project.
import utils

We load the datasets with the extracted high-level features.

In [3]:
separator = ''.center(80, '-')

path_train = pathlib.Path.cwd() / 'data' / 'swissroads-features-train.npz'
data_train = utils.load(path_train)
print(separator)
print(f'Dataset: train\n{utils.info(data_train)}')

path_valid = pathlib.Path.cwd() / 'data' / 'swissroads-features-valid.npz'
data_valid = utils.load(path_valid)
print(separator)
print(f'Dataset: valid\n{utils.info(data_valid)}')

path_test = pathlib.Path.cwd() / 'data' / 'swissroads-features-test.npz'
data_test = utils.load(path_test)
print(separator)
print(f'Dataset: test\n{utils.info(data_test)}')

--------------------------------------------------------------------------------
Dataset: train
data: shape=(280, 224, 224, 3), dtype=float32
label_idxs: shape=(280,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(280,), dtype=<U19
features: shape=(280, 1280), dtype=float32
--------------------------------------------------------------------------------
Dataset: valid
data: shape=(139, 224, 224, 3), dtype=float32
label_idxs: shape=(139,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(139,), dtype=<U19
features: shape=(139, 1280), dtype=float32
--------------------------------------------------------------------------------
Dataset: test
data: shape=(50, 224, 224, 3), dtype=float32
label_idxs: shape=(50,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(50,), dtype=<U19
features: shape=(50, 1280), dtype=float32


In [4]:
label_strs = data_train['label_strs']  # Same for all data sets.
assert (
    np.all(data_train['label_strs'] == data_valid['label_strs']) and
    np.all(data_train['label_strs'] == data_test['label_strs'])
)

X_train = data_train['data']
y_train = data_train['label_idxs']
F_train = data_train['features']
N_train = data_train['names']

X_valid = data_valid['data']
y_valid = data_valid['label_idxs']
F_valid = data_valid['features']
N_valid = data_train['names']

X_test = data_test['data']
y_test = data_test['label_idxs']
F_test = data_test['features']
N_test = data_test['names']

We will fix the seed for the PRNGs in order to try to make computations deterministic.

In [5]:
RANDOM_SEED = 0

## Step 1: 1-layer dense network<a name="step-7.1"></a> ([top](#top-7))
---

We begin with some imports.

In [6]:
import tensorflow as tf
import tensorflow.keras as keras
from sklearn.model_selection import GridSearchCV
from tensorflow.keras import activations
from tensorflow.keras import initializers
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import optimizers

We define an utility function to build our Keras model.

In [7]:
# The number of classes.
n_classes = len(label_strs)
    

def build_model1(dropout_rate: float = 0.0,
                l2_alpha: float = 0.01):
    """\
    Builds, compiles and returns a Keras model.
    
    .. seealso:: https://keras.io/scikit-learn-api/
    """
    # Create model.
    model = keras.Sequential()

    # Add input layer.
    if dropout_rate:
        # Add drop-out layer.
        model.add(layers.Dropout(dropout_rate, seed=RANDOM_SEED, input_shape=(1280,)))
    else:
        model.add(layers.InputLayer(input_shape=(1280,)))

    # Add output layer.
    model.add(layers.Dense(
        units=n_classes, activation=activations.softmax,
        kernel_initializer=initializers.VarianceScaling(scale=1.0, seed=RANDOM_SEED),
        kernel_regularizer=keras.regularizers.l2(l=l2_alpha)
    ))

    # Compile the model.
    model.compile(
        optimizer=optimizers.Adam(),  # use defaults
        loss=losses.sparse_categorical_crossentropy,
        metrics=['acc']
    )

    return model

**Note:** Regarding regularization:
* *Dropout.* We were not sure whether it is a good idea to use dropout on the input layer. We decided to do it after reading online that this is done in the article that introduced the technique ([Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://jmlr.org/papers/v15/srivastava14a.html)).
* *L2 regularization.* We add L2 regularization.
* *Early stopping.* We were not sure whether we should use early stopping or not. We decided not to do it after reading some contradictory opinions online in the context of grid-search (e.g. [this post](https://stackoverflow.com/a/48139341) on StackOverflow).

We define the grid of values to search.

In [8]:
nonzero_rates = [0.1, 1/4, 1/3, 1/2]

dropout_rates = (
    # No dropout.
    [0.0] + 
    # Input layer.
    nonzero_rates
)

l2_alphas = [0.0, 0.01]  # disable: 0.0, default: 0.01

param_grid1 = {
    'dropout_rate': dropout_rates,
    'l2_alpha': l2_alphas
}

**Note:** Regarding the grid-search: Given the ratio of training data (280 samples) to validation data (139 samples), using a ``GridSearchCV`` would require at least 3 folds in order to train the model on folds of a size equivalent to the training set. This would take more time than we are willing to allocate. On the other hand, we very much like the convenience offered by ``GridSearchCV`` (parallelism, automatic refitting, results easy to convert to a data-frame, etc.). It turns out that by manually providing the indices of the training and test sets we can use ``GridSearCV`` on a single fold. Since they are already properly "stratified" (similar distribution of classes), we can use the training and validation sets as-is.

In [9]:
X_train_large = np.concatenate([X_train, X_valid])
y_train_large = np.concatenate([y_train, y_valid])
F_train_large = np.concatenate([F_train, F_valid])
N_train_large = np.concatenate([N_train, N_valid])

We perform the grid-search.

In [10]:
# Try to make computations deterministic.
utils.reset_seeds()

model1 = keras.wrappers.scikit_learn.KerasClassifier(build_fn=build_model1)

# Compute the weight of each class.
class_weight = utils.get_class_weight(y_train_large)

# Compute the indices of the training and validation sets.
idx_train = np.arange(0, len(y_train))
idx_test = np.arange(len(y_train), len(y_train) + len(y_valid))
cv = [(idx_train, idx_test)]

nn_gscv1 = GridSearchCV(model1, param_grid1, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

# Fit/evaluate the estimator.
nn_gscv1.fit(F_train_large, y_train_large,
             batch_size=32, epochs=50, verbose=0,
             shuffle=True, class_weight=class_weight);

We convert the results to a data frame.

In [11]:
df_report1 = (
    pd
    .DataFrame(nn_gscv1.cv_results_)
    .sort_values(by='mean_test_score', ascending=False)
)
df_report1[['param_dropout_rate', 'param_l2_alpha', 'mean_test_score', 'std_test_score', 'mean_train_score', 'std_train_score']].head()

Unnamed: 0,param_dropout_rate,param_l2_alpha,mean_test_score,std_test_score,mean_train_score,std_train_score
7,0.333333,0.01,0.920863,0.0,1.0,0.0
0,0.0,0.0,0.906475,0.0,1.0,0.0
1,0.0,0.01,0.906475,0.0,1.0,0.0
3,0.1,0.01,0.906475,0.0,1.0,0.0
4,0.25,0.0,0.906475,0.0,1.0,0.0


We compute the accuracy of the best model on the test set.

In [12]:
accuracy_test = nn_gscv1.best_estimator_.score(F_test, y_test)
print(f'test accuracy: {accuracy_test * 100:.1f} %')

test accuracy: 90.0 %


In [13]:
nn_gscv1.best_params_

{'dropout_rate': 0.3333333333333333, 'l2_alpha': 0.01}

In [14]:
# Persist the result.
desc = ', '.join([f'{key}={nn_gscv1.best_params_[key]}' for key in ['dropout_rate', 'l2_alpha']])
utils.persist_result('1-layer nn', 'part-07-a', desc, accuracy_test)

**Note:** We accidentally noticed that by manually refitting the model on the extended training set a few times we were able to get an accuracy on the test set that varies between 0.90 and 0.92 (even 0.94). The fact that there are differences is probably due to the fact that the PRNGs are in a a different state each time. The amplitude of the differences is probably due to the fact that we are dealing with very small training and validation sets.

In [15]:
scores = []
for _ in range(10):
    model1 = keras.wrappers.scikit_learn.KerasClassifier(build_fn=build_model1)
    model1.set_params(**nn_gscv1.best_params_)
    model1.fit(
        F_train_large, y_train_large,
        batch_size=32, epochs=50, verbose=0,
        shuffle=True, class_weight=class_weight)
    scores.append(model1.score(F_test, y_test))
print(f'min (max) score: {np.min(scores)} ({np.max(scores)})')

min (max) score: 0.9000000047683716 (0.9199999928474426)


## Step 2: 2-layer dense network<a name="step-7.2"></a> ([top](#top-7))
---

We begin with some imports.

In [16]:
import itertools

We define an utility function to build our Keras model.

In [17]:
def build_model2(hidden_n_neurons: int,
                 dropout_rate: T.Tuple[float, float] = (0.0, 0.0),
                 l2_alpha: float = 0.01):
    """\
    Builds, compiles and returns a Keras model.
    
    .. seealso:: https://keras.io/scikit-learn-api/
    """
    dropout_rate_input, dropout_rate_hidden = dropout_rate
    
    # Create model.
    model = keras.Sequential()

    # Add input layer.
    if dropout_rate_input:
        # Add drop-out layer.
        model.add(layers.Dropout(dropout_rate_input, seed=RANDOM_SEED, input_shape=(1280,)))
    else:
        model.add(layers.InputLayer(input_shape=(1280,)))
        
    # Add hidden layer.
    if dropout_rate_hidden:
        # Add drop-out layer.
        model.add(layers.Dropout(dropout_rate_hidden, seed=RANDOM_SEED))
    model.add(layers.Dense(
        units=hidden_n_neurons, activation=activations.relu,
        kernel_initializer=initializers.VarianceScaling(scale=2.0, seed=RANDOM_SEED),
        kernel_regularizer=keras.regularizers.l2(l=l2_alpha)))

    # Add output layer.
    model.add(layers.Dense(
        units=n_classes, activation=activations.softmax,
        kernel_initializer=initializers.VarianceScaling(scale=1.0, seed=RANDOM_SEED),
        kernel_regularizer=keras.regularizers.l2(l=l2_alpha)
    ))

    # Compile the model.
    model.compile(
        optimizer=optimizers.Adam(),  # use defaults
        loss=losses.sparse_categorical_crossentropy,
        metrics=['acc']  # cannot use metrics.sparse_categorical_accuracy
    )

    return model

We define the grid of values to search.

In [18]:
nonzero_rates = [0.1, 1/4, 1/3, 1/2]

dropout_rates = (
    # No dropout.
    [(0.0, 0.0)] +
    # Hidden layer only.
    list(itertools.product([0.0], nonzero_rates)) +
    # Input and hidden layers (same rate).
    list(zip(nonzero_rates, nonzero_rates))
)

hidden_n_neurons = [10, 25, 50, 75, 100, 250, 500]

l2_alphas = [0.0, 0.01]  # disable: 0.0, default: 0.01

param_grid2 = {
    'dropout_rate': dropout_rates,
    'hidden_n_neurons': hidden_n_neurons,
    'l2_alpha': l2_alphas
}

We perform the grid-search.

In [19]:
# Try to make computations deterministic.
utils.reset_seeds()

model2 = keras.wrappers.scikit_learn.KerasClassifier(build_fn=build_model2)

# Compute the weight of each class.
class_weight = utils.get_class_weight(y_train_large)

# Compute the indices of the training and validation sets.
idx_train = np.arange(0, len(y_train))
idx_test = np.arange(len(y_train), len(y_train) + len(y_valid))
cv = [(idx_train, idx_test)]

nn_gscv2 = GridSearchCV(model2, param_grid2, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

# Fit/evaluate the estimator.
nn_gscv2.fit(F_train_large, y_train_large,
             batch_size=32, epochs=50, verbose=0,
             shuffle=True, class_weight=class_weight);

We convert the results to a data frame.

In [20]:
df_report2 = (
    pd
    .DataFrame(nn_gscv2.cv_results_)
    .sort_values(by='mean_test_score', ascending=False)
)
df_report2[['param_dropout_rate', 'param_hidden_n_neurons', 'param_l2_alpha', 'mean_test_score', 'std_test_score', 'mean_train_score', 'std_train_score']].head()

Unnamed: 0,param_dropout_rate,param_hidden_n_neurons,param_l2_alpha,mean_test_score,std_test_score,mean_train_score,std_train_score
26,"(0.0, 0.1)",500,0,0.942446,0.0,1.0,0.0
122,"(0.5, 0.5)",250,0,0.942446,0.0,1.0,0.0
96,"(0.25, 0.25)",500,0,0.942446,0.0,1.0,0.0
98,"(0.3333333333333333, 0.3333333333333333)",10,0,0.935252,0.0,1.0,0.0
114,"(0.5, 0.5)",25,0,0.935252,0.0,1.0,0.0


We compute the accuracy of the best model on the test set.

In [21]:
accuracy_test = nn_gscv2.best_estimator_.score(F_test, y_test)
print(f'test accuracy: {accuracy_test * 100:.1f} %')

test accuracy: 94.0 %


In [22]:
nn_gscv2.best_params_

{'dropout_rate': (0.0, 0.1), 'hidden_n_neurons': 500, 'l2_alpha': 0.0}

In [23]:
# Persist the result.
desc = ', '.join([f'{key}={nn_gscv2.best_params_[key]}' for key in ['dropout_rate', 'hidden_n_neurons', 'l2_alpha']])
utils.persist_result('2-layer nn', 'part-07-b', desc, accuracy_test)