# Deep Learning
## Formative assessment
### Week 4: Neural network training

#### Instructions

In this notebook, you will write code to implement a binary classifier model in Keras. You will experiment with different sized models, datasets and regularisation techniques to validate your model and combat overfitting.

Some code cells are provided you in the notebook. You should avoid editing provided code, and make sure to execute the cells in order to avoid unexpected errors. Some cells begin with the line: 

`#### GRADED CELL ####`

These cells require you to write your own code to complete them.

#### Let's get started!

We'll start by running some imports, and loading the dataset.

In [None]:
#### PACKAGE IMPORTS ####

# Run this cell first to import all required packages. Do not make any imports elsewhere in the notebook

import keras
from keras import ops
from keras.models import Sequential
from keras.layers import Input, Dense, BatchNormalization, Dropout
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

<center><img src="figures/lhc.jpg" title="Large Hadron Collider" style="width: 600px;"/></center>
<center><font style="font-size:12px">source: flickr/Image Editor <a href=http://www.flickr.com/>http://www.flickr.com/</a></font></center>

#### The HIGGS dataset
In this formative assessment, you will use the [HIGGS dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) from the UCI Machine Learning Repository. This dataset contains kinematic properties measured by the particle detectors in the accelerator, and a binary class label that distinguishes between a signal process which produces Higgs bosons and a background process which does not. For more information see the UCI website or the original paper:

* Baldi, P., Sadowski, P. & Whiteson, D. (2014), "Searching for Exotic Particles in High-energy Physics with Deep Learning", *Nature Communications* **5** 4308.

The full dataset contains 11,000,000 examples. We will be working with a small subset of the data in this assignment. Your goal is to develop a classifier to predict the presence of Higgs bosons using MLP models.

#### Load and prepare the data
For this assignment, you are provided with a subset of the HIGGS dataset. Note that the full dataset can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00280/), but it is not necessary to download it for this assignment. 

In [None]:
# Run this cell to load and describe the data

df = pd.read_csv(Path("./data/HIGGS-sample.csv"), header=None)
df.describe()

In [None]:
# View a sample of the data

df.sample(5)

The first column is the binary label, and the remaining columns are the features. 

In this assignment, we will use TensorFlow Datasets. You should now complete the following function to build training, validation and test Datasets, according to the following specifications:

* Create a random train/validation/test data partition with a 80/10/10 percentage split
* Your function should be able to operate on a numeric `DataFrame` of any shape
* Load the separate splits into separate `tf.data.Dataset` objects
* Each Dataset should have an `element_spec` containing a single Tensor (of type `float32`) that represents an entire row of the CSV file
* The function should then return the tuple of `tf.data.Dataset` objects `(train_ds, valid_ds, test_ds)`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_datasets(dataframe):
    """
    This function takes in the loaded DataFrame, and builds training, validation
    and test Dataset objects as described above.
    Your function should return a tuple (train_ds, valid_ds, test_ds) of Datasets.
    """
    dataset_size = dataframe.shape[0]
    df = dataframe.sample(dataset_size)
    num_train = int(dataset_size * 0.8)
    num_valid = int(dataset_size * 0.1)
    train_ds = tf.data.Dataset.from_tensor_slices(df[:num_train].values.astype(np.float32))
    valid_ds = tf.data.Dataset.from_tensor_slices(df[num_train:num_train + num_valid].values.astype(np.float32))
    test_ds = tf.data.Dataset.from_tensor_slices(df[num_train + num_valid:].values.astype(np.float32))
    return train_ds, valid_ds, test_ds

In [None]:
# Run your function to create the Datasets

train_ds, valid_ds, test_ds = get_datasets(df)

In [None]:
# View the Dataset element_spec

train_ds.element_spec

You should now further process the Datasets, ready for training. The following functions will shuffle and batch the Datasets, and extract the input features and targets. 

First you should complete the following function to shuffle and batch the Datasets.

* The function takes `dataset` (a `tf.data.Dataset` object), `batch_size` and `shuffle_buffer` as inputs
* If `shuffle_buffer` is `None` (the default), then the Dataset should not be shuffled
* If `shuffle_buffer` is an integer, then it should be used to shuffle the Dataset
* The function should then batch the Dataset using `batch_size`
* Your function should then return the (maybe) shuffled and batched Dataset

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def shuffle_and_batch_dataset(dataset, batch_size, shuffle_buffer=None):
    """
    This function is used to shuffle and batch the dataset, using shuffle_buffer
    and batch_size.
    Your function should return the shuffled and batched Dataset.
    """
    if shuffle_buffer is not None:
        dataset = dataset.shuffle(shuffle_buffer)
    dataset = dataset.batch(batch_size)
    return dataset

In [None]:
# Use your function to shuffle and batch the Datasets

train_ds = shuffle_and_batch_dataset(train_ds, 500, shuffle_buffer=1000)
valid_ds = shuffle_and_batch_dataset(valid_ds, 500)
test_ds = shuffle_and_batch_dataset(test_ds, 500)

The following `map_dataset` function should now extract the input features and targets.

Inside this function you should define an auxiliary function that you will use with the `map` method of the Dataset object. This auxiliary function should take the Tensor (as in the element_spec of the shuffled and batched Dataset), and return a tuple of two elements, with the input features in the first element, and the binary label in the second element. 

* The function takes `dataset` as an input (a `tf.data.Dataset` object)
* The function should define an inner function to extract the inputs and targets
* The inner function should be used to `map` over the Dataset
* The `map_dataset` should then return the mapped Dataset
* The resulting `element_spec` of the mapped Dataset should be a 2-tuple where the elements have shape `(batch_size, 28)` and `(batch_size, 1)` respectively

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def map_dataset(dataset):
    """
    This function is used to map over the Dataset object to extract the input features 
    and target variable. The function takes a Dataset object, and maps over the 
    Dataset to create input features and targets.
    Your function should return the mapped Dataset.
    """
    def extract_inputs_and_targets(batch_of_features):
        features = batch_of_features[..., 1:]
        targets = batch_of_features[..., :1]
        return features, targets
    return dataset.map(extract_inputs_and_targets)

In [None]:
# Use your function to map over the Datasets

train_ds = map_dataset(train_ds)
valid_ds = map_dataset(valid_ds)
test_ds = map_dataset(test_ds)

In [None]:
# Prefetch the Datasets

train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.prefetch(tf.data.AUTOTUNE)

In [None]:
# Print the Dataset element_spec

train_ds.element_spec

#### Build and train the small MLP model

You should now complete the following function to build an MLP classifier. We will experiment with different sized models, so this function needs to be able to build an MLP with different numbers of layers and units. 

The function should use the Sequential API, and build the model according to the following specifications:

* The function has `input_shape` and `hidden_units` arguments
* The `input_shape` should be used to define the Input layer
* The `hidden_units` argument is a list of integers (of any length), containing the number of units to use in subsequent `Dense` hidden layers
* Each `Dense` hidden layer should use a `selu` activation function
* There should also be a final output `Dense` layer with one unit and a linear (no) activation
* The function should then return the model

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_mlp(input_shape, hidden_units):
    """
    This function is used to build the MLP model. It takes input_shape and hidden_units
    as arguments, which should be used to build the model as described above.
    Your function should return the model.
    """
    model = Sequential([
        Input(shape=input_shape),
    ])
    for units in hidden_units:
        model.add(Dense(units, activation='selu'))
    model.add(Dense(1, activation=None))
    return model

In [None]:
# Run your function to get the first (small) MLP

model = get_mlp(input_shape=(28,), hidden_units=[16, 16])
model.summary()

The following function defines the optimizer, loss function and metrics to use to compile the model. 

* The function should create an Adam optimizer object from the `keras.optimizers` module, with learning rate 0.0005
* It should also create an instance of the binary cross entropy loss from the `keras.losses` module, with the option `from_logits=True`, as the final layer of our model has a linear activation
* It should also create a binary accuracy metric object from the `keras.metrics` module, with the default settings
* The function should then return a tuple of the three objects `(optimizer, loss, metric)`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_metrics():
    """
    This function is used to create the optimizer, loss, metric objects. 
    Each of these should be created as instances from the corresponding classes in the
    optimizers, losses and metrics modules respectively, with the options as above.
    The function should then return the tuple (optimizer, loss, metric)
    """
    opt = keras.optimizers.Adam(learning_rate=0.0005)
    loss = keras.losses.BinaryCrossentropy(from_logits=True)
    acc = keras.metrics.BinaryAccuracy()
    return opt, loss, acc

The following function defines the `EarlyStopping` and `ModelCheckpoint` callbacks used to fit the model.

* The function takes a single input argument, `filepath`
* It should create the following callbacks:
  * An `EarlyStopping` callback, with patience set to 200
  * A `ModelCheckpoint` callback that saves the best model only (according to the validation loss), and saves weights only, using the filename `filepath`
* The function should then return a tuple of the two callback objects `(earlystopping, modelckpt)`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_callbacks(filepath):
    """
    This function is used to create the callback objects. 
    Each of these should be created as instances from the corresponding classes in the
    callbacks modules respectively, with the options as above.
    The function should then return the tuple (earlystopping, modelckpt)
    """
    earlystopping = keras.callbacks.EarlyStopping(patience=200)
    modelckpt = keras.callbacks.ModelCheckpoint(filepath, save_best_only=True, 
                                                save_weights_only=True, monitor="val_loss")
    return earlystopping, modelckpt

The required Keras extension when saving model weights only is `.weights.h5` (for a full Keras model, it is `.keras`). The filepath can be passed into the `ModelCheckpoint` initialiser as a string or a `pathlib.Path` object.

In [None]:
# Run your functions to get the optimizer, loss, metric and callbacks

adam, bce_loss, bin_acc = get_metrics()
early_stopping, ckpt = get_callbacks(Path("./models/small.weights.h5"))

You are now ready to complete the following function to compile and fit the model.

* The function takes `model`, `optimizer`, `loss`, `num_epochs`, `train_dataset`, `validation_dataset`, `metrics` and `callbacks` arguments
* It should compile the `model` using the `optimizer`, `loss` and `metrics` list
* It should then fit the `model` using the `train_dataset`, `validation_dataset`, `num_epochs` arguments and `callbacks` list
* The `fit` method should be passed `verbose=0`, as there will be many epochs
* The function should then return the `History` object returned by the `fit` method

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def compile_and_fit(model, optimizer, loss, num_epochs, train_dataset, 
                    validation_dataset=None, metrics=None, callbacks=None):
    """
    This function should compile and fit the model according to the above specifications.
    It should then return the History object returned by the fit method
    """
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    history = model.fit(train_dataset, epochs=num_epochs, verbose=0,
                        validation_data=validation_dataset, callbacks=callbacks)
    return history

In [None]:
# Compile and fit the small MLP model

small_history = compile_and_fit(model, adam, bce_loss, 2000, train_ds, 
                                validation_dataset=valid_ds, metrics=[bin_acc],
                                callbacks=[early_stopping, ckpt])

In [None]:
# Plot the learning curves

fig = plt.figure(figsize=(14, 4))

fig.add_subplot(121)
plt.plot(small_history.history['loss'], label='train', color='C0', linestyle='-')
plt.plot(small_history.history['val_loss'], label='valid', color='C0', linestyle=':')
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary cross entropy loss")
plt.legend()

fig.add_subplot(122)
plt.plot(small_history.history['binary_accuracy'], label='train', color='C0', linestyle='-')
plt.plot(small_history.history['val_binary_accuracy'], label='valid', color='C0', linestyle=':')
plt.title("Binary accuracy vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary accuracy")
plt.legend()

plt.show()

#### Build and train the medium and large MLP models
We will now see if we can improve the model performance by increasing the model capacity. We will reuse the `get_mlp` function to build these models.

In [None]:
# Build a medium-sized MLP model

model = get_mlp(input_shape=(28,), hidden_units=[64, 64, 64])
model.summary()

In [None]:
# Get fresh compile and fit arguments

adam, bce_loss, bin_acc = get_metrics()
early_stopping, ckpt = get_callbacks(Path("./models/medium.weights.h5"))

As we have silenced the printout from the `fit` method, we will create a custom callback to print the model progress less frequently than every epoch.

You should now complete the following class, which subclasses the base `Callback` class.

* The class initialiser takes one required argument, `num_epochs`, that defines the frequency to print logs
* After every `num_epochs` epochs of training, the class should print out a single line with the epoch number, training and validation loss and metric values
  * Make sure to account for the zero-indexing of python, e.g. the first epoch is numbered 0, so your class should print `epoch + 1`
* The loss and metric values should be printed to 4 decimal places (_hint: use_ `f"{value:.4f}"`)

In [None]:
#### GRADED CELL ####

# Complete the following class. 
# Make sure to not change the class name or provided methods and signatures.

class PrintProgress(keras.callbacks.Callback):
    
    def __init__(self, num_epochs, **kwargs):
        """
        The initializer should call the base class initializer, passing in any 
        optional keyword arguments passed in
        """
        super().__init__(**kwargs)
        self.num_epochs = num_epochs
        
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.num_epochs == 0:
            loss_and_metrics = ', '.join([f'{k}: {v:.4f}' for k, v in logs.items()])
            print(f"Epoch: {epoch + 1}, {loss_and_metrics}")

In [None]:
# Create an instance of your callback class

print_progress = PrintProgress(num_epochs=50)

In [None]:
# Compile and fit the medium MLP model

medium_history = compile_and_fit(model, adam, bce_loss, 2000, train_ds, 
                                 validation_dataset=valid_ds, metrics=[bin_acc],
                                 callbacks=[early_stopping, print_progress, ckpt])

Finally, we will also build and train a large MLP model.

In [None]:
# Build a large-sized MLP model

model = get_mlp(input_shape=(28,), hidden_units=[512, 512, 512, 512])
model.summary()

In [None]:
# Get fresh compile and fit arguments

adam, bce_loss, bin_acc = get_metrics()
early_stopping, ckpt = get_callbacks(Path("./models/large.weights.h5"))

In [None]:
# Compile and fit the large MLP model

large_history = compile_and_fit(model, adam, bce_loss, 2000, train_ds, 
                                validation_dataset=valid_ds, metrics=[bin_acc],
                                callbacks=[early_stopping, print_progress, ckpt])

We now compare the performance of each model by plotting the training and validation loss and metrics.

In [None]:
# Plot the learning curves for all models

fig = plt.figure(figsize=(14, 6))

fig.add_subplot(121)
plt.plot(small_history.history['loss'], label='small (train)', color='C0', linestyle='-')
plt.plot(small_history.history['val_loss'], label='small (valid)', color='C0', linestyle=':')
plt.plot(medium_history.history['loss'], label='medium (train)', color='C1', linestyle='-')
plt.plot(medium_history.history['val_loss'], label='medium (valid)', color='C1', linestyle=':')
plt.plot(large_history.history['loss'], label='large (train)', color='C2', linestyle='-')
plt.plot(large_history.history['val_loss'], label='large (valid)', color='C2', linestyle=':')
plt.xscale('log')
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary cross entropy loss")
plt.legend()

fig.add_subplot(122)
plt.plot(small_history.history['binary_accuracy'], label='small (train)', color='C0', linestyle='-')
plt.plot(small_history.history['val_binary_accuracy'], label='small (valid)', color='C0', linestyle=':')
plt.plot(medium_history.history['binary_accuracy'], label='medium (train)', color='C1', linestyle='-')
plt.plot(medium_history.history['val_binary_accuracy'], label='medium (valid)', color='C1', linestyle=':')
plt.plot(large_history.history['binary_accuracy'], label='large (train)', color='C2', linestyle='-')
plt.plot(large_history.history['val_binary_accuracy'], label='large (valid)', color='C2', linestyle=':')
plt.xscale('log')
plt.title("Binary accuracy vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary accuracy")
plt.legend()

plt.show()

#### Regularise the large model
As we can see clearly in the above plots, the large model achieves a low loss and high accuracy, but severely overfits the training data. We will now look to regularise this model.

First, you should write a new function, `get_regularised_mlp`, to build regularised MLP models.

The function should use the Sequential API, and build the model according to the following specifications:

* The function has `input_shape`, `hidden_units`, `l2_reg_coeff` and `dropout_rate` arguments
* The `input_shape` should be used to define the `Input` layer
* The `hidden_units` argument is a list of integers (of any length), containing the number of units to use in subsequent `Dense` hidden layers
* Each `Dense` hidden layer should use a `selu` activation function
* Each `Dense` layer should use the `l2_reg_coeff` argument to set kernel $l^2$ regularisation
* After each `Dense` hidden layer, there should be a `Dropout` layer with rate `dropout_rate`
* There should also be a final output `Dense` layer with one unit and a linear (no) activation
* The function should then return the model

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_regularised_mlp(input_shape, hidden_units, l2_reg_coeff, dropout_rate):
    """
    This function is used to build the MLP model. It takes input_shape and hidden_units
    as arguments, which should be used to build the model as described above.
    Your function should return the model.
    """
    model = Sequential([
        Input(shape=input_shape),
    ])
    for units in hidden_units:
        model.add(Dense(units, activation='selu', kernel_regularizer=keras.regularizers.l2(l2_reg_coeff)))
        model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation=None))
    return model

In [None]:
# Build a regularised version of the large MLP model

model = get_regularised_mlp(input_shape=(28,), hidden_units=[512, 512, 512, 512],
                            l2_reg_coeff=0.0001, dropout_rate=0.5)
model.summary()

In [None]:
# Get fresh compile and fit arguments

adam, bce_loss, bin_acc = get_metrics()
early_stopping, ckpt = get_callbacks(Path("./models/reg_large.weights.h5"))

In [None]:
# Compile and fit the regularised large MLP model

reg_large_history = compile_and_fit(model, adam, bce_loss, 2000, train_ds, 
                                    validation_dataset=valid_ds, metrics=[bin_acc],
                                    callbacks=[early_stopping, print_progress, ckpt])

In [None]:
# Plot the learning curves for all models

fig = plt.figure(figsize=(14, 6))

fig.add_subplot(121)
plt.plot(small_history.history['loss'], label='small (train)', color='C0', linestyle='-')
plt.plot(small_history.history['val_loss'], label='small (valid)', color='C0', linestyle=':')
plt.plot(medium_history.history['loss'], label='medium (train)', color='C1', linestyle='-')
plt.plot(medium_history.history['val_loss'], label='medium (valid)', color='C1', linestyle=':')
plt.plot(large_history.history['loss'], label='large (train)', color='C2', linestyle='-')
plt.plot(large_history.history['val_loss'], label='large (valid)', color='C2', linestyle=':')
plt.plot(reg_large_history.history['loss'], label='reg large (train)', color='C3', linestyle='-')
plt.plot(reg_large_history.history['val_loss'], label='reg large (valid)', color='C3', linestyle=':')
plt.xscale('log')
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary cross entropy loss")
plt.legend()

fig.add_subplot(122)
plt.plot(small_history.history['binary_accuracy'], label='small (train)', color='C0', linestyle='-')
plt.plot(small_history.history['val_binary_accuracy'], label='small (valid)', color='C0', linestyle=':')
plt.plot(medium_history.history['binary_accuracy'], label='medium (train)', color='C1', linestyle='-')
plt.plot(medium_history.history['val_binary_accuracy'], label='medium (valid)', color='C1', linestyle=':')
plt.plot(large_history.history['binary_accuracy'], label='large (train)', color='C2', linestyle='-')
plt.plot(large_history.history['val_binary_accuracy'], label='large (valid)', color='C2', linestyle=':')
plt.plot(reg_large_history.history['binary_accuracy'], label='reg large (train)', color='C3', linestyle='-')
plt.plot(reg_large_history.history['val_binary_accuracy'], label='reg large (valid)', color='C3', linestyle=':')
plt.xscale('log')
plt.title("Binary accuracy vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary accuracy")
plt.legend()

plt.show()

Clearly the regularisation has helped to prevent overfitting in the large model.

#### Regularise with more data
Finally, we will demonstrate the regularising effect of more data. We will significantly increase the capacity of the network by making it wider and deeper.

You should now write the following function to build this model. The function should again use the Sequential API, and build the model according to the following specifications:

* The function has `input_shape`, `hidden_units`, `l2_reg_coeff` and `dropout_rate` arguments
* The `input_shape` should be used to define the `Input` layer
* The `hidden_units` argument is a list of integers (of any length), containing the number of units to use in subsequent `Dense` hidden layers
* Each `Dense` hidden layer should use a `selu` activation function
* Each `Dense` layer should use the `l2_reg_coeff` argument to set kernel $l^2$ regularisation
* After each `Dense` hidden layer, there should be a `BatchNormalization` layer
* After each `BatchNormalization` layer, there should be a `Dropout` layer with rate `dropout_rate`
* There should also be a final output `Dense` layer with one unit and a linear (no) activation
* The function should then return the model

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_regularised_bn_mlp(input_shape, hidden_units, l2_reg_coeff, dropout_rate):
    """
    This function is used to build the MLP model. It takes input_shape and hidden_units
    as arguments, which should be used to build the model as described above, using the
    functional API.
    Your function should return the model.
    """
    model = Sequential([
        Input(shape=input_shape),
    ])
    for units in hidden_units:
        model.add(Dense(units, activation='selu', kernel_regularizer=keras.regularizers.l2(l2_reg_coeff)))
        model.add(BatchNormalization())
        model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation=None))
    return model

In [None]:
# Build a huge MLP model

model = get_regularised_bn_mlp(input_shape=(28,), hidden_units=[1024, 1024, 1024, 512, 512, 512],
                               l2_reg_coeff=0.0001, dropout_rate=0.5)
model.summary()

In [None]:
# Get fresh compile and fit arguments

adam, bce_loss, bin_acc = get_metrics()
early_stopping, ckpt = get_callbacks(Path("./models/huge.weights.h5"))

In [None]:
# Compile and fit the huge MLP model

huge_history = compile_and_fit(model, adam, bce_loss, 2000, train_ds, 
                               validation_dataset=valid_ds, metrics=[bin_acc],
                               callbacks=[early_stopping, print_progress, ckpt])

In [None]:
# Plot the performance of the huge model and large regularised model

fig = plt.figure(figsize=(14, 7))

fig.add_subplot(121)
plt.plot(reg_large_history.history['loss'], label='reg large (train)', color='C3', linestyle='-')
plt.plot(reg_large_history.history['val_loss'], label='reg large (valid)', color='C3', linestyle=':')
plt.plot(huge_history.history['loss'], label='huge (train)', color='C4', linestyle='-')
plt.plot(huge_history.history['val_loss'], label='huge (valid)', color='C4', linestyle=':')
plt.xscale('log')
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary cross entropy loss")
plt.legend()

fig.add_subplot(122)
plt.plot(reg_large_history.history['binary_accuracy'], label='reg large (train)', color='C3', linestyle='-')
plt.plot(reg_large_history.history['val_binary_accuracy'], label='reg large (valid)', color='C3', linestyle=':')
plt.plot(huge_history.history['binary_accuracy'], label='huge (train)', color='C4', linestyle='-')
plt.plot(huge_history.history['val_binary_accuracy'], label='huge (valid)', color='C4', linestyle=':')
plt.xscale('log')
plt.title("Binary accuracy vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary accuracy")
plt.legend()

plt.show()

As we can see, this huge model is again overfitting to the training data. We will now retrain the same model on a much larger dataset to see the regularising effect of more data.

There is an additional CSV saved at `'./data/HIGGS-sample-extra.csv'` that contains an extra 50,000 data examples. You should now complete the following function to construct training, validation and test datasets using this CSV.

* The function takes the `csv_path`, `batch_size`, `map_dataset` function and `shuffle_buffer` size as arguments
* Your function should read in the CSV to a pandas `DataFrame`, using the `csv_path`. Make sure to use the option `header=None`
* Your function should then be able to operate on a numeric `DataFrame` of any shape
* It should then randomly shuffle the `DataFrame` and as before, create a train/validation/test data partition with a 80/10/10 percentage split
* It should load these data splits into `tf.data.Dataset` objects
* It should shuffle the training Dataset using the `shuffle_buffer` size, if it is not `None`
* It should then batch the training, validation and test Datasets using the `batch_size`
* It should the use your `map_dataset` function as above to parse the data into input features and targets
* Finally, it should then make a call to `prefetch`, with the argument `tf.data.AUTOTUNE` for each Dataset
* Your function should return the tuple of Dataset objects `(train_ds, valid_ds, test_ds)`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_more_data(csv_path, batch_size, map_dataset=map_dataset, shuffle_buffer=None):
    """
    This function takes in the CSV filepath, batch_size, map_dataset function, and
    shuffle_buffer size. It should create train/valid/test Datasets according to the
    above specifications.
    Your function should then return the tuple (train_ds, valid_ds, test_ds) of Datasets.
    """
    df = pd.read_csv(csv_path, header=None)
    dataset_size = df.shape[0]
    df = df.sample(dataset_size)
    
    num_train = int(dataset_size * 0.8)
    num_valid = int(dataset_size * 0.1)
    train_ds = tf.data.Dataset.from_tensor_slices(df[:num_train].values.astype(np.float32))
    valid_ds = tf.data.Dataset.from_tensor_slices(df[num_train:num_train + num_valid].values.astype(np.float32))
    test_ds = tf.data.Dataset.from_tensor_slices(df[num_train + num_valid:].values.astype(np.float32))
    
    if shuffle_buffer is not None:
        train_ds = train_ds.shuffle(shuffle_buffer)
    
    train_ds = train_ds.batch(batch_size)
    valid_ds = valid_ds.batch(batch_size)
    test_ds = test_ds.batch(batch_size)
    
    train_ds = map_dataset(train_ds)
    valid_ds = map_dataset(valid_ds)
    test_ds = map_dataset(test_ds)
    
    train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
    valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)
    test_ds = test_ds.prefetch(tf.data.AUTOTUNE)
    
    return train_ds, valid_ds, test_ds

In [None]:
# Run your function to get the extra Datasets

train_ds_extra, valid_ds_extra, test_ds_extra = get_more_data(Path('./data/HIGGS-sample-extra.csv'), 500,
                                                              map_dataset, 1000)

In [None]:
# Concatenate the new Datasets with the existing ones

train_ds_full = train_ds_extra.concatenate(train_ds)
valid_ds_full = valid_ds_extra.concatenate(valid_ds)
test_ds_full = test_ds_extra.concatenate(test_ds)

In [None]:
# Build another instance of the huge MLP model

model = get_regularised_bn_mlp(input_shape=(28,), hidden_units=[1024, 1024, 1024, 512, 512, 512],
                               l2_reg_coeff=0.0001, dropout_rate=0.5)
model.summary()

In [None]:
# Get fresh compile and fit arguments

adam, bce_loss, bin_acc = get_metrics()
early_stopping, ckpt = get_callbacks(Path("./models/huge_full.weights.h5"))

This is now a very large model trained on a much bigger dataset, so will take some time to train - you might want to go make yourself a cup of tea or coffee while it's running!

In [None]:
# Compile and fit the huge MLP model on the expanded dataset

huge_full_history = compile_and_fit(model, adam, bce_loss, 2000, train_ds_full, 
                                    validation_dataset=valid_ds_full, metrics=[bin_acc],
                                    callbacks=[early_stopping, print_progress, ckpt])

In [None]:
# Plot the performance of the huge models and large regularised model

fig = plt.figure(figsize=(14, 7))

fig.add_subplot(121)
plt.plot(reg_large_history.history['loss'], label='reg large (train)', color='C3', linestyle='-')
plt.plot(reg_large_history.history['val_loss'], label='reg large (valid)', color='C3', linestyle=':')
plt.plot(huge_history.history['loss'], label='huge (train)', color='C4', linestyle='-')
plt.plot(huge_history.history['val_loss'], label='huge (valid)', color='C4', linestyle=':')
plt.plot(huge_full_history.history['loss'], label='huge full (train)', color='C5', linestyle='-')
plt.plot(huge_full_history.history['val_loss'], label='huge full (valid)', color='C5', linestyle=':')
plt.xscale('log')
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary cross entropy loss")
plt.legend()

fig.add_subplot(122)
plt.plot(reg_large_history.history['binary_accuracy'], label='reg large (train)', color='C3', linestyle='-')
plt.plot(reg_large_history.history['val_binary_accuracy'], label='reg large (valid)', color='C3', linestyle=':')
plt.plot(huge_history.history['binary_accuracy'], label='huge (train)', color='C4', linestyle='-')
plt.plot(huge_history.history['val_binary_accuracy'], label='huge (valid)', color='C4', linestyle=':')
plt.plot(huge_full_history.history['binary_accuracy'], label='huge full (train)', color='C5', linestyle='-')
plt.plot(huge_full_history.history['val_binary_accuracy'], label='huge full (valid)', color='C5', linestyle=':')
plt.xscale('log')
plt.title("Binary accuracy vs epochs")
plt.xlabel("Epochs")
plt.ylabel("Binary accuracy")
plt.legend()

plt.show()

We will conclude by evaluating each model on the held-out test dataset.

In [None]:
# Collect evaluation loss and metrics for each model

saved_models = {
    'small': {"build_fn": get_mlp, "args": {"input_shape": (28,), "hidden_units": [16, 16]}},
    'medium': {"build_fn": get_mlp, "args": {"input_shape": (28,), "hidden_units": [64, 64, 64]}},
    'large': {"build_fn": get_mlp, "args": {"input_shape": (28,), "hidden_units": [512, 512, 512, 512]}},
    'reg_large': {"build_fn": get_regularised_mlp, "args": {
        "input_shape": (28,), "hidden_units": [512, 512, 512, 512], 
        "l2_reg_coeff": 0.0001, "dropout_rate": 0.5
    }},
    'huge': {"build_fn": get_regularised_bn_mlp, "args": {
        "input_shape": (28,), "hidden_units": [1024, 1024, 1024, 512, 512, 512], 
        "l2_reg_coeff": 0.0001, "dropout_rate": 0.5
    }},
    'huge_full': {"build_fn": get_regularised_bn_mlp, "args": {
        "input_shape": (28,), "hidden_units": [1024, 1024, 1024, 512, 512, 512], 
        "l2_reg_coeff": 0.0001, "dropout_rate": 0.5
    }}
}

evaluation = {"Model": [], "Test loss": [], "Test accuracy": []}
for model_size, options in saved_models.items():
    model = options['build_fn'](**options['args'])
    adam, bce_loss, bin_acc = get_metrics()
    model.compile(loss=bce_loss, optimizer=None, metrics=[bin_acc])
    model.load_weights(Path(f"./models/{model_size}.weights.h5"))
    results = model.evaluate(test_ds_full, return_dict=True, verbose=0)
    evaluation["Model"].append(model_size)
    evaluation["Test loss"].append(results['loss'])
    evaluation["Test accuracy"].append(results['binary_accuracy'])
    
pd.DataFrame(evaluation)

In [None]:
# Clean up

! rm -r ./models

Congratulations on completing this week's assignment! In this assignment you have experimented with model capacity and various forms of regularisation, and seen their effects on the model training and performance.