# Assignment 4 - Jonas Gstöttenmayr

## 1) Ingesting Data

In [2]:
import shutil
import os
from itertools import product
from utils import train_model

if os.path.exists("logs"):
    shutil.rmtree("logs")
    print("Logs directory cleared")
else:
    print("No logs directory found")

No logs directory found


In [3]:
# load data
import keras
import numpy as np
from skimage.transform import resize


sym_dim=8
(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

X_train = X_train.reshape(-1, 28 * 28).astype("float32") / 255.0 #this is how we normally scale with images
X_test = X_test.reshape(-1, 28 * 28).astype("float32") / 255.0 #same for test


X_train = resize(X_train.reshape(-1, 28, 28), (len(X_train), sym_dim, sym_dim)).reshape(-1, sym_dim*sym_dim).astype("float32")
X_test = resize(X_test.reshape(-1, 28, 28), (len(X_test), sym_dim, sym_dim)).reshape(-1, sym_dim*sym_dim).astype("float32")

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")
print(f"Classes: {np.unique(y_train)}")


X_train: (60000, 64), y_train: (60000,)
X_test: (10000, 64), y_test: (10000,)
Classes: [0 1 2 3 4 5 6 7 8 9]


In [4]:
rng = np.random.default_rng(42)

## Exercise 1: Extended Random Search (7 points)

### Part a) Extending the Hyperparameter Search (6 points)

Based on the notebook `mnist_fashion_fcnn_simple.ipynb` from exercise 9, extend the hyperparameter search with additional parameters and implement a systematic search strategy. **Feel free to adapt/add/delete code in any way you see fit.**

**Tasks:**

1. **Extend the search space** with at least 3 additional hyperparameters beyond `hidden_layers` and `dropout_rate`.

2. **Modify the necessary functions:**

   - Update `create_hyperparams()` to generate configurations for your new parameters
   - Adapt `create_fcnn()` to accept and use these parameters
   - Modify `run_search()` to pass the parameters correctly

3. **Run at least 100 different configurations** and document your findings.

4. **Explain your choices:** For each hyperparameter you add, explicitly describe:
   - Why you chose this parameter
   - What range/values you selected and why
   - What impact you expect it to have

**Note:** You will need to modify multiple functions, but you can base everything on the existing `mnist_fashion_fcnn_simple.ipynb` notebook structure.

### Part b) Fixing the Data Leak (1 point)

There is a subtle data leak in the current search implementation.

**Hint:** Look carefully at how we perform validation in the `train_model()` function in `utils.py`. Consider what data the model sees during training and how this might affect the fairness of comparing different random seeds or configurations.

Identify the issue, explain what's wrong, and fix it.

### a)

In [5]:
len(X_train)

60000

In [6]:
import keras.optimizers as opt

In [7]:
def create_fcnn(
    input_dim: int,
    num_classes: int,
    hidden_layers: list[int] = [512, 256, 128],
    dropout_rate: float = 0.3,
    name: str = "fcnn",
    learning_rate: float = 2e-4,
    activation_function: str = "relu",
    optimizer: opt.Optimizer = opt.Adam,
) -> keras.Model:
    
    inputs = keras.layers.Input(shape=(input_dim,))
    x = inputs
    
    for _, units in enumerate(hidden_layers):
        x = keras.layers.Dense(units)(x)
        x = keras.layers.Activation(activation_function)(x)
        if dropout_rate > 0:
            x = keras.layers.Dropout(dropout_rate)(x)
    
    outputs = keras.layers.Dense(num_classes, activation="softmax")(x)
    model = keras.Model(inputs=inputs, outputs=outputs, name=name)
    model.compile(
        optimizer=optimizer(learning_rate=learning_rate),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    return model

In [8]:
# testing to see if the creat fun works
basic_model = create_fcnn(
    input_dim=sym_dim*sym_dim,
    num_classes=10,
    optimizer=keras.optimizers.Adam,
    hidden_layers=[512, 256, 128],
    dropout_rate=0.3,
    name="fashion_fcnn"
)
basic_model.summary()

I0000 00:00:1765469683.175884   30171 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14793 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5


### b) Data Leak

What was wrong?

We trained on the whole training set and than evaluated on the first 10000 items of the dataset, so we were using data the model had trained on to evaluate it a clear data leak! 

I fixed this by simply training on only the first 50,000 images, than evaluationg on the next 10,000. The test set is reserved for the best model at we choose at the end to see it's actuall estimated accuracy.

In [None]:
import pandas as pd

def create_hyperparams(
    n: int,
    num_layers_range: list[int] = [1, 3],
    units_choices: list[int] = [128, 256, 512],
    dropout_range: tuple[float, float] = (0.1, 0.5),
    lr: tuple[float, float] = (5e-4, 25e-4), # new
    activation_function: list[str] = ["relu"], # new
    optimizers: list[keras.optimizers.Optimizer] = [opt.Adam]
) -> pd.DataFrame:
    
    params_list = []
    i = 0
    # making sure it uses different configurations, by going over all unique ones
    for mix in product(num_layers_range, units_choices, activation_function, optimizers):
        params_list.append({
            "hidden_layers": [mix[1] for _ in range(mix[0])],
            "dropout_rate": round(rng.uniform(*dropout_range), 3),
            "learning_rate": rng.uniform(*lr),
            "activation_function": mix[2],
            "optimzer": mix[3]
        })
        i+= 1
        if i == n:
            break
    return pd.DataFrame(params_list)


def run_search(
    hp_df: pd.DataFrame,
    X_train: np.ndarray,
    y_train: np.ndarray,
    epochs: int = 10,
) -> tuple[keras.Model, dict, pd.DataFrame]:
    results = hp_df.copy()
    results["val_acc"] = -1.0
    models = {}
    
    for i, row in hp_df.iterrows():
        params = {
            "hidden_layers": row["hidden_layers"],
            "dropout_rate": row["dropout_rate"],
            "learning_rate": row["learning_rate"],
            "activation_function": row["activation_function"],
            "optimizer": row["optimzer"],
        }
        
        model = create_fcnn(input_dim=X_train.shape[1], num_classes=10, name=f"trial_{i}", **params)
        
        layers_str = "_".join(map(str, params["hidden_layers"]))
        model_name = f"rs_L{layers_str}_D{int(params['dropout_rate'])}"
        
        model, _ = train_model(model, X_train[:50000], y_train[:50000], model_name=model_name, epochs=epochs, early_stopping=True, verbose=0)
        _, val_acc = model.evaluate(X_train[50000:], y_train[50000:], verbose=0)
        
        results.loc[i, "val_acc"] = round(val_acc, 4)
        models[i] = model
        
        print(f"Trial {i+1}/{len(hp_df)}: acc={val_acc:.4f}")
    
    results = results.sort_values("val_acc", ascending=False).reset_index(drop=True)
    best_idx = results.iloc[0].name if "name" in dir(results.iloc[0]) else 0
    
    print(f"\n✓ Best: val_acc={results.iloc[0]['val_acc']:.4f}")
    return models[hp_df.index[results.index[0]]], results.iloc[0].to_dict(), results

#### 4. The hyperparameters

I added:

- **learning rate** (lr)
    How large a step we make each adjustment of the weights

    My choise comes from the default keras learning rate of 1e-3 or 0.0001, I decided to randomly choose values in a range of 0.00005-0.00025, so between half and 2.5 times the original rate, as it would deviate by a reasonable amount but not too much

    This parameter also interacts with the optimizers.
    
    Expected impact: with low eochs I hope a higher learning rate increase the accuracy.
- **activation function**
    Which function the hidden layers use for activation and backprobagation

    Why? Honest answer: I just wanted to use all elu variants\
    Techincal answer: Altough we don't have to deep a network so even sigmoid could work it is still interesting which, and if how the activation function effects the model. Especially with the non-scalled data and the small number of epochs, the way the activation function makes our deltas which are backprobagated could/will effect how much we can adjust our weights from the initial value to fit the data.

    All the functions are chosen for beeing part of the elu familiy, allowing them to train deep models like relue.\
    Very sort overview:

    celu: relu, but continously differentiable\
    elu: relu, but continously differentiable and it does not have 0 for no activation but negative numbers\
    gelu: weigh inputs by their value, unlike relu which gates by sign, the activation is x multiplied by the cumulative probability of x in standard normal distribution\
    relu: good old reliable\
    selu: elu but it has fixed parameters that scale it\
    silu: ~~I missread it as selu before~~, or swish is is unbounded above 0 but bouned below, it multiplied x with the sigmoid of x, so if the sigmoid is close to 0 it will have a almost 0 activation

    Expected impact: Little to none, maybe with our smaller epcochs certain ones can help learn faster by modifying the weights more.
- **Optimizer**
    Which optimizer we use, the optimizer calculates the update step using the learning rate and calculated error.

    Optimizers differ in how they calculate the step size for the weights, thus they require different batch-sizes and learning rates.

    With it I want to see which optimizer is good for my range of learning rates and if I can get way with using the more memory efficient lion over Adam (important for large training sets which require heaps of memory)

    My options:
    Adam: the classical choise, Bodenhofer approved -> uses stochasitc gradient descent based on first and second-order\
    Lion: also stochastic gradient descent, but unlike adam it usees the sign operator to control the magnitude of updates, it is more memory efficient than adam as no second-order derivative is used.

    Expected impact: Rather small as both are quite good and memory is not an issue, at most one will be slightly better with the current learning rate.

Was there:
- the number of layers: I choose a lower number to not have to wait forever to finish training
- units_choises: how many neurons a hidden layer has, once more i choose lower numbers to reduce training time
- dropout range: slightly below the recommended range of 0-0.5, 0.5 drops to many and won't be too good.

In [11]:
hp_df = create_hyperparams(
    n=128,
    num_layers_range= [2,3,4],
    units_choices=[32, 48, 64],
    dropout_range=(0.0, 0.4),
    lr = (5e-4, 25e-4),
    activation_function = ["elu", "celu", "gelu", "relu", "selu", "silu"],
    optimizers=[opt.Adam, opt.Lion]
)
hp_df

Unnamed: 0,hidden_layers,dropout_rate,learning_rate,activation_function,optimzer
0,"[32, 32]",0.310,0.001378,elu,<class 'keras.src.optimizers.adam.Adam'>
1,"[32, 32]",0.343,0.001895,elu,<class 'keras.src.optimizers.lion.Lion'>
2,"[32, 32]",0.038,0.002451,celu,<class 'keras.src.optimizers.adam.Adam'>
3,"[32, 32]",0.304,0.002072,celu,<class 'keras.src.optimizers.lion.Lion'>
4,"[32, 32]",0.051,0.001401,gelu,<class 'keras.src.optimizers.adam.Adam'>
...,...,...,...,...,...
103,"[64, 64, 64, 64]",0.053,0.001855,relu,<class 'keras.src.optimizers.lion.Lion'>
104,"[64, 64, 64, 64]",0.049,0.001513,selu,<class 'keras.src.optimizers.adam.Adam'>
105,"[64, 64, 64, 64]",0.278,0.001662,selu,<class 'keras.src.optimizers.lion.Lion'>
106,"[64, 64, 64, 64]",0.080,0.002108,silu,<class 'keras.src.optimizers.adam.Adam'>


In [None]:
from pathlib import Path
results_df_path = "/home/azureuser/cloudfiles/code/Users/s2410929009/NDL3Repo/assignments/assignment_4/exports/results.parquet"
best_model_path = "/home/azureuser/cloudfiles/code/Users/s2410929009/NDL3Repo/assignments/assignment_4/exports/best_mode.keras"

The message: "Restoring model weights from the end of the best epoch: x."
does not mean weights are beeing transferd, but that early stopping is doing it's job, so no impact on grid search, simply kears choosing the best fit for it

Took roughly 45 min to train all 108 configurations

In [None]:
if Path(results_df_path).exists() and Path(best_model_path).exists:
    best_model = keras.saving.load_model()
    results_df = pd.read_parquet(results_df_path)
    best_params = results_df.iloc[0].to_dict()
else:
    best_model, best_params, results_df = run_search(hp_df, X_train, y_train, epochs=10) # changed to fewer epochs to safe on compute time
    # safing the results in case of loosing the state, don't want to retrain for another 40 min
    copy = results_df
    copy["optimzer"] = copy["optimzer"].astype(str)
    copy.to_parquet(results_df_path) 
    best_model.save(best_model_path)
results_df

Restoring model weights from the end of the best epoch: 10.
Trial 1/108: acc=0.8000
Restoring model weights from the end of the best epoch: 8.
Trial 2/108: acc=0.8202
Restoring model weights from the end of the best epoch: 10.
Trial 3/108: acc=0.8453
Restoring model weights from the end of the best epoch: 7.
Trial 4/108: acc=0.8204
Restoring model weights from the end of the best epoch: 10.
Trial 5/108: acc=0.8433
Restoring model weights from the end of the best epoch: 9.
Trial 6/108: acc=0.8216
Restoring model weights from the end of the best epoch: 10.
Trial 7/108: acc=0.8229
Restoring model weights from the end of the best epoch: 10.
Trial 8/108: acc=0.8183
Restoring model weights from the end of the best epoch: 9.
Trial 9/108: acc=0.8001
Restoring model weights from the end of the best epoch: 9.
Trial 10/108: acc=0.8103
Restoring model weights from the end of the best epoch: 10.
Trial 11/108: acc=0.8123
Restoring model weights from the end of the best epoch: 6.
Trial 12/108: acc=0.

Unnamed: 0,hidden_layers,dropout_rate,learning_rate,activation_function,optimzer,val_acc
0,"[64, 64, 64, 64]",0.049,0.002162,elu,<class 'keras.src.optimizers.adam.Adam'>,0.8597
1,"[64, 64]",0.080,0.000515,elu,<class 'keras.src.optimizers.lion.Lion'>,0.8585
2,"[64, 64]",0.056,0.000729,gelu,<class 'keras.src.optimizers.lion.Lion'>,0.8576
3,"[64, 64, 64]",0.041,0.001675,gelu,<class 'keras.src.optimizers.adam.Adam'>,0.8568
4,"[64, 64, 64, 64]",0.006,0.000959,relu,<class 'keras.src.optimizers.adam.Adam'>,0.8560
...,...,...,...,...,...,...
103,"[32, 32, 32]",0.289,0.001424,relu,<class 'keras.src.optimizers.lion.Lion'>,0.7625
104,"[64, 64, 64, 64]",0.278,0.001662,selu,<class 'keras.src.optimizers.lion.Lion'>,0.7245
105,"[32, 32, 32, 32]",0.331,0.002292,celu,<class 'keras.src.optimizers.lion.Lion'>,0.7211
106,"[48, 48, 48, 48]",0.357,0.001997,gelu,<class 'keras.src.optimizers.lion.Lion'>,0.7124


In [64]:
results_df.head(10)

Unnamed: 0,hidden_layers,dropout_rate,learning_rate,activation_function,optimzer,val_acc
0,"[64, 64, 64, 64]",0.049,0.002162,elu,<class 'keras.src.optimizers.adam.Adam'>,0.8597
1,"[64, 64]",0.08,0.000515,elu,<class 'keras.src.optimizers.lion.Lion'>,0.8585
2,"[64, 64]",0.056,0.000729,gelu,<class 'keras.src.optimizers.lion.Lion'>,0.8576
3,"[64, 64, 64]",0.041,0.001675,gelu,<class 'keras.src.optimizers.adam.Adam'>,0.8568
4,"[64, 64, 64, 64]",0.006,0.000959,relu,<class 'keras.src.optimizers.adam.Adam'>,0.856
5,"[64, 64]",0.086,0.001317,silu,<class 'keras.src.optimizers.lion.Lion'>,0.853
6,"[48, 48, 48]",0.039,0.002305,relu,<class 'keras.src.optimizers.adam.Adam'>,0.8521
7,"[64, 64, 64]",0.132,0.000789,celu,<class 'keras.src.optimizers.lion.Lion'>,0.8511
8,"[48, 48]",0.155,0.001077,silu,<class 'keras.src.optimizers.lion.Lion'>,0.8508
9,"[48, 48]",0.188,0.000879,gelu,<class 'keras.src.optimizers.lion.Lion'>,0.8504


In [65]:
results_df.tail(10)

Unnamed: 0,hidden_layers,dropout_rate,learning_rate,activation_function,optimzer,val_acc
98,"[32, 32, 32]",0.341,0.000968,elu,<class 'keras.src.optimizers.adam.Adam'>,0.7892
99,"[64, 64, 64, 64]",0.053,0.001855,relu,<class 'keras.src.optimizers.lion.Lion'>,0.7854
100,"[64, 64, 64, 64]",0.286,0.001978,silu,<class 'keras.src.optimizers.lion.Lion'>,0.7757
101,"[32, 32, 32, 32]",0.332,0.002117,silu,<class 'keras.src.optimizers.lion.Lion'>,0.7744
102,"[48, 48, 48, 48]",0.208,0.001132,relu,<class 'keras.src.optimizers.lion.Lion'>,0.7672
103,"[32, 32, 32]",0.289,0.001424,relu,<class 'keras.src.optimizers.lion.Lion'>,0.7625
104,"[64, 64, 64, 64]",0.278,0.001662,selu,<class 'keras.src.optimizers.lion.Lion'>,0.7245
105,"[32, 32, 32, 32]",0.331,0.002292,celu,<class 'keras.src.optimizers.lion.Lion'>,0.7211
106,"[48, 48, 48, 48]",0.357,0.001997,gelu,<class 'keras.src.optimizers.lion.Lion'>,0.7124
107,"[32, 32, 32, 32]",0.291,0.002037,relu,<class 'keras.src.optimizers.lion.Lion'>,0.6524


In [30]:
print("Best params: ")
for k, v in best_params.items():
    print(k, v)

Best params: 
hidden_layers [64, 64, 64, 64]
dropout_rate 0.049
learning_rate 0.002162225344249812
activation_function elu
optimzer <class 'keras.src.optimizers.adam.Adam'>
val_acc 0.8597


In [43]:
_, test_acc = best_model.evaluate(X_test, y_test)
print(f"For the best model the accuracy on the test set is: {test_acc*100:.2f}%")

[1m  1/313[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m7s[0m 24ms/step - accuracy: 0.7500 - loss: 0.6479

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8016 - loss: 0.5468
For the best model the accuracy on the test set is: 79.74%


### Result:

The bes model is the one with the most neurons per layer and most layers, indicating that we are most likely underfitting, so either more neurons or more epcohs (very likley) are needed.

Otherwise as mentioned above Lion is quite good with small learning rates but VERY bad with larger learning rates (the 9 worst models are with lion optimization and large learning rates) so adam is more resistant to the learning rate (be it small or large), but if the learning rate is chosen well Lion can match adam as it takes up quite a few spots in the top 10.

Generally lower dropout rates are favoured (most top 10 have smaller dropout rates), which makes sense since we are most likly underfitting and dropout is to prevent overfitting.

The activation functions are very diverse, altough it seems that relue specificly leans more towards the bad side of the models.

Rather interesting is that a large model canno't safe bad hyperparameters as a model the same size as the best one is among the worst 5 models just through it's large dropout rate + large learningrate/lion combination.

## Exercise 2: Keras Tuner (4 points)

Explore systematic hyperparameter optimization using Keras Tuner. The principle behind it is quite similar to the implementation we did by hand, however it has support for other search algorithms than random search. Documentation can be found under: https://keras.io/keras_tuner/.

**Tasks:**

1. **Choose and explain one optimization strategy:**

   - Either Hyperband Optimization
   - Or Bayesian Optimization

   Briefly describe how the chosen method works (search for an appropriate reference paper or other academic resource! Hint: look at what keras-tuner cites) and compare to random search.

2. **Implement the search:**

   - Use Keras Tuner with your chosen strategy on the Fashion MNIST dataset
   - Build a small comparison experiment with random search from exercise 1. (e.g. convergence speed.)
   - Is such an approach inherently better than random search with additional manual tuning? Reason in one sentence.


### 1) Explaining Hyperband optimization

https://jmlr.org/papers/volume18/16-558/16-558.pdf

It is a newer (2018) highly efficient optimization method, that tries to avoid the problems of the older (2012) baysian optimizaion.

The hyperband algorithm is an extension of the SuccessiveHalving (SH) Algorithm, which as it names suggests halves the amount of hyperparameter configurations it considers every "round". The SH algorithm uniformly allocates a budget to each set of hyperparameters before evaluating them and throwing out the worse half. As the set of parameters get ruduced the freed up budged can be used on the better half.

The way it exstends upon the SH Algorithm is by doing an easentially grid-sear over feasible values of how many reasources each set should be able to use from the getgo. It does this by running mulitple instances of SH in parrallel in the beginning with different resource allogations to make sure not to discard good solutions to early.

<!-- ## 2) Implementing Search -->

## 2 Implementing Search

In [63]:
import keras_tuner as kt
from keras import layers

In [None]:
# %%time

def hypermodel(hp):
    # It's good practice to add an Input layer or define the input_shape 
    # in the first Dense layer. Assuming input shape is 784 for flat images.
    
    # starts here
    inputs = keras.layers.Input(shape=(sym_dim*sym_dim,))
    x = inputs
    
    units = hp.Choice('units_1', [32, 48, 64])
    activation_function = hp.Choice("activation_1", ["elu", "celu", "gelu", "relu", "selu", "silu"])
    dropout_rate = hp.Float('dropout_rate', min_value=0.0, max_value=0.4, sampling='linear')
    for i in range(hp.Choice("layers", [2,3,4])):
        x = keras.layers.Dense(units)(x)
        x = keras.layers.Activation(activation_function)(x)
        if dropout_rate > 0:
            x = keras.layers.Dropout(dropout_rate)(x)
        
    optimizer_choice = hp.Choice('optimizer', ['adam', 'lion'])
    learning_rate = hp.Float('learning_rate', min_value=5e-4, max_value=25e-4, sampling='linear')
    if optimizer_choice == 'adam':
        optimizer = opt.Adam(learning_rate=learning_rate)
    else:
        optimizer = opt.Lion(learning_rate=learning_rate)
    
    outputs = keras.layers.Dense(10, activation="softmax")(x)
    model = keras.Model(inputs=inputs, outputs=outputs, name="hypermodel")
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    
    return model

tuner = kt.Hyperband(
    hypermodel=hypermodel,
    objective="val_accuracy", # monitor validation metric, not training loss
    max_epochs=10,
    hyperband_iterations=2,
    seed=42,
    project_name="hyperband"
)

tuner.search(X_train[:50_000], y_train[:50_000], epochs=10, validation_data=(X_train[50_000:], y_train[50_000:]))
best_tuned = tuner.get_best_models()[0]
# hypermodel(keras_tuner.HyperParameters())
best_tuned

Trial 60 Complete [00h 00m 34s]
val_accuracy: 0.8299000263214111

Best val_accuracy So Far: 0.8632000088691711
Total elapsed time: 00h 19m 24s
CPU times: user 19min 55s, sys: 1min 54s, total: 21min 50s
Wall time: 17min 46s


<Functional name=hypermodel, built=True>

In [104]:
tuner.get_best_hyperparameters(num_trials=1)[0].values

{'units_1': 32,
 'activation_1': 'celu',
 'learning_rate': 0.0006040691120683395,
 'layers': 4,
 'optimizer': 'lion',
 'tuner/epochs': 10,
 'tuner/initial_epoch': 0,
 'tuner/bracket': 0,
 'tuner/round': 0}

In [107]:
_, test_acc = best_tuned.evaluate(X_test, y_test)
print(f"For the best model the accuracy on the test set is: {test_acc*100:.2f}%")

[1m307/313[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.8560 - loss: 0.3996

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8560 - loss: 0.3996
For the best model the accuracy on the test set is: 85.53%


In [108]:
best_tuned_path = "/home/azureuser/cloudfiles/code/Users/s2410929009/NDL3Repo/assignments/assignment_4/exports/best_tuned.keras"

best_model.save(best_tuned_path)

### Results

This is quite the surpirse for me, as my approach for random search was more grid-search as I tried most options.

But what this method has over the ranodom search is the ability to test all the different params in the ranges for dropout and learning rate.

So while a surprise it is quite logical for it to have better training accuracy than the random search, what is more surprising than that though is that the test accuracy is almost as high as the validation accuracy, so the hyperband optimizer was able to find a not only a objectivly better solution (with less complexity to boot).

As such I would recommend the Hyperband algorithm for parameter tuning and will be using it again in the future.

Personally the most intersting part of this is that the model is less complex than the one of random search, disproving my earlier claim of possible underfitting, with the test accuracy of this smaller model beating the test accuracy of the larger one it might have actually overfitted and I was simply fooled by it's circumstances, I will leave my earlier speculation in this notebook though so this reveal can happen.

## Comparison

I will take my results in EX 1 as random search because it used an element of randomness, as to which configurations were tried.

Due to a bug in the naming the droupout rate is not output in the end parameters (it was applied though in the model)

Convergence speed: 

- random search: 45 mins
- hyperband: 19 mins

Best result: 

- **random search**\
    hidden_layers [64, 64, 64, 64]\
    dropout_rate 0.049\
    learning_rate 0.0021\
    activation_function elu\
    optimzer Adam\
    val acc 85.97%
    test set acc: 79.74%
- **hyperband**\
    hidden_layers [32, 32, 32, 32]\
    learning_rate: 0.0006\
    activation: celu'\
    optimizer lion\
    validation acc 86.32%
    test set acc: 85.53%


#### Better?

No, such a approach is not **inherently** better than random search, while smarter and on **avarage probably yielding better** results (more exploitation), it is much easier to get stuck in local optima, while random search could simply get lucky and land in an even deeper optima (much more exploration).