# Attendance

[The link to SEAts](https://gold.seats.cloud/angular/#/lectures)

# 13 Best practices for the real world

## Getting the most out of your models

### Hyperparameter optimization

Deep Learning engineers have many seemingly arbitrary decisions:

- How many layers?
- How many unites per layer?
- What activation function(s)?
- Batch/Layer normalise?
- How much dropout/regularisation?
- What optimizer or learning rate?
- etc..

### Hyperparameter optimisation workflow

Throughout: **Document everything**.

1. Think of a *baseline* (without a neural net & with an untrained net)
2. Choose a set of hyperparameters;
3. Build the model;
4. Fit to training data and measure performance on validation data;
5. If good enough:
    - Stop & go to 7.;
6. Go to 2;
7. Retrain with the same hyperparameters as the best run on *your entire training set* (partial train + validation, no longer any validation split!), up until the epoch where your best model started overfitting, evaluate on test data.

---

### Keras Tuner

Keras now comes with its own module for automated hyperparameter optimisation.

If you go for the Data Science option in Coursework 2, please **do not use it**!!  
We would like you to practice building this **yourself**.

This could be used in projects with another focus.

#### References

[Documentation](https://keras.io/keras_tuner/)  
[TensorFlow tutorial](https://www.tensorflow.org/tutorials/keras/keras_tuner)

#### The future of hyperparameter tuning: automated machine learning

Automating Hyperparameter search is an active area of research, known as AutoML. Here is the [the Keras package trying to develop that](https://autokeras.com/).

---

### Model ensembling

Ensembling means pooling predictions from a number of **independent models**.

The idea is based on the observation that any model can only grasp **part of the truth**.

We hope that a *more diverse* set of models might access **more aspects of the truth**.

Ensembling generally produces the most competitive models.

Pooling can proceed by **averaging** model predictions.

But there is a disadvantage – a poor model can worsen the average, even dragging it below the performace of the best model.

The predictions can be weighted, with higher weights to better models.

The weights can even be automatically optimised.  
(Idea: give the best-performing model on the validation set the highest weight...)  

#### Equal weight for all predictions

```python
preds_a = model_a.predict(x_val) # Use four different
preds_b = model_b.predict(x_val) # models to compute
preds_c = model_c.predict(x_val) # initial predictions.
preds_d = model_d.predict(x_val)
             # ↓ This new prediction array should be accurate than any of the initial ones.
final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)
```

DLWP, p.420

#### Different weights for predictions

```python
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)
            # ↓ These weights (0.5, 0.25, 0.1, 0.15) are assumed to be learned empirically.
final_preds = 0.5 * preds_a + 0.25 * preds_b + 0.1 * preds_c + 0.15 * preds_d
```
DLWP, p.420-1

#### Note: Diversity

The models in the ensemble need to be **as diverse** as possible.

A model in the ensemble might be comparatively poor and have low weight, but can nevertheless make a quantitative difference to the overall prediction because it was distant from the other models and may perform really well in specific situations.

The concern is for a **diversity of models** rather than how well the best model performs.

#### Note: different initialisations vs architectures

Avoid training the same model from different initialisations/order of exposure to data: this is not different enough!

Better to have different architectures or approaches (neural and not neural, for instance).

---

## Scaling-up model training

### Speeding up training on GPU with mixed precision

#### Understanding floating-point precision

There are three of levels of precision you’d typically use:
- Half precision, or `float16`, where numbers are stored on 16 bits;
- Single precision, or `float32`, where numbers are stored on 32 bits;
- Double precision, or `float64`, where numbers are stored on 64 bits.

(DLWP, p.422)

Tradeoff: for some operations, you want more **precision** (requires more compute), for others you want more **speed** (the precision matters less).

#### Manual tensor conversion

In [38]:
np_array = np.zeros((2, 2))
tf_tensor = tf.convert_to_tensor(np_array)
tf_tensor.dtype

tf.float64

In [39]:
np_array = np.zeros((2, 2))
tf_tensor = tf.convert_to_tensor(np_array, dtype="float32")
tf_tensor.dtype

tf.float32

#### Mixed-precision training in practice

```python
tf.keras.mixed_precision.set_global_policy("mixed_float16")
```

### Multi-GPU training

<!-- ![Chollet mirrored strategy](images/chollet/figure18.2.png) -->

![Chollet  mirrored strategy](https://raw.githubusercontent.com/jchwenger/AI/main/lectures/10/images/chollet/figure18.2.png)

[DLWP](https://deeplearningwithpython.io/chapters/chapter18_best-practices-for-the-real-world/#model-parallelism-splitting-your-model-across-multiple-gpus), Figure 18.2

```python
# Create a “distribution strategy” object. (go-to solution: `MirroredStrategy`)
strategy = tf.distribute.MirroredStrategy()

print(f"Number of devices: {strategy.num_replicas_in_sync}")

with strategy.scope():           # Open scope: everything inside is distributed
    model = get_compiled_model() # All variables must be under the scope!
                                 # (Model construction and `compile()`)
```
```python
model.fit(                       # The training will automatically be
    train_dataset,               # distributed across devices!
    epochs=100,
    validation_data=val_dataset
)

```

#### Global batch size

When using `datasets`, the batch size is then a **global** batch size, and gets split up on each device.

Usually then you calculate it from the capacity of each GPU:

```python
batch_size = batch_size_per_device * n_devices
```

### TPU training

#### Using a TPU via Google Colab

They still sometimes have a few available for free, try it with the starter code in [`lab-8-TPU`](https://github.com/jchwenger/AI/blob/main/labs/8-lab/lab-8-TPU.ipynb)! (Change the Runtime to TPU.)

This notebook also contains a link to another Colab notebook, which itself refers to **more** Colab notebooks!

#### TPU checklist

- TPUs have a lot of capacity: you can easily have larger batch sizes!
- Loading data can be a big bottleneck in these contexts. TensorFlow recommends using their [binary format TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord) (and the [Keras page](https://keras.io/examples/keras_recipes/creating_tfrecords/));
- TPUs require that everything is compiled before it runs. You can even gain in speed by *compiling several steps of training in one* (called **step fusing**) to improve performance: specify `steps_per_execution=n` in the `compile()` method.

# Summary

- the `Sequential` syntax;
- the `Functional` syntax;
- `Model` or `Layer` subclassing
- Losses must match outputs!
- Various losses are **summed**, weighted average using `loss_weights`;
- Retrieve inner layers from models → build new ones;
- Mix & match: functional & subclassing can be combined;
- No `summary()` method with subclassing!

Personal advice: *when you reach subclassing, you may leave Keras behind and just use TF... :}*

### Hyperparameter search

- DL engineering requires exploration in hyperparameter space:
    - currently guided by intuition/experience;
    - automation is coming;
    - human or automatic search should be as **systematic** as possible;
    - three ideas:
        - grid search;
        - random search;
        - babysitting

### Ensembling

- Ensembling with the appropriate weighted average of contributions is very powerful;
- Component models should be as dissimilar as possible.