# MLOps for RSEs - 3. Model Development

In the first notebook, we looked at setting up a data pipeline to preprocess our data ready to use with a machine learning pipeline where we create and train machine learning models to do some prediction for our example problem. In this notebook we will look at the key principles and tools from the point of a Research Software Engineer for building such a machine learning model training pipeline. The key idea is that machine learning model are assets to be tracked and shared like code and data. Users of the models need to know all the details on how they were trained, including the input data, the architecture of the model, the hyperparameters used and many other details so that the results can be understood and trusted. This is just the same as RSEs routinely do for code, so in many ways much of what is in this notebook is not new in principle, but may be implemented slightly differently at times to handle different aspects of ML models.

![Met Office Logo](https://www.metoffice.gov.uk/webfiles/1661941781161/images/icons/social-icons/default_card_315.jpg)

## How to train your ~~dragon🐉~~ machine learning model 🤖 (reproducibly, efficiently, at scale)
Training a machine learning model is easy, at least as of 2022 when there are many libraries and tools available and a lot of good documentation and tutorials to help researchers get started in applying machine learning to their problem if they've not done it before. So in general training a small machine learning model is not something that requires special RSE support. As with many research tasks, the challenges appear when you try to take your toy example trained on a small initial dataset and try to scale it up to a large complex real world dataset. Suddenly the technical infrastructure demands of machine learning at scale become apparent and this where the expertise of a RSE becomes valuable in setting up the tools and infrastructure to easily scale up the compute to match demands of the research. 

In addition in a research context one is not training a single large model with well defined parameters, all the elements of the pipeline are part of the research. Researchers will likely want to perform many experiments training different models with slightly altered inputs and hyperparameters. The amount of data and other assets produced by such ML experiments can quickly become overwhelming. *Experiment tracker* software is vital, especially in a research context, to manage and compare the results of different machine learning experiments and select the best results.

It is also important in larger models to be be able to track the performance of models whiles they are training to see whether progress in improving performance with respect to key metrics is being made by continuing training or whether improvements have stalled. Training dashboard tools are useful for researchers to be able to check on the status of models during what can be quite a long training process.

The steps for this part of the project are typically as follows:
* *create train/validate/test splits* - a key practice in train ML models is to only train with a portion of the data, and hold back smaller portions of the data for for evaluating model performance both during the development cycle (validate or dev split), and right at the end of the project to check the final solution generalises to unseen data (test split).
* *prepare the data* - Although most of the work  will likely have been done during the data prep part of the project, there is likely at least a small amount of "last mile" preparation to prepare the data for use with the particular ML framework and architecture being used. In particular normalising the data.
* *build the model architecture* - Use the chosen modelling framework (e.g. TensorFlow, PyTorch) to set up the model.
  * *log hyper parameters* - log the details of the model architecture and any hyperparameters with experiment tracking tool to be able recreate the architecture later.
* *train the model* - Run the training algorithm to determine the parameters for the model that best fit the training data.
  * *monitor training* - Training is usually the longest and most compute intensive part of the process for most machine learning projects. While it is ongoing, there are various tools to monitor the current state of the model, so you can abort training if training is going wrong and generally check if it is converging and loss is decreasing.
  * *log training metrics* - Either during training or afterwards, log the performance metrics for the model with the experiment tracking tool.
* *save the the trained model*  - Once the model is trained, you want to save the definition of the architecture and the weights to some sort of persistent storage for subsequent inference.
  * *log trained model* with experiment manager* The architecture and weights/thresholds can also be saved with as artifact of the run in the experiment tracking tool so that all the configuration and outputs of a particular run are accessible in the same place.

A key part of model development is finding the correct *hyperparameters* for the model. Hyperparameters are the configuration elements of the model that are not determined by training the model. Those elements that are calculated, for example the weights and coefficients of a neural network, are called parameters. Those that are not, for example the number of hidden layers are learning rate for a neural network, are called hyperparameters. Hyperparameters are usually found by trial and error for specific problems. An additional training loop to find the best hyperparameters is usually employed and this is called hyperparameter tuning. This is an inherently parallel process which involves training many models independentally.

### Key Principles
* *reusability* - ensure the elements of training infrastructure are easy to reuse/adapt for different models and projects.
* *scalability* - ensure researchers can select the size of training dataset and the size/complexity of the model architecture to match their requirements, rather than having to restrict the scale of their experiments due to technical limitations.
* *reproducibility* - providing tools to systematically record the inputs, configurations and outputs of model training run to keep track of experiments so results can be reproduced and evidence compiled for reports and papers.


### Key Tasks for RSEs
* Setting up infrastructure for training models
* Facilitating running code on ML targeted platforms e.g. GPU, TPU etc.
* Support good practices for ML development e.g. experiment tracking
* Applying FAIR principles to all ML assets (data, code, trained models)
* Applying good code management practices to ML code
* Setting up test suites for ML pipelines


### Key Terms
* *experiment tracking* - a place to log all of the configuration and outputs from a machine learning experiment in one place, allowing the researchers to easily compare results from different runs, select the best run and compile evidence for papers,reports and other research outputs.
* *machine learning pipeline/workflow* - A series of tasks centred around training and evaluating machine learning models.
* *train/test split* - Splitting the available data for training a model into separate sets so that some of the data is used to train the model and the rest, which the model will not have been exposed to in training, will be used for an evaluation on generalization capability of the model to independent data.
* *training* - The process of finding the best *parameters* for a model with the specified architecture that best fits the training data.
* *hyper parameters* - Hyper parameters are part of the configuration of the model that is not determined by training. These include the structure or architecture of the model, such as the number of hidden layers or number of nodes in a layer for a neural network, or the depth of decision tree.
  * *hyper parameter tuning* - As hyperparameters are not optimised as part of the training process, a separate outer training loop, called hyperparameter tuning, is run to automate the search for the optimal hyperparameters for the problem.

### Key Tools
* ML frameworks (scikit-learn, TensorFlow, PyTorch)
* Workflow tools (ray)
* Experiment tracking (MLflow)

### Running this notebook
This notebook should run from a conda environment created with the [requirements_model_development.yml file](requirements_model_development.yml). See the [readme file](https://github.com/informatics-lab/ukrse_2022_mlops_walkthrough/blob/main/README.md) for info on how to set up a conda environment for using this notebook.

## Example problem - Predicting wind rotors

In the previous notebook, we introduced the example problem of predicting wind rotor events using machine learning. Now we are going to train a model to do this prediction for that problem. We will use this example problem to demonstrate the following parts of training a model:
* loading the preprocessed data from an Intake catalog.
* set up an experiment and run in our experiment tracking tool to track and store the configuration and outputs of our experiment using ML Flow.
* monitor the training process using TensorBoard (not particularly necessary for such a small example of course, but a lot more valuable for larger models and datasets.).
* save the model for subsequent inference in notebook 4 (model evaluation).

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pathlib
import datetime
import os
import functools
import math

In [3]:
import matplotlib
%matplotlib inline

In [4]:
import numpy
import pandas
import dask

In [5]:
import sklearn
import sklearn.preprocessing
import sklearn.model_selection

In [6]:
import tensorflow

import tensorflow.keras
import tensorflow.keras.layers
import tensorflow.keras.models
import tensorflow.keras.optimizers
import tensorflow.keras.metrics
import tensorflow.keras.layers
import tensorflow.keras.constraints

In [7]:
import tensorboard

In [8]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [9]:
import mlflow
mlflow.tensorflow.autolog()



In [10]:
import intake

In [11]:
mlflow_dash_port = 5001
tensorboard_dash_port = 5002
ray_dash_port = 5003

### Load the data

In the previous notebook we explored the data and prepared it for use in training a machine learning model. We saved the resulting dataset in a catalog, and now we will use that catalogue to load the data. The advantage of this approach is that it is easy for other to find this dataset by looking at the catalogue. For example you could imagine a common catalogue for a research group to allow members to find datasets they can use for their projects. This is useful not just for ML projects. In addition, catalogs support  separation of tasks and responsibilities within a project and team, so those responsible for creating/curating the data make it available through the catalogue, then others who are responsible for the ML modelling aspects of project can just access the data from the catalogue and get on with their part of the project.

In [12]:
try:
    rse_root_data_dir = pathlib.Path(os.environ['RSE22_ROOT_DATA_DIR'])
    print('reading from environment variable')
except KeyError as ke1:
    rse_root_data_dir = pathlib.Path(os.environ['HOME'])  / 'data' / 'ukrse2022'
    print('using default path')
rse_root_data_dir

using default path


PosixPath('/home/h01/shaddad/data/ukrse2022')

In [13]:
rotors_catalog = intake.open_catalog(rse_root_data_dir / 'rotors_catalog.yml')
rotors_catalog 

rotors_catalog:
  args:
    path: /home/h01/shaddad/data/ukrse2022/rotors_catalog.yml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}


In [14]:
list(rotors_catalog)

['rotors', 'rotors_preprocessed']

We see that our catalog contains preprocessed data ready to use in our machine learning development pipeline.

In [15]:
rotors_df = rotors_catalog['rotors_preprocessed'].read()

In [16]:
rotors_df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,air_temp_obs,dewpoint_obs,wind_direction_obs,wind_speed_obs,wind_gust_obs,air_temp_1,air_temp_2,air_temp_3,...,v_wind_18,u_wind_19,v_wind_19,u_wind_20,v_wind_20,u_wind_21,v_wind_21,u_wind_22,v_wind_22,time
0,1,1,283.9,280.7,110.0,4.1,-9999999.0,284.000,283.625,283.250,...,5.756768,-1.953409,5.673111,-2.674064,5.482644,-3.000000e+00,5.196152,-2.987221,4.971570,2015-01-01 00:00:00
1,2,2,280.7,279.7,90.0,7.7,-9999999.0,281.500,281.250,280.750,...,6.502872,-1.460878,5.094687,-0.790064,3.716961,-7.837740e-16,3.200000,0.727691,3.423517,2015-01-01 03:00:00
2,3,3,279.8,278.1,100.0,7.7,-9999999.0,279.875,279.625,279.125,...,5.481273,-1.423505,5.312592,-0.174497,4.996954,7.293223e-01,4.136193,2.462646,3.152043,2015-01-01 06:00:00
3,4,4,279.9,277.0,120.0,7.2,-9999999.0,279.625,279.250,278.875,...,2.475770,-1.311123,3.245143,-0.407661,3.878635,6.883116e-01,4.345829,1.723190,4.265046,2015-01-01 09:00:00
4,5,5,279.9,277.4,120.0,8.7,-9999999.0,279.250,278.875,278.375,...,-0.775695,-1.997259,0.104672,-1.928942,1.252670,-1.287595e+00,2.142918,-0.899056,2.225241,2015-01-01 12:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17481,20101,20101,276.7,275.5,270.0,3.6,-9999999.0,277.875,277.750,277.625,...,-8.555992,-8.047581,-8.629974,-7.479073,-8.603689,-7.111320e+00,-8.781749,-6.538771,-9.338333,2020-12-31 06:00:00
17482,20102,20102,277.9,276.9,270.0,3.1,-9999999.0,277.875,277.625,277.875,...,-6.956383,-8.273280,-6.942106,-8.886116,-7.456336,-8.995651e+00,-8.388580,-8.029567,-8.917738,2020-12-31 09:00:00
17483,20103,20103,283.5,277.1,220.0,3.6,-9999999.0,281.125,280.625,280.125,...,-8.332875,-7.326372,-9.377328,-8.397556,-9.660283,-7.962654e+00,-8.843423,-7.495332,-7.495332,2020-12-31 12:00:00
17484,20104,20104,286.1,276.9,250.0,3.6,-9999999.0,284.625,284.125,283.625,...,-6.646804,-5.294689,-6.776892,-4.398330,-7.038799,-5.356255e+00,-6.855694,-7.265332,-7.016050,2020-12-31 15:00:00


In [17]:
list(rotors_df.columns)

['Unnamed: 0',
 'Unnamed: 0.1',
 'air_temp_obs',
 'dewpoint_obs',
 'wind_direction_obs',
 'wind_speed_obs',
 'wind_gust_obs',
 'air_temp_1',
 'air_temp_2',
 'air_temp_3',
 'air_temp_4',
 'air_temp_5',
 'air_temp_6',
 'air_temp_7',
 'air_temp_8',
 'air_temp_9',
 'air_temp_10',
 'air_temp_11',
 'air_temp_12',
 'air_temp_13',
 'air_temp_14',
 'air_temp_15',
 'air_temp_16',
 'air_temp_17',
 'air_temp_18',
 'air_temp_19',
 'air_temp_20',
 'air_temp_21',
 'air_temp_22',
 'sh_1',
 'sh_2',
 'sh_3',
 'sh_4',
 'sh_5',
 'sh_6',
 'sh_7',
 'sh_8',
 'sh_9',
 'sh_10',
 'sh_11',
 'sh_12',
 'sh_13',
 'sh_14',
 'sh_15',
 'sh_16',
 'sh_17',
 'sh_18',
 'sh_19',
 'sh_20',
 'sh_21',
 'sh_22',
 'winddir_1',
 'windspd_1',
 'winddir_2',
 'windspd_2',
 'winddir_3',
 'windspd_3',
 'winddir_4',
 'windspd_4',
 'winddir_5',
 'windspd_5',
 'winddir_6',
 'windspd_6',
 'winddir_7',
 'windspd_7',
 'winddir_8',
 'windspd_8',
 'winddir_9',
 'windspd_9',
 'winddir_10',
 'windspd_10',
 'winddir_11',
 'windspd_11',
 'winddi

In [18]:
# one small bit of cleaning: ensuring the correct datetime type for our time feature
rotors_df['time'] = pandas.to_datetime(rotors_df['time'])

In [19]:

temp_feature_names = [f'air_temp_{i1}' for i1 in range(1,23)]
humidity_feature_names = [f'sh_{i1}' for i1 in range(1,23)]
wind_direction_feature_names = [f'winddir_{i1}' for i1 in range(1,23)]
wind_speed_feature_names = [f'windspd_{i1}' for i1 in range(1,23)]
u_wind_feature_names = [f'u_wind_{i1}' for i1 in range(1,23)]
v_wind_feature_names = [f'v_wind_{i1}' for i1 in range(1,23)]
target_feature_name = 'rotors_present'

### Train/test split

The next important part in preparing the data is to split it into parts with some used for training and some used for evaluation. The best practice for this is to split the data into three parts:
* training - This is the data that will be shown to the algorithm during training
* validation/dev - This is data set aside that is used to evaluate the trained model during active development of the model. Based on this performance you may choose to update the hyperparameters or refine/augment the dataset to improve results. 
* test - this part consists of data that put aside until the very end of the project. This ensures that neither the model parameters not the hyperparameters are overfitted or too specifically tuned to the particular contents of the data and will actually generalise to the broader problem being solved.

For this example, we're just using train and test to keep things simple. Often train/validate/test sets are chosen at random. Sometimes though, specifically for time series or geospatial data, neighboring points can be correlated in some way, so typically more care is required. In this case we are going to split our data by year to avoid correlations between train and test sets.

In [20]:
train_df = rotors_df[rotors_df['time'] < datetime.datetime(2020,1,1,0,0)]
val_df = rotors_df[rotors_df['time'] > datetime.datetime(2020,1,1,0,0)]

In [21]:
input_feature_names = temp_feature_names + humidity_feature_names + u_wind_feature_names + v_wind_feature_names

In [22]:
preproc_dict = {}
for if1 in input_feature_names:
    scaler1 = sklearn.preprocessing.StandardScaler()
    scaler1.fit(train_df[[if1]])
    preproc_dict[if1] = scaler1

In [23]:
target_encoder = sklearn.preprocessing.LabelEncoder()
target_encoder.fit(train_df[[target_feature_name]])

LabelEncoder()

In [24]:
def preproc_input(data_subset, pp_dict):
    return numpy.concatenate([scaler1.transform(data_subset[[if1]]) for if1,scaler1 in pp_dict.items()],axis=1)

def preproc_target(data_subset, enc1):
     return enc1.transform(data_subset[[target_feature_name]])

In [25]:
X_train = preproc_input(train_df, preproc_dict)
y_train = numpy.concatenate(
    [preproc_target(train_df, target_encoder).reshape((-1,1)),
    1.0 - (preproc_target(train_df, target_encoder).reshape((-1,1))),],
    axis=1
)

In [26]:
X_val = preproc_input(val_df, preproc_dict)
y_val = numpy.concatenate(
    [preproc_target(val_df, target_encoder).reshape((-1,1)),
    1.0 - (preproc_target(val_df, target_encoder).reshape((-1,1))),],
    axis=1
)

### Set up experiment tracking - ML Flow
Now that we've loaded and prepared our data, we can start training some machine learning models! Over the course of a project, we may want to train many models, varying different elements of the training configuration, model architecture, training data and many other aspects. Experiment tracking helps us log a complete description of the experiment so we can compare different results, select best configurations and be able to reproduce results when required.

An important distinction of concepts in experiment tracking frameworks generally, and specifically for ML Flow, is between experiments and runs. An experiment is a particular setup of your machine learning pipeline that you are wanting to develop and optimise. Within each experiment you are likely to have many runs, representing each iteration of a particular configuration. 
https://www.mlflow.org/docs/latest/concepts.html

In this example, we will use an ML Flow server running locally.  To start up such an ML Flow server, activate the conda environment used for this notebook, then run the following ML Flow command from the command line to start your tracking server

```mlflow server --port $PORT --backend-store-uri sqlite://<PATH TO DB>/<FILENAME>.db  --default-artifact-root $ARTIFACT_PATH```

Alternatively, you can run the `run_mlflow.sh` script file from the `ukrse2022_mlops_model_dev` conda environment

To Note

* This will only be accessible on your local machine. Yo make more widely accessible use the `--host 0.0.0.0` option
* when specifying the backend URI, if you specify a full path you will need FOUR slashes after the sqlite:, for example sqlite:////user/name/experiments/my_project.db, otherwise you will get an error.
* The artifact store can be a local file path e.g. /path/to/artifacts/ or a remote object store like S3 e.g. s3://project_bucket/project_key/

We will now set up our access to our server and set up an experiment.

In [27]:
rse_rotors_experiment_name = 'rse_mlops_demo_rotors'

In [28]:
timestamp_template = '{dt.year:04d}{dt.month:02d}{dt.day:02d}T{dt.hour:02d}{dt.minute:02d}{dt.second:02d}'

In [29]:
rse_run_name_template = 'rse_rotors_{network_name}_' + timestamp_template

In [33]:
# mlflow_server_address = '127.0.0.1'
mlflow_server_address = '10.152.49.196'
mlflow_server_port = mlflow_dash_port
mlflow_server_uri = f'http://{mlflow_server_address}:{mlflow_server_port:d}'
mlflow_server_uri

'http://10.152.49.196:5001'

In [34]:
mlflow.set_tracking_uri(mlflow_server_uri)

Example of required run: ```mlflow server --port 5001 --backend-store-uri sqlite:///mlflowSQLserver.db  --default-artifact-root ./mlflow_artifacts/```

In [35]:
try: 
    print('creating experiment')
    rse_rotors_exp_id = mlflow.create_experiment(rse_rotors_experiment_name)
    rse_rotors_exp = mlflow.get_experiment(rse_rotors_exp_id)
except mlflow.exceptions.RestException:
    rse_rotors_exp = mlflow.get_experiment_by_name(rse_rotors_experiment_name)
rse_rotors_exp



creating experiment


<Experiment: artifact_location='/home/h01/shaddad/data/ukrse2022/1', experiment_id='1', lifecycle_stage='active', name='rse_mlops_demo_rotors', tags={}>

### Set up model architecture hyperparameters

Now we will actually set up our models for this pipeline. In this case we're using a fairly simple feed-forward neural network. We don't need to do any specific logging of config or architecture at this point because the autologging capabilities of ML Flow for TensorFlow/Keras will ensure all these hyperparameters are logged.

In [36]:
def build_ffnn_model(hyperparameters, input_shape):
    """
    Build a feed forward neural network model in tensorflow for predicting the occurence of turbulent orographically driven wind gusts called Rotors.
    """
    model = tensorflow.keras.models.Sequential()
    model.add(tensorflow.keras.layers.Dropout(hyperparameters['drop_out_rate'], 
                                              input_shape=input_shape))
    for i in numpy.arange(0,hyperparameters['n_layers']):
        model.add(tensorflow.keras.layers.Dense(hyperparameters['n_nodes'], 
                                                activation=hyperparameters['activation'], 
                                                kernel_constraint=tensorflow.keras.constraints.max_norm(3)))
        model.add(tensorflow.keras.layers.Dropout(hyperparameters['drop_out_rate']))
    model.add(tensorflow.keras.layers.Dense(2, activation='softmax'))             # This is the output layer
    return model


In [37]:
nx = X_train.shape[1]
input_shape = (nx,)

In [38]:
hyperparameters_dict = {
    'initial_learning_rate': 1.0e-4,
    'drop_out_rate': 0.2,
    'n_epochs': 100,
    'batch_size': 1000,
    'n_nodes': 1000,
    'n_layers': 4,
    'activation': 'relu',
    'loss': 'mse'
}

### Train model and setup monitoring

Now we're (almost) all setup to run our training. To monitor while training is running we'll use tensorboard, so we'll set up tensorboard ready to use. TensorBoard uses a callback in the Keras training loop to update the dashboard.

In [39]:
log_dir_tensorboard = rse_root_data_dir / 'log_tensorboard' 
if not log_dir_tensorboard.is_dir():
    log_dir_tensorboard.mkdir()
    print(f'created tensorboard log directory {log_dir_tensorboard}')    
log_dir_tensorboard

PosixPath('/home/h01/shaddad/data/ukrse2022/log_tensorboard')

In [40]:
tensorboard_callback = tensorflow.keras.callbacks.TensorBoard(log_dir=log_dir_tensorboard, 
                                                              histogram_freq=1)

In [41]:
log_dir_tensorboard

PosixPath('/home/h01/shaddad/data/ukrse2022/log_tensorboard')

In [42]:
%tensorboard --logdir ~/data/ukrse2022/log_tensorboard

In [43]:
%%time 
current_run_name = rse_run_name_template.format(network_name='ffnn',
                                                dt=datetime.datetime.now()
                                               )
with mlflow.start_run(experiment_id=rse_rotors_exp.experiment_id, run_name=current_run_name) as current_run:
    rotors_ffnn_model = build_ffnn_model(hyperparameters=hyperparameters_dict,
                                     input_shape=input_shape,
                                    )
    rotors_ffnn_optimizer = tensorflow.optimizers.Adam(
        learning_rate=hyperparameters_dict['initial_learning_rate'])  
    
    rotors_ffnn_model.compile(optimizer=rotors_ffnn_optimizer, 
                          loss=hyperparameters_dict['loss'], 
                          metrics=[tensorflow.keras.metrics.RootMeanSquaredError()])
    
    history=rotors_ffnn_model.fit(
        X_train, 
        y_train, 
        validation_data=(X_val, 
                          y_val), 
        epochs=hyperparameters_dict['n_epochs'], 
        batch_size=hyperparameters_dict['batch_size'], 
        shuffle=True,
        verbose=0,
        callbacks=[tensorboard_callback],
    )    


2023-01-26 22:59:59.420581: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-26 23:03:12.955774: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: /var/tmp/tmpvw8rhqfm/model/data/model/assets
CPU times: user 17min 58s, sys: 1min 29s, total: 19min 27s
Wall time: 3min 21s


Once our training has run, you can then take a look at the outputs in ML Flow to see all the aspects of logged. In this notebook we can look at our ML Flow server at http://127.0.0.1:5001 

In [44]:
rse_rotors_exp

<Experiment: artifact_location='/home/h01/shaddad/data/ukrse2022/1', experiment_id='1', lifecycle_stage='active', name='rse_mlops_demo_rotors', tags={}>

##  Example - Hyperparameter tuning in ray

As explained in the introduction to this notebook, the processing of training a machine learning algorithm calculates the *parameters* of the model. There are additional config values not calculated by the training process that are determined by the data scientist doing the training which are called hyperparameters. The values of these hyperparameters are typically problem dependant and so change from problem to problem, even within a project. While over time one may develop an intuition for what hyperparameter values are suitable in a particular context, typically these values are determined in an outer training loop, called hyperparameter tuning.

In hyperparameter tuning, many models are trained, each with different hyperparameters. Common ways to select candidate hyperparameter combinations are through a grid search through the n-dimensional hyperparameter space, or specifying possible hyperparameter values as independent distributions that are randomly sampled n times for n trials. Typically the hyperparameters for the trained model that scores best on the particular metric of choice will be chosen for the particular problem from then on.

As each of these trials is completely independent, it is an easy workflow (conceptually at least) to execute in parallel. One can actually do this entirely manually by just setting up many different training runs with different hyperparameter runs. There are many ways to make mistakes in gathering and selecting the best results and saving output for evidence, as well as easily scaling up to run these trials in parallel etc. It is best to use existing tools to facilitate scaling and reproducibility.

In this example, we are now going to do hyperparameter tuning using the *Ray* library, which is a python library for facilitating parallel, distributed computing. In this example we are just going to run on a local cluster to make thing easier for this tutorial, but ray provides facilities to easily set up a cliuster of ray worker on a linux cluster or a cloud VM  or a kubernetes cluster or almost any distributed computing platform.

Facilitating such use of tools and compute resources is where RSEs add value to a project so that research requirements determine technical implementation, not the other way around, and ensuring the best trained ML model is used to produce the best research outputs.

Ray docs:
* [ray tune docs](https://docs.ray.io/en/latest/tune/index.html)
* [source for example](https://docs.ray.io/en/latest/tune/examples/tune_mnist_keras.html#tune-mnist-keras)


In [45]:
import rotors_hpt

In [46]:
import ray
import ray.tune
import ray.tune.schedulers 
import ray.tune.integration.keras

In [47]:
num_training_terations=20

Here we initialise ray (this should only be called once), so that we can view the ray dashboard to monitor progress on the specified URL `http://127.0.0.1:5003`.

In [48]:
ray.init(num_cpus=4, dashboard_port=ray_dash_port, dashboard_host='127.0.0.1')


RayContext(dashboard_url='', python_version='3.8.15', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '10.154.1.23', 'raylet_ip_address': '10.154.1.23', 'redis_address': None, 'object_store_address': '/var/tmp/ray/session_2023-01-26_23-03-21_440057_35330/sockets/plasma_store', 'raylet_socket_name': '/var/tmp/ray/session_2023-01-26_23-03-21_440057_35330/sockets/raylet', 'webui_url': '', 'session_dir': '/var/tmp/ray/session_2023-01-26_23-03-21_440057_35330', 'metrics_export_port': 51379, 'gcs_address': '10.154.1.23:58202', 'address': '10.154.1.23:58202', 'node_id': '721af103a47d0cce88301ef6f8cd7b9a506ce1f4db068531948fe75c'})

Now we specify distributions for the hyperparameters we want to tune.

In [49]:
rotors_hpt_config = {
    'initial_learning_rate': ray.tune.uniform(1e-5,1e-3),
    'n_nodes': ray.tune.randint(100,500),
    'n_layers': ray.tune.randint(2,6)
}

We then set up a hyperparameter job that will execute on our local ray cluster.

In [50]:
rotors_hpt_analysis = ray.tune.run(
    rotors_hpt.run_ml_pipeline,
    name="rse_rotors_hpt",
    scheduler=ray.tune.schedulers.AsyncHyperBandScheduler(
        time_attr="training_iteration", 
        max_t=400, 
        grace_period=20,
    ),
    metric="root_mean_squared_error",
    mode="max",
    stop={"root_mean_squared_error": 0.99, 
          "training_iteration": num_training_terations},
    num_samples=10,
    resources_per_trial={"cpu": 2,
                         "gpu": 0},
    config=rotors_hpt_config,
    progress_reporter=ray.tune.JupyterNotebookReporter(overwrite=True),
    )

Trial name,status,loc,initial_learning_rate,n_layers,n_nodes,iter,total time (s),root_mean_squared_error
run_ml_pipeline_a3df0_00000,TERMINATED,10.154.1.23:7866,0.000895869,5,250,20,29.6363,0.148189
run_ml_pipeline_a3df0_00001,TERMINATED,10.154.1.23:8186,0.000540599,3,435,20,30.772,0.148186
run_ml_pipeline_a3df0_00002,TERMINATED,10.154.1.23:7866,0.00066691,5,228,20,29.0618,0.148189
run_ml_pipeline_a3df0_00003,TERMINATED,10.154.1.23:8186,0.000385338,2,422,20,27.3634,0.148072
run_ml_pipeline_a3df0_00004,TERMINATED,10.154.1.23:7866,0.000996128,3,430,20,33.4955,0.148189
run_ml_pipeline_a3df0_00005,TERMINATED,10.154.1.23:8186,0.000710586,5,278,20,31.8206,0.148189
run_ml_pipeline_a3df0_00006,TERMINATED,10.154.1.23:7866,0.000325596,3,103,20,25.6335,0.147709
run_ml_pipeline_a3df0_00007,TERMINATED,10.154.1.23:8186,0.000966742,5,197,20,29.1157,0.148189
run_ml_pipeline_a3df0_00008,TERMINATED,10.154.1.23:7866,0.000737624,2,494,20,28.9961,0.148183
run_ml_pipeline_a3df0_00009,TERMINATED,10.154.1.23:8186,0.000818004,2,245,20,27.1055,0.146692


2023-01-26 23:06:05,247	INFO tune.py:747 -- Total run time: 159.10 seconds (158.92 seconds for the tuning loop).


Once the job has run, we can then interrogate the results and find the best hyperparameters found from our search. In this case we've done quite a small search so we might not have found the best values, as this is a more illustrative example of how to tune hyperparameters.

In [51]:
print('\n'.join([f'{k1} - {v1["root_mean_squared_error"]}' for k1,v1 in rotors_hpt_analysis.results.items()]))

a3df0_00000 - 0.14818881452083588
a3df0_00001 - 0.14818641543388367
a3df0_00002 - 0.1481887549161911
a3df0_00003 - 0.14807158708572388
a3df0_00004 - 0.1481887698173523
a3df0_00005 - 0.1481887847185135
a3df0_00006 - 0.14770904183387756
a3df0_00007 - 0.14818881452083588
a3df0_00008 - 0.14818333089351654
a3df0_00009 - 0.14669230580329895


In [52]:
rotors_hpt_analysis.best_config

{'initial_learning_rate': 0.0008958685115361621, 'n_nodes': 250, 'n_layers': 5}

In [53]:
rotors_hpt_analysis.best_trial

run_ml_pipeline_a3df0_00000

In [54]:
rotors_hpt_analysis.best_result

{'root_mean_squared_error': 0.14818881452083588,
 'time_this_iter_s': 0.28623247146606445,
 'done': True,
 'timesteps_total': None,
 'episodes_total': None,
 'training_iteration': 20,
 'trial_id': 'a3df0_00000',
 'experiment_id': 'b831a0e105624a08aae9774efa697a65',
 'date': '2023-01-26_23-04-01',
 'timestamp': 1674774241,
 'time_total_s': 29.636255741119385,
 'pid': 7866,
 'hostname': 'expspicesrv027',
 'node_ip': '10.154.1.23',
 'config': {'initial_learning_rate': 0.0008958685115361621,
  'n_nodes': 250,
  'n_layers': 5},
 'time_since_restore': 29.636255741119385,
 'timesteps_since_restore': 0,
 'iterations_since_restore': 20,
 'warmup_time': 0.004109621047973633,
 'experiment_tag': '0_initial_learning_rate=0.0009,n_layers=5,n_nodes=250'}

# Example - save the model 

Having trained a model, possibly through hyperparameter tuning, we want our model to persist and so we need to store the trained weights and the architecture do describe the structure of the model to disk in a file of some sort. This is the first step in treating our models as *machine learning assets* that like data and code should follow *FAIR* principles.

We will explore several way to save our models. The first is to use the ML framework we're using for our model, in this case the Keras interface to TensorFlow. Secondly we'll use ML Flow, the experiment tracking tool which provides a framework independent way to save and load models. thirdly we'll look at a completely separate tool which is intended as an open standard way for exchanging machine learning models called *Open Neural-network eXchange* format or ONNX.

Docs:
* [Keras Model Saving/Loading](https://www.tensorflow.org/guide/keras/save_and_serialize)
* [ML Flow Models](https://www.mlflow.org/docs/latest/models.html)
* [ONNX](https://onnxruntime.ai/docs/)


### Keras model format

First we'll use the framework specific Keras format to save our model.

In [55]:
rotors_ffnn_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dropout (Dropout)           (None, 88)                0         
                                                                 
 dense (Dense)               (None, 1000)              89000     
                                                                 
 dropout_1 (Dropout)         (None, 1000)              0         
                                                                 
 dense_1 (Dense)             (None, 1000)              1001000   
                                                                 
 dropout_2 (Dropout)         (None, 1000)              0         
                                                                 
 dense_2 (Dense)             (None, 1000)              1001000   
                                                                 
 dropout_3 (Dropout)         (None, 1000)              0

In [56]:
model_keras_export_path = rse_root_data_dir / 'keras_model'
model_keras_export_path


PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model')

In [57]:
rotors_ffnn_model.save(model_keras_export_path)

INFO:tensorflow:Assets written to: /home/h01/shaddad/data/ukrse2022/keras_model/assets


In [58]:
list(model_keras_export_path.rglob('*'))

[PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model/saved_model.pb'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model/keras_metadata.pb'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model/assets'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model/variables'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model/variables/variables.index'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/keras_model/variables/variables.data-00000-of-00001')]

### ML FLOW

When we trained our model at the start of the model, we logged all the details using ml flow automatically because at the start of the notebook we called `mlflow.tensorflow.autolog()` so that MLflow automatically logged any calls to `fit` with a TensorFlow model. So as we'll see in the next notebook, we can access our saved models through the runs that are logged with ML Flow without explicitly saving the model. We can also do an explicit save outside the default artifact store of our ML Flow server as demonstrated below.

In [59]:
mlflow_model_path = rse_root_data_dir / 'mflow_model'
mlflow_model_path

PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model')

In [60]:
help(mlflow_model_path.rmdir)

Help on method rmdir in module pathlib:

rmdir() method of pathlib.PosixPath instance
    Remove this directory.  The directory must be empty.



In [61]:
import shutil

In [62]:
if mlflow_model_path.is_dir():
    shutil.rmtree(mlflow_model_path)
    print('deleted existing directory')

deleted existing directory


In [63]:
mlflow.keras.save_model(rotors_ffnn_model, mlflow_model_path)

INFO:tensorflow:Assets written to: /home/h01/shaddad/data/ukrse2022/mflow_model/data/model/assets


In [64]:
list(mlflow_model_path.rglob('*'))

[PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/requirements.txt'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/conda.yaml'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/MLmodel'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/model'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/save_format.txt'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/keras_module.txt'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/model/saved_model.pb'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/model/keras_metadata.pb'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/model/assets'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/model/variables'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_model/data/model/variables/variables.index'),
 PosixPath('/home/h01/shaddad/data/ukrse2022/mflow_mod

In [65]:
mlflow.keras.load_model(str(mlflow_model_path) )

<keras.engine.sequential.Sequential at 0x2ab40bdb8f40>

We will now look at our model in the ML Flow GUI, so that we can explore how to register a model.

### ONNX

The third format is a standard alone library for storing models in a common format and a runtime environment for inference that is independent of the framework that trained the models. This should (potentially) provide a more stable environment for model storage and inference that does not need to change if the model training pipeline and its dependencies change.

In [66]:
import onnx
import tf2onnx

In [67]:
help(tf2onnx.convert.from_keras)

Help on function from_keras in module tf2onnx.convert:

from_keras(model, input_signature=None, opset=None, custom_ops=None, custom_op_handlers=None, custom_rewriter=None, inputs_as_nchw=None, outputs_as_nchw=None, extra_opset=None, shape_override=None, target=None, large_model=False, output_path=None, optimizers=None)
    Returns a ONNX model_proto for a tf.keras model.
    
    Args:
        model: the tf.keras model we want to convert
        input_signature: a tf.TensorSpec or a numpy array defining the shape/dtype of the input
        opset: the opset to be used for the ONNX model, default is the latest
        custom_ops: if a model contains ops not recognized by onnx runtime,
            you can tag these ops with a custom op domain so that the
            runtime can still open the model. Type is a dictionary `{op name: domain}`.
        target: list of workarounds applied to help certain platforms
        custom_op_handlers: dictionary of custom ops handlers
        custom_rew

In [68]:
%%time
rotors_onnx_model, _ = tf2onnx.convert.from_keras(
    rotors_ffnn_model,
    [tensorflow.TensorSpec(
        shape=tensorflow.TensorShape([None,X_train.shape[1] ]),
        dtype=X_train.dtype,
        name='ukmo_rotors_model_input',
    )],
)

2023-01-26 23:06:15.459598: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-01-26 23:06:15.459736: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-01-26 23:06:15.461583: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1164] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.006ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.

2023-01-26 23:06:15.696669: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-01-26 23:06:15.696797: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-01-26 23:06:15.839890: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:116

CPU times: user 1.83 s, sys: 171 ms, total: 2 s
Wall time: 1.97 s


In [69]:
onnx_model_path = rse_root_data_dir / 'onnx_model'
onnx_model_path

PosixPath('/home/h01/shaddad/data/ukrse2022/onnx_model')

In [70]:
onnx.save_model(rotors_onnx_model, str(onnx_model_path))

In [71]:
onnx_model_path.stat()

os.stat_result(st_mode=33188, st_ino=139553715229, st_dev=50, st_nlink=1, st_uid=10882, st_gid=1000, st_size=12378631, st_atime=1674769572, st_mtime=1674774377, st_ctime=1674774377)

In [72]:
# warning do not try to display, it will freeze your notebook!
reloaded_onnx_model = onnx.load_model(str(onnx_model_path))

### Next Steps / Further Reading
Many of the features can be integrated into new or existing ML projects immediately and provide benefits for researchers almost immediately. You can find more more about the core concepts with links below:

* [Experiment Tracking concepts](https://www.mlflow.org/docs/latest/concepts.html)
* [Explanation of hyperparameter tuning - Google Cloud](https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview)
* [Running hyperparameter tuning at scale - Ray Tune](https://docs.ray.io/en/latest/tune/getting-started.html)

### References
* [ML Flow Models](https://www.mlflow.org/docs/latest/models.html)
* [TensorFlow Keras](https://keras.io/about/)
  * [Saving Keras Models](https://www.tensorflow.org/guide/keras/save_and_serialize)
* [TensorBoard](https://www.tensorflow.org/tensorboard)
* [ONNX](https://onnx.ai/)
* [Ray tune](https://docs.ray.io/en/latest/tune/index.html)