# MLOps Walkthrough - Model Development

In the first notebook, we looked at setting up a data pipeline to preprocess our data ready to use with a machine learning pipeline where we create and train machine learning models to do some prediction for our example problem. In this notebook we will look at the key pricniples and tools from the point of a Research Software Engineer for building such a machine learning model training piepline. The key idea is that machine learning model are assets to be tracked and shared like code and data. Users of the models need to know all the details on how they were trained, including the input data, the architecture of the model, the hyperparameters used and many other details so that the results can be understood and trusted. This is just the same as RSEs routinely do for code, so in many ways much of what is in this notebook is not new in principle, but may be implemented slightly differenlty at times to handle different aspects of ML models.

## How to train your ~~dragon🐉~~ machine learning model 🤖 (reproducibly, efficiently, at scale)
Training a machine learning model is easy, at least as of 2022 when there are many libraries and tools available and a lot of good documentation and tutorials to help researchers get started in applying machine learning to their problem if they've not done it before. So in general training a small machine learning model is not something that requires special RSE support. As with many research tasks, the challenges appear when you try to take your toy example trained on a small initial dataset and try to scale it up to a large complex real world dataset. Suddenly the technical infrastrcuture demands of machine learning at scale become apparent and this where the expertise of a RSE becomes valuable in setting up the tools and infrastructure to easily scale up the compute to match demands of the research. 

In addition in a research context one is not training a single large model with well defined parameters, all the elements of the pipeline are part of the research. Researchers will likely want to many experiments training different models with slightly altered inputs and hyperaparameters. The amount of data and other assets produced by such ML experiments can quickly become overwhelming. *Experiment tracker* software is vital, especially in a research context, to manage and compare the results of different machine learning experiments and select the best results.

It is also important in larger models to be be able to track the performance of models whiles they are training to see whether progress in imroving performance with respect to key metrics is being made by continuing training or whether improvements have stalled. Training dashboard tools are useful for researchers to be able to check on the status of models during what can be quite a lon training process.

The steps for this part of the project are typically as follows:
* *create train/validate/test splits* - a key paractice in train ML models is to only train with a portion of the data, and hold back smaller portions of the data for for evaluating model performance both during the development cycle (validate or dev split), and right at the end of the project to check the final solution generalises to unseen data (test split)
* *prepare the data* - Although most of the work  will likely have been done duringthe data prep part of the project, there is likely at least a small amount of "last mile" preparation to prepare the data for use with the particular ML framework and architecture being used. In particular normalising the data
* *build the model architecture* - 
  * *log hyper parameters* - 
* *train the model*
  * *monitor training* - 
  * *log training metrics* - 
* *save the the trained model*  - 
  * *log trained model with experiment manager* 

A key part of model development is finding the correct *hyperparameters* for the model. Hyperparameters are the configuration elements of the model that are not determined by training the model. Those elements that are calculated, for example the werights and coefficients of a neural network, are called parameters. Those that are not, for example the number of hidden layers are learning rate for a neural network, are called hyperparameters. Hyperparameters are usually found by trial and error for specific problems. An addditional training loop to find the best hyperparameters is sually employed and this is called hyperparameter tuning. This is an inherently parallel process which inviolves training many models independentally.




### Key Principles
* reusable components - ensure the elements of training infrastrcuture are easy to reuse/adapt for different models and projects
* running at scale - ensure researchers can select the size of training dataset and the size/complexity  of the model architectu
* reproducible experiments - providing tools to systematically record the inputs, configurations and outputs of model training run to keep track of experiements for reproducability.


### Key Tasks for RSEs
* Setting up infrastrcuture for training models
* Facilitating running code on ML targeted platforms e.g. GPU, TPU etc.
* Support good practices for ML development e.g. experiment tracking
* Applying FAIR principles to all ML assets (data, code, trained models)
* applying good code management practices to ML code
* setting uop test suites for ML pipelines


### Key Terms
* Experiment tracking
* machine learning pipeline
* train/test split
* training
* model architecture
* hyper parameters
* hyper parameter tuning

### Key Tools
* ML frameworks (scikit-learn, tensorflow, pytorch)
* Workflow tools (ray)
* Experiment tracking (mlflow)

### Running this notebook
This notebook should run from a conda environment created with the [requirements_model_development.yml file](requirements_model_development.yml). See the readme file for info on how to set up a conda environment for using this notebook.

## Example problem - Predicting wind rotors

brief intro - refer to data prep notebook
what are we doing in this notebook


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pathlib
import datetime
import os
import functools
import math

In [3]:
import matplotlib
%matplotlib inline

In [4]:
import numpy
import pandas
import dask

In [5]:
import sklearn
import sklearn.preprocessing
import sklearn.model_selection

In [6]:
import tensorflow

import tensorflow.keras
import tensorflow.keras.layers
import tensorflow.keras.models
import tensorflow.keras.optimizers
import tensorflow.keras.metrics
import tensorflow.keras.layers
import tensorflow.keras.constraints

In [7]:
import tensorboard

In [8]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [9]:
import mlflow
mlflow.tensorflow.autolog()



In [10]:
import intake

In [11]:
mlflow_dash_port = 5001
tensorboard_dash_port = 5002
ray_dash_port = 5003

### Load the data

load the preprocessed data from the catalog
catalogs allow separation of tasks and responsibilities within a project and team

In [12]:
try:
    rse_root_data_dir = pathlib.Path(os.environ['RSE22_ROOT_DATA_DIR'])
    print('reading from environment variable')
except KeyError as ke1:
    rse_root_data_dir = pathlib.Path(os.environ['HOME'])  / 'data' / 'ukrse2022'
    print('using default path')
rse_root_data_dir

using default path


PosixPath('/Users/stephen.haddad/data/ukrse2022')

In [13]:
rotors_catalog = intake.open_catalog(rse_root_data_dir / 'rotors_catalog.yml')
rotors_catalog 

rotors_catalog:
  args:
    path: /Users/stephen.haddad/data/ukrse2022/rotors_catalog.yml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}


In [14]:
list(rotors_catalog)

['rotors', 'rotors_preprocessed']

We see that our catalog contains preprocessed data ready to use in our machine learning development pipeline.

In [15]:
rotors_df = rotors_catalog['rotors_preprocessed'].read()

In [16]:
rotors_df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,air_temp_obs,dewpoint_obs,wind_direction_obs,wind_speed_obs,wind_gust_obs,air_temp_1,air_temp_2,air_temp_3,...,v_wind_18,u_wind_19,v_wind_19,u_wind_20,v_wind_20,u_wind_21,v_wind_21,u_wind_22,v_wind_22,time
0,1,1,283.9,280.7,110.0,4.1,-9999999.0,284.000,283.625,283.250,...,5.756768,-1.953409,5.673111,-2.674064,5.482644,-3.000000e+00,5.196152,-2.987221,4.971570,2015-01-01 00:00:00
1,2,2,280.7,279.7,90.0,7.7,-9999999.0,281.500,281.250,280.750,...,6.502872,-1.460878,5.094687,-0.790064,3.716961,-7.837740e-16,3.200000,0.727691,3.423517,2015-01-01 03:00:00
2,3,3,279.8,278.1,100.0,7.7,-9999999.0,279.875,279.625,279.125,...,5.481273,-1.423505,5.312592,-0.174497,4.996954,7.293223e-01,4.136193,2.462646,3.152043,2015-01-01 06:00:00
3,4,4,279.9,277.0,120.0,7.2,-9999999.0,279.625,279.250,278.875,...,2.475770,-1.311123,3.245143,-0.407661,3.878635,6.883116e-01,4.345829,1.723190,4.265046,2015-01-01 09:00:00
4,5,5,279.9,277.4,120.0,8.7,-9999999.0,279.250,278.875,278.375,...,-0.775695,-1.997259,0.104672,-1.928942,1.252670,-1.287595e+00,2.142918,-0.899056,2.225241,2015-01-01 12:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17481,20101,20101,276.7,275.5,270.0,3.6,-9999999.0,277.875,277.750,277.625,...,-8.555992,-8.047581,-8.629974,-7.479073,-8.603689,-7.111320e+00,-8.781749,-6.538771,-9.338333,2020-12-31 06:00:00
17482,20102,20102,277.9,276.9,270.0,3.1,-9999999.0,277.875,277.625,277.875,...,-6.956383,-8.273280,-6.942106,-8.886116,-7.456336,-8.995651e+00,-8.388580,-8.029567,-8.917738,2020-12-31 09:00:00
17483,20103,20103,283.5,277.1,220.0,3.6,-9999999.0,281.125,280.625,280.125,...,-8.332875,-7.326372,-9.377328,-8.397556,-9.660283,-7.962654e+00,-8.843423,-7.495332,-7.495332,2020-12-31 12:00:00
17484,20104,20104,286.1,276.9,250.0,3.6,-9999999.0,284.625,284.125,283.625,...,-6.646804,-5.294689,-6.776892,-4.398330,-7.038799,-5.356255e+00,-6.855694,-7.265332,-7.016050,2020-12-31 15:00:00


In [17]:
list(rotors_df.columns)

['Unnamed: 0',
 'Unnamed: 0.1',
 'air_temp_obs',
 'dewpoint_obs',
 'wind_direction_obs',
 'wind_speed_obs',
 'wind_gust_obs',
 'air_temp_1',
 'air_temp_2',
 'air_temp_3',
 'air_temp_4',
 'air_temp_5',
 'air_temp_6',
 'air_temp_7',
 'air_temp_8',
 'air_temp_9',
 'air_temp_10',
 'air_temp_11',
 'air_temp_12',
 'air_temp_13',
 'air_temp_14',
 'air_temp_15',
 'air_temp_16',
 'air_temp_17',
 'air_temp_18',
 'air_temp_19',
 'air_temp_20',
 'air_temp_21',
 'air_temp_22',
 'sh_1',
 'sh_2',
 'sh_3',
 'sh_4',
 'sh_5',
 'sh_6',
 'sh_7',
 'sh_8',
 'sh_9',
 'sh_10',
 'sh_11',
 'sh_12',
 'sh_13',
 'sh_14',
 'sh_15',
 'sh_16',
 'sh_17',
 'sh_18',
 'sh_19',
 'sh_20',
 'sh_21',
 'sh_22',
 'winddir_1',
 'windspd_1',
 'winddir_2',
 'windspd_2',
 'winddir_3',
 'windspd_3',
 'winddir_4',
 'windspd_4',
 'winddir_5',
 'windspd_5',
 'winddir_6',
 'windspd_6',
 'winddir_7',
 'windspd_7',
 'winddir_8',
 'windspd_8',
 'winddir_9',
 'windspd_9',
 'winddir_10',
 'windspd_10',
 'winddir_11',
 'windspd_11',
 'winddi

In [18]:
# one small bit of cleaning: ensuring the correct datetime type for our time feature
rotors_df['time'] = pandas.to_datetime(rotors_df['time'])

In [19]:

temp_feature_names = [f'air_temp_{i1}' for i1 in range(1,23)]
humidity_feature_names = [f'sh_{i1}' for i1 in range(1,23)]
wind_direction_feature_names = [f'winddir_{i1}' for i1 in range(1,23)]
wind_speed_feature_names = [f'windspd_{i1}' for i1 in range(1,23)]
u_wind_feature_names = [f'u_wind_{i1}' for i1 in range(1,23)]
v_wind_feature_names = [f'v_wind_{i1}' for i1 in range(1,23)]
target_feature_name = 'rotors_present'

### Train/test split

Split based on year to avoid correlations between train and test sets.

In [20]:
train_df = rotors_df[rotors_df['time'] < datetime.datetime(2020,1,1,0,0)]
val_df = rotors_df[rotors_df['time'] > datetime.datetime(2020,1,1,0,0)]

In [21]:
input_feature_names = temp_feature_names + humidity_feature_names + u_wind_feature_names + v_wind_feature_names

In [22]:
preproc_dict = {}
for if1 in input_feature_names:
    scaler1 = sklearn.preprocessing.StandardScaler()
    scaler1.fit(train_df[[if1]])
    preproc_dict[if1] = scaler1

In [23]:
target_encoder = sklearn.preprocessing.LabelEncoder()
target_encoder.fit(train_df[[target_feature_name]])

LabelEncoder()

In [24]:
def preproc_input(data_subset, pp_dict):
    return numpy.concatenate([scaler1.transform(data_subset[[if1]]) for if1,scaler1 in pp_dict.items()],axis=1)

def preproc_target(data_subset, enc1):
     return enc1.transform(data_subset[[target_feature_name]])

In [25]:
X_train = preproc_input(train_df, preproc_dict)
y_train = numpy.concatenate(
    [preproc_target(train_df, target_encoder).reshape((-1,1)),
    1.0 - (preproc_target(train_df, target_encoder).reshape((-1,1))),],
    axis=1
)

In [26]:
X_val = preproc_input(val_df, preproc_dict)
y_val = numpy.concatenate(
    [preproc_target(val_df, target_encoder).reshape((-1,1)),
    1.0 - (preproc_target(val_df, target_encoder).reshape((-1,1))),],
    axis=1
)

### Set up experiment tracking - ML Flow
What is an experiment vs a run 

https://www.mlflow.org/docs/latest/concepts.html

In [27]:
rse_rotors_experiment_name = 'rse_mlops_demo_rotors'

In [28]:
timestamp_template = '{dt.year:04d}{dt.month:02d}{dt.day:02d}T{dt.hour:02d}{dt.minute:02d}{dt.second:02d}'

In [29]:
rse_run_name_template = 'rse_rotors_{network_name}_' + timestamp_template

In [30]:
mlflow_server_address = '127.0.0.1'
mlflow_server_port = mlflow_dash_port
mlflow_server_uri = f'http://{mlflow_server_address}:{mlflow_server_port:d}'
mlflow_server_uri

'http://127.0.0.1:5001'

In [31]:
mlflow.set_tracking_uri(mlflow_server_uri)

In [32]:
try: 
    print('creating experiment')
    rse_rotors_exp_id = mlflow.create_experiment(rse_rotors_experiment_name)
    rse_rotors_exp = mlflow.get_experiment(rse_rotors_exp_id)
except mlflow.exceptions.RestException:
    rse_rotors_exp = mlflow.get_experiment_by_name(rse_rotors_experiment_name)
rse_rotors_exp



creating experiment


<Experiment: artifact_location='/Users/stephen.haddad/data/ukrse2022/artifacts/1', experiment_id='1', lifecycle_stage='active', name='rse_mlops_demo_rotors', tags={}>

### Set up model architecture hyperparameters

In [33]:
def build_ffnn_model(hyperparameters, input_shape):
    """
    Build a feed forward neural network model in tensorflow for predicting the occurence of turbulent orographically driven wind gusts called Rotors.
    """
    model = tensorflow.keras.models.Sequential()
    model.add(tensorflow.keras.layers.Dropout(hyperparameters['drop_out_rate'], 
                                              input_shape=input_shape))
    for i in numpy.arange(0,hyperparameters['n_layers']):
        model.add(tensorflow.keras.layers.Dense(hyperparameters['n_nodes'], 
                                                activation=hyperparameters['activation'], 
                                                kernel_constraint=tensorflow.keras.constraints.max_norm(3)))
        model.add(tensorflow.keras.layers.Dropout(hyperparameters['drop_out_rate']))
    model.add(tensorflow.keras.layers.Dense(2, activation='softmax'))             # This is the output layer
    return model


In [34]:
nx = X_train.shape[1]
input_shape = (nx,)

In [35]:
hyperparameters_dict = {
    'initial_learning_rate': 1.0e-4,
    'drop_out_rate': 0.2,
    'n_epochs': 100,
    'batch_size': 1000,
    'n_nodes': 1000,
    'n_layers': 4,
    'activation': 'relu',
    'loss': 'mse'
}

### Train model and setup monitoring

Tools:
* tensorboard
* tensorflow
* mlflow


In [36]:
log_dir_tensorboard = rse_root_data_dir / 'log_tensorboard' 
if not log_dir_tensorboard.is_dir():
    log_dir_tensorboard.mkdir()
    print(f'created tensorboard log directory {log_dir_tensorboard}')    
log_dir_tensorboard

PosixPath('/Users/stephen.haddad/data/ukrse2022/log_tensorboard')

In [37]:
tensorboard_callback = tensorflow.keras.callbacks.TensorBoard(log_dir=log_dir_tensorboard, 
                                                              histogram_freq=1)

In [38]:
log_dir_tensorboard

PosixPath('/Users/stephen.haddad/data/ukrse2022/log_tensorboard')

In [39]:
%tensorboard --logdir /Users/stephen.haddad/data/ukrse2022/log_tensorboard

In [40]:
%time 
current_run_name = rse_run_name_template.format(network_name='ffnn',
                                                dt=datetime.datetime.now()
                                               )
with mlflow.start_run(experiment_id=rse_rotors_exp.experiment_id, run_name=current_run_name) as current_run:
    rotors_ffnn_model = build_ffnn_model(hyperparameters=hyperparameters_dict,
                                     input_shape=input_shape,
                                    )
    rotors_ffnn_optimizer = tensorflow.optimizers.Adam(
        learning_rate=hyperparameters_dict['initial_learning_rate'])  
    
    rotors_ffnn_model.compile(optimizer=rotors_ffnn_optimizer, 
                          loss=hyperparameters_dict['loss'], 
                          metrics=[tensorflow.keras.metrics.RootMeanSquaredError()])
    
    history=rotors_ffnn_model.fit(
        X_train, 
        y_train, 
        validation_data=(X_val, 
                          y_val), 
        epochs=hyperparameters_dict['n_epochs'], 
        batch_size=hyperparameters_dict['batch_size'], 
        shuffle=True,
        verbose=0,
        callbacks=[tensorboard_callback],
    )    
    

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.91 µs


2022-09-02 11:50:50.017514: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-02 11:53:48.183242: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: /var/folders/w0/2x361bn95wj7lfgl33vksx1w0000gn/T/tmpyjvz9l_j/model/data/model/assets


In [41]:
rse_rotors_exp

<Experiment: artifact_location='/Users/stephen.haddad/data/ukrse2022/artifacts/1', experiment_id='1', lifecycle_stage='active', name='rse_mlops_demo_rotors', tags={}>

##  Example - Hyperparameter tuning in ray

As explained in the introduction to this notebook, the processing of training a machine learning algorithm calculate the *parameters* of the model. There are addtional config values not calculated by the training process that are determined by the data scientist doing the training which are called hyperparameters. The values of these hyperparameters are typically problem dependant and so change from problem to problem, even within a project. While over time one may develop an intuition for what hyperparameter values are suitable in a particular context, typically these values are determined in an outer training loop, called hyperparameter tuning.

In hyperparameter tuning, many models are trained, each with different hyperparameters. Commons ways to select candidate hyperparameter combinations are through a grid search through the n-dimensional hyperparameter space, or specifying possible hyperparameter values as independent distributions that are randomly sampled n times for n trials. Typically the hyperparameters for the trained model that scores best on the particular metric of choice will be chosen for the particular problem from then on.

As each of these trials is completely independent, its an easy workflow (conceptually at least) to execute in parallel. One can actually do this entirely manually by just setting up many different training runs with different hyperparameter runs. There are many ways to make mistakes in gathering and selecting the best results and saving output for evidence, as well as easily scaling up to run these trials in parallel etc. It is best to use existing tools to facilitate scaling and reproducability.

In this example, we are now going to do hyperparameter tuning using the *Ray* library, which is a python library for facilitating parallel, distributed computing. In this example we are just going to run on a local cluster to make thing easier for this tutorial, but ray provides facilities to easily set up a cliuster of ray worker on a linux cluster or a cloud VM  or a kubernetes cluster or almost any distributed computing platform.

Facilitating such use of tools and compute resources is where RSEs add value to a project so that research requirements determine technical implementation, not the other way around, and ensuring the best trained ML model is used to produce the best research outputs.

Ray docs:
* [ray tune docs](https://docs.ray.io/en/latest/tune/index.html)
* []https://docs.ray.io/en/latest/tune/examples/tune_mnist_keras.html#tune-mnist-keras


In [42]:
import rotors_hpt

In [43]:
import ray
import ray.tune
import ray.tune.schedulers 
import ray.tune.integration.keras

In [44]:
num_training_terations=20

Here we initialise ray (this should only be called once), so that we can view the ray dashboard to monitor progress on the specifed URL `http://127.0.0.1:5003`.

In [45]:
ray.init(num_cpus=4, dashboard_port=ray_dash_port, dashboard_host='127.0.0.1')


RayContext(dashboard_url='', python_version='3.8.13', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-09-02_11-53-54_362644_9954/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-09-02_11-53-54_362644_9954/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-09-02_11-53-54_362644_9954', 'metrics_export_port': 61913, 'gcs_address': '127.0.0.1:61846', 'address': '127.0.0.1:61846', 'node_id': '16340e3d257b2916205421ea82b677185e1eb8699c77067ef59e0d72'})

Now we specify distributions of for the hyperparameters we want to tune.

In [46]:
rotors_hpt_config = {
    'initial_learning_rate': ray.tune.uniform(1e-5,1e-3),
    'n_nodes': ray.tune.randint(100,500),
    'n_layers': ray.tune.randint(2,6)
}

We then set up a hyperparameter job that will execute on our local ray cluster.

In [47]:
rotors_hpt_analysis = ray.tune.run(
    rotors_hpt.run_ml_pipeline,
    name="rse_rotors_hpt",
    scheduler=ray.tune.schedulers.AsyncHyperBandScheduler(
        time_attr="training_iteration", 
        max_t=400, 
        grace_period=20,
    ),
    metric="root_mean_squared_error",
    mode="max",
    stop={"root_mean_squared_error": 0.99, 
          "training_iteration": num_training_terations},
    num_samples=10,
    resources_per_trial={"cpu": 2,
                         "gpu": 0},
    config=rotors_hpt_config,
    progress_reporter=ray.tune.JupyterNotebookReporter(overwrite=True),
    )

Trial name,status,loc,initial_learning_rate,n_layers,n_nodes,iter,total time (s),root_mean_squared_error
run_ml_pipeline_8c286_00000,TERMINATED,127.0.0.1:10094,0.000146297,2,474,20,22.4654,0.145108
run_ml_pipeline_8c286_00001,TERMINATED,127.0.0.1:10097,0.000673807,3,479,20,26.7092,0.148188
run_ml_pipeline_8c286_00002,TERMINATED,127.0.0.1:10094,0.000993082,2,115,20,22.8935,0.139948
run_ml_pipeline_8c286_00003,TERMINATED,127.0.0.1:10097,0.000650049,3,129,20,20.7422,0.142927
run_ml_pipeline_8c286_00004,TERMINATED,127.0.0.1:10094,0.000934637,3,380,20,23.7122,0.148189
run_ml_pipeline_8c286_00005,TERMINATED,127.0.0.1:10097,0.000699673,2,235,20,21.8068,0.147493
run_ml_pipeline_8c286_00006,TERMINATED,127.0.0.1:10094,0.00051609,5,234,20,23.5405,0.148186
run_ml_pipeline_8c286_00007,TERMINATED,127.0.0.1:10097,0.000456528,5,139,20,23.1664,0.14426
run_ml_pipeline_8c286_00008,TERMINATED,127.0.0.1:10094,0.000210488,2,276,20,21.5508,0.148041
run_ml_pipeline_8c286_00009,TERMINATED,127.0.0.1:10097,0.000578361,2,305,20,20.989,0.148031


2022-09-02 11:56:00,839	INFO tune.py:747 -- Total run time: 122.10 seconds (121.93 seconds for the tuning loop).


Once the job has run, we can then interrogate the results and find the best hyperparameters found from our search. In this case we've done quite a sdmall search so we might not have found the best values, as this is more illustrative of how to do it.

In [48]:
print('\n'.join([f'{k1} - {v1["root_mean_squared_error"]}' for k1,v1 in rotors_hpt_analysis.results.items()]))

8c286_00000 - 0.1451084166765213
8c286_00001 - 0.1481884866952896
8c286_00002 - 0.13994833827018738
8c286_00003 - 0.1429273784160614
8c286_00004 - 0.14818879961967468
8c286_00005 - 0.14749304950237274
8c286_00006 - 0.14818613231182098
8c286_00007 - 0.144259974360466
8c286_00008 - 0.1480407565832138
8c286_00009 - 0.14803124964237213


In [49]:
rotors_hpt_analysis.best_config

{'initial_learning_rate': 0.0009346367481620676, 'n_nodes': 380, 'n_layers': 3}

In [50]:
rotors_hpt_analysis.best_trial

run_ml_pipeline_8c286_00004

In [51]:
rotors_hpt_analysis.best_result

{'root_mean_squared_error': 0.14818879961967468,
 'time_this_iter_s': 0.25099873542785645,
 'done': True,
 'timesteps_total': None,
 'episodes_total': None,
 'training_iteration': 20,
 'trial_id': '8c286_00004',
 'experiment_id': 'f2c7b77c247b4efa80f1309e442788b3',
 'date': '2022-09-02_11-55-12',
 'timestamp': 1662116112,
 'time_total_s': 23.712169885635376,
 'pid': 10094,
 'hostname': 'Stephens-MBP',
 'node_ip': '127.0.0.1',
 'config': {'initial_learning_rate': 0.0009346367481620676,
  'n_nodes': 380,
  'n_layers': 3},
 'time_since_restore': 23.712169885635376,
 'timesteps_since_restore': 0,
 'iterations_since_restore': 20,
 'warmup_time': 0.0039730072021484375,
 'experiment_tag': '4_initial_learning_rate=0.0009,n_layers=3,n_nodes=380'}

# Example - save the model 

Having trained a model, possibly through hyperparameter tuning, we want our model to persist and so we need to store the trained weights and the architecture do describe the strcuture of the model to disk in a file of some sort. This is the first step in treating our models as *machine learning assets* that like data and code should follow *FAIR* principles.

We will use explore several way to save our models. The first is to use the ML framework we're using for our model, in this case the Keras interface to Tensorflow. Secondly we'll use ML Flow, the experiment tracking tool which provides a framework independent way to save and load models. thirdly we'll look at a completely separate tool which is intended as an open standard way for exchanging machine learning models called *Open Neural-network eXchange* format or ONNX.

Docs:
* [Keras Model Saving/Loading](https://www.tensorflow.org/guide/keras/save_and_serialize)
* [ML Flow Models](https://www.mlflow.org/docs/latest/models.html)
* [ONNX](https://onnxruntime.ai/docs/)


### Keras model format

First we'll use the framework specific keras format to save our model.

In [61]:
rotors_ffnn_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dropout (Dropout)           (None, 88)                0         
                                                                 
 dense (Dense)               (None, 1000)              89000     
                                                                 
 dropout_1 (Dropout)         (None, 1000)              0         
                                                                 
 dense_1 (Dense)             (None, 1000)              1001000   
                                                                 
 dropout_2 (Dropout)         (None, 1000)              0         
                                                                 
 dense_2 (Dense)             (None, 1000)              1001000   
                                                                 
 dropout_3 (Dropout)         (None, 1000)              0

In [62]:
model_keras_export_path = rse_root_data_dir / 'keras_model'
model_keras_export_path


PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model')

In [63]:
rotors_ffnn_model.save(model_keras_export_path)

INFO:tensorflow:Assets written to: /Users/stephen.haddad/data/ukrse2022/keras_model/assets


In [67]:
list(model_keras_export_path.rglob('*'))

[PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model/keras_metadata.pb'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model/variables'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model/saved_model.pb'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model/assets'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model/variables/variables.data-00000-of-00001'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/keras_model/variables/variables.index')]

### ML FLOW

When we trained our model at the start of the model, we logged all the details using ml flow automatically because at the start of the notebook we called `mlflow.tensorflow.autolog()` so that mlflow automatically logged any calls to `fit` with a tensorflow model. So as we'll see in the next notebook, we can access our saved models through the runs that are logged with ML Flow without explictly saving the model. We can also do an excplicit save outside the default artifact store of our ML Flow server as deomstrasted below.

In [70]:
mlflow_model_path = rse_root_data_dir / 'mflow_model'
mlflow_model_path

PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model')

In [71]:
mlflow.keras.save_model(rotors_ffnn_model, mlflow_model_path)

INFO:tensorflow:Assets written to: /Users/stephen.haddad/data/ukrse2022/mflow_model/data/model/assets


In [72]:
list(mlflow_model_path.rglob('*'))

[PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/requirements.txt'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/MLmodel'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/conda.yaml'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/save_format.txt'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/keras_module.txt'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/model'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/model/keras_metadata.pb'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/model/variables'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/model/saved_model.pb'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/model/assets'),
 PosixPath('/Users/stephen.haddad/data/ukrse2022/mflow_model/data/model/variables/variables.data-00000-of-0

In [77]:
mlflow.keras.load_model(str(mlflow_model_path) )

<keras.engine.sequential.Sequential at 0x7fbeb529cd90>

### ONNX

The third format is a standard alone libary for storing models in a common format and a runtime envrionment for inference that is independant of the framework that trained the models. This should (potentially) provide a more stable envirnment for model storage and inference tat does not need to change if the model training pipeline and its dependencies change.

In [54]:
import onnx
import tf2onnx

In [80]:
help(tf2onnx.convert.from_keras)

Help on function from_keras in module tf2onnx.convert:

from_keras(model, input_signature=None, opset=None, custom_ops=None, custom_op_handlers=None, custom_rewriter=None, inputs_as_nchw=None, extra_opset=None, shape_override=None, target=None, large_model=False, output_path=None)
    Returns a ONNX model_proto for a tf.keras model.
    
    Args:
        model: the tf.keras model we want to convert
        input_signature: a tf.TensorSpec or a numpy array defining the shape/dtype of the input
        opset: the opset to be used for the ONNX model, default is the latest
        target: list of workarounds applied to help certain platforms
        custom_op_handlers: dictionary of custom ops handlers
        custom_rewriter: list of custom graph rewriters
        extra_opset: list of extra opset's, for example the opset's used by custom ops
        shape_override: dict with inputs that override the shapes given by tensorflow
        inputs_as_nchw: transpose inputs in list from nchw to 

In [96]:
%%time
rotors_onnx_model, _ = tf2onnx.convert.from_keras(
    rotors_ffnn_model,
    [tensorflow.TensorSpec(
        shape=tensorflow.TensorShape([None,X_train.shape[1] ]),
        dtype=X_train.dtype,
        name='ukmo_rotors_model_input',
    )],
)

2022-09-02 17:33:15.862789: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2022-09-02 17:33:15.862982: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2022-09-02 17:33:15.864590: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1164] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.004ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.

2022-09-02 17:33:16.128831: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2022-09-02 17:33:16.128956: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2022-09-02 17:33:16.242258: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:116

CPU times: user 1.28 s, sys: 399 ms, total: 1.68 s
Wall time: 1.67 s


In [97]:
onnx_model_path = rse_root_data_dir / 'onnx_model'
onnx_model_path

PosixPath('/Users/stephen.haddad/data/ukrse2022/onnx_model')

In [101]:
onnx.save_model(rotors_onnx_model, str(onnx_model_path))

In [106]:
onnx_model_path.stat()

os.stat_result(st_mode=33188, st_ino=16819661, st_dev=16777223, st_nlink=1, st_uid=501, st_gid=20, st_size=12378620, st_atime=1662136459, st_mtime=1662136498, st_ctime=1662136498)

In [114]:
# warning do not try to display, it will freze your notebook!
reloaded_onnx_model = onnx.load_model(str(onnx_model_path))

### Next Steps / Further Reading

https://www.mlflow.org/docs/latest/concepts.html

https://docs.ray.io/en/latest/tune/getting-started.html
https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview


### References
* mlflow
* scikit-learn
* tensorflow
* tensorboard