# DRYML Tutorial 6 - Hyperparameter Searches with `ray`

## Ray tune

One primary use case for DRYML is hyperparameter tuning of your models. With `ObjectDef` providing an easy way to define model templates, and the uniform training interface of trainables we can slot DRYML into the ray hyperparameter tuning process.

We'll briefly introduce `ray`, and write a simple hyperparameter tuning example.

RAY is a platform for remote process execution. It creates a server which manages connected resources. Jobs can then be sent to those resources in the form of multiple processes confined to specific resources. This is ideal for hyperparameter tuning, and in fact RAY provides the `ray.tune` library for exactly this.

This tutorial won't serve as a tutorial for ray, for that please consult the ray documentation available here: https://docs.ray.io/en/latest/index.html and here: https://docs.ray.io/en/latest/tune/index.html

Let's start the ray server, and write a simple method for generating models.

In [1]:
import ray

In [2]:
ray.init(num_gpus=4, num_cpus=8)

2023-03-21 16:51:39,519	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.20.14',
 'raylet_ip_address': '192.168.20.14',
 'redis_address': None,
 'object_store_address': '/tmp/ray/session_2023-03-21_16-51-36_925744_63058/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2023-03-21_16-51-36_925744_63058/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2023-03-21_16-51-36_925744_63058',
 'metrics_export_port': 56378,
 'gcs_address': '192.168.20.14:63339',
 'address': '192.168.20.14:63339',
 'node_id': 'd7f5be4fa838268e878ba76905c1cd19140b94392c30d24969dd14f9'}

[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m Traceback (most recent call last):
[2m[33m(raylet)[0m   File "/home/mkrafcz2/.conda/envs/opence-v1.6.1-drymldev/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
[2m[33m(raylet)[0m     loop.run_until_complete(agent.run())
[2m[33m(raylet)[0m   File "/home/mkrafcz2/.conda/envs/opence-v1.6.1-drymldev/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
[2m[33m(raylet)[0m     return future.result()
[2m[33m(raylet)[0m   File "/home/mkrafcz2/.conda/envs/opence-v1.6.1-drymldev/lib/python3.9/s

## DRYML support

DRYML provides support for `ray.tune` in the form of the `dryml.ray.tune.Trainer` class. This class defines a callable function compatible with the ray tune functional API. We just need to supply it with a special callable which can provide a few needed callable methods to setup and run the tune experiment.

`dryml.ray.tune.Trainer` expects the arguments:
* `name`: The name of the experiment to use
* `prep_method`: The callable for creating the necessary callables for setting up the experiment. This must be picklable via `dill`.
* `metrics`: A dictionary of metrics to compute after each step of training.

Once created, the user can then design their tune experiment in the usual way, and pass the `Trainer` as the callable trainable method.

We'll create a `prep_method` which can yield all needed callables for a simple experiment: How large a convolutional kernel is appropriate for a two layer convolutional model for classifying MNIST digits.

## Define `prep_method`

In [3]:
import dryml
import os

In [4]:
# Name experiment so we can set model directory
experiment_name = 'TF_ray_test'
model_dir = os.path.realpath(os.path.join('./models', experiment_name))

# A callable to create the train/test `Dataset`
def data_gen():
    import tensorflow_datasets as tfds
    from dryml.data.tf import TFDataset
    
    # Check whether tensorflow support exists
    # For the current GPU.
    dryml.context.context_check({'tf': {}})
    
    (ds_train, ds_test), ds_info = tfds.load(
        'mnist',
        split=['train', 'test'],
        shuffle_files=True,
        as_supervised=True,
        with_info=True)
    
    train_ds = TFDataset(
        ds_train,
        supervised=True
    )
    test_ds = TFDataset(
        ds_test,
        supervised=True
    )
    return {
        'train': train_ds,
        'test': test_ds,
    }


def prep_method():
    # A callable to create a repo. This is needed to store completed models for later use.
    def repo_gen():
        return dryml.Repo(directory=model_dir, create=True)


    # We need another callable since the input datasets have a context
    # Requirement, we want the Trainer function to incorporate this
    # requirement when building the compute context.
    def data_ctx_gen():
        return {'tf': {}}

    # Model generator method which takes a config, and generates a model
    # It also can take a repo keyword argument so already trained components
    # Can be grabbed from the repo.
    def model_gen(config, repo=None):
        import dryml
        import dryml.models
        import dryml.models.tf
        import tensorflow as tf
        
        # Grab the existing Best Category data transformation
        best_cat_def = dryml.ObjectDef(dryml.data.transforms.BestCat)
        best_step = repo.get(best_cat_def)
        
        kernel_size = int(config['kernel_size'])

        filters = 32
        n_layers = 2
        layer_defs = []
        for i in range(n_layers):
            layer_defs.append(
                ['Conv2D', {'filters': filters, 'kernel_size': kernel_size, 'activation': 'relu'}])
        layer_defs.append(['Flatten', {}])
        layer_defs.append(['Dense', {'units': 10, 'activation': 'linear'}])
        
        mdl_def = dryml.ObjectDef(
            dryml.models.tf.keras.base.SequentialFunctionalModel,
            input_shape=(28, 28, 1),
            layer_defs=layer_defs,
        )
        
        # Create model definition
        mdl_def = dryml.ObjectDef(
            dryml.models.Pipe,
            dryml.ObjectDef(
                dryml.models.tf.keras.Trainable,
                # Model definition
                model=mdl_def,
                # Train method
                train_fn=dryml.ObjectDef(
                    dryml.models.tf.keras.base.BasicTraining,
                    epochs=5,
                ),
                # Optimizer
                optimizer=dryml.ObjectDef(
                    dryml.models.tf.Wrapper,
                    tf.keras.optimizers.Adam,
                ),
                # Loss
                loss=dryml.ObjectDef(
                    dryml.models.tf.Wrapper,
                    tf.keras.losses.SparseCategoricalCrossentropy,
                    from_logits=True
                )
            ),
            best_step,
        )
        
        ctx_reqs = {'tf': {'num_gpus': 1}}

        # Return dictionary with the model and optionally, 
        return {
            'model': mdl_def.build(repo=repo),
            'ctx_reqs': ctx_reqs,
        }

    # Return dictionary with defined callables.
    return {
        'repo': repo_gen,
        'data_ctx': data_ctx_gen,
        'data': data_gen,
        'model': model_gen,
    }

Next, let's prepare the `Repo` with needed objects and the `Repo` directory.

In [5]:
# Let's create a repo pointing to the same directory.
repo = dryml.Repo(directory=model_dir, create=True)

# Create a Best Category trainable, and save it to the repo.
best_cat_def = dryml.ObjectDef(dryml.data.transforms.BestCat)
# Repo's get method has a special ability when `build_missing_def=True`.
# If a non-concrete definition is not in the repo, one instance will be
# created and stored in the repo.
repo.get(best_cat_def, build_missing_def=True)

# Save the objects
repo.save()

Now, let's design the tune experiment!

In [7]:
from ray import tune
import dryml.ray
import dryml.metrics
import datetime

local_dir = os.path.realpath(f'./ray_results')

# Define study space
config = {
    'kernel_size': tune.randint(3,8),
}

# Create model trainer
model_trainer = dryml.ray.tune.Tune1Trainer(
    name=experiment_name,
    prep_method=prep_method,
    metrics={'accuracy': dryml.metrics.scalar.categorical_accuracy},
)

analysis = tune.run(
    model_trainer,
    config=config,
    metric='accuracy',
    mode='max',
    resources_per_trial={'cpu': 1, 'gpu': 1},
    num_samples=20,
    keep_checkpoints_num=2,
    local_dir=local_dir,
    log_to_file=True,
    reuse_actors=False,
    fail_fast=True,
    progress_reporter=tune.JupyterNotebookReporter(
        True,
        max_report_frequency=5,
    ),
    # resume=True, # Currently needs more experimentation to work properly
)

Trial name,status,loc,kernel_size,iter,total time (s),accuracy
TF_ray_test_31227_00000,TERMINATED,192.168.20.14:63213,7,6,45.5019,0.970152
TF_ray_test_31227_00001,TERMINATED,192.168.20.14:63214,3,6,42.6595,0.974159
TF_ray_test_31227_00002,TERMINATED,192.168.20.14:63217,5,6,43.9685,0.96875
TF_ray_test_31227_00003,TERMINATED,192.168.20.14:63215,6,6,46.7062,0.97506
TF_ray_test_31227_00004,TERMINATED,192.168.20.14:63211,5,6,43.1475,0.975861
TF_ray_test_31227_00005,TERMINATED,192.168.20.14:63210,3,6,43.3579,0.970553
TF_ray_test_31227_00006,TERMINATED,192.168.20.14:91299,6,6,43.9778,0.969151
TF_ray_test_31227_00007,TERMINATED,192.168.20.14:91300,7,6,44.6296,0.974559
TF_ray_test_31227_00008,TERMINATED,192.168.20.14:107899,5,6,42.1635,0.981671
TF_ray_test_31227_00009,TERMINATED,192.168.20.14:108433,6,6,44.1281,0.973057


2023-03-21 17:00:14,326	INFO tune.py:639 -- Total run time: 243.18 seconds (243.04 seconds for the tuning loop).


In [8]:
# Get the best trial by max accuracy
results_df = analysis.results_df
best_trial_id = results_df['accuracy'].idxmax()



In [9]:
results_df

Unnamed: 0_level_0,accuracy,dry_id,done,time_this_iter_s,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,time_total_s,pid,hostname,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,experiment_tag,config.kernel_size
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
31227_00000,0.970152,70237032-9694-4ae8-a781-3ed0d9c4660d,True,1.055847,,,6,3738207981a34fbe9ce2188b18d1aa8a,2023-03-21_16-56-57,1679435817,45.501856,63213,hal14,192.168.20.14,45.501856,0,6,0_kernel_size=7,7
31227_00001,0.974159,70f44434-4ede-42b6-84b6-7982217d4b92,True,0.597679,,,6,6cb6003011b34749bd3cfae54dd36746,2023-03-21_16-56-54,1679435814,42.659537,63214,hal14,192.168.20.14,42.659537,0,6,1_kernel_size=3,3
31227_00002,0.96875,545a008c-7dbc-4041-ae11-9726eb1eda2d,True,0.784724,,,6,45c24fa6af6949c4bbca46736e873299,2023-03-21_16-56-56,1679435816,43.968468,63217,hal14,192.168.20.14,43.968468,0,6,2_kernel_size=5,5
31227_00003,0.97506,82184280-fab7-4e8c-a1dc-4bd77f30a4c5,True,0.881449,,,6,c57b495c8100401b9dbfe2b565cc2ef2,2023-03-21_16-56-59,1679435819,46.706225,63215,hal14,192.168.20.14,46.706225,0,6,3_kernel_size=6,6
31227_00004,0.975861,478d1dd2-0d70-408a-adcd-6bf41f6d7028,True,1.268937,,,6,138df1d05a2f4c8fb544674ea6b3455d,2023-03-21_16-57-42,1679435862,43.147506,63211,hal14,192.168.20.14,43.147506,0,6,4_kernel_size=5,5
31227_00005,0.970553,f257cdac-f137-40e0-bdcb-4faa7759ef1e,True,0.620278,,,6,e225d6277a9546e5bfc9ea1c443d9c93,2023-03-21_16-57-44,1679435864,43.357932,63210,hal14,192.168.20.14,43.357932,0,6,5_kernel_size=3,3
31227_00006,0.969151,2d48b58c-1bd7-428b-a360-4a2b2ee9f371,True,1.012824,,,6,2b46e341bea5464f846436f5d8426568,2023-03-21_16-57-48,1679435868,43.977832,91299,hal14,192.168.20.14,43.977832,0,6,6_kernel_size=6,6
31227_00007,0.974559,cca85122-9e8d-4510-af76-87d223c7d0b1,True,1.104136,,,6,ff9e85f0613746879d69896853de5e45,2023-03-21_16-57-48,1679435868,44.629603,91300,hal14,192.168.20.14,44.629603,0,6,7_kernel_size=7,7
31227_00008,0.981671,31b2fa4a-15d7-4d99-b19b-720d2cfdd7af,True,1.020965,,,6,20b9277b936646cea95402b46e1a43ce,2023-03-21_16-58-31,1679435911,42.16352,107899,hal14,192.168.20.14,42.16352,0,6,8_kernel_size=5,5
31227_00009,0.973057,f82269af-6728-4f1d-92a9-14b04d6819b2,True,0.754099,,,6,a6da8d78832748b0a8a21332a2fc07b2,2023-03-21_16-58-34,1679435914,44.128121,108433,hal14,192.168.20.14,44.128121,0,6,9_kernel_size=6,6


In [10]:
# Get and show config of best trial
best_trial_data = results_df.loc[best_trial_id]
config_data = best_trial_data.loc[best_trial_data.index.str.contains('config')]
config_data

config.kernel_size    5
Name: 31227_00008, dtype: object

In [11]:
# Report accuracy and id
print(f"Accuracy: {best_trial_data.loc['accuracy']}")
best_model_id = best_trial_data.loc['dry_id'] 
print(f"id: {best_model_id}")

Accuracy: 0.9816706730769231
id: 31b2fa4a-15d7-4d99-b19b-720d2cfdd7af


In [12]:
# Refresh the repository
repo.load_objects_from_directory()

In [13]:
# Fetch the best performing model
model = repo.get_obj_by_id(best_model_id)

In [14]:
# Lets define a method to test the model's accuracy
@dryml.compute
def test_model(model):
    import dryml.metrics
    import tensorflow_datasets as tfds
    from dryml.data.tf import TFDataset

    # Check whether tensorflow support exists
    # For the current GPU.
    dryml.context.context_check({'tf': {}})

    (ds_test,) = tfds.load(
        'mnist',
        split=['test'],
        shuffle_files=True,
        as_supervised=True)

    test_ds = TFDataset(
        ds_test,
        supervised=True
    )

    return dryml.metrics.scalar.categorical_accuracy(model, test_ds)

In [15]:
# And verify recorded accuracy!
test_model(model)

0.9816706730769231