# DRYML Tutorial

## Ray tune

One primary use case for DRYML is hyperparameter tuning of your models. `ObjectDef` provides an easy way to define and construct models. The `train` method of trainables allows uniform training for all models, and DRYML provides a set of interface methods to easily run hyperparameters searches on your models.

We'll briefly introduce `ray`, and write a simple hyperparameter tuning example.

RAY is a platform for remote process execution. It creates a server which manages connected resources. Jobs can then be sent to those resources in the form of multiple processes confined to specific resources. This is ideal for hyperparameter tuning, and in fact RAY provides the `ray.tune` library for exactly this.

This tutorial won't serve as a tutorial for ray, for that please consult the ray documentation available here: https://docs.ray.io/en/latest/index.html and here: https://docs.ray.io/en/latest/tune/index.html

Let's start the ray server, and write a simple method for generating models.

In [1]:
import ray

In [2]:
ray.init(num_gpus=1, num_cpus=8)

2022-10-13 14:45:53,329	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.20.4',
 'raylet_ip_address': '192.168.20.4',
 'redis_address': None,
 'object_store_address': '/tmp/ray/session_2022-10-13_14-45-46_593281_3296046/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-10-13_14-45-46_593281_3296046/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2022-10-13_14-45-46_593281_3296046',
 'metrics_export_port': 58012,
 'gcs_address': '192.168.20.4:63946',
 'address': '192.168.20.4:63946',
 'node_id': 'd57470a9eb3eed63932ad86b49d612860fa422f8131f17b5e343bc2a'}

[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):


## DRYML support

DRYML provides support for `ray.tune` in the form of the `dryml.ray.tune.Trainer` class. This class defines a callable function compatible with the ray tune functional API. We just need to supply it with a special callable which can provide a few needed callable methods to setup and run the tune experiment.

`dryml.ray.tune.Trainer` expects the arguments:
* `name`: The name of the experiment to use
* `prep_method`: The callable for creating the necessary callables for setting up the experiment. This must be picklable via `dill`.
* `metrics`: A dictionary of metrics to compute after each step of training.

Once created, the user can then design their tune experiment in the usual way, and pass the `Trainer` as the callable trainable method.

We'll create a `prep_method` which can yield all needed callables for a simple experiment: How large a convolutional kernel is appropriate for a two layer convolutional model for classifying MNIST digits.

## Define `prep_method`

In [3]:
import dryml
import os

In [4]:
# Name experiment so we can set model directory
experiment_name = 'TF_ray_test'
model_dir = os.path.realpath(os.path.join('./models', experiment_name))

# A callable to create the train/test `Dataset`
def data_gen():
    import tensorflow_datasets as tfds
    from dryml.data.tf import TFDataset
    
    # Check whether tensorflow support exists
    # For the current GPU.
    dryml.context.context_check({'tf': {}})
    
    (ds_train, ds_test), ds_info = tfds.load(
        'mnist',
        split=['train', 'test'],
        shuffle_files=True,
        as_supervised=True,
        with_info=True)
    
    train_ds = TFDataset(
        ds_train,
        supervised=True
    )
    test_ds = TFDataset(
        ds_test,
        supervised=True
    )
    return {
        'train': train_ds,
        'test': test_ds,
    }


def prep_method():
    # A callable to create a repo. This is needed to store completed models for later use.
    def repo_gen():
        return dryml.Repo(directory=model_dir, create=True)


    # We need another callable since the input datasets have a context
    # Requirement, we want the Trainer function to incorporate this
    # requirement when building the compute context.
    def data_ctx_gen():
        return {'tf': {}}

    # Model generator method which takes a config, and generates a model
    # It also can take a repo keyword argument so already trained components
    # Can be grabbed from the repo.
    def model_gen(config, repo=None):
        import dryml
        import dryml.models
        import dryml.models.tf
        import tensorflow as tf
        
        # Grab the existing Best Category data transformation
        best_cat_def = dryml.ObjectDef(dryml.data.transforms.BestCat)
        best_step = repo.get(best_cat_def)
        
        kernel_size = int(config['kernel_size'])

        filters = 32
        n_layers = 2
        layer_defs = []
        for i in range(n_layers):
            layer_defs.append(
                ['Conv2D', {'filters': filters, 'kernel_size': kernel_size, 'activation': 'relu'}])
        layer_defs.append(['Flatten', {}])
        layer_defs.append(['Dense', {'units': 10, 'activation': 'linear'}])
        
        mdl_def = dryml.ObjectDef(
            dryml.models.tf.keras.base.SequentialFunctionalModel,
            input_shape=(28, 28, 1),
            layer_defs=layer_defs,
        )
        
        # Create model definition
        mdl_def = dryml.ObjectDef(
            dryml.models.Pipe,
            dryml.ObjectDef(
                dryml.models.tf.keras.Trainable,
                # Model definition
                model=mdl_def,
                # Train method
                train_fn=dryml.ObjectDef(
                    dryml.models.tf.keras.base.BasicTraining,
                    epochs=5,
                ),
                # Optimizer
                optimizer=dryml.ObjectDef(
                    dryml.models.tf.ObjectWrapper,
                    tf.keras.optimizers.Adam,
                ),
                # Loss
                loss=dryml.ObjectDef(
                    dryml.models.tf.ObjectWrapper,
                    tf.keras.losses.SparseCategoricalCrossentropy,
                    obj_kwargs={
                        'from_logits': True
                    }
                )
            ),
            best_step,
        )
        
        ctx_reqs = {'tf': {'num_gpus': 1}}

        # Return dictionary with the model and optionally, 
        return {
            'model': mdl_def.build(repo=repo),
            'ctx_reqs': ctx_reqs,
        }

    # Return dictionary with defined callables.
    return {
        'repo': repo_gen,
        'data_ctx': data_ctx_gen,
        'data': data_gen,
        'model': model_gen,
    }

Next, let's prepare the `Repo` with needed objects and the `Repo` directory.

In [5]:
# Let's create a repo pointing to the same directory.
repo = dryml.Repo(directory=model_dir, create=True)

# Create a Best Category trainable, and save it to the repo.
best_cat_def = dryml.ObjectDef(dryml.data.transforms.BestCat)
# Repo's get method has a special ability when `build_missing_def=True`.
# If a non-concrete definition is not in the repo, one instance will be
# created and stored in the repo.
repo.get(best_cat_def, build_missing_def=True)

# Save the objects
repo.save()

[2m[33m(raylet)[0m   if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):


Now, let's design the tune experiment!

In [62]:
from ray import tune
import dryml.ray
import dryml.metrics
import datetime

local_dir = os.path.realpath(f'./ray_results')

# Define study space
config = {
    'kernel_size': tune.randint(3,8),
}

# Create model trainer
model_trainer = dryml.ray.tune.Tune1Trainer(
    name=experiment_name,
    prep_method=prep_method,
    metrics={'accuracy': dryml.metrics.scalar.categorical_accuracy},
)

analysis = tune.run(
    model_trainer,
    config=config,
    metric='accuracy',
    mode='max',
    resources_per_trial={'cpu': 8, 'gpu': 1},
    num_samples=2,
    keep_checkpoints_num=2,
    local_dir=local_dir,
    log_to_file=True,
    reuse_actors=False,
    fail_fast=True,
    progress_reporter=tune.JupyterNotebookReporter(
        True,
        max_report_frequency=5,
    ),
    # resume=True, # Currently needs more experimentation to work properly
)

2022-10-13 15:41:24,731	INFO registry.py:69 -- Detected unknown callable for trainable. Converting to class.


ValueError: You called resume (True) when no checkpoint exists in local directory (/home/mkrafcz2/HAL_Projects/DRYML/tutorials/ray_results/TF_ray_test_2022-10-13_15-41-24). If you want to start a new experiment, use `resume="AUTO"` or `resume=None`. If you expected an experiment to already exist, check if you supplied the correct `local_dir` to `tune.run()`.

In [33]:
results_df

Unnamed: 0_level_0,accuracy,dry_id,done,time_this_iter_s,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,time_total_s,pid,hostname,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,experiment_tag,config.kernel_size
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
c0912_00000,0.976863,434c6dfd-7aed-4eae-b2f8-631ebd512adf,True,2.365537,,,6,6027c150368f47a19e15f4ad60a325c9,2022-10-13_14-47-53,1665690473,70.25405,3296964,hal04,192.168.20.4,70.25405,0,6,0_kernel_size=6,6
c0912_00001,0.976963,53e9ecc4-0967-40af-8dc2-56b22d354b5a,True,2.33263,,,6,b0aa37dd5c9a44ceb51bb2695270ebce,2022-10-13_14-48-44,1665690524,47.492501,3296957,hal04,192.168.20.4,47.492501,0,6,1_kernel_size=6,6


In [30]:
# Get the best trial by max accuracy
results_df = analysis.results_df
best_trial_id = results_df['accuracy'].idxmax()

In [54]:
# Get and show config of best trial
best_trial_data = results_df.loc[best_trial_id]
config_data = best_trial_data.loc[best_trial_data.index.str.contains('config')]
config_data

config.kernel_size    6
Name: c0912_00001, dtype: object

In [56]:
# Report accuracy and id
print(f"Accuracy: {best_trial_data.loc['accuracy']}")
best_model_id = best_trial_data.loc['dry_id'] 
print(f"id: {best_model_id}")

Accuracy: 0.9769631410256411
id: 53e9ecc4-0967-40af-8dc2-56b22d354b5a


In [57]:
# Refresh the repository
repo.load_objects_from_directory()

In [58]:
# Fetch the best performing model
model = repo.get_obj_by_id(best_model_id)

In [59]:
# Lets define a method to test the model's accuracy
@dryml.compute
def test_model(model):
    import dryml.metrics
    import tensorflow_datasets as tfds
    from dryml.data.tf import TFDataset

    # Check whether tensorflow support exists
    # For the current GPU.
    dryml.context.context_check({'tf': {}})

    (ds_test,) = tfds.load(
        'mnist',
        split=['test'],
        shuffle_files=True,
        as_supervised=True)

    test_ds = TFDataset(
        ds_test,
        supervised=True
    )

    return dryml.metrics.scalar.categorical_accuracy(model, test_ds)

In [60]:
# And verify recorded accuracy!
test_model(model)

0.9769631410256411