In [1]:
import dryml
from dryml import ObjectDef
import numpy as np

# DRYML Tutorial 5 - Model Generation

We have `Object`s, `ObjectDef`s, `Repo`s, `context`, and `Trainable`s. We've also seen some default implemented data transformations in `dryml.data.transforms`, and some metrics in `dryml.metrics`. Not all ML problems require huge models on the scale of GPT-3. Sometimes we have a smallish dataset and smallish models. In these cases it can be useful to check model statistics. For instance, how often does a model achieve a given accuracy? Is the model finicky and sometimes train well and other times terribly? We'll use DRYML's `ObjectDef` factories to define model classes and then generate new copies. We will then train and check the model accuracy and see how much that accuracy varies!

## Experiment setup

First, we need to set up some utility methods which we will use. `gen_dataset` is a nice function for creating the datasets we need. Then `train_model` and `test_model` are small functions encapsulating the training and testing step which is the same for all `Trainables`.

In [2]:
%%writefile temp_mod.py
# We use the %%writefile cell magic because dryml compute processes use `spawn` and will need to access the definition of
# gen_dataset

# Create function to generate the datasets for later use
def gen_dataset():
    # import some names
    import dryml
    import tensorflow_datasets as tfds
    from dryml.data.tf import TFDataset

    # Check that the context has tensorflow ability, but don't get specific.
    dryml.context.context_check({'tf': {}})

    (ds_train, ds_test), ds_info = tfds.load(
        'mnist',
        split=['train', 'test'],
        shuffle_files=True,
        as_supervised=True,
        with_info=True)
    
    train_ds = TFDataset(
        ds_train,
        supervised=True,
    )
    
    test_ds = TFDataset(
        ds_test,
        supervised=True,
    )
    
    return train_ds, test_ds

Writing temp_mod.py


In [3]:
# Create function to train a model.
# We use ctx_update_objs=True to indicate any objects we give the method should be updated with their
# state at the end of the method.
@dryml.compute_context(ctx_update_objs=True)
def train_model(model):
    from temp_mod import gen_dataset
    train_ds, _ = gen_dataset()

    model.prep_train()
    model.train(train_ds)


# Create function to test model
# Since this method doesn't change the models, we don't have to update them after calling it.
@dryml.compute
def test_model(model):
    from dryml.metrics import categorical_accuracy
    from temp_mod import gen_dataset
    _, test_ds = gen_dataset()

    model.prep_eval()
    return categorical_accuracy(model, test_ds)

## Create ML Models

Now we'll create a few model classes using `ObjectDef`s. We'll then use `ObjectDef.build` to create instances of these models.

In [4]:
import dryml.models
import dryml.data
import dryml.models.sklearn
import sklearn.neighbors

In [5]:
# Let's define some common processing steps so we don't have to build full definitions for them every time.
flatten_def = ObjectDef(dryml.data.transforms.Flatten)
best_cat_def = ObjectDef(dryml.data.transforms.BestCat)

In [6]:
# First, we'll build an sklearn model.
sklearn_mdl_def = ObjectDef(
    dryml.models.Pipe,
    flatten_def,
    ObjectDef(
        dryml.models.sklearn.Trainable,
        model=ObjectDef(
            dryml.models.sklearn.ClassifierModel,
            sklearn.neighbors.KNeighborsClassifier,
            n_neighbors=10,
        ),
        train_fn=ObjectDef(
            dryml.models.sklearn.BasicTraining,
            num_examples=500,
            shuffle=True,
            shuffle_buffer_size=5000,
        )
    ),
    best_cat_def,
)

In [7]:
# Now, we can generate, train and test a model.
simple_tf_reqs = {'tf': {}}
temp_model = sklearn_mdl_def.build()
train_model(temp_model, call_context_reqs=simple_tf_reqs)
test_model(temp_model, call_context_reqs=simple_tf_reqs)

2023-03-21 16:29:07.633084: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


0.8209134615384616

## Repeated Model Generation

Now, let's write a function which takes a definition, trains some number of models, tests them and returns the trained models as well as the mean accuracy and accuracy deviation.

In [8]:
def train_multiple(model_def=None, num_to_train=None, ctx_reqs=None):
    models = []
    accuracies = []
    for i in range(num_to_train):
        new_model = model_def.build()
        train_model(new_model, call_context_reqs=ctx_reqs)
        acc = test_model(new_model, call_context_reqs=ctx_reqs)
        accuracies.append(acc)
        models.append(new_model)

    return models, accuracies

In [9]:
num_to_train = 5

In [10]:
sklearn_models, sklearn_accuracies = train_multiple(
    model_def=sklearn_mdl_def,
    num_to_train=num_to_train,
    ctx_reqs={'tf': {}})

2023-03-21 16:29:26.460073: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-03-21 16:29:39.289135: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-03-21 16:29:52.784692: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. I

In [11]:
# compute accuracy mean/stddev
print(f"sklearn accuracy: {np.mean(sklearn_accuracies)}+/-{np.std(sklearn_accuracies)}")

sklearn accuracy: 0.8150641025641026+/-0.005604178925329168


### Training multiple tensorflow models

In [12]:
import tensorflow as tf
import dryml.models.tf

In [13]:
# Now the same thing but with tensorflow instead

mdl_def = ObjectDef(
    dryml.models.tf.keras.SequentialFunctionalModel,
    input_shape=(28, 28, 1),
    layer_defs=[
        ['Conv2D', {'filters': 16, 'kernel_size': 3, 'activation': 'relu'}],
        ['Conv2D', {'filters': 16, 'kernel_size': 3, 'activation': 'relu'}],
        ['Flatten', {}],
        ['Dense', {'units': 10, 'activation': 'linear'}],
    ]
)
tf_mdl_def = ObjectDef(
    dryml.models.Pipe,
    ObjectDef(
        dryml.models.tf.keras.Trainable,
        model=mdl_def,
        train_fn=ObjectDef(
            dryml.models.tf.keras.BasicTraining,
            epochs=2
        ),
        optimizer=ObjectDef(
            dryml.models.tf.Wrapper,
            tf.keras.optimizers.Adam,
        ),
        loss=ObjectDef(
            dryml.models.tf.Wrapper,
            tf.keras.losses.SparseCategoricalCrossentropy,
            from_logits=True,
        )
    ),
    ObjectDef(
        dryml.data.transforms.BestCat
    )
)

In [14]:
tf_models, tf_accuracies = train_multiple(
    model_def=tf_mdl_def,
    num_to_train=num_to_train,
    ctx_reqs={'tf': {'gpu/0': 1.}}
)

2023-03-21 16:31:01.497771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13870 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0


Epoch 1/2


2023-03-21 16:31:10.469867: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101


Epoch 2/2


2023-03-21 16:31:26.735944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13867 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0
2023-03-21 16:31:35.243358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13866 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0


Epoch 1/2


2023-03-21 16:31:44.088383: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101


Epoch 2/2


2023-03-21 16:32:00.519170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13865 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0
2023-03-21 16:32:09.001445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13864 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0


Epoch 1/2


2023-03-21 16:32:17.699206: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101


Epoch 2/2


2023-03-21 16:32:33.999434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13865 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0
2023-03-21 16:32:42.845656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13871 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0


Epoch 1/2


2023-03-21 16:32:51.570611: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101


Epoch 2/2


2023-03-21 16:33:07.194587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13865 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0
2023-03-21 16:33:15.539049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13867 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0


Epoch 1/2


2023-03-21 16:33:24.280403: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101


Epoch 2/2


2023-03-21 16:33:40.898165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13874 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0


In [15]:
# compute accuracy mean/stddev
print(f"tf accuracy: {np.mean(tf_accuracies)}+/-{np.std(tf_accuracies)}")

tf accuracy: 0.9697315705128204+/-0.003142798386126868


### Training multiple PyTorch models

And now, let's have a look at a similar pytorch model, We'll have to add another step to change the order of the indicies of the data since pytorch expects data in nchw format while tensorflow uses nhwc format. We'll also have to add a `TorchDevice` transformation to make sure the data is on the cpu.

In [18]:
import dryml.models.torch
import dryml.data.torch
import torch

In [19]:
mdl_def = ObjectDef(
    dryml.models.torch.generic.Sequential,
    layer_defs=[
        [torch.nn.LazyConv2d, (16, 3), {}],
        [torch.nn.ReLU, (), {}],
        [torch.nn.LazyConv2d, (16, 3), {}],
        [torch.nn.ReLU, (), {}],
        [torch.nn.Flatten, (), {}],
        [torch.nn.LazyLinear, (10,), {}],
    ]
)
torch_mdl_def = ObjectDef(
    dryml.models.Pipe,
    ObjectDef(
        dryml.data.transforms.Transpose,
        axes=(2, 0, 1)
    ),
    ObjectDef(
        dryml.data.transforms.Cast,
        dtype='float32'
    ),
    ObjectDef(
        dryml.models.torch.generic.Trainable,
        model=mdl_def,
        train_fn=ObjectDef(
            dryml.models.torch.generic.BasicTraining,
            optimizer=ObjectDef(
                dryml.models.torch.generic.TorchOptimizer,
                torch.optim.Adam,
                mdl_def,
            ),
            loss=ObjectDef(
                dryml.models.torch.base.Wrapper,
                torch.nn.CrossEntropyLoss
            )
        )
    ),
    ObjectDef(
        dryml.data.torch.transforms.TorchDevice,
        device='cpu'
    ),
    ObjectDef(
        dryml.data.transforms.BestCat
    )
)

In [20]:
torch_models, torch_accuracies = train_multiple( model_def=torch_mdl_def,
    num_to_train=num_to_train,
    # We need to provide 'tf' with some resources because the input datasets are in tensorflow.                                                
    ctx_reqs={'tf': {}, 'torch': {'gpu/0': 1.}}
)

100%|██████████| 1875/1875 [00:20<00:00, 93.68it/s, loss=0.00544] 


Epoch 1 - Average Loss: 0.0054381017005371785


100%|██████████| 1875/1875 [00:17<00:00, 105.02it/s, loss=0.00817]


Epoch 1 - Average Loss: 0.008173266277859026


100%|██████████| 1875/1875 [00:17<00:00, 109.48it/s, loss=0.00779]


Epoch 1 - Average Loss: 0.007787893315518158


100%|██████████| 1875/1875 [00:18<00:00, 102.80it/s, loss=0.00901]


Epoch 1 - Average Loss: 0.009005006158843268


100%|██████████| 1875/1875 [00:17<00:00, 104.92it/s, loss=0.00679]


Epoch 1 - Average Loss: 0.006792048954065831




In [21]:
# compute accuracy mean/stddev
print(f"torch accuracy: {np.mean(torch_accuracies)}+/-{np.std(torch_accuracies)}")

torch accuracy: 0.9693509615384615+/-0.00135716223445149


# Wrap-up

We inspected used DRYML's `ObjectDef` to create model templates, then generated models and used `context` to train models in isolated sub-processes preventing GPUs from being locked up by one framework. Once trained, we could compute statistics about these models to learn a little bit more about how well they perform. A model's performance can be due to any of the components that make up the `Trainable` including the training function, and optimizer for example.