In [1]:
import numpy as np
from dryml import Object, save_object, load_object, Repo, Selector
import dill

# DRYML Tutorial 1

The first and most important classes in DRYML are `Object` and `ObjectDef`. These form a foundation on which automatic class serialization is based. `Objects` store metadata about themselves required to recreate the object. `ObjectDef`s represent *just* the metadata. The following diagram provides a visual representation of these two classes.

<img src="images/Object_1.svg">

## DRYML `Object` Basics

`Object` is the base class for python objects we want to serialize. The starting point to utilizing DRYML's machinery is to make a class inherit from `Object`. Once a class inherits `Object`, it remembers how it was created, and can produce an `ObjectDef` of itself. Suppose we have some plain data we want to serialize, Let's create a new `Object` class to house it.

> Caveat: DRYML `Object`s use a special metaclass called `Meta` which handles the saving of constructor arguments. `Meta` handles the creation of proper `__init__` methods which do this, and enforce an order for calling superclass consturctors. You should not call a super class's constructor within your `__init__` methods. `Meta` will handle that.

### `Object` definition and construction

In [2]:
# We create a new very simple class just to remember data
# in its arguments
class Data(Object):
    def __init__(self, data):
        pass

In [3]:
# Construct a new instance of Data
imp_obj = Data([1, 2, 3, 4, 5])

Now, `imp_obj` is an `Object` which carries information regarding the arguments used to construct it. This is available in two attributes, `dry_args` which stores the non-keyword arguments used, and `dry_kwargs` which stores the keyword arguments. Let's have a look at those.

In [4]:
# Check if imp_obj remembers its arguments
imp_obj.dry_args

([1, 2, 3, 4, 5],)

In [5]:
# Check if imp_obj remembers its keyword arguments
imp_obj.dry_kwargs

{'dry_id': '892c1223-af85-4771-8abe-2e228f940964',
 'dry_metadata': {'description': '', 'creation_time': 1679341115.453845}}

Interesting! there are keyword arguments here even though we didn't specify any! That's the `Object`'s `dry_id` which is used to uniquely identify the `Object`, and the `Object`'s `dry_metadata` meant to store any data you might find useful for later selection. The constructor which `Meta` and `Object` build for you automatically create the id, and populate some simple metadata if you don't specify it explicitly.
> Such uniquely identifying information is necessary when we want to create multiple neural networks which have the same hyperparameters (such as how many layers, and units/filters for the layers, as well as the training procedure, for example how many epochs to train the neural network.). Most neural network frameworks initialize a network's parameters by sampling from a random distribution. This means we can create copies of a given network and achieve different results after training. So, training multiple copies of a given network can give us an idea about how reliably a given network is able to train to a certain performance level, and if we train multiple networks, we can pick exactly the 'best' network by using it's `dry_id`.

### `ObjectDef` - Object Definitions

By executing `.definition()` on an `Object` we get an `ObjectDef` object which just contains the metadata about the object. `ObjectDef` acts as a factory through its method `.build()` which generates new `Objects` matching the definition it contains. It also acts as a minimal metadata 'banner' of an underlying `Object`. This enables the automatic creation of objects without user intervention, as well as searching and matching objects which may contain lots of data without loading the whole object from disk.

Users can create generic `ObjectDef`s without an id, which create new instances of the same underlying object. Useful for studying robustness in machine learning model training.

Let's take a look at `imp_obj`'s definition.

> Why does DRYML have both `Object` _as well as_ `ObjectDef`? Well, `ObjectDef`s are guaranteed to only contain hyperparmeter information. Large datasets will not be passed in through arguments to an `Object`'s constructor so these large datasets do exist in `ObjectDef`. This allows us to be certain we aren't polluting our memory with large objects such as numpy arrays until we are ready to do so.

In [6]:
imp_obj.definition()

{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': '892c1223-af85-4771-8abe-2e228f940964', 'dry_metadata': {'description': '', 'creation_time': 1679341115.453845}}}

### `Object` - Serialize/Deserialize

We can now serialize (save) it to disk, as well as load it from disk. There are multiple ways to do this, Each `Object` implements a `save_self` method which takes a filepath or file-like object which the `Object` is serialized to. DRYML also provides the `save_object` method which takes any `Object` and saves it to a filepath or file-like object.

Similarly, we can now load the object from a filepath or filelike object using the provided `load_object` method.

In [7]:
# We tell the object to save itself to a specific file
imp_obj.save_self('imp_obj.dry')

True

In [8]:
# We can now load a new copy of this object from disk by using the `load_object` method.
new_obj = load_object('imp_obj.dry')

`imp_obj` and `new_obj` are nearly indistinguishable! They contain the same data!  Let's look at each object's `.data` attribute and see.
> Caveat: While `imp_obj` and `new_obj` are very similar, they are still different objects from python's perspective.

In [9]:
print(imp_obj.dry_args[0])
print(new_obj.dry_args[0])
assert imp_obj.dry_args[0] == new_obj.dry_args[0]

[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]


### `Object` -  Storing data

Now, let's try storing some data which isn't part of the object's hyperparameters. In this case, we need to implement a couple methods to properly save and load the data. The `save_object_imp` method implements that class's logic for serializing it's internal state. Similarly, the `load_object_imp` implements the class's logic for loading it's data from the serialized file. Both methods are given a `zipfile.ZipFile` object in which to store/load its data. For now, DRYML serializes data using a zipfile. Let's show the updated class diagram with those new methods.

<img src="images/Object_2.svg">

To give you a better idea of whats going on here, Have a look at these two sequence diagram for an `Object` which the user has inherited to define new classes. First, is a diagram for Load. Notice how for loading, we first traverse down to the base class `Object`, then load object material progressively starting with classes that are closer to `Object` as classes higher up in the inheritance chain may require base class objects to be ready before they can load their data.

<img src="images/Object_Load_2.svg">

Second is the diagram for Save. We can see here that first, the highest levels in the inheritance hierarchy save their data first.

<img src="images/Object_Save_2.svg">

In [10]:
# Define the new Object type
class Array(Object):
    def __init__(self, array_shape=(32, 32)):
        # Initialize the array in the constructor
        self.data = np.zeros(array_shape)
    
    def save_object_imp(self, file):
        # Here we define how dryml should save data to disk
        with file.open('data.pkl', 'w') as f:
            f.write(dill.dumps(self.data))
        return True

    def load_object_imp(self, file):
        # Here we define how dryml should load data from disk.
        with file.open('data.pkl') as f:
            self.data = dill.loads(f.read())
        return True

To make it a bit clearer how each of these methods fits into the procedure of loading and saving an `Object`, we have a sequence diagram.

In [11]:
# Create the object
arr_obj = Array(array_shape=(8, 8))

In [12]:
# The object contains a numpy array with the specified shape.
arr_obj.data.shape

(8, 8)

### Modifying object state

We can now modify the state of this object, and save it to disk.

In [13]:
arr_obj.data[0,0] = 50

In [14]:
save_object(arr_obj, 'test_obj.dry')

True

Let's pretend we're trying to load this object from disk now, and check that the correct data gets loaded.

In [15]:
arr_obj_loaded = load_object('test_obj.dry')

In [16]:
assert np.all(arr_obj.data == arr_obj_loaded.data)

Great! Let's take a look at the object's definitions

In [17]:
arr_obj.definition()

{'cls': <class '__main__.Array'>, 'dry_mut': False, 'dry_args': (), 'dry_kwargs': {'array_shape': (8, 8), 'dry_id': '05258189-db5f-47cf-b928-57b2f29dc079', 'dry_metadata': {'description': '', 'creation_time': 1679341121.1506178}}}

In [18]:
arr_obj_loaded.definition()

{'cls': <class '__main__.Array'>, 'dry_mut': False, 'dry_args': (), 'dry_kwargs': {'array_shape': (8, 8), 'dry_id': '05258189-db5f-47cf-b928-57b2f29dc079', 'dry_metadata': {'description': '', 'creation_time': 1679341121.1506178}}}

However, constructing these `Object`s from an `ObjectDef` won't recover the data we modified!

In [19]:
new_arr_obj = arr_obj.definition().build()

In [20]:
print(arr_obj.data[0,0])
print(new_arr_obj.data[0,0])

50.0
0.0


### Nested `Object`s

`Object`s can take other `Object`s as arguments. This allows us to build more complex `Object`s out of components which we can reuse for other tasks! Let's create a nested `Object` and see how it works! We'll just store an `Array` object within a `Data` object since both of those have been defined in this session.

In [21]:
arr_data_container = Data(arr_obj)

In [22]:
# Verify the container we wrote has the right value
assert(arr_data_container.dry_args[0].data[0,0] == 50.0)

In [23]:
# Now we can write this container to disk, and reload it.
# Objects are saved recursively within the same file.
arr_data_container.save_self('test2.dry')

True

In [24]:
# Now we can load the object from disk.
# The object is created recursively first creating the ArrayObject
# then creating the Data object.
arr_data_container_2 = load_object('test2.dry')
assert(arr_data_container_2.dry_args[0].data[0,0] == 50.0)

## `Repo` - The `Object` store, `Selector` - The `Object` finder

A major problem with ML workflows is the management of different versions of trained models. A common scene is a directory filled with sub-directories each cordoning off models of a certain variety. This leads often to either heavily nested directories, or directory names that are long and convoluted, specifying most properties of the network uniquely for the project a practictioner is working on. DRYML approaches this problem with the `Repo` object which stores either a reference to where the object is stored on disk (an `ObjectFile`) or `Object`s themselves within a python dictionary indexed by the `Object`s `dry_id`.

By itself, `Repo` isn't super useful beyond managing where `Object`s eventually get written to disk. However DRYML defines another object `Selector`. `Selector` defines a callable object which can be passed an `Object`, `ObjectDef`, or `ObjectFile` and say whether it 'matches' the selector's criteria. `Selector` is similar to python's `slice` object which acts to grab a subset from an array or other collection object. With `Selector`, `Repo` transforms into an `Object` store from which we can grab specific `Object`s or classes of `Object`s.

Here's a general diagram describing this:

<img src="images/Repo_Selector_1.svg">

Let's see this in action. Let's say we want to store many different kinds of `Array` objects like we defined earlier. But for a given analysis later, we're only interested in `Array`s with a specific shape.

In [25]:
# Create the repo
repo = Repo()

To add an object to the `Repo`, we use the `add_object` method.

In [26]:
# Generate several arrays of different shapes and add them to the store.
num_gen = 10
size_progression = [8, 10, 20, 100]
for s in size_progression:
    array_shape = (s,s)
    for i in range(num_gen):
        obj = Array(array_shape=array_shape)
        obj.data = np.random.random(array_shape)
        repo.add_object(obj)

We can see how many objects are currently stored by calling `len` on the `repo` object.

In [27]:
len(repo)

40

Imagine we were in a new notebook accessing these objects, There are tons of objects, so how do we quickly get the one's we're interested in? That's where the `Selector` comes in. Let's make a `Selector` to get arrays with shape `(20, 20)`, and use the `get` method of `Repo` to grab only those `Objects` matching the `Selector`.

In [28]:
sel = Selector(cls=Array, kwargs={'array_shape': (20, 20)})

In [29]:
selected_objs = repo.get(sel)

In [30]:
# We can check and see we only have objects with shape (20, 20)!
list(map(lambda o: o.data.shape, selected_objs))

[(20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20)]

Great! So we can get specific `Objects` matching the selector! Notice however, that I didn't specify any `args` and, I also didn't specify `dry_id` in `kwargs`. This is because the `Selector` doesn't attempt to match keys which are missing from the `Selector`. This means, when we don't specify `dry_id`, it will return all `Objects` matching other parts of the `Selector`s requirements.

Now, suppose we know the `dry_id` of the specific object we're interested in, we can create a `Selector` which will only match the object with that `dry_id`. We can also use the `get_obj_by_id` method of `repo` and supply the `dry_id` directly.

In [31]:
print(id(selected_objs[0]))
id_of_interest = selected_objs[0].dry_id
id_of_interest

140269816539312


'1cf7f00f-30f3-4d3a-8201-358c0e229b09'

In [32]:
# We can create a selector which will only match against `dry_id`, we look at the python id of the object
# to verify it's the same object.
specific_selector = Selector(None, kwargs={'dry_id': id_of_interest})
id(repo.get(specific_selector))

140269816539312

In [33]:
# We can also try to get it directly if we have an id,
id(repo.get_obj_by_id(id_of_interest))

140269816539312

## Wrap-up

The discussed functionalities of `Object`, `ObjectDef`, `Repo`, and `Selector` form the important core operating functionality of the DRYML library. All `Object`s track their constructor parameters (aka. hyperparameters) allowing reconstruction of the object without user intervention. `ObjectDef`s give the user a 'factory' system for creating new `Object`s matching a certain set of hyper parameters. `Repo`s and `Selectors` give the user the power to manage numerous `Object`s in a sane and coherent manner.

Importantly, the functionality of `Object` and all of it's friends is independent of ML, and can be used in other contexts just as easily! Most components of DRYML are meant to be usable outside of the patterns described for common use in DRYML.