In [1]:
import numpy as np
from dryml import Object, save_object, load_object, Repo, Selector
import dill

# DRYML Tutorial 1

## DRYML `Object` Basics

The `Object` is the basic class for python objects we want to serialize. Suppose we have some plain data we want to serialize, Let's create a new `Object` class to house it.

> Caveat: DRYML `Object`s use a special metaclass called `Meta` which handles the saving of constructor arguments. `Meta` handles the creation of proper `__init__` methods which do this, and enforce an order for calling superclass consturctors. You should not call a super class's constructor within your `__init__` methods. `Meat` will handle that.

### `Object` definition and construction

In [None]:
class Data(Object):
    def __init__(self, data):
        pass

In [4]:
imp_obj = Data([1, 2, 3, 4, 5])

Now, `imp_obj` is an `Object` which carries information regarding the arguments used to construct it. This is available in two attributes, `dry_args` which stores the non-keyword arguments used, and `dry_kwargs` which stores the keyword arguments. Let's have a look at those.

In [5]:
imp_obj.dry_args

([1, 2, 3, 4, 5],)

In [6]:
imp_obj.dry_kwargs

{'dry_id': '252fa7fe-8fdf-45de-beaa-df9e4553e1e6'}

Interesting! there's a keyword argument here even though we didn't specify any! That's the `Object`'s `dry_id` which is used to uniquely identify the `Object`. The constructor which `Meta` and `Object` build for you automatically create such an id if you don't specify it directly.
> Such uniquely identifying information is necessary when we want to create multiple neural networks which have the same hyperparameters (such as how many layers, and units/filters for the layers, as well as the training procedure, for example how many epochs to train the neural network.). Most neural network frameworks initialize a network's parameters by sampling from a random distribution. This means we can create copies of a given network and achieve different results after training. So, training multiple copies of a given network can give us an idea about how reliably a given network is able to train to a certain performance level, and if we train multiple networks, we can pick exactly the 'best' network by using it's `dry_id`.

### `ObjectDef` - Object Definitions

Why would we need this information? It's to allow the automatic creation of objects without user intervention. DRYML `Object`s have another attribute called `definition` This method produces an `ObjectDef` object which contains all necessary information to build a new `Object` nearly identical to a given `Object`, and it allows the user to create a generic object definition. The user can then call the `build` method on that definition, and DRYML will construct a new object matching that definition. Let's take a look at `imp_obj`'s definition.

> Why does DRYML have both `Object` _as well as_ `ObjectDef`? Well, `ObjectDef`s are guaranteed to only contain hyperparmeter information. Any contained datasets within the eventual object do not exist yet. This allows us to be certain we aren't polluting our memory with large objects such as numpy arrays until we are ready to do so.

In [7]:
imp_obj.definition()

{'cls': <class '__main__.Data'>, 'dry_mut': False, 'dry_args': ([1, 2, 3, 4, 5],), 'dry_kwargs': {'dry_id': '252fa7fe-8fdf-45de-beaa-df9e4553e1e6'}}

### `Object` - Serialize/Deserialize

We can now serialize (save) it to disk, as well as load it from disk. There are multiple ways to do this, Each `Object` implements a `save_self` method which takes a filepath or file-like object which the `Object` is serialized to. DRYML also provides the `save_object` method which takes any `Object` and saves it to a filepath or file-like object.

Similarly, we can now load the object from a filepath or filelike object using the provided `load_object` method.

In [8]:
# We tell the object to save itself to a specific file
imp_obj.save_self('imp_obj.dry')

True

In [9]:
# We can now load a new copy of this object from disk by using the `load_object` method.
new_obj = load_object('imp_obj.dry')

`imp_obj` and `new_obj` are nearly indistinguishable! They contain the same data!  Let's look at each object's `.data` attribute and see.
> Caveat: While `imp_obj` and `new_obj` are very similar, they are still different objects from python's perspective.

In [10]:
print(imp_obj.dry_args[0])
print(new_obj.dry_args[0])
assert imp_obj.dry_args[0] == new_obj.dry_args[0]

[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]


### `Object` -  Storing data

Now, let's try storing some data which isn't part of the object's hyperparameters. In this case, we need to implement a couple methods to properly save and load the data. The `save_object_imp` method implements that class's logic for serializing it's internal state. Similarly, the `load_object_imp` implements the class's logic for loading it's data from the serialized file. Both methods are given a `zipfile.ZipFile` object in which to store/load its data. For now, DRYML serializes data using a zipfile.

In [11]:
# Define the new Object type
class Array(Object):
    def __init__(self, array_shape=(32, 32)):
        self.data = np.zeros(array_shape)
    
    def save_object_imp(self, file):
        with file.open('data.pkl', 'w') as f:
            f.write(dill.dumps(self.data))
        return True

    def load_object_imp(self, file):
        with file.open('data.pkl') as f:
            self.data = dill.loads(f.read())
        return True

In [12]:
# Create the object
arr_obj = Array(array_shape=(8, 8))

In [13]:
# The object contains a numpy array with the specified shape.
arr_obj.data.shape

(8, 8)

### Modifying object state

We can now modify the state of this object, and save it to disk.

In [14]:
arr_obj.data[0,0] = 50

In [15]:
save_object(arr_obj, 'test_obj.dry')

True

Let's pretend we're trying to load this object from disk now, and check that the correct data gets loaded.

In [16]:
arr_obj_loaded = load_object('test_obj.dry')

In [18]:
assert np.all(arr_obj.data == arr_obj_loaded.data)

Great! Let's take a look at the object's definitions

In [19]:
arr_obj.definition()

{'cls': <class '__main__.Array'>, 'dry_mut': False, 'dry_args': (), 'dry_kwargs': {'array_shape': (8, 8), 'dry_id': 'b87ed684-45e1-4e1a-8e3a-724fcb57387d'}}

In [20]:
arr_obj_loaded.definition()

{'cls': <class '__main__.Array'>, 'dry_mut': False, 'dry_args': (), 'dry_kwargs': {'array_shape': (8, 8), 'dry_id': 'b87ed684-45e1-4e1a-8e3a-724fcb57387d'}}

However, constructing these `Object`s from an `ObjectDef` won't recover the data we modified!

In [21]:
new_arr_obj = arr_obj.definition().build()

In [22]:
print(arr_obj.data[0,0])
print(new_arr_obj.data[0,0])

50.0
0.0


## `Repo` - The `Object` store, `Selector` - The `Object` finder

A major problem with ML workflows is the management of different versions of trained models. A common scene is a directory filled with sub-directories each cordoning off models of a certain variety. This leads often to either heavily nested directories, or directory names that are long and convoluted, specifying most properties of the network uniquely for the project a practictioner is working on. DRYML approaches this problem with the `Repo` object which stores either a reference to where the object is stored on disk (an `ObjectFile`) or `Object`s themselves within a python dictionary indexed by the `Object`s `dry_id`.

By itself, `Repo` isn't super useful beyond managing where `Object`s eventually get written to disk. However DRYML defines another object `Selector`. `Selector` defines a callable object which can be passed an `Object`, `ObjectDef`, or `ObjectFile` and say whether it 'matches' the selector's criteria. `Selector` is similar to python's `slice` object which acts to grab a subset from an array or other collection object. With `Selector`, `Repo` transforms into an `Object` store from which we can grab specific `Object`s or classes of `Object`s.

Let's see this in action. Let's say we want to store many different kinds of `Array` objects like we defined earlier. But for a given analysis later, we're only interested in `Array`s with a specific shape.

In [23]:
# Create the repo
repo = Repo()

To add an object to the `Repo`, we use the `add_object` method.

In [24]:
# Generate several arrays of different shapes and add them to the store.
num_gen = 10
size_progression = [8, 10, 20, 100]
for s in size_progression:
    array_shape = (s,s)
    for i in range(num_gen):
        obj = Array(array_shape=array_shape)
        obj.data = np.random.random(array_shape)
        repo.add_object(obj)

We can see how many objects are currently stored by calling `len` on the `repo` object.

In [25]:
len(repo)

40

Imagine we were in a new notebook accessing these objects, There are tons of objects, so how do we quickly get the one's we're interested in? That's where the `Selector` comes in. Let's make a `Selector` to get arrays with shape `(20, 20)`, and use the `get` method of `Repo` to grab only those `Objects` matching the `Selector`.

In [26]:
sel = Selector(cls=Array, kwargs={'array_shape': (20, 20)})

In [27]:
selected_objs = repo.get(sel)

In [28]:
# We can check and see we only have objects with shape (20, 20)!
list(map(lambda o: o.data.shape, selected_objs))

[(20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20),
 (20, 20)]

Great! So we can get specific `Objects` matching the selector! Notice however, that I didn't specify any `args` and, I also didn't specify `dry_id` in `kwargs`. This is because the `Selector` doesn't attempt to match keys which are missing from the `Selector`. This means, when we don't specify `dry_id`, it will return all `Objects` matching other parts of the `Selector`s requirements.

Now, suppose we know the `dry_id` of the specific object we're interested in, we can create a `Selector` which will only match the object with that `dry_id`. We can also use the `get_obj_by_id` method of `repo` and supply the `dry_id` directly.

In [29]:
print(id(selected_objs[0]))
id_of_interest = selected_objs[0].dry_id
id_of_interest

140467178754640


'c15b7701-3346-4c20-b0f9-171e85decafc'

In [30]:
# We can create a selector which will only match against `dry_id`, we look at the python id of the object
# to verify it's the same object.
specific_selector = Selector(None, kwargs={'dry_id': id_of_interest})
id(repo.get(specific_selector))

140467178754640

In [31]:
# We can also try to get it directly if we have an id,
id(repo.get_obj_by_id(id_of_interest))

140467178754640

## Wrap-up

The discussed functionalities of `Object`, `ObjectDef`, `Repo`, and `Selector` form the important core operating functionality of the DRYML library. All `Object`s track their constructor parameters (aka. hyperparameters) allowing reconstruction of the object without user intervention. `ObjectDef`s give the user a 'factory' system for creating new `Object`s matching a certain set of hyper parameters. `Repo`s and `Selectors` give the user the power to manage numerous `Object`s in a sane and coherent manner.

Importantly, the functionality of `Object` and all of it's friends is independent of ML, and can be used in other contexts just as easily! Most components of DRYML are meant to be usable outside of the patterns described for common use in DRYML.