In [7]:
import dryml
from dryml.data import Dataset
import numpy as np

# DRYML Tutorial 2

## DRYML `Dataset` and `context` basics

Modern ML platforms suffer from two fairly annoying issues. First, iterating through datasets is in no way standardized. Each platform offers its own version of dataset ingestion, batching, and iterators. DRYML attempts to remedy this by offering a wrapper class called `Dataset` which implements a uniform set of functionality for two major ML platforms, as well as generic numpy datasets. Secondly, use of compute resources is often difficult to configure. By default, most platforms just allocate an entire GPU. DRYML attempts to remedy this in two ways. First, it provides a primative compute `context` system where methods requiring compute resources can request them. `context` is then able to launch a python process to contain compute operations. This allows device memory to be released when the method completes. Secondly, `Object` supports a 'compute' mode, where the user should specify any objects which require memory allocation on a device such as a GPU.

We'll go over basic functionality of each of these components in this Tutorial

## DRYML `Dataset` basics and `NumpyDataset`

The `Dataset` API is an attempt to standardize ML dataset interaction. It takes a functional-style approach, borrowing much from `tensorflow`'s `tf.data.Dataset` type. We'll explore the `Dataset` type using a generated data sample loading into `NumpyDataset` wrapper. The `NumpyDataset` wrapper implements most of the `Dataset` API, and allows easy use of the data as well as common ML operations on the data.

### Creating `NumpyDataset`

In [9]:
from dryml.data import NumpyDataset

In [8]:
# Create random numpy dataset. 
num_examples = 10000
data_shape = (28, 28)
data_np = np.random.random((num_examples,)+data_shape)

In [14]:
# Create NumpyDataset from the numpy dataset supervised=False is necessary as we don't have supervised targets
# in this dataset
data_ds = NumpyDataset(data_np, supervised=False)

### `Dataset.peek`

Now we can take advantage of the `Dataset` API. First, We'll introduce the very useful method `peek`. `peek` simply returns the first element of the `Dataset`. If the data is batched, it'll return the first batch, if not batched, it'll return the first element. Let's verify the shape of the first element of this dataset.

In [16]:
data_ds.peek().shape

(10000, 28, 28)

### `Dataset.batch` and `Dataset.unbatch`

Great! That's what we put into the dataset!, So what if we don't want to treat the entire dataset as one giant batch? Well we have the `unbatch` and `batch` methods. We can first `unbatch` the dataset, then `batch` it with the appropriate `batch_size`.

In [17]:
# Unbatch, then rebatch with new batch size
batch_size = 32
batched_ds = data_ds.unbatch().batch(batch_size=batch_size)

In [18]:
# Take a peek at the new dataset object's element shape.
batched_ds.peek().shape

(32, 28, 28)

We now see the dataset gives batches with a `batch_size` of 32! Well, what if we want to look at one single example? Well, we just don't use the last `batch` call! And notice, the result of each method call is a new `Dataset` object! This means, we can re-use each as often as we like, and changes to the dataset don't destroy the original.

In [19]:
data_ds.unbatch().peek().shape

(28, 28)

So we see, the shape of a single example is what we intended at the start!

### `Dataset.take` and `Dataset.skip`

`Dataset` implements the `take` and `skip` methods as well. This is very useful for getting a limited set of data to interact with if the dataset is infinite or just very large. `Dataset` also provides the method `count` which attempts to literally count the elements in the `Dataset`. (This is in lieu of a better heuristic method we are working on)

In [23]:
data_ds.unbatch().take(10).count()

10

### `Dataset` iteration

`Dataset`s are python iterables! Let's have a look at that now!

In [25]:
# Iterate through the unbatched data and show the 0,0'th element
for el in data_ds.unbatch().take(10):
    print(el[0,0])

0.6099607496776515
0.8018996842937309
0.8826141329289703
0.7928764569998471
0.3964297929012688
0.9348840216061983
0.7806513078157304
0.4423940369166266
0.9003693292511876
0.9527959674923151


### `Dataset` - Supervised, Unsupervised, and Indexed datasets

`Dataset` supports various types of datasets currently, the most common difference between two `Dataset`s is whether they are supervised, and whether they are indexed. Let's consider non-indexed `Dataset`s first. If such a `Dataset` is unsupervised, then each element will just be the `X` value (the value we would pass to a model.). If supervised, then each element will be a tuple `(X, Y)` where `Y` are the known labels for the supervised dataset. If the dataset is batched, then `X` and `Y` will be batches as well. If the dataset is indexed, the index data for each element will appear as the first element in the tuple. For unsupervised data, it will look like this: `(I, X)` where `I` is the index for the element. And for supervised data, it will look like this: `(I, (X, Y))`. Thus to get the index for indexed data, we access `el[0]` where `el` is the data element.

Let's have a look at how this works for a supervised dataset. We'll create a new supervised dataset by generating random data.

In [35]:
# Creating the new supervised data
num_examples = 10000
x_shape = (20, 5)
num_classes = 5
x_data = np.random.random((num_examples,)+x_shape)
y_data = np.random.choice(list(range(num_classes)), size=(num_examples,))

In [39]:
# Create NumpyDataset
ds = NumpyDataset((x_data, y_data), supervised=True)

We can verify this data has the right shape

In [44]:
print(f"X shape: {ds.peek()[0].shape}")
print(f"Y shape: {ds.peek()[1].shape}")

(10000, 20, 5)
(10000,)


We can index this dataset using the `as_indexed` method.

In [48]:
indexed_ds = ds.as_indexed()
first_el = indexed_ds.peek()
print(f"I shape: {first_el[0].shape}")
print(f"X shape: {first_el[1][0].shape}")
print(f"Y shape: {first_el[1][1].shape}")

I shape: (10000,)
X shape: (10000, 20, 5)
Y shape: (10000,)


By default, `as_indexed` counts examples starting from 0, so we can look at the first 10 elements index of the data using `unbatch`, `take`, and the `index` method which returns an iterable which just gives the index.

In [54]:
for el in indexed_ds.unbatch().take(10).index():
    print(el)

0
1
2
3
4
5
6
7
8
9


We can also remove the supervised data from `ds` using the `as_unsupervised` method.

In [56]:
ds.as_not_supervised().peek().shape

(10000, 20, 5)

### `Dataset` - `map`, `map_el`, `apply`, `apply_X`, and `apply_Y`

`Dataset` supports mapping of a function across all elements of the dataset. This is useful for applying transformations to the dataset, and other components of DRYML operating on datasets use these methods in their implementations.

* `map` applies a given function to all content in the `Dataset`.
* `apply` applies a given function to all `X` and `Y` content in the `Dataset`.
* `apply_X` applies a given function only to the `X` dataset in a `Dataset`.
* `apply_Y` applies a given function only to the `Y` dataset in a `Dataset`.
* `map_el` is a special function. You should have noticed at this point that all elements yielded by the dataset are within a tuple or by themselves. So `map_el` applies a function to each piece of primitive data within a collection such as tuple and applies a given function to every primative element of data within supported collections. So if the element is `(d1, (d2, d3))`, this will give `(f(d1), (f(d2), f(d3))`.

Let's try `apply_X`

In [59]:
# We apply a function to the Dataset
el_sq = ds.apply_X(lambda x: x**2).peek()[0]
# We can check that the function was applied with an assert
el = ds.peek()[0]
assert np.all(el_sq == el**2)

## DRYML context basics

DRYML implements 'compute contexts' for specific ML frameworks. Resources for these contexts can be requested for each of these contexts using the keyword for each. We have the following keywords: for tensorflow 'tf', for pytorch 'torch', and for default 'default'. We can then build up a dictionary of 'resource requests' like so:

In [2]:
ctx_reqs = {
    'default': {'num_gpus': 0},
    'tf': {'num_gpus': 1},
    'torch': {'num_gpus': 0},
}

A resource request is a dictionary with a couple keys to signal a request for specific resources. Right now we can ask for a specific number of cpus/gpus with `num_cpus` and `num_gpus`. We can also ask for specific cpus and gpus with `cpu/<i>` and `gpu/<i>` with a float value between `0.` and `1.`. When possible, if you request a fraction of a gpu, DRYML will configure the corresponding framework for that.

Thus, the above context requirements asks for tensorflow with one gpu, and torch with no gpus.

With the `ctx_reqs` dictionary, DRYML will create a `ContextManager` which will attempt to create appropriate contexts with the correct resources. If successful, the user will have access to the necessary GPUs, and the correponding libraries will be configured for the requested devices (if possible).

> Be aware that most frameworks currently have no way of enforcing limits on memory consumption of GPUs. This means, the user is trusted to try and adhere to the memory requirements which DRYML makes available at all times through the `dryml.get_context()` method which returns the current `ContextManager`.

If the user wants their objects to avoid allocating memory on a device, they can simply not set a context, and if a context is required, DRYML will throw an exception.

For the rest of this Tutorial, we'll set the compute context using the resources above.

In [4]:
dryml.context.set_context(ctx_reqs)

2022-09-23 10:13:00.080735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5780 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1


If there is code you suspect may require device memory, DRYML provides the `context_check` method to check whether the current context satisfies some resource constraints. Let's check if the current context has two GPUs allocated to tensorflow. (This should fail!)

In [5]:
dryml.context.context_check({'tf': {'num_gpus': 2}})

ContextIncompatibilityError: Context doesn't satisfy requirements {'tf': {'num_gpus': 2}}

And we'll just double check that the current context satisfies the requirements we set out earlier:

In [6]:
dryml.context.context_check(ctx_reqs)

## DRYML `Dataset` - ML Framework transformations

DRYML `Dataset` comes with a special new power. `Datasets` can implement transformations to datasets in other ML frameworks! For example, tensorflow!. the `NumpyDataset` class supports the method `tf` which creates a new `tf.data.Dataset` wrapped in a `TFDataset` which contains the data from the original `NumpyDataset`! This is very useful for moving data into tensorflow tensors, to be used in tensorflow models. Since we now have a context which supports tensorflow, we can proceed with this transformation.

> This type of transformation or creation of `tf.data.Dataset`s is one good use for the `context_check` method we just saw.

In [64]:
tf_ds = ds.tf()
# We can now peek at the first element of the new dataset and see it's type.
type(tf_ds.peek()[0])

tensorflow.python.framework.ops.EagerTensor

We can see it's now a tensorflow `EagerTensor` (it may also be just a `Tensor`).

Other transformations may exist as well. the `torch` method turns the tensor into a pytorch tensor, and `numpy` turns it back into a `NumpyDataset`. Be aware, that these types of transformations currently come with large performance hits, and there is a benefit to staying within a single ML ecosystem, however this ability makes exploring new algorithms much simpler, as we don't have to re-program our data source right away and can take advantage of data input pipelines already built in other frameworks when testing new frameworks.

In [66]:
# test for pytorch
type(ds.torch().peek()[0])

  from .autonotebook import tqdm as notebook_tqdm


torch.Tensor

In [67]:
# We can go back to numpy!
type(ds.tf().numpy().peek()[0])

numpy.ndarray

## Wrap-up

Like other components of DRYML, `Dataset`s and `dryml.context` can be used outside of `DRYML`. `dryml.context` is very useful for automatically setting ML framework's device settings. and `Dataset` is great for inspecting and bridging existing datasets between different frameworks.