In [1]:
import dryml
from dryml.data import Dataset
import numpy as np

# DRYML Tutorial 2 - `Dataset`s

Interacting with datasets is in no way standardized between the major ML platforms. Each platform offers its own version of dataset ingestion, batching, and iteration. DRYML attempts to remedy this by offering a wrapper class called `Dataset` which implements a uniform set of functionality for two major ML platforms (and in some cases some functionality none implement!).  Once a common API is defined, developers using DRYML can rely on it and create pipelines which can communicate with eachother with minimal effort.

## `NumpyDataset`


The `Dataset` API is an attempt to standardize ML dataset interaction. It takes a functional-style approach, borrowing much from `tensorflow`'s `tf.data.Dataset` type. We'll explore the `Dataset` type by generating a data sample using `numpy` and loading it into a `NumpyDataset` object. The `NumpyDataset` object implements the `Dataset` API using `numpy` arrays and operations.

In [2]:
from dryml.data import NumpyDataset

In [3]:
# Create random numpy dataset. 
num_examples = 10000
data_shape = (28, 28)
data_np = np.random.random((num_examples,)+data_shape)

In [4]:
# Create NumpyDataset from the numpy dataset supervised=False is necessary as we don't have supervised targets
# in this dataset
data_ds = NumpyDataset(data_np, supervised=False)

### `Dataset.peek`

We'll introduce the very useful method `peek`. `peek` simply returns the first element of the `Dataset`. If the data is batched, it'll return the first batch, if not batched, it'll return the first element. Let's verify the shape of the first element of this dataset.

In [5]:
data_ds.peek().shape

(10000, 28, 28)

### `Dataset.batch` and `Dataset.unbatch`

Great! That's what we put into the dataset!, So what if we don't want to treat the entire dataset as one giant batch? Well we have the `unbatch` and `batch` methods. We can first `unbatch` the dataset, then `batch` it with the appropriate `batch_size`.

In [6]:
# Unbatch, then rebatch with new batch size
batch_size = 32
batched_ds = data_ds.unbatch().batch(batch_size=batch_size)

In [7]:
# Take a peek at the new dataset object's element shape.
batched_ds.peek().shape

(32, 28, 28)

We now see the dataset gives batches with a `batch_size` of 32! Well, what if we want to look at one single example? Well, we just don't use the last `batch` call! And notice, the result of each method call is a new `Dataset` object! This means, we can re-use each as often as we like, and changes to the dataset don't destroy the original.

In [8]:
data_ds.unbatch().peek().shape

(28, 28)

So we see, the shape of a single example is what we intended at the start!

### `Dataset.take` and `Dataset.skip`

Useful for getting a limited set of data to interact with if the dataset is infinite or just very large. `Dataset` also provides the method `count` which attempts to literally count the elements in the `Dataset`. (This is in lieu of a better heuristic method we are working on)

In [9]:
data_ds.unbatch().take(10).count()

10

### `Dataset` iteration

`Dataset`s are python iterables! Let's have a look at that now!

In [10]:
# Iterate through the unbatched data and show the 0,0'th element
for el in data_ds.unbatch().take(10):
    print(el[0,0])

0.3003042853411838
0.24456574035714496
0.9984157078557113
0.9592794840556339
0.8656060126343419
0.5492409303870289
0.20690550626773896
0.29400305581091024
0.8367717639755184
0.9956552021875325


### `Dataset` - Supervised, Unsupervised, and Indexed datasets

`Dataset` supports various types of datasets currently, the most common difference between two `Dataset`s is whether they are supervised, and whether they are indexed. Let's consider non-indexed `Dataset`s first. Unsupervised, the `Dataset` contains the 'input' elements (the value we would pass to a model) or `X` elements, and the target elements or the `Y` elements. When retrieving values from the `Dataset`, you will get a tuple like (`X`, `Y`). If such a `Dataset` is unsupervised, then you will just get the `X` value. If the dataset is batched, then `X` and `Y` will be batches as well. If the dataset is indexed, the returned data will be nested in another tuple whose first element is the index data. For unsupervised data, it will look like this: `(I, X)` where `I` is the index for the element. And for supervised data, it will look like this: `(I, (X, Y))`. Thus to get the index for indexed data, we access `el[0]` where `el` is the data element.

Let's have a look at how this works for a supervised dataset. We'll create a new supervised dataset by generating random data.

In [11]:
# Creating the new supervised data
num_examples = 10000
x_shape = (20, 5)
num_classes = 5
x_data = np.random.random((num_examples,)+x_shape)
y_data = np.random.choice(list(range(num_classes)), size=(num_examples,))

In [12]:
# Create NumpyDataset
ds = NumpyDataset((x_data, y_data), supervised=True)

We can verify this data has the right shape

In [13]:
print(f"X shape: {ds.peek()[0].shape}")
print(f"Y shape: {ds.peek()[1].shape}")

X shape: (10000, 20, 5)
Y shape: (10000,)


We can index this dataset using the `as_indexed` method.

In [14]:
indexed_ds = ds.as_indexed()
first_el = indexed_ds.peek()
print(f"I shape: {first_el[0].shape}")
print(f"X shape: {first_el[1][0].shape}")
print(f"Y shape: {first_el[1][1].shape}")

I shape: (10000,)
X shape: (10000, 20, 5)
Y shape: (10000,)


By default, `as_indexed` counts examples starting from 0, so we can look at the first 10 elements index of the data using `unbatch`, `take`, and the `index` method which returns an iterable which just gives the index.

In [15]:
for el in indexed_ds.unbatch().take(10).index():
    print(el)

0
1
2
3
4
5
6
7
8
9


We can also remove the supervised data from `ds` using the `as_unsupervised` method.

In [16]:
ds.as_not_supervised().peek().shape

(10000, 20, 5)

### `Dataset` - `map`, `map_el`, `apply`, `apply_X`, and `apply_Y`

`Dataset` supports mapping of a function across all elements of the dataset. This is useful for applying transformations to the dataset, and other components of DRYML operating on datasets use these methods in their implementations.

* `map` applies a given function to all content in the `Dataset`.
* `apply` applies a given function to all `X` and `Y` content in the `Dataset`.
* `apply_X` applies a given function only to the `X` dataset in a `Dataset`.
* `apply_Y` applies a given function only to the `Y` dataset in a `Dataset`.
* `map_el` is a special function. You should have noticed at this point that all elements yielded by the dataset are within a tuple or by themselves. So `map_el` applies a function to each piece of primitive data within a collection such as tuple and applies a given function to every primative element of data within supported collections. So if the element is `(d1, (d2, d3))`, this will give `(f(d1), (f(d2), f(d3))`.

Let's try `apply_X`

In [17]:
# We apply a function to the Dataset
el_sq = ds.apply_X(lambda x: x**2).peek()[0]
# We can check that the function was applied with an assert
el = ds.peek()[0]
assert np.all(el_sq == el**2)

## DRYML `Dataset` - ML Framework transformations

DRYML `Dataset` comes with a special new power. `Datasets` can implement transformations to datasets in other ML frameworks! For example, tensorflow!. the `NumpyDataset` class supports the method `tf` which creates a new `tf.data.Dataset` wrapped in a `TFDataset` (Implementing the DRYML `Dataset` API for TensorFlow) which contains the data from the original `NumpyDataset`! This is very useful for moving data into tensorflow tensors, to be used in tensorflow models.

In [18]:
tf_ds = ds.tf()
# We can now peek at the first element of the new dataset and see it's type.
type(tf_ds.peek()[0])

2023-03-20 16:42:30.652171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6392 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1
2023-03-20 16:42:30.684924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 7362 MB memory:  -> device: 1, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1


tensorflow.python.framework.ops.EagerTensor

We can see it's now a tensorflow `EagerTensor` (it may also be just a `Tensor`).

Other transformations may exist as well. the `torch` method turns the tensor into a pytorch tensor, and `numpy` turns it back into a `NumpyDataset`. Be aware, that these types of transformations currently come with large performance hits, and there is a benefit to staying within a single ML ecosystem, however this ability makes exploring new algorithms much simpler, as we don't have to re-program our data source right away and can take advantage of data input pipelines already built in other frameworks when testing new frameworks.

In [19]:
# test for pytorch
type(ds.torch().peek()[0])

  from .autonotebook import tqdm as notebook_tqdm


torch.Tensor

In [20]:
# We can go back to numpy!
type(ds.tf().numpy().peek()[0])

numpy.ndarray

## Wrap-up

Like other components of DRYML, `Dataset`s can be used outside of `DRYML`. `Dataset` is great for inspecting and bridging existing datasets between different frameworks.