# Fast minibatch sampling

This example shows how to create minibatches from a dataset, which is found in a Machine Learning pipeline.
A SeqTools object can then easily serve as input to [data module](https://www.tensorflow.org/guide/datasets) or [torch.utils.Dataset](https://pytorch.org/docs/stable/data.html).

## Data samples

For this example we consider a set of (X, y) data samples where X is a real vector observation and y an integer label.

The following script generates sample data and stores it into large chunks of `chunk_size` items to mock a dataset.

In [1]:
import os
import tempfile
import numpy as np

workdir = tempfile.TemporaryDirectory()
os.chdir(workdir.name)

n_samples = 18000
n_classes = 10
sample_shape = (248,)
chunk_size = 5000

# generate reference class centers
means = np.random.randn(n_classes, *sample_shape) * 3

# generate random class labels
targets = np.random.randint(n_classes, size=n_samples)
np.save('targets.npy', targets)

# generate noisy samples
n_chunks = n_samples // chunk_size + (1 if n_samples % chunk_size > 0 else 0)
for i in range(n_chunks):
    n = min((i + 1) * chunk_size, n_samples) - i * chunk_size
    chunk_file = "values_{:02d}.npy".format(i)
    values = means[targets[i * chunk_size:i * chunk_size + n]] \
        + np.random.randn(n, *sample_shape) * 0.1
    np.save(chunk_file, values)

## Data loading

Now begins the actual data loading.

In [6]:
import os
import seqtools

targets = np.load("targets.npy")

values_files = sorted(f for f in os.listdir() if f.startswith('values_'))
# use mmap if the data cannot fit in memory
values_chunks = [np.load(f) for f in values_files]
values = seqtools.concatenate(values_chunks)

assert len(values) == len(targets)

`seqtools.concatenate` consolidates the chunks back into a single list of items, but for that particular case we could also use `values = seqtools.unbatch(values_chunks)` because all chunks (except for the last one) have the same size.

Let's now assemble the samples with their targets to facilitate manipulation:

In [7]:
dataset = seqtools.collate([values, targets])

and split the dataset between training and testing samples

In [8]:
train_dataset = dataset[:-10000]
test_dataset = dataset[-10000:]

In this example, training will be done iteratively using small batches of data sampled from the whole dataset.

In [9]:
batch_size = 64

def collate_fn(batch):
    inputs = np.stack([x for x, _ in batch])
    targets = np.stack([y for _, y in batch])
    return inputs, targets

batches = seqtools.batch(train_dataset, batch_size, collate_fn=collate_fn)

## Training

With the minibatches ready to be used, we create a Gaussian Naive Bayes model and train over the dataset several times:

In [11]:
import time
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
classes = np.arange(n_classes)

In [27]:
%%time

for epoch in range(50):
    for inputs, targets in batches:
        pass
        # model.partial_fit(inputs, targets, classes=classes)

CPU times: user 7.8 s, sys: 6.46 ms, total: 7.81 s
Wall time: 8.01 s


Since the model is very simple, building the batches actually takes more time than training.
While there is not much that can be done to build individual batches faster, prefetching can help by building batches concurrently using multiple cpu cores.
SeqTools proposes two prefetching methods:

- `'thread'` has the smallest overhead but only offer true concurrency for specific loads, notably IO bound operations.
- `'process'` offers true parallelism but values computed by the workers must be sent back to the main process which incurs serialization costs. For buffers data such as numpy arrays, this can be aleviated by the use of shared memory (`shm_size` argument).

In [25]:
method = 'process'
prefetched_batches = seqtools.prefetch(
    batches, method=method, nworkers=2, max_buffered=40, shm_size=10 * 1024 ** 2)

model = GaussianNB()
classes = np.arange(n_classes)

In [26]:
%%time

for epoch in range(50):
    for inputs, targets in prefetched_batches:
        pass
        # model.partial_fit(inputs, targets, classes=classes)

CPU times: user 2.55 s, sys: 359 ms, total: 2.91 s
Wall time: 7.98 s


## Testing

For completeness, we evaluate the accuracy of the results on the testing data.

In [None]:
test_batches = seqtools.batch(test_dataset, batch_size, collate_fn=collate_fn)

predictions = []
targets = []

for X, y in test_batches:
    predictions.extend(model.predict(X))
    targets.extend(y)

accuracy = np.mean(np.array(predictions) == np.array(targets))
print("Accuracy: {:.0f}%".format(accuracy * 100))