# Fast minibatch sampling

In this example, we show how to create a fast minibatch generator, which is typically used in Machine Learning to feed a training routine.
It is not the intent of SeqTools to supplant specialized libraries such as tensorflow's [data module](https://www.tensorflow.org/guide/datasets) or [torch.utils.Dataset](https://pytorch.org/docs/stable/data.html), but these might lack simplicity and flexibility for certain usages.
Besides, it is absolutly possible to use seqtools to provide the inputs for these modules.

**Note**: As a general guideline, special care should be taken when using worker based functions along with these libraries.
User are advised to become familiar with the behaviour of Python [threads](https://docs.python.org/3/library/threading.html) and [processes](https://docs.python.org/3/library/multiprocessing.html) before using them.

## Data samples

For this example we consider a set of (X, y) data samples composed of a real vector observation and an integer label.
Since it is common practice to store these samples by large groups into a few binary dump files, the following script generates such files to mock a dataset.

In [None]:
import os
import tempfile
import numpy as np

workdir = tempfile.TemporaryDirectory()
os.chdir(workdir.name)

n_samples = 18000
n_classes = 10
sample_shape = (248,)
chunk_size = 5000

# generate reference class centers
means = np.random.randn(n_classes, *sample_shape) * 3

# generate random class labels
targets = np.random.randint(n_classes, size=n_samples)
np.save('targets.npy', targets)

# generate noisy samples
n_chunks = n_samples // chunk_size + (1 if n_samples % chunk_size > 0 else 0)
for i in range(n_chunks):
    n = min((i + 1) * chunk_size, n_samples) - i * chunk_size
    chunk_file = "values_{:02d}.npy".format(i)
    values = means[targets[i * chunk_size:i * chunk_size + n]] \
        + np.random.randn(n, *sample_shape) * 0.1
    np.save(chunk_file, values)

## Data loading

Now begins the actual data loading.
Assuming the dataset is too big to fit in memory, memory mapping is used to access the file's content more easily:

In [None]:
import os
import seqtools

targets = np.load("targets.npy")

values_files = sorted(f for f in os.listdir() if f.startswith('values_'))
values_chunks = [np.load(f, mmap_mode='r') for f in values_files]
values = seqtools.concatenate(values_chunks)

assert len(values) == len(targets)

Concatenate is easy to memorize and does the job, but for that particular case we could also use `values = seqtools.unbatch(values_chunks)` since all of our data chunks (except for the last one) have the same size.

Let's now assemble the samples with their targets to facilitate manipulation:

In [None]:
dataset = seqtools.collate([values, targets])

and split the dataset between training and testing samples

In [None]:
train_dataset = dataset[:-10000]
test_dataset = dataset[-10000:]

In this example, training will be done iteratively using small batches of data sampled from the whole dataset.

In [None]:
batch_size = 64

def collate_fn(batch):
    inputs = np.stack([x for x, _ in batch])
    targets = np.stack([y for _, y in batch])
    return inputs, targets

batches = seqtools.batch(train_dataset, batch_size, collate_fn=collate_fn)

## Training

With the minibatches ready to be used, we create a Gaussian Naive Bayes model and train over the dataset a 20 times:

In [None]:
import time
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
classes = np.arange(n_classes)

t1 = time.time()
for epoch in range(50):
    for inputs, targets in batches:
        model.partial_fit(inputs, targets, classes=classes)

t2 = time.time()
print("training took {:.0f}s".format(t2 - t1))

Since the model is very simple, building the batches actually takes more time than training.
While there is not much that can be done to build individual batches faster, prefetching can help by building batches concurrently using multiple cpu cores.
SeqTools proposes three different prefetching methods:

- `'thread'` has the smallest overhead but only offer true concurrency for specific loads, notably IO bound operations.
- `'process'` offers true parallelism but values computed by the workers must be sent back to the main process which incurs serialization costs. For buffers data such as numpy arrays, this can be aleviated by the use of shared memory.

In this example, either methods can work.
`'thread'` will be the slowest since IO operations are not critical, whereas `'process'` should boost the execution speed compared the serial loop.

In [None]:
method = 'process'
prefetched_batches = seqtools.prefetch(
    batches, method=method, nworkers=2, max_buffered=40)

model = GaussianNB()
classes = np.arange(n_classes)

t1 = time.time()
for epoch in range(50):
    for inputs, targets in prefetched_batches:
        model.partial_fit(inputs, targets, classes=classes)

t2 = time.time()
print("training took {:.0f}s".format(t2 - t1))

## Testing

For completeness, we evaluate the accuracy of the results on the testing data.

In [None]:
test_batches = seqtools.batch(test_dataset, batch_size, collate_fn=collate_fn)

predictions = []
targets = []

t1 = time.time()
for X, y in test_batches:
    predictions.extend(model.predict(X))
    targets.extend(y)

accuracy = np.mean(np.array(predictions) == np.array(targets))
print("Accuracy: {:.0f}%".format(accuracy * 100))