# Fast minibatch sampling

In this example, we show how to create a fast minibatch generator, which is typically used in Machine Learning to feed a training routine.
It is not the intent of SeqTools to supplant specialized libraries such as tensorflow's [data module](https://www.tensorflow.org/guide/datasets) or [torch.utils.Dataset](https://pytorch.org/docs/stable/data.html), but these might lack simplicity and flexibility for certain usages.
Besides, it is absolutly possible to use seqtools to provide the inputs for these modules.

**Note**: As a general guideline, special care should be taken when using worker based functions along with these libraries.
User are advised to become familiar with the behaviour of Python [threads](https://docs.python.org/3/library/threading.html) and [processes](https://docs.python.org/3/library/multiprocessing.html) before using them.

## Data samples

For this example we consider a set of (X, y) data samples composed of a real vector observation and an integer label.
Since it is common practice to store these samples by large groups into a few binary dump files, the following script generates such files to mock a dataset.

In [None]:
import os
import tempfile
import numpy as np

workdir = tempfile.TemporaryDirectory()
os.chdir(workdir.name)

n_samples = 18000
n_classes = 10
sample_shape = (248,)
chunk_size = 5000

# generate reference class centers
means = np.random.randn(n_classes, *sample_shape) * 3

# generate random class labels
labels = np.random.randint(n_classes, size=n_samples)
np.save('labels.npy', labels)

# generate noisy samples
n_chunks = n_samples // chunk_size + (1 if n_samples % chunk_size > 0 else 0)
for i in range(n_chunks):
    n = min((i + 1) * chunk_size, n_samples) - i * chunk_size
    chunk_file = "data_{:02d}.npy".format(i)
    data = means[labels[i * chunk_size:i * chunk_size + n]] \
        + np.random.randn(n, *sample_shape) * 0.1
    np.save(chunk_file, data)

## Data loading

Now begins the actual data loading.
Assuming the dataset is too big to fit in memory, data samples are not loaded but mapped into memory.

In [None]:
import os
import seqtools

labels = np.load("labels.npy")

data_files = sorted(f for f in os.listdir() if f.startswith('data_'))
data_chunks = [np.load(f, mmap_mode='r') for f in data_files]
data = seqtools.concatenate(data_chunks)

assert len(data) == n_samples

Concatenate is easy to memorize and does the job, but for that particular case we could also use `data = seqtools.unbatch(data_chunks)` since all of our data chunks (except for the last one) have the same size.

Let's now assemble the samples with their labels to facilitate manipulation and split the dataset between training and testing samples

In [None]:
dataset = seqtools.collate([data, labels])
train_dataset = dataset[:-10000]
test_dataset = dataset[-10000:]

We now write a simple random minibatch sampler and pass it to `seqtools.load_buffers` to start generating samples using multiple background workers:

In [None]:
def assemble_minibatch(samples):
    """Assembles a bunch of samples into a minibatch."""
    batch_data = np.stack([data for data, _ in samples])
    batch_labels = np.stack([label for _, label in samples])
    return batch_data, batch_labels


batch_size = 64


def sample_minibatch():
    subset = np.random.choice(len(dataset), batch_size)
    samples = list(seqtools.gather(dataset, subset))
    return assemble_minibatch(samples)


minibatch_iter = seqtools.load_buffers(sample_minibatch, max_cached=10, nworkers=2)

`minibatch_iter` simply yields minibatches indefinitely by repeatedly calling `sample_minibatch` and puts the results into buffers which are returned at each iteration.

*Please, note that the buffer slots are reused cyclicly so their content is overwritten across iterations.*


## Training

With the minibatches ready to be used, we create a Gaussian Naive Bayes model and start training:

In [None]:
import time
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
classes = np.arange(n_classes)

t1 = time.time()
for _ in range(4000):
    X, y = next(minibatch_iter)
    model.partial_fit(X, y, classes=classes)
    
t2 = time.time()
print("training took {:.0f}s".format(t2 - t1))

Without `seqtools.load_buffers` to prefetech the minibatches, the training procedure must wait for its input data.

In [None]:
model = GaussianNB()
classes = np.arange(n_classes)

t1 = time.time()
for _ in range(4000):
    X, y = sample_minibatch()  # load the data traditionally
    model.partial_fit(X, y, classes=classes)

t2 = time.time()
print("training took {:.0f}s".format(t2 - t1))

## Testing

For completeness, we evaluate the accuracy of the results on the testing data.
Assuming the testing dataset is also too big, the evaluation also proceeeds by small chunks:

In [None]:
testing_chunks = seqtools.batch(test_dataset, 64, collate_fn=assemble_minibatch)

predictions = []
targets = []

t1 = time.time()
for X, y in testing_chunks:
    predictions.extend(model.predict(X))
    targets.extend(y)

accuracy = np.mean(np.array(predictions) == np.array(targets))
print("Accuracy: {:.0f}%".format(accuracy * 100))