# Data loading : batching, bucketing


In deep learning, it is common to process the training data in "mini-batches", at each optimization step we only use a small random sample of the data.

For image processing tasks, all images are of the same size; however, for some sequence processing tasks like text processing, the length of each sequence is arbitrary. This means that when we group together some sentences in a minibatch, we need to add "padding" tokens to make all sentences as long as the longest sentence in the minibatch. This is a waste of space in the minibatch and a way to paliate this problem is to try to make batches with sentences of similar length, so that we minimize the amount of padding introduced. This way of preparing batches based on the sequence length is known as "bucketing".

`seqp` offers you bucketed sequence batch loading out of the box. This notebook illustrates such a feature in a very flexible way.

In [1]:
!wget -q http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng_news_2016_10K.tar.gz
!tar xzf eng_news_2016_10K.tar.gz
!cut -f2 eng_news_2016_10K/eng_news_2016_10K-sentences.txt | gshuf > ./corpus.en
!rm -rf eng_news_2016_10K eng_news_2016_10K.tar.gz

In [2]:
import re
from seqp.vocab import Vocabulary, VocabularyCollector

file_name = 'corpus.en'

collector = VocabularyCollector()

with open(file_name) as f:
    for line in f:
        line = line.strip().lower()
        # tokenize words (taken from https://stackoverflow.com/a/8930959/674487)
        tokens = re.findall(r"\w+|[^\w\s]", line, re.UNICODE)
        for token in tokens:
            collector.add_symbol(token)

vocab = collector.consolidate(max_num_symbols=5000)

In [3]:
import numpy as np
from seqp.record import ShardedWriter
from seqp.hdf5 import Hdf5RecordWriter


with ShardedWriter(Hdf5RecordWriter,
                   'corpus.shard_{:02d}.hdf5',
                   max_records_per_shard=4000) as writer, open(file_name) as f:

    # save vocabulary along with the records
    writer.add_metadata({'vocab': vocab.to_json()})

    for idx, line in enumerate(f):
        line = line.strip().lower()
        tokens = re.findall(r"\w+|[^\w\s]", line, re.UNICODE)
        token_ids = vocab.encode(tokens, add_eos=False, use_unk=True)
        writer.write(idx, np.array(token_ids))

In [7]:
from glob import glob
from seqp.iteration import DataLoader
from seqp.hdf5 import Hdf5RecordReader
from seqp.vocab import Vocabulary

BATCH_SIZE_IN_TOKENS = 500

with Hdf5RecordReader(glob('corpus.shard_*.hdf5')) as reader:
    vocab = Vocabulary.from_json(reader.metadata('vocab'))
    
    loader = DataLoader(reader, pad_value=vocab.pad_id, num_buckets=8) 
    
    batch_it = loader.iterator(batch_size=BATCH_SIZE_IN_TOKENS,
                               is_size_in_tokens=True)
    
    for k, batch in enumerate(batch_it):
        num_tokens = batch.shape[0] * batch.shape[1]
        print("Batch {:03d}. num_tokens={}, shape={}".format(k, num_tokens, batch.shape))
        if k > 5:
            break


Batch 000. num_tokens=480, shape=(20, 24)
Batch 001. num_tokens=496, shape=(31, 16)
Batch 002. num_tokens=496, shape=(31, 16)
Batch 003. num_tokens=492, shape=(41, 12)
Batch 004. num_tokens=493, shape=(17, 29)
Batch 005. num_tokens=496, shape=(31, 16)
Batch 006. num_tokens=500, shape=(25, 20)
