# How to create simple generators for Machine Learning and Deep Learning

This tutorial explain how to **write differents iterators** and what **features** these iterators can implement for machine learning tasks. It also show the usage of **`machinelearning.iterator` tools**.

Usefull for case like **deep learning train (e.g. on Keras with `fit_generator`)**. The tool will help you to easily create **generators which yield batches of samples infinitely in multiple processes**.

## A fake dataset

We create a fake dataset which will contains **documents** and **labels** with this structure:
    
    A this is a document
    B an other document
    A this talk about things
    C the apple is green or yellow

In [1]:
import random
voc = "this is a an other document talk about things the apple is green or yellow".split()
paths = ["part" + str(i) + ".txt" for i in range(3)]
print(paths)
for path in paths:
    with open(path, "w") as f:
        for sampleId in range(2):
            line = ""
            # We first add the class of the document:
            line = random.choice(['A', 'B', 'C']) + " "
            # Then we add a random words:
            for i in range(5):
                line += random.choice(voc) + " "
            f.write(line + "\n")

['part0.txt', 'part1.txt', 'part2.txt']


## A simple generator

Generators and iterators are usefull in cases **you cannot load your dataset into memory**. Lets started with a **simple generator** which takes a container (a file) in parameters:

In [2]:
def dataGenerator(container, *args, **kwargs):
    with open(container) as f:
        for line in f:
            line = line.strip().split(" ")
            yield (line[1:], line[0])

And iterate over the whole dataset (single-processing):

In [3]:
for path in paths:
    for tokens, label in dataGenerator(path):
        print(label + " --> " + str(tokens))

B --> ['about', 'is', 'about', 'the', 'talk']
A --> ['document', 'about', 'document', 'things', 'the']
A --> ['things', 'is', 'green', 'this', 'or']
A --> ['green', 'is', 'the', 'a', 'things']
B --> ['yellow', 'apple', 'other', 'things', 'other']
A --> ['talk', 'a', 'a', 'apple', 'other']


## A multi-processing iterator

Now in case you **compressed files** and your **disk storage is sufficiently fast**, or in cases the preprocessing of data is **time consuming**, it is beneficial to read files in a multi-processing way so files will be uncompressed on **multiple cores**.

To do this you can simply use `machinelearning.iterator.ConsistentIterator` which takes a **list of containers** (typically a list of files) and a function which is a **generator function**.

In [4]:
from machinelearning import iterator as mlit
ci = mlit.ConsistentIterator(paths, dataGenerator, verbose=False)

And iterate over the whole dataset:

In [5]:
for tokens, label in ci:
    print(label + " --> " + str(tokens))

B --> ['about', 'is', 'about', 'the', 'talk']
A --> ['things', 'is', 'green', 'this', 'or']
B --> ['yellow', 'apple', 'other', 'things', 'other']
A --> ['document', 'about', 'document', 'things', 'the']
A --> ['green', 'is', 'the', 'a', 'things']
A --> ['talk', 'a', 'a', 'apple', 'other']


**Features:**

 * Multi-processing
 * Always generate data in the same order

## Make it multi-iterable

The main problem of this iterator is that if you give an instance of your iterator to an external tool like [`Doc2Vec` from Gensim](https://radimrehurek.com/gensim/models/doc2vec.html), the tool **won't be able to iterate several time over your dataset**. Demonstration, you try to iterate again the instance `ci`, you will directly leave the loop: 

In [6]:
print("Start")
for _ in ci: print("Got one")
print("End")

Start
End


The solution is to **wrap your iterator initialization** in `AgainAndAgain` which takes a generator (or an Iterator class like `ConsistentIterator`) and all parameters to propagate:

In [7]:
aaaCI = mlit.AgainAndAgain(mlit.ConsistentIterator, paths, dataGenerator, verbose=False)

In [8]:
print("Start")
for _ in aaaCI: print("Got one")
for _ in aaaCI: print("Got one again")
print("End")

Start
Got one
Got one
Got one
Got one
Got one
Got one
Got one again
Got one again
Got one again
Got one again
Got one again
Got one again
End


**Features:**

 * Multi-iterable (the iterator instance embbed init parameters)

## Generate infinite batches

Some libraries take generators which must **yield batches of samples**. For instance, **[`fit_generator`](https://keras.io/models/sequential/#fit_generator) from Keras** takes a generator which must generate data this way (here with a batch size of 2):

    ([doc1, doc2], [label1, label2])
    ([doc3, doc4], [label3, label4])
    ([doc5, doc6], [label5, label6])

You just need to **wrap an AgainAndAgain instance**:

In [9]:
infiniteBatches = mlit.InfiniteBatcher(aaaCI, batchSize=2, toNumpyArray=False)

And Keras will iterate over your dataset this way:

In [10]:
for i in range(4):
    print(next(infiniteBatches))

([['about', 'is', 'about', 'the', 'talk'], ['things', 'is', 'green', 'this', 'or']], ['B', 'A'])
([['yellow', 'apple', 'other', 'things', 'other'], ['document', 'about', 'document', 'things', 'the']], ['B', 'A'])
([['green', 'is', 'the', 'a', 'things'], ['talk', 'a', 'a', 'apple', 'other']], ['A', 'A'])
([['about', 'is', 'about', 'the', 'talk'], ['things', 'is', 'green', 'this', 'or']], ['B', 'A'])


In **[`fit_generator`](https://keras.io/models/sequential/#fit_generator) from Keras**, you need to give the InfiniteBatcher instance, specify the **number of steps** (`steps_per_epoch`) to terminate an epoch, and specify **the number of epochs**.

You can use these **optionnal parameters**:

 * `shuffle` (integer) which will indicate how many batches to aggregate and shuffle
 * `skip` (integer) to skip some samples at the beggining in cases you resume a train from a previous run

**Features:**

 * Generate batches
 * Infinite

## Use variables in the data preprocessing

Let's say you want to **remove stop words** in multiple processes while you iterate your dataset. You just need to use `subProcessParseFunct` and `subProcessParseFunctKwargs`. This function will take one input from your base generator and must returned the processed item.

First we define stop words and a function which remoave stop word of a single data:

In [11]:
stopWords = ["a", "an", "the", "or", "the", "this", "is"]
def removeStopWords(data, *args, stopWords=set(), **kwargs):
    tokens, label = data
    tokens = [word for word in tokens if word not in stopWords]
    return (tokens, label)

Then we init a `ConsistentIterator`:

In [12]:
ci = mlit.ConsistentIterator(paths, dataGenerator, verbose=False,
                             subProcessParseFunct=removeStopWords,
                             subProcessParseFunctKwargs={"stopWords": stopWords})

We finally iterate the dataset:

In [13]:
for tokens, label in ci:
    print(label + " --> " + str(tokens))

B --> ['about', 'about', 'talk']
A --> ['things', 'green']
B --> ['yellow', 'apple', 'other', 'things', 'other']
A --> ['document', 'about', 'document', 'things']
A --> ['green', 'things']
A --> ['talk', 'apple', 'other']


Of course, like we saw, you can wrap it in `AgainAndAgain` and an `InfiniteBatcher`.

Note that **you cannot share variables across processes** (python serialize each variables so it will be differents instances, differents copies), in this example, **`stopWords` will be replicated** in each process. If you want to set global variables on-the-fly during the iteration, use `mainProcessParseFunct` and `mainProcessParseFunctKwargs`. This function takes one output from previous function and must return the processed item (single-processing). You can also give parameters for the base generator using `itemGeneratorKwargs`.