<a href="https://colab.research.google.com/github/moctarjallo/blogeh/blob/main/Python_Generators_for_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this tutorial we will see what are `python generators` and what important problem do they solve.

In [None]:
# usefull imports
import tensorflow as tf
import numpy as np

# Python Generators

## The Problem


Let's start with an example: the `fibonacci function`


In [None]:
def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)
fib(6)

This function returns the `nth fibonnacci` number.

In [None]:
# Now let's get the fibonacci sequence:
def fib_sequence(n):
    seq = []
    for i in range(n):
        seq.append(fib(i))
    return seq

Use this code to print the first `n` numbers of the fibonacci sequence.

Let's see what might be the probleme with this implementation.

What if we did put n=1000 or n=10000 or n=100000 etc

In [None]:
for fib_number in fib_sequence(36):
    print(fib_number)

We can clearly see how long this takes for just `n=36` as input !

We also notice that it only prints the whole thing only when it finishes computing the whole thing.

What if we wanted to get at least the first results and use them, while it is computing the rest or next values of the results so that we don't have to wait for all the data to be generated?

This is where `generators` comes in.


## What are `generators`?

### Definition and usage.

A generator in python is a function that produces a sequence of results, one at a time.

To use a generator in our example we would have to rewrite the `fib_sequence` function like this:


In [None]:
def fib_generator(n):
    # seq = []
    for i in range(n):
        # seq.append(fib(i))
        yield fib(i)
    # return seq

Here we have just replaced the `return` keyword with `yield` keyword, also without using a temporary `seq` variable.

Now let's see how this function can be used.

In [None]:
fib_generated_sequence = fib_generator(6)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

Here we can see how each element of the sequence is being generated one at a time, using the built-in function `next()`

We say that the generator is being *_consumed_*.

Notice what happens when the generator gets fully consumed:

In [None]:
next(fib_generated_sequence)


Since we specified that we wanted to generate only the 6 elements of the sequence, it raises a `StopIteration` Exception when it finishes the computation. And that's very normal.

### Consuming a generator with a `list`

There is, as you can probably guess, another way of consuming the whole generator:

In [None]:
# Here we rebuild the generator:
fib_generated_sequence = fib_generator(6)

In [None]:
# Here we consume all the generator
list(fib_generated_sequence)

This returns all the elements of the generator.

So the other way of consuming all data in the generator is by calling `list` on it so to get all its elements into a list.

Notice again what happens when we try to continue consuming that generator:

In [None]:
# Using next function
next(fib_generated_sequence)

This exception again means the generator has been fully consumed.


In [None]:
# Or using list
list(fib_generated_sequence)

Here it shows an empy list meaning that the generator has fully been consumed.

So be careful when you consume a generator that has already been consumed!

You'd have to rebuild the generator anytime you want to consume its values.

Now the nice thing about generators it that they also can be consumed using a `for loop`:


### Consume a generator with a `for-loop`

In [None]:
# Rebuild the generator
fib_generated_sequence = fib_generator(6)

In [None]:
for fib_number in fib_generated_sequence:
    print(fib_number)

The generator has again been fully consumed:

In [None]:
next(fib_generated_sequence)


### The advantage of a generator

Let's see now what would be the adavantage of using a generator.

Remember our initial problem was that it took too much time to retrieve the first 36 elements of the sequence before printing it at once.

Now what would happend in the case of generators?

In [None]:
# Let's rebuild the generator that generates the 36 elements
fib_generated_sequence = fib_generator(36)

There we have our generator ready to be consumed, element by element, without waiting for the whole sequence of 36 elements to be retrived.

Here we can just retrieve the first element if we wish, or the second too, and the third:


In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

In [None]:
next(fib_generated_sequence)

We are consuming our sequence of 36 elements one by one!

We can also try to get the `next` 10 elements from this point in time in the state of the generator.


In [None]:
for i in range(10):
    print(next(fib_generated_sequence))

Or we can just consume the full generator from now on using `list`


In [None]:
print(list(fib_generated_sequence))

But now let's see why we no longer have to wait until the end of computation to print each and every element of the fibonacci sequence.


In [None]:
# Let's rebuild the generator that generates the 36 elements
fib_generated_sequence = fib_generator(36)

In [None]:
for fib_number in fib_generated_sequence:
    if fib_number <= 500000: # let's show only the last elements so we can visualize their generation
        continue # let's show only the last elements so we can visualize their generation
    print(fib_number)

Nice.

We can see the great flexibility that the generator offers in order to handle huge amounts of computation.

This just means a generator allows us to use the generated data before it even finishes to compute the next elements of the data.

This is the idea we're going to use when training a machine learning model for example on huge amount of data that take lot of time to do preprocessing computation and also takes up too much memory.

We are going to train our neural network with the already generated (and preprocessed, eventually) part of the data without waiting for that whole preprocessing to finish.
This is also going to allow us to load just enough data into RAM (one batch) at a time while working with it and wait for the next generated batch of data.


# Application: Training a Deep Learning Model using a Generator?


## Train a simple model

Let's build a simple neural network with keras

In [None]:
# Build a simple model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=(5,)))
model.add(tf.keras.layers.Dense(32))
model.add(tf.keras.layers.Dense(1))
 
model.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.BinaryCrossentropy(), metrics=tf.keras.metrics.Accuracy())
 
model.summary()

Traditionally, to train such a model we create data like this


In [None]:
def get_batch(batch_size=30, features=5):
    X = np.random.rand(batch_size, features)
    # print('X shape:', dataX.shape)
    Y = (np.random.rand(batch_size, 1) > .5).astype('int32')
    # print('Y shape:', dataY.shape)
    return X, Y

In [None]:
# Here we use a function to generate our data
def get_data(size, batch=30, features=5):
    dataX = []
    dataY = []
    i = 0
    while i < size//batch:
        dataX.extend(get_batch(batch, features)[0])
        dataY.extend(get_batch(batch, features)[1])
        # We already see how we are tediously concatenating each batch of data to make the full data: this takes time!
        i+=1
    return np.array(dataX), np.array(dataY)

Then train the model


In [None]:
X, Y = get_data(size=30000)
model.fit(X, Y, epochs=5)

## Train on a generator

Now here we replace this function using a generator instead, that is going to generate one `batch_size` amount of data at a time


Now the idea is to generate one batch of data at a time using this function then give it to the model to train, before generating the next batch and do the training on that batch, and so on.

The advantage now is that we don't need to wait for the entire batches of the entire data to be generated before starting to train our model. And this is what helps us avoid the RAM to be overloaded with all the data at once.

Let's take a look:

In [None]:
def get_data_generator(size, batch=30, features=5):
    # size: the size of all the data
    # batch_size: the size of one batch of data
    i = 0
    while i <= size:
        yield get_batch(batch, features)
        # In this previous statement we see that we have now 
        # replaced the concatenation step by using a `yield` keyword
        # which is going to make it a generator
        i += 1

Let's test this data generator first.

In [None]:
for batch in get_data_generator(3000):
    print(batch)

In the previous output we can already feel how this data is being generated one batch at a time; and this is what we needed.

Now we can apply this and send it to our model to train incrementally.

In [None]:
data_generator = get_data_generator(size=30000)
model.fit(data_generator, epochs=5, steps_per_epoch=3000//5)

## Comparisons: the old way vs the new one..

What are the advantages of using a python generator versus the old method.

This becomes clear when we are dealing with huge amounts of data.

In our previous case we had only a set of 30 to 30000 data points in our dataset, so that the differences aren't that noticeable.

Here we are going to build a huge enough dataset to demonstrate the comparisons.

In [None]:
# Here we reuse our get_data function
X, Y = get_data(3000000) # here we have up to 3 millions data points
model.fit(X, Y, epochs=2, batch_size=300)

In the previous code execution we can feel how long it took to create the full data first before starting to train. 

We know remedy this by using the generator instead.

In [None]:
# Here we reuse our get_data_generator
data_generator = get_data_generator(3000000, batch=3000) # here we have up to 3 millions data points
model.fit(data_generator, epochs=2, steps_per_epoch=3000000//5)

In this previous code execution we see that it started training at the moment we executed our code, because the `get_data_generator` function had already finished generating the first batch of `batch` datapoints that is enough to start training the model.

**CONCLUSION**:

Next time you are dealing with huge amouts of data in your machine learning projects, try to take advantage of the generator functionality: this will save you lot time and space!

Using generators becomes very usefull when specifically the data comes from an outside source like a database, an IoT system or the Web..
We can also generators to do `online learning` in reinforcement learning. And so many other examples.