# 04 - Data Augmentation

Machine learning algorithms need a lot of data. One way to increase the size of an image dataset is to perform data augmentation. The goal is to apply modifications on the raw images like rotation or rescaling and add the resulting images as they were new data samples. We will use `ImageDataGenerator` from the Keras library to do data augmentation.

Be sure that you have Keras installed.


In [1]:
import numpy as np
from keras.preprocessing.image import ImageDataGenerator

import time

Using TensorFlow backend.


## Single Image

Use again the library `imageio` to load the image named `humming_bird.jpg`. Divide the values by 255 to rescale the image.

In [2]:
# Your code here


### Width Shifts

You will create new images from this image that are modified versions using `ImageDataGenerator`. To do that you will need to:

- create an instance of `ImageDataGenerator` called `datagen`. You can choose the following parameters for the transformations: `width_shift_range=0.3` (0.3 is the proportion of the image size). The instance is called a generator, this is something you can iterate on to create batches of new images.

- You will use the `flow` method to do the iterations. Just be careful: this will produce infinite loops, so find a way to stop it. For now, start by creating one new image.

- Plot the resulting image to see how it is different from the original one.

Try to do it several time to see that it creates a different image each time.

Be careful, in the flow method, the input image should have a dimension 4: `(number of images, height, width, number of channels)`.

In [5]:
# Your code here


### More Variety

You should see that the new images are shifted with a random value.

Now, to illustrate the variety of new images that we can create, you will try to use a new data generator using these parameters for instance:

```python
width_shift_range=0.3,
height_shift_range=0.3,
shear_range=0.3,
zoom_range=0.3,
horizontal_flip=True,
fill_mode='nearest'
```

Here are few details about these parameters:

- `rotation_range` is in degrees.
- `width_shift_range` and `height_shift_range` is in proportion of the image size.
- Shearing is a transformation that shifts part of the image in a direction and the other part in an opposite direction (https://docs.gimp.org/2.10/en/gimp-tool-shear.html).
- `horizontal_flip` allows the generator to flip the image horizontally.
- `fill_mode` accounts for new pixels that can be created in rotations for instance.

In [9]:
# Your code here


Try to run your code multiple times to see the results.

**Question**: Try to explain to your buddy why and in what cases it can be useful to do these transformations.

## MNIST Fashion

This time, you will try to apply data augmentation to the MNIST Fashion dataset and not to a single image. The idea is the same: you create a image generator with the parameters you want, you iterate with the `flow` method. You can use a `batch_size` of 60000 to have a new dataset with the same number of samples (that you will add to `X_train` to have a dataset with a size doubled).

To get the corresponding label of the new images, you can do `for X_batch, y_batch in datagen.flow(`... and pass `X_train` and `y_train` as parameters.

Usually, the documentation of Keras is great and there is a lot of content on the Internet, so feel free to have a look if you are stuck.

Then, plot few examples corresponding to one class to check that it worked.

Now that you have a lot of new images (60,000 new images), try to concatenate them with `X_train` to create an augmented dataset `X_aug` (of shape `(120000, 28, 28, 1)`). You will also need to create the corresponding new variable `y_aug` with the right labels.

In [22]:
# Your code here


Now that you have `X_aug` and `y_aug`, use them to train a new random forest model.

In [39]:
# Your code here


Unfortunately, you should see that the accuracy is not good in comparison to the initial dataset (not augmented). This is because MNIST Fashion is very homogeneous between the train and the test sets. This means that the augmented dataset is quite different from the data used to test the algorithm. This shows that the machine learning algorithms we have used are sensitive to the rotation of the images. This is why the developments of deep learning algorithms for computer vision like the *convolutional neural networks* (CNN) are mostly used for computer vision.
