## manual data augmentation

In this notebook the code written to augmentate data is consolidated into a few lines of code.
The respective scripts can be reviewed at `src/image_handling`.
`src.utils` should not need any further explanation.

The code is basically emulation the behavior of the keras image generator, but also accepts another parameter `repetitions`.
The data is processed into the interim folder applying the modifications before training.
Therefore the manipulations do not take place during training, but a time independent preprocessing phase.

The images are copied to the `interim` directory first.
After the data is fully in `interim`, the data gets augmented "inplace", meaning we only work with the `interim` data and the result is a `interim` directory with all the augmented data (in this case 32000 images per class).

`inline_augment_images` returns a list with dictionaries.
These dictionaries contain all necessary information to create records in the upcoming cell.

The parameters should be self explanatory.

In [1]:
import shutil
from os.path import join

from src.image_handling import inline_augment_images, encode_record
from src.utils import reset_and_distribute_data

raw=join('data', 'raw')
interim=join('data', 'interim')
processed=join('data', 'processed')

reset_and_distribute_data(raw, interim, [1000, 100, 100])
shutil.rmtree(processed, ignore_errors=True)

target_size=(32, 32)

train = inline_augment_images(join(interim, 'train'),
    repetitions=32, h_flip=True, v_flip=True, rotation_range=360, target_size=target_size)

validation = inline_augment_images(join(interim, 'validation'), target_size=target_size)

test = inline_augment_images(join(interim, 'test'), target_size=target_size)

`encode_record` takes the previously described list, which is hardcoded in this instance as:
```python
feature = {
    'image': # /path/to/image
    'label': # 1 or 0
    'angle': # a number: [0, 180)
}
```

These features are then used to create protobufs.
protobufs can be read by Tensorflow very efficient and with no overhead of decoding the data (the decoding takes place during the preprocessing here).

This procedure also gives way more control over how the data is stored. For example the angle in this instance is saved, because maybe in the future it may be interesting to determine the angle of a linear node, where the data is way easier to generate by using only vertical lines (and the randomization is done by the augmentation).

The next step is to create data generators for the model which can read these tensorflow records.

In [2]:
from os import listdir

labels = listdir(raw)

encode_record(train, labels, processed, 'train')
encode_record(validation, labels, processed, 'validation')
encode_record(test, labels, processed, 'test')