```
Data generation

This notebooks deals with data augmentation as well as generating test and train sets.
```

### Data augmentation

Given the small amount of useable data we've had (572 frames in the latest .h5 file), we've decided to use data augmentation to generate more training data. For doing so, we used the `imgaug` package which provided us with all useful methods.

We decided to use augmentations that made sense in the context of our task. Since our network should deal with images of warped, stretched, translated/rotated brains, we decided to apply those augmentations to the images as well as add noise.

Using `iaa.SomeOf` and `iap.Choice`, we randomly choose 2 to 4 transformations to be applied on a given image, with any black artifacts due to rotating or translating being filled with a random background value. We used :

- `shearXY` for stretching
- `Rotate` for rotating
- `PiecewiseAffine` for distortion
- `Affine` for translation and zoom
- `AdditiveGaussianNoise` for adding noise

In [1]:
from imgaug import augmenters as iaa
from imgaug import parameters as iap
import numpy as np
import glob
from PIL import Image
from util.handle_files import *

%load_ext autoreload
%autoreload 2

In [2]:
#Uniformly sampling the background value for which we use to fill black spots
bg = iap.Uniform(16,18)
#Creating a series of augmentations
shearXY = iaa.Sequential([
                           iaa.ShearY(iap.Uniform(-10,10),cval=bg),
                           iaa.ShearX(iap.Uniform(-10,10),cval=bg)
                       ])
rotate = iaa.Rotate(rotate = iap.Choice([-30,-15,15,30],
                                        p=[0.25,0.25,0.25,0.25]),
                    cval = bg)

pwAff = iaa.PiecewiseAffine(scale=(0.01,0.06),
                            cval = bg)

affine = iaa.Affine(scale={"x":iap.Uniform(1.1,1.2),
                                       "y":iap.Uniform(1.1,1.2)},
                    cval = bg)

noise = iaa.AdditiveGaussianNoise(loc=0, scale=(0,0.025*255))
#Using SomeOf to randomly select some augmentations
someAug = iaa.SomeOf(iap.Choice([2,3,4], p = [1/3,1/3,1/3]),
                     [
                         affine,
                         shearXY,
                         pwAff,
                         rotate,
                         noise
                     ], random_order=True)

In [3]:
def augment_images(input_path, augs, times_target=1):
    #Load input image
    imgs = []
    print("Loading images from :", input_path)
    for f in glob.iglob(input_path+'*.jpg'):
        imgs.append(np.asarray(Image.open(f)))
        
    #Initialize list to receive augmented target images
    #For each time we want to augment (since random augmentation)
    target_imgs = []
    #Create the directory to receive the images in
    target_path = input_path+'/augmented/'
    target_path = makedir(target_path)
    
    #Applying (times_target) times the augmentation
    for i in range(times_target):
        print("Augmenting", i, "/",times_target-1)
        target_imgs += augs.augment_images(imgs)
        
    print("Saving images")
    for idx, augmented in enumerate(target_imgs):
        img = Image.fromarray(augmented)
        img.save(target_path+'augmented_frame_'+str(idx)+'.jpg')
    print("Done")

In the end, we decided to only use RGB images for now.
For a batch, each time the transformations are selected randomly, using `SomeOf` and `iap.Choice`. This way, even if we augment 10 times with the same set of transformations, the results won't be all the same.

In the end, we get a 10-fold augmentation yielding a total of 5720 images (including the originals) for training and validation.


In [None]:
#Datapaths
paths = ['../worm_data/h5_data/512/rgb_frames/',
         '../worm_data/h5_data/240/rgb_frames/']

#Augmenting for all
for path in paths:
    augment_images(path, someAug, 10)

## Generating Train/Validation sets

#### This part required a bit of manual handling initially, but generate_trainval.py bypassed this issue by directly extracting data and generating the augmentation as well as csvs in the appropriate folders.

For a given training datasets, :
- create a folder with an appropriate name in `./datasets/` containing a folder named `TrainVal` and move all images in it.
- create a folder with an appropriate name in `./training_data/` suffixed with -random

For example, for the **rgb240_augmented** dataset:

- create `./datasets/rgb240_augmented/TrainVal` which will contain all the images
- create `./training_data/rgb240_augmented-random/`

In [5]:
import csv, os
import pandas as pd
from sklearn.model_selection import train_test_split

In [6]:
def create_csv_split(image_dir, csv_dir='', ratio=0.2):
    """
    Create csv files containing the filenames to use in each of the 
    training and validation set.
    """
    data=[]
    csvdir = image_dir.split('datasets/')[1].split('/TrainVal')[0]
    csv_dir = makedir('./training_data/'+csvdir+'-random/')
    #else if csv_dir != '': 
    #    csvdir = makedir(csv_dir)
    #
    with open(csv_dir+'data.csv', 'w', newline='') as writeFile:
        writer = csv.writer(writeFile)
        writer.writerow(['image'])
        for filename in os.listdir(image_dir):
            data.append('TrainVal/'+filename)
            writer.writerow(data)
            data=[]
    writeFile.close()
    
    df = pd.read_csv(csv_dir+'data.csv')
    train, val = train_test_split(df, test_size=ratio)
    
    train.to_csv(csv_dir+'train.csv', index=False)
    val.to_csv(csv_dir+'val.csv', index=False)

In [9]:
imgdr = ['./datasets/rgb240_augmented/TrainVal', 
         './datasets/red240_augmented/TrainVal',
         './datasets/rgb512_augmented/TrainVal']

csvdr = ['./training_data/rgb240_augmented-random/',
         '', #use empty str here to test my method
         './training_data/rgb512_augmented-random/']

for x,y in zip(imgdr, csvdr):
    create_csv_split(x,y)

**Getting dataset means and stds to be used for standardization**

In [1]:
from skimage import io
import numpy as np

x = list_files('./datasets/rgb512_augmented/TrainVal/', 'jpg')
imgs = []
for f in x :
    imgs.append(io.imread(f))
arr = np.asarray(imgs)

In [2]:
print("red",arr[...,0].mean()/255, arr[...,0].std()/255)
print("green", arr[...,1].mean()/255, arr[...,1].std()/255)

red 0.0742920482408156 0.01979817471748977
green 0.0754988783485235 0.03203817782416048


In [3]:
print("blue", arr[...,2].mean()/255, arr[...,2].std()/255)

blue 0.00703948247409379 0.015462965792484733


From a previous notebook, the datasets were found to have the following means and stds:
```
means = [0.0743, 0.0755, 0.0070]
stds = [0.0198, 0.0320, 0.0155]
```
These values will be used for standardization.
