# Brownspot Disease Data Preprocessing 

Preprocessing is the general term for all transformation done to the data before feeding it into the model. The various steps include centering, normalization, shift, rotation, shear and so on. There are few occasions when we need to preprocess data. Among these include:

   - Cleaning up the data
   - Augmenting the data
   


**Importing Libraries**

In [None]:
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import tensorflow as tf
import cv2

In [None]:
import os
import json
import shutil

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
import seaborn as sns


# Data Preparation

In [None]:
train_path = 'C:\\Users\\Keith\\Desktop\\Brownspot Detector\\3. Data Preprocessing\\train'

# Cleaning Transformations 

**1. Rescaling**

Rescaling is an operation that moves the data from one numerical range to another by simple division using a predefined constant.

In Deep neural nteworks, we usually want to restrict the input to the range of 0 to 1, due to possible overflow, optimization and stability issues.

In [None]:
img_height = 4000
img_width = 3000

In [None]:
datagen_rescaled = ImageDataGenerator(rescale=1. / 255.)
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_rescaled = datagen_rescaled.flow_from_directory(train_path, 
                                                    target_size=(img_height, img_width), 
                                                    batch_size=1, 
                                                    shuffle=False, 
                                                    class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_rescaled = next(gen_rescaled)
compare_images(sample_default[0], sample_rescaled[0])

In [None]:
sample_default[0][:2, :2, 0]

In [None]:
sample_rescaled[0][:2, :2, 0]

**2. Grayscaling**

Grayscaling turns color RGB image into images with only shades of gray representing colors. 

In [None]:
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_grayscaled = datagen_default.flow_from_directory(train_path, 
                                                     target_size=(img_height, img_width), 
                                                     batch_size=1, 
                                                     shuffle=False, 
                                                     class_mode=None, 
                                                     color_mode="grayscale")

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_grayscaled = next(gen_grayscaled)
compare_images(sample_default[0], sample_grayscaled[0])

**3. Samplewise Centering**  **- across features inside one sample**

The raw data value in an image are from 0 to 255. One sample is a 3d array of numbers from  0 to 255. We might want to normalize the dataset such that the mean value of each data sample is equal to 0.

Therefore, we calculate the mean value across one whole sample and subtract it from each number in it.


In [None]:
datagen_samplewise_mean = ImageDataGenerator(samplewise_center=True)
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_samplewise_mean = datagen_samplewise_mean.flow_from_directory(train_path, 
                                                                  target_size=(img_height, img_width), 
                                                                  batch_size=1, 
                                                                  shuffle=False,  
                                                                  class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_samplewise_mean = next(gen_samplewise_mean)
compare_images(sample_default[0], sample_samplewise_mean[0])

**4. Samplewise std normalization**  **-across features inside one sample**

This is a similar idea as samplewise centering, but instead of setting the mean value to 0, here we set the standard deviation value to 1. 

Std normalization is controlled by the keras option **samplewise_std_normalization**. It is a common practice to use these two samplewise normalization options simultaneously.

This helps in improving the optimization stability by reducing the influence of exploding gradients problem.


In [None]:
datagen_samplewise_std = ImageDataGenerator(samplewise_std_normalization=True)
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_samplewise_std = datagen_samplewise_std.flow_from_directory(train_path, 
                                                                target_size=(img_height, img_width), 
                                                                batch_size=1, 
                                                                shuffle=False, 
                                                                class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_samplewise_std = next(gen_samplewise_std)
compare_images(sample_default[0], sample_samplewise_std[0])

**5. Featurewise centering** **- across samples in dataset**

Here each number in the image array is treated as a feature. Then each image is represented by a vector of features.
Because there are plenty of such vectors in the dataset, therefore treat them as coming from some unknown distribution.

The distribution will be multivariate and the dimension of space will be equal to the number of features which is **width x height x 3**.

Though we don't know the real distribution of the data, we normalize it by subtracting the mean value of the distribution.

Here the mean value is the vector of the same dimension as space i.e., it is an image itself. (***we average across the dataset and not across one sample***)

The whole dataset is read into memory by setting the batch size to the size of the dataset.

The man image across the dataset is calculated.

Finally the mean is subtracted from the test image.



In [None]:
datagen_default = ImageDataGenerator()

In [None]:
gen_dataset = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_dataset = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  shuffle=False, 
                                                  batch_size=gen_dataset.n, 
                                                  class_mode=None)

In [None]:
dataset = next(gen_dataset)
dataset.shape

In [None]:
mean_image = dataset.mean(axis=0)

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_featurewise_mean = sample_default - mean_image
compare_images(sample_default[0], sample_featurewise_mean[0])

**6. Featurewise std normalization** **-across samples in dataset**

The idea behind featurewise standard deviation normalization is exactly the same as featurewise centering. The difference is that here we divid by the sample standard deviation instead of subtracting the mean value.



In [None]:
datagen_default = ImageDataGenerator()

In [None]:
gen_dataset = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_dataset = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  shuffle=False, 
                                                  batch_size=gen_dataset.n, 
                                                  class_mode=None)

In [None]:
dataset = next(gen_dataset)
dataset.shape

In [None]:
std_image = dataset.std(axis=0)

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_featurewise_std = sample_default / std_image
compare_images(sample_default[0], sample_featurewise_std[0])

# Augmentation Transformations

These are data-dependent transformations which explicitly use the graphical nature of data. 

**1. Rotation**

This transformation rotates the image in a certain direction (clockwise or anticlockwise).
The parameter that allows the rotations is called **rotation_range**. it specifies the range of rotations in degrees from which the random angle will be chosen uniformly to do a rotation.

Note that some of image regions are cropped out and some of the regions of the new image will need to be filled. Thus the filling mode can be set up by the **fill_mode** parameter.

In [None]:
datagen_rotated = ImageDataGenerator(rotation_range=45, fill_mode="constant")
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_rotated = datagen_rotated.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_rotated = next(gen_rotated)
compare_images(sample_default[0], sample_rotated[0])

**2. Horizontal shift**

This transformation shifts the image to a certain direction along the horizontal axis(left or right).

The size of the shift can be determined using the **width_shift_range** parameter and is measured as a fraction of the the total width.

In [None]:
datagen_hshifted = ImageDataGenerator(width_shift_range=0.4, fill_mode="constant")
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_hshifted = datagen_hshifted.flow_from_directory(train_path, 
                                                    target_size=(img_height, img_width), 
                                                    batch_size=1, 
                                                    shuffle=False, 
                                                    class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_hshifted = next(gen_hshifted)
compare_images(sample_default[0], sample_hshifted[0])

**3. Vertical shift**

This transformation shifts the image to a certain direction along the vertical axis(down or up).

The size of the shift can be determined using the **height_shif** generator and is measured as a fraction of the the total height.

In [None]:
datagen_vshifted = ImageDataGenerator(height_shift_range=0.4, fill_mode="constant")
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_vshifted = datagen_vshifted.flow_from_directory(train_path, 
                                                    target_size=(img_height, img_width), 
                                                    batch_size=1, 
                                                    shuffle=False, 
                                                    class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_vshifted = next(gen_vshifted)
compare_images(sample_default[0], sample_vshifted[0])

**4. Shearing**

Shear mapping or shearing displaces each point in the vertical direction by an amount proportional to its distance from an edge of the image.

**NB:** The direction doesn't have to be vertical and can be arbitrary.

The parameter that controls the displacement is called **shear_range** and corresponds to the deviation angle (in radians) between a horizontal line in the original picture and the image of this line in the transformed image.

In [None]:
datagen_sheared = ImageDataGenerator(shear_range=0.75, fill_mode="constant")
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_sheared = datagen_sheared.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_sheared = next(gen_sheared)
compare_images(sample_default[0], sample_sheared[0])

**5. Zoom**

This transformation zooms the initial image in or out. The **zoom_range** parameter controls the zooming factor.

For example, **zoom_range** is equal to 0.5 means that implies that the zooming factor will be chosen from the range [0.5, 1.5]

In [None]:
datagen_zoomed = ImageDataGenerator(zoom_range=0.5, fill_mode="constant")
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_zoomed = datagen_zoomed.flow_from_directory(train_path, 
                                                target_size=(img_height, img_width), 
                                                batch_size=1, 
                                                shuffle=False, 
                                                class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_zoomed = next(gen_zoomed)
compare_images(sample_default[0], sample_zoomed[0])

**6. Horizontal Flip**

This flips the image with respect to the vertical axis. One can either turn it on or off using the horizontal_flip parameter.

In [None]:
datagen_hflipped = ImageDataGenerator(horizontal_flip=True)
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_hflipped = datagen_hflipped.flow_from_directory(train_path, 
                                                    target_size=(img_height, img_width), 
                                                    batch_size=1, 
                                                    shuffle=False, 
                                                    class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_hflipped = next(gen_hflipped)
compare_images(sample_default[0], sample_hflipped[0])

**7. Vertical Flip**

This flips the image with regard to the horizontal axis. The **vertical_flip** Boolean parameter controls the presence of this transformation.

In [None]:
datagen_vflipped = ImageDataGenerator(vertical_flip=True)
datagen_default = ImageDataGenerator()

In [None]:
gen_default = datagen_default.flow_from_directory(train_path, 
                                                  target_size=(img_height, img_width), 
                                                  batch_size=1, 
                                                  shuffle=False, 
                                                  class_mode=None)
gen_vflipped = datagen_vflipped.flow_from_directory(train_path, 
                                                    target_size=(img_height, img_width), 
                                                    batch_size=1, 
                                                    shuffle=False, 
                                                    class_mode=None)

In [None]:
np.random.seed(21)
sample_default = next(gen_default)
sample_vflipped = next(gen_vflipped)
compare_images(sample_default[0], sample_vflipped[0])

# Combination of the Techniques.

In [None]:
datagen = ImageDataGenerator(rotation_range=45, 
                             width_shift_range=0.2, 
                             height_shift_range=0.2, 
                             shear_range=0.2, 
                             zoom_range=0.3, 
                             horizontal_flip=True, 
                             vertical_flip=True, 
                             fill_mode="nearest")

In [None]:
try:
    img = load_img(os.path.join(train_pos_path, "Firehydrant2.jpg"))
except:
    img = load_img(os.path.join(valid_pos_path, "Firehydrant2.jpg"))

In [None]:
img

In [None]:
img = img_to_array(img)
img = img.reshape((1,) + img.shape)

In [None]:
n_augmentations = 8

In [None]:
save_dir = os.path.join(data_path, "augmentation_preview")
if os.path.exists(save_dir):
    shutil.rmtree(save_dir)
os.mkdir(save_dir)

In [None]:
plt.figure(figsize=(15, 6))    
i = 0

for batch in datagen.flow(img, 
                          batch_size=1, 
                          seed=21, 
                          save_to_dir=save_dir, 
                          save_prefix="hydrant", 
                          save_format="jpeg"):
    
    plt.subplot(2, int(np.ceil(n_augmentations * 1. / 2)), i + 1)
    plt.imshow(array_to_img(batch[0]))
    plt.axis("off")
    
    i += 1
    if i >= n_augmentations:
        break