# **02 - DataPreperation**

## Objectives

* Clean data - Remove non image files from dataset.
* Split data - Split dataset into 'test', 'train' and 'validation' sets.
* Check data for imbalance - Check all data subsets for imbalance between labels to determine if their is cause for under-sampling or over-sampling.
* Check image sizes - Check images sizes to determine if their is caus for action to resize for uniform dataset.

## Inputs

* Raw dataset - The dataset provided by the client, collected and stored in "directory /inputs/cherry_leaves_dataset/cherry-leaves", outputed from notebook "01 - DataCollection"

## Outputs

* Plot showing data balance
* Plot showing image size variation

## Additional Comments

* No additional comments. 



---

## Change working directory

Change the working directory from the current folder to /workspace/Mildew-Detection-in-Cherry-Leaves

The output from the below cell should be '/workspace/Mildew-Detection-in-Cherry-Leaves'

In [1]:
import os
os.chdir(os.path.dirname(os.getcwd()))
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-Detection-in-Cherry-Leaves'

---

# Clean Data

## Filtering out non image files

In [6]:
import os
healthy_leaves='inputs/cherry_leaves_dataset/cherry-leaves/healthy'
powdery_mildew='inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew'
file_extension = ('.png', '.jpg', '.jpeg')

def drop_non_imgs(folder):
    files = os.listdir(folder)
    non_img_counter = 0
    for file in files:
        if not file.lower().endswith(file_extension):
            file_path = folder + '/' + file
            os.remove(file_path)
            non_img_counter =+ 1
        else:
            pass
    print(f"{folder} contained {non_img_counter} non image files before removal")

drop_non_imgs(healthy_leaves)
drop_non_imgs(powdery_mildew)

inputs/cherry_leaves_dataset/cherry-leaves/healthy contained 0 non image files before removal
inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew contained 0 non image files before removal


## Split dataset into train-, validation- and test-set

* Set the ratio for train-, validation- and test-sets

In [11]:
train_set_ratio = 0.7
validation_set_ratio = 0.1
test_set_ratio = 0.2

ratio_sum = round(train_set_ratio + validation_set_ratio + test_set_ratio, 2)
if not ratio_sum == 1:
    print(f'WARNING: The sum of the ratios must be 1. It is currently {ratio_sum} and need to be corrected.')
else:
    print('Data set ratio is set.')


Data set ratio is set.


* Create the new train, validation and test folders within the dataset directory

In [None]:
import os

data_dir = '/workspace/Mildew-Detection-in-Cherry-Leaves/inputs/cherry_leaves_dataset/cherry-leaves'

data_labels = os.listdir(data_dir)
folder_names = ['train', 'validation', 'test']

for foldername in folder_names:
    for data_label in data_labels:
        os.makedirs(name=data_dir + '/' + foldername + '/' + data_label)

* Split and move the data to the new folders and delete the previous folders

In [8]:
import os
import shutil
import random

for data_label in data_labels:
    data = os.listdir(data_dir + '/' + data_label)
    random.shuffle(data)

    train_data_amount = round(len(data) * train_set_ratio)
    validation_data_amount = round(len(data) * validation_set_ratio)
    
    for img in data:
        if len(os.listdir(data_dir + '/train/' + data_label)) <= train_data_amount:
            shutil.move(data_dir + '/' + data_label + '/' + img,
                        data_dir + '/train/' + data_label + '/' + img)
        elif len(os.listdir(data_dir + '/validation/' + data_label)) <= validation_data_amount:
            shutil.move(data_dir + '/' + data_label + '/' + img,
                        data_dir + '/validation/' + data_label + '/' + img)           
        else:
            shutil.move(data_dir + '/' + data_label + '/' + img,
                        data_dir + '/test/' + data_label + '/' + img)

    os.rmdir(data_dir + '/' + data_label)

## Check for Data Imbalance

In [None]:
import os

healthy_leaves_sum = len(os.listdir(healthy_leaves))
powdery_mildew_leaves_sum = len(os.listdir(powdery_mildew))
dataset_sum =  healthy_leaves_sum + powdery_mildew_leaves_sum

healthy_leaves_ratio = round(healthy_leaves_sum / dataset_sum, 2)
powdery_mildew_leaves_ratio = round(powdery_mildew_leaves_sum / dataset_sum, 2)

print('Optimal ratio would be 0.5 each for both healthy leaves and leaves with powdery mildew')
print(f'Dataset contains a ratio of {healthy_leaves_ratio} for healthy leaves')
print(f'Dataset contains a ratio of {powdery_mildew_leaves_ratio} for leaves with powdery mildew')

Optimal ratio would be 0.5 each for both healthy leaves and leaves with powdery mildew
Dataset contains a ratio of 0.5 for healthy leaves
Dataset contains a ratio of 0.5 for leaves with powdery mildew


## Check image sizes

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import joblib

from matplotlib.image import imread

In [None]:
img_width, img_height = [], []
for label in labels:
  for img_file in os.listdir(train_data_dir + '/'+ label):
    img = imread(train_data_dir + '/' + label + '/' + img_file, 0)
    img_shape = img.shape
    img_width.append(img_shape[1])
    img_height.append(img_shape[0])

fig, axes = plt.subplots()
sns.scatterplot(x=img_width, y=img_height, alpha=0.2)
axes.set_xlabel("Width [px]")
axes.set_ylabel("Height [px]")
img_width_mean = int(np.array(img_width).mean())
img_height_mean = int(np.array(img_height).mean())
axes.axvline(x=img_width_mean,color='r', linestyle='--')
axes.axhline(y=img_height_mean,color='r', linestyle='--')
plt.show()
print(f"Width average: {img_width_mean} \nHeight average: {img_height_mean}")

---

# Conclusions and Next Steps

* Data is balanced and no cause for action.
* Images are of uniform size 256px X 256 px and no resizing is required.
  * Image size will be managed in notebook "04 - Modelling", in order to control ratio of model performance/model size. 