# EDA and Data Cleaning
In this notebook I am going to further investigate the dataset and am going to clean it, if required.

In [None]:
import pandas as pd
import numpy as np
import os
import glob
import csv
import matplotlib.pyplot as plt

%matplotlib inline

plt.rcParams['figure.figsize'] = [8,8]

## Exploratory Data Analysis
Let's start with further exploring the dataset. As a first step, let's first create a pandas dataframe containing all image filenames and their class. This helps to easily plot the data distribution and to create different data splits and to maybe oversample the dataset in a later step (in case there is some imbalance present). 

In [None]:
def get_size_image(image_path):
    image = plt.imread(image_path)
    dimensions = image.shape
    # check if last is missing -> image is grayscale -> add depth of 1
    if len(dimensions) < 3:
        return np.expand_dims(image, axis=-1).shape
    else:
        return dimensions

In [None]:
def create_dataframe(dataset_path):
    # first step: get list of files
    list_files = glob.glob(os.path.join(dataset_path, "**", "*.jpg"))
    
    # next step: get list of labels
    list_labels = list(map(lambda x: x.split(os.path.sep)[1], list_files))
    
    # loop over classes and get all images per class
    # first: let's create the empty dataframe
    df = pd.DataFrame(columns=["filepath", "class_name", "image_height", "image_width", "image_depth"])
    df["image_height"] = df["image_height"].astype(int)
    df["image_width"] = df["image_width"].astype(int)
    df["image_depth"] = df["image_depth"].astype(int)
    # loop over images to add each to the dataframe plus get dimensions of image
    for image, label in zip(list_files, list_labels):
        # remove dataset foldername from path
        image_shorthened = os.path.sep.join(image.split(os.path.sep)[1:])
        height, width, depth = get_size_image(image)
        df = df.append({"filepath":image_shorthened, 
                        "class_name":label, 
                        "image_height": height, 
                        "image_width": width, 
                        "image_depth": depth }, ignore_index=True)
    
    return df, list_labels

In [None]:
df, class_names = create_dataframe("dataset")
df.head()

In [None]:
df.info()

Okay perfect! In Kaggle its stated that there should be 6862 images in total, so we've read out all images. Now let's plot the data distribution to get a better understanding of the dataset.

In [None]:
df["class_name"].hist()
plt.title("Dataset Distribution.")
plt.xlabel("Class Name", labelpad=10)
plt.ylabel("Number of Samples")
plt.show()

Okay. The dataset is definetly imbalanced! Let's later apply some imbalance strategies to check if model performance can be improved to the model trained on the imbalanced dataset. <br> <br>
<b> IMPORTANT: </b> Don't change the distribution on the validation set and the hold-out test set! They should reflect the real data distribution! Therefore, we can first split the data into the according sets and could then later apply some imbalance strategies only on the training set.

## Image Sizes
Let's now check the size of all images and if they are different frome each other. This is important, because the model needs to have images of fixed size as input. <br> <br>
When looking at the first rows of the dataframe, it gets clear that the images have different shapes. Therefore, it is important to later choose one image size where all images should be resized to. The optimal image size can be searched by training models on the different image sizes and then comparing their performance. <br> <br>
Let's now quickly check for the largest and the smallest image sizes.

In [None]:
print(f'Min height: {df["image_height"].min()} | Max height: {df["image_height"].max()}')
print(f'Min width: {df["image_width"].min()} | Max width: {df["image_width"].max()}')
print(f'Min depth: {df["image_depth"].min()} | Max depth: {df["image_depth"].max()}')

Okay so the ranges are quite large! There are also grayscale images, rgb images and even some images which also include a transparent layer (some transparent map above the image with certain alpha). I always tend to train my networks on grayscale images, because I often realized that having rgb images didn't boost the models performance. But let's first keep the images in their original format, except the images having depth of 4. They can be transformed to rgb by dropping the fourth depth dimension.

In [None]:
def clean_file(row):
    dimension = row[-1]
    if dimension == 4 and row[0]:
        image = plt.imread(os.path.join("dataset", row[0]))
        plt.imsave(os.path.join("dataset", row[0]), image[:, :, :3])

In [None]:
df.apply(clean_file, axis=1)

Let's now reload the data and check if this problem is solved.

In [None]:
df, class_names = create_dataframe("dataset")
df.head()

In [None]:
df.info()

In [None]:
print(f'Min height: {df["image_height"].min()} | Max height: {df["image_height"].max()}')
print(f'Min width: {df["image_width"].min()} | Max width: {df["image_width"].max()}')
print(f'Min depth: {df["image_depth"].min()} | Max depth: {df["image_depth"].max()}')

Perfect! Let's now store the created dataframe as csv file such that it can be easily accessed later during the model training part.

In [None]:
df_encoded.to_csv(r"dataset\data.csv", index=False)