# Image data explorations

For the image dataset we are working on a waste recycling plant dataset, which can be found on [Kaggle](https://www.kaggle.com/datasets/parohod/warp-waste-recycling-plant-dataset). There are three different versions of this dataset. We are using WaRP-C which contains cutout images of a single object. 
In this notebook we are going to explore the image dataset using pandas, PIL and matplotlib. Therefore the first step will be to import the libraries

In [1]:
import os
from PIL import Image
import pandas as pd
import matplotlib.pyplot as plt

Underneath you can find the code to load all images into a pandas dataframe. With this we are also adding some other information to this dataset. This includes the width and height of the images together with their area and image ratio. We also add the label of the image. Lastly, the image itself is saved under the "image" column in PIL Image format.

In [2]:
count = 0
images = pd.DataFrame(columns=["width", "height","area", "ratio", "label", "image"])
directory = 'datasets/Warp-C/'
for root_dir, cur_dir, files in os.walk(directory):
    print("root dir: " + str(root_dir))
    label = os.path.basename(os.path.normpath(root_dir))
    for file in files:
        if ".jpg" in file:
            file_name = root_dir +"/"+ file
            count += 1
            image = Image.open(file_name)
            image_size = image.size
            ratio = image_size[0]/image_size[1]
            row = [image_size[0], image_size[1], image_size[0]*image_size[1], ratio, label, image]
            images.loc[len(images)] = row

print("file count: " + str(count))
print(images)

root dir: datasets/Warp-C/
root dir: datasets/Warp-C/bottle-blue
root dir: datasets/Warp-C/bottle-blue-full
root dir: datasets/Warp-C/bottle-blue5l
root dir: datasets/Warp-C/bottle-blue5l-full
root dir: datasets/Warp-C/bottle-dark
root dir: datasets/Warp-C/bottle-dark-full
root dir: datasets/Warp-C/bottle-green
root dir: datasets/Warp-C/bottle-green-full
root dir: datasets/Warp-C/bottle-milk
root dir: datasets/Warp-C/bottle-milk-full
root dir: datasets/Warp-C/bottle-multicolor
root dir: datasets/Warp-C/bottle-multicolor-full
root dir: datasets/Warp-C/bottle-oil
root dir: datasets/Warp-C/bottle-oil-full
root dir: datasets/Warp-C/bottle-transp
root dir: datasets/Warp-C/bottle-transp-full
root dir: datasets/Warp-C/bottle-yogurt
root dir: datasets/Warp-C/canister
root dir: datasets/Warp-C/cans
root dir: datasets/Warp-C/cardboard-juice
root dir: datasets/Warp-C/cardboard-milk
root dir: datasets/Warp-C/detergent-box
root dir: datasets/Warp-C/detergent-color
root dir: datasets/Warp-C/detergen

## Looking at the images
Your first step will be to take a look at a sample for the dataset to get a grasp of what the images look like.

To get a random sample of a pandas dataframe you can use the following function.
<code>dataframe.sample(n=size_of_sample)</code>

To be able to show the images you will have to iterate over the dataframe. This can be done as follows:
<code>for index, row in dataframe.iterrows():</code>

Lastly to show the images you can plot them using subplots in matplotlib:
</br>
<code>
fig, axes = plt.subplots(nrows=number_of_rows, ncols=number_of_columns, figsize=(width, heigth) ) </br>
axes[x,y].imshow(PIL_image)
</code>

## Looking at the image sizes

The next step will be to get a grasp of the different image sizes. That is why we saved the width, height, area and aspect ratio of the images. To explore the image data sizes you can use the following functions.

- In Pandas it is possible to easily create histograms of all numeric values using:</br> <code>dataframe.hist(figsize=[width, height], bins=n_bins)</code>
- To create a scatter plot of two columns, you can use matplotlib:</br><code>plt.scatter(dataframe["column_1], dataframe["column_2"]) </br> plt.show()</code> 
- To get a row of a dataframe where a certain column has its maximum value, you can use:</br> <code>maximum = dataframe.loc[dataframe['coluumn'].idxmax()]</code>
- A similar function can be used to get the minimum value: </br><code>minimum = dataframe.loc[dataframe['coluumn'].idxmin()]</code>
- To display a single PIL Image you can use: </br><code>display(PIL Image)</code>
</br></br>

With this information, exoplore the distributions of the different sizes of the images, explore the width vs. height distributions, and find out what the minimum and maximum image size and image ratio is, as well as the minimum and maximum width and height. Latstly, think about how these sizes will imact the training of your machine learning model.

## Class distributions

Next to the image sizes, the class distributions is another important aspect to explore. You can use the following functions.
- Get the unique values in a list from a certain pandas column: <code>unique_values = dataframe["column"].unique()</code>
- To count the unique values in a column : <code>unique_values_counted = dataframe["column"].value_counts()</code>
- create a barplot in matplotlib from the counted values: <code>counted.plot(kind="bar")</code>

What does this distribution mean for the creation of our machine learning model?