**D3APL: Aplicações em Ciência de Dados** <br/>
IFSP Campinas

Prof. Dr. Samuel Martins (Samuka) <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Animal Dataset - v0
We will evaluate some **multiclass classification** CNNs to predict the classes of the **Animal Dataset**: https://www.kaggle.com/datasets/alessiocorrado99/animals10


Target goals:
- Dataset Organization
    + Understand the dataset's structure
    + Handle the _class imbalance_ by _undersampling_
    + Saving the balanced dataset
    + Code a function to load the _animals dataset_ as a _dataframe_

## 1. Dataset
**Animal Dataset**: https://www.kaggle.com/datasets/alessiocorrado99/animals10

Dataset locally stored on _'../datasets/animal_dataset'_

### 1.1 Dataset Info
**Dataset Folder:** `../datasets/animal-dataset/raw-img`

Each **animal** (a **class** in your _classification problem_) has a _folder_ inside the _dataset folder_. <br/>
There are **10 animals (classes)** in the dataset.

Each _class folder_ contain _all_ **images** (**samples**) from the corresponding _class_.

We see that it is an **imbalanced dataset**. 

The class with the _fewest images_ is _"elefante"_ (1446 images) and the one with the most images is _"cane"_ (4863 images).

### 1.2 Handling Class Imbalance by Undersampling
To handle the **class imbalance** we will consider the **_undersampling_** technique:

<img src='./figs/undersampling_x_oversampling.jpg' width=600/> <br/>
Source: https://www.mastersindatascience.org/learning/statistics-data-science/undersampling/#:~:text=Undersampling%20is%20a%20technique%20to,information%20from%20originally%20imbalanced%20datasets.

<br/>

Although _the lowest number of samples per class_ in our dataset is _1446 images_, we will consider a "rounded number": **1400 images**.

In [None]:
# ita ==> en
translate = {'cane': 'dog', 'cavallo': 'horse', 'elefante': 'elephant', 'farfalla': 'butterfly', 'gallina': 'chicken', 'gatto': 'cat', 'mucca': 'cow', 'pecora': 'sheep', 'scoiattolo': 'squirrel', 'ragno': 'spider'}

translate

In [None]:
# creating a dataframe to store the image full pathnames and their corresponding classes
import pandas as pd

dataset_df = pd.DataFrame({
    'image_pathname': img_full_paths,
    'class': img_classes
})

dataset_df

##### **Alternative to Undersampling from Scratch**

The [`imbalanced-learn` package](https://imbalanced-learn.org/stable/references/index.html) provides several functionalities to deal with **class_imbalance**, such as _undersampling_ and _oversampling_.

### 1.3 Saving the undersampled dataset

**Output directory structure:**

<pre>
+ ../datasets/animal-dataset/raw-img-balanced
    + cat
    + dog
    ...
</pre>

In [None]:
import os
import shutil



# if the directory exists, we will delete it


# (re)create the output dir

    
# create the output class folders inside `out_dir`


    
# copy each image from the balanced dataset
    
    # copy the image from its original location to the new one

    
    # verbose - print every 1000 iterations
    if index % 1000 == 0:
        print(f'{index + 1}/{dataset_df.shape[0]}\nFrom: {img_path}\nTo: {out_img_path}\n')
    

### 1.4 Function to load the animal dataset images from the disk as a DataFrame

**Required dataset structure:**
<pre>
+ dataset_folder
  + class_folder_1
    - image_path_name_1.jpeg
    - image_path_name_2.jpeg
    ...
  + class_folder_2
    - image_path_name_1.jpeg
    - image_path_name_3.jpeg
    ...
  ...
</pre>

In [None]:
# load the balanced dataset
dataset_df = load_animal_dataset_as_dataframe('../datasets/animal-dataset/raw-img-balanced')

In [None]:
dataset_df

Since this function may be useful to load the _animal dataset_ for different notebooks, let's create a **python file/module** to make it available:

**File:** `animals_utils.py`

In [None]:
import animals_utils

In [None]:
# load the balanced dataset
dataset_df = animals_utils.load_animal_dataset_as_dataframe('../datasets/animal-dataset/raw-img-balanced')

dataset_df

### 1.5 Inspect some images

**Open CV** for Image Processing: https://pypi.org/project/opencv-python/

In [None]:
import cv2

**Read image:** https://www.askpython.com/python-modules/python-imread-opencv#:~:text=Return%20Value%3A%20cv2.,%2C%20unsupported%20or%20invalid%20format

In [None]:
# read an image
img = cv2.imread(dataset_df.loc[0, 'image_pathname'])

In [None]:
print(type(img))

**Image's Shape**
- **Color Image:** (_height_, _width_, _channels_)
- **Channel Order:**  **BGR**: Blue, Green, Red ==> flag value is `cv2.IMREAD_COLOR` in `imread()`.

In [None]:
img.shape

**Image depth (number of bits)**:

In [None]:
img.min(), img.max()

It's an 24-bit color image.

**Visualizing the image**

In [None]:
import matplotlib.pyplot as plt
plt.imshow(img)

Note that the **color channels** are in a _different order_. <br/>
We need to _reorganize the channels_ from **BGR** to **RGB**.

In [None]:
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

In [None]:
plt.imshow(img)

**Other image:**

In [None]:
img = cv2.imread(dataset_df.loc[9999, 'image_pathname'])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)

In [None]:
img.shape

Note that the **images' shapes are different**, so we will need to **rescale the images** to a _standard shape_ according to the _considered network's architecture_.

In [None]:
img.min(), img.max()

It's an 24-bit color image.