# Deep Learning - Preprocessing

### *Facial age prediction - a SML Regression problem*

# 1. References


1. What is the purpose of image preprocessing in deep learning?, [block article](https://www.isahit.com/blog/what-is-the-purpose-of-image-preprocessing-in-deep-learning#:~:text=Even%20though%20geometric%20transformations%20of,features%20crucial%20for%20subsequent%20processing)
2. Balancing an imbalanced dataset with keras image generator, [block article](https://stackoverflow.com/questions/41648129/balancing-an-imbalanced-dataset-with-keras-image-generator)
3. `tf.keras.utils.image_dataset_from_directory`, [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory)


# 2. Initial Treatment

## 2.1. Configurations and import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install imageio --quiet
!pip install kaggle --quiet
!pip install imagehash --quiet

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras import Sequential, layers, initializers, regularizers, optimizers, metrics
import imageio.v2 as imageio
from google.colab import files

from sklearn.metrics import pairwise_distances
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest

import os
import time
import shutil
import random 
import zipfile
import math
from collections import defaultdict
import imagehash
import itertools
import pickle

import gdown # To download zip file from URL
from PIL import Image

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib.colors import ListedColormap
import seaborn as sns

import numpy as np
import pandas as pd

## 2.2. Auxiliary functions

Collection of all user defined functions in this notebook. 

In [None]:
def create_image_grid_per_group(image_groups, images_per_row, show_class=False):
  '''
  Creates a grid of images for each group of images in a dictionary.

  Args:
  --
    image_groups (dict): A dictionary where the keys are group labels and the values are lists of paths to the images in each group.
    images_per_row (int): The number of images to display in each row of the grid.
    show_class (bool): Parameter to show or not the class of the image

  Returns:
  --
    None. The resulting image grids are saved to .jpg files and displayed on screen.
  '''

  for group_label, image_paths in image_groups.items():
    print("--------------------------------------------------------------------")
    print(group_label)
    if show_class:
        groups_list = [path.split('/')[-2] for path in image_paths] 
        print(groups_list)
    print("--------------------------------------------------------------------")
    image_files = [f for f in image_paths if os.path.splitext(f)[1].lower() in ('.png', '.jpg', '.jpeg')]

    img_sizes = [Image.open(f).size for f in image_files]
    max_width = max([size[0] for size in img_sizes])
    max_height = max([size[1] for size in img_sizes])

    num_rows = math.ceil(len(image_files) / images_per_row)
    grid_size = (max_width * images_per_row, max_height * num_rows)
    grid_image = Image.new('RGB', grid_size, color='white')

    for i, image_file in enumerate(image_files):
        img = Image.open(image_file)

        x = (i % images_per_row) * max_width
        y = (i // images_per_row) * max_height
        grid_image.paste(img, (x, y))

    grid_image.save(f'image_grid_{group_label}.jpg')
    grid_image.show()

In [None]:
def save_images(dataset, output_folder):
  '''
  Saves a set of images in a specified folder.

  Args:
  --
  dataset (tf.data.Dataset): Output from the function image_dataset_from_dictionary().
  output_folder (str): Path to the folder where we want to store the images of the dataset.

  Returns:
  --
    None. The images from the dataset are stored on the output_folder.
  '''

    # Initialize timer
    t0 = time.time()
    
    i = 0
    print('Saving images for', output_folder.split('/')[-1])
    for file_path in dataset.file_paths:
        if i % 500 == 0:
            print('Saved', i, 'images')
        class_age = str(int(file_path.split('/')[-2]))
        file_name = file_path.split('/')[-1]

        folder_path = os.path.join(output_folder, class_age)
        os.makedirs(folder_path, exist_ok=True)

        image_path = os.path.join(folder_path, file_name)
        shutil.copy(file_path, image_path)
        i += 1
    print('Saved all images - Total', i)
    print("Saved in %0.3f seconds" % (time.time() - t0))
    print("Waiting 5 minutes to sync all files in Drive")
    time.sleep(60 * 5)

## 2.3. Configuring paths and loading files

Loading the generated files from exploration part.

In [None]:
# Define the path, where the dataset should be saved
vm_path = "/content"
path = "/content/drive/MyDrive/FacialAgeProject/"

data_path = os.path.join(path, 'data')
metadata_path = os.path.join(path, 'metadata')
dataset_path = os.path.join(data_path, "facial_age_dataset_unsplit/")

duplicated_path = os.path.join(metadata_path, 'ahash_duplicated_keys.pkl')
preprocessed_path = os.path.join(data_path, 'facial_age_dataset_preprocessed')

In [None]:
with (open(duplicated_path, "rb")) as openfile:
    duplicated_keys = pickle.load(openfile)

duplicated_paths = [path for path_list in duplicated_keys.values() for path in path_list]

# 3. Preprocessing

Preprocessing is a crucial aspect of computer vision projects as it enhances the image quality and makes it appropriate for training deep learning models [1]. In our investigation, we discovered that our dataset is unbalanced, with a strong representation of images in the younger age groups, and a decreasing proportion of images for the older age groups. Additionally, we identified the presence of duplicates, some of which appear in groups of three or four. We need to take this into consideration during preprocessing, as we want all images and classes to be treated equally and not give undue importance to duplicates. After simplifying the set, we will split it into a temporary training set and the final test set to complete the process.

## 3.1. Removing the duplicates from the main folder

In terms of removing duplicates, we opted to remove the duplicates identified by the ahash method. We chose to use ahash instead of phash and geometric distance method because ahash was able to identify a greater number of duplicates. Additionally, as mentioned in the explore notebook, ahash is better suited for detecting near-duplicates in image sets that have not undergone significant transformations, which is applicable to our scenario. Despite the fact that some of the images identified as duplicates may not actually be duplicates, we decided to stick with ahash for what was previously explained. The drawback of this approach is that more single images will be deleted, but we don't consider this a problem as we have a substantial and well-represented dataset. Our next step will be to eliminate all pairs of duplicates, as well as groups of 3 and 4 duplicates, from our primary folder. As some of these duplicates have been assigned to different classes, we cannot know the actual age of the person, so we can't consider any of them. We will temporarily store these duplicates in a separate folder.

In [None]:
new_folder = os.path.join(data_path, 'similar_images')
# Create the new folder if it doesn't exist
os.makedirs(new_folder, exist_ok=True)

for file_path in duplicated_paths:
    # Get the parent directory of the file
    age_folder = file_path.split('/')[-2]
    file_name = file_path.split('/')[-1]
    # Create the destination folder path
    destination_folder_path = os.path.join(new_folder, age_folder)
    # Create the destination folder if it doesn't exist
    os.makedirs(destination_folder_path, exist_ok=True)
    # Create the destination file path
    destination_file_path = os.path.join(destination_folder_path, file_name)
    # Moving similar files identified in hash method
    shutil.move(file_path, destination_file_path)

print("Waiting 5 minutes to sync all files in Drive")
time.sleep(60*5)

Waiting 5 minutes to sync all files in Drive


In [None]:
# Verify the mv files
verification = {}
for key in duplicated_keys.keys():
    verification[key] = [path.replace('facial_age_dataset_unsplit', 'similar_images') for path in duplicated_keys[key]]

# Calculate the max size of possible duplicated images
images_per_row = max([len(count_keys) for count_keys in verification.values()])

# Creating a visual inspection for the hashes
create_image_grid_per_group(image_groups=verification, images_per_row=images_per_row, show_class=True)

Output hidden; open in https://colab.research.google.com to view.

## 3.2. Removing less represented classes (>75)


It is important to maintain a balanced dataset while training a Deep Learning model to ensure that model's predictions are not skewed towards any particular classes. In our case, as the classes for the individuals over 75 years of age are not well represented in the dataset, it can lead to a biased model, where the model may not be able to accurately predict the age of individuals in that age group.

Moreover, since the dataset is unbalanced, the model may end up giving more weightage to the majority classes, leading to a poorer performance on the minority classes. To avoid such scenario, we will remove the images from the dataset that belong to the poorly represented classes, as removing the images of individuals over 75 years of age will help create a more balanced dataset and improve the model's performance. Once again, we are going to temporarily store the removed files in a separate folder.

**Note:** During the early stages of our project, we explored the option of utilizing ImageDataGenerator to address the imbalance in our data set. This involved undersampling the overrepresented classes while oversampling the underrepresented classes through data augmentation. However, we eventually realized that this approach was not a conventional method for handling imbalanced data, as it would result in altered class distributions. Specifically, the larger class would have a wide range of variation, while the smaller class would consist of many similar images with minor affine transforms. As a result, the smaller class would occupy a significantly smaller area in the image space than the majority class. [2]

In [None]:
# Create a list of folders path with ages above 75 years old
above_75_images_path = []
for path in os.walk(dataset_path):
    age_class = path[0].split('/')[-1]
    if age_class:
        if int(age_class) > 75:
            above_75_images_path.append(path[0])

print(above_75_images_path)

['/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/076', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/077', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/078', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/079', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/080', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/081', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/082', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/083', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/084', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/085', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/086', '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/087', '/content/drive/MyDrive/FacialAgeProject/data/facia

In [None]:
for folder_path in above_75_images_path:
    # Create the destination folder path
    destination_folder = folder_path.replace('facial_age_dataset_unsplit', 'above_75_images')
    # Create the destination folder if it doesn't exist
    os.makedirs(destination_folder_path, exist_ok=True)
    # Moving folders
    print(f'Moving folder: {folder_path} to {destination_folder}')
    shutil.move(folder_path, destination_folder)
print('All folders with age >75 moved!')
print("Waiting 5 minutes to sync all files in Drive")
time.sleep(60*5)

Moving folder: /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/076 to /content/drive/MyDrive/FacialAgeProject/data/above_75_images/076
Moving folder: /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/077 to /content/drive/MyDrive/FacialAgeProject/data/above_75_images/077
Moving folder: /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/078 to /content/drive/MyDrive/FacialAgeProject/data/above_75_images/078
Moving folder: /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/079 to /content/drive/MyDrive/FacialAgeProject/data/above_75_images/079
Moving folder: /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/080 to /content/drive/MyDrive/FacialAgeProject/data/above_75_images/080
Moving folder: /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_unsplit/081 to /content/drive/MyDrive/FacialAgeProject/data/above_75_images/081
Moving folder: /content/drive/MyDrive/FacialAg

## 3.3. Splitting the dataset into a temporary train test and a definitive test set




In our subsequent steps, we aim to use the `image_dataset_from_directory()` function [3] to divide our recently cleaned image folder into two separate sets: a temporary training set and a final test set. Afterward, we will utilize the previously defined `save_images()` function to save the respective images for each set into their respective designated folders.

It's important to note that we saved both the training set and test set because we plan to further subdivide the temporary training set into our definitive training set and validation set later in the model_handcrafted notebook. The validation set will serve as a crucial tool in assessing and improving our different models using a representative sample that is not directly used for training. Additionally, we store the test set since it will be utilized to evaluate the effectiveness of our ultimate models.

In [None]:
# Dividing our data into a temp set (it'll be transformed in train and validation set) and test set. 
# Later on we have to adjust,
# as we still didn't perform any division for the test set. But for now this is enough.
train_ds, test_ds = image_dataset_from_directory(
    dataset_path,
    labels='inferred',
    label_mode='int',
    color_mode='rgb',
    batch_size=64,
    image_size=(200,200),
    shuffle=True,
    seed=0,
    validation_split=0.2,
    subset="both"
    )

Found 8823 files belonging to 75 classes.
Using 7059 files for training.
Using 1764 files for validation.


In [None]:
# create directories for train, test and validation datasets
!mkdir /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_preprocessed -p;
!mkdir /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_preprocessed/train -p;
!mkdir /content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_preprocessed/test -p;

In [None]:
#path to save images in train test and validation datasets
test_path = '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_preprocessed/test'
train_path = '/content/drive/MyDrive/FacialAgeProject/data/facial_age_dataset_preprocessed/train'

In [None]:
# Save images in temp folder to be splitted in train and validation afterwards
save_images(train_ds, output_folder=train_path)

Saving images for train
Saved 0 images
Saved 500 images
Saved 1000 images
Saved 1500 images
Saved 2000 images
Saved 2500 images
Saved 3000 images
Saved 3500 images
Saved 4000 images
Saved 4500 images
Saved 5000 images
Saved 5500 images
Saved 6000 images
Saved 6500 images
Saved 7000 images
Saved all images - Total 7059
Saved in 59.238 seconds
Waiting 5 minutes to sync all files in Drive


In [None]:
# Save images in test folder
save_images(test_ds, output_folder=test_path)

Saving images for test
Saved 0 images
Saved 500 images
Saved 1000 images
Saved 1500 images
Saved all images - Total 1764
Saved in 15.580 seconds
Waiting 5 minutes to sync all files in Drive


# 4. Conclusion


In conclusion, preprocessing is a crucial aspect of computer vision projects as it enhances image quality and makes it appropriate for training deep learning models. Throught this notebook, we performed the following actions:

*   Removed all the duplicates identified by the ahash method, some of which also appear in groups of three and four. Stored the removed images in a separate folder.

*   During our investigation, we identified that our dataset was unbalanced, and so we removed the poorly represented classes from our dataset. Stored the removed images in a separate folder.

*   We divided our cleaned image folder into a temporary training set and a final test set, and stored the corresponding images in designated folders.

Another possible step that we explored briefly that could enhance the quality of our dataset was to remove outliers. However, due to time constraints and the absence of an optimal and satisfactory method, we ultimately decided against removing outliers.

In the upcoming model_handcrafted notebook, we plan to subdivide our temporary training set into a definitive training set and a validation set. We will then begin experimenting and handcrafting multiple CNN models.