# Face classifier - Exploratory Data Analysis

In this notebook I shall perform the exploratory data analysis (EDA) of the [Kaggle dataset](https://www.kaggle.com/datasets/nipunarora8/age-gender-and-ethnicity-face-data-csv) used in this project. The data set contains images of people together with some attributes about these people such as age, gender, and ethnicity.

## Set up

In [None]:
# 3rd party imports
import pandas as pd
import numpy as np
from numpy import random as rd
import seaborn as sns
import matplotlib.pyplot as plt

# Local imports
from facecls import fcaux

## Load data

In [None]:
data = pd.read_csv("data/age_gender.csv")

## EDA

### What does the data look like?

Let's first get some basic information about the data set such as its shape, an idea what specific data examples look like, what the data types are, if there are NULL values, etc.

In [None]:
# Shape of the data set
data.shape

In [None]:
# Get the first five specific examples
data.head()

In [None]:
# Let's print som info about data types, NULL value counts, etc.
data.info()

From this very basic analysis it can be seen that the data set contains 23705 5-dimensional examples. The first three dimensions correspond to the age, the ethnicity, and the gender of the person on the image, all represented as integers. The fourth dimension is a string containing the image name and the fifth is a stringified version of the image itself.

I found by coincidence that some images occur more than once. This is not desired and therefore multiple occurences should be removed.

In [None]:
# The next line removes rows from the data set which have occured
# already at least once
data = data[~data["pixels"].duplicated()]

Next, let's print some images. However, to do so, we need to process the 5th data column: as seen above, this column originally contains strings of space-separated pixel values (integers between 1 and 255). So these strings must be split into arrays of individual values, each value must be cast to the integer data type, and the resulting 1D array must be reshaped into a 2D array. These transformations are done by the functions fcaux.pxlstring2pxlvec and fcaux.pxlvec2pxlarray, respectively.

The sample of images plotted below is a hardcoded selection of 15 images, containing three subsets with 5 images each. Each of the three subsets represent a different age/gender group.

In [None]:
fig, axs = plt.subplots(3,5, figsize=(10,6))

for i in range(3):
    for j in range(5):
        img = fcaux.pxlvec2pxlarray(fcaux.pxlstring2pxlvec(data,i*2000+j))
        axs[i,j].imshow(img, interpolation = "nearest", cmap="gray")
        axs[i,j].axis("off")
        axs[i,j].set_title(f"Image #{i*2000+j}")

fig.suptitle("Example images")
plt.show()

Looking closely we can see that there are imperfections in this data set: The middle row is supposed to contain images of females, however image #2003 is that of a male. We learn that the dataset contains wrong labels.

In order to get a sense of how significant wrong labels are in this data set, let's recreate the above plot array with a random collection of images, changing every time the code (in the next cell below) is executed. To get more info, let's add the age (a), ethnicity (e) and gender (g) information to the title of each picture. 

In [None]:
fig, axs = plt.subplots(3,5, figsize=(10,6))

for i, row in enumerate(sorted(rd.choice(range(10), size=3, replace=False))):
    for j, col in enumerate(sorted(rd.choice(range(2000), size=5, replace=False))):
        img = fcaux.pxlvec2pxlarray(fcaux.pxlstring2pxlvec(data,row*2000+j))
        axs[i,j].imshow(img, interpolation = "nearest", cmap="gray")
        axs[i,j].axis("off")
        current_img = data.iloc[row*2000+col]
        axs[i,j].set_title(f"#{row*2000+col}: a{current_img['age']}, e{current_img['ethnicity']}, g{current_img['gender']}")

fig.suptitle("Example images")
plt.show()

It becomes evident that there are several mislabelings. While wrong gender labels and vastly wrong age labels are quite easily spotted, determining the correctness of the ethnicity labels is tricky for several reasons: (a) one would need clear criteria by which ethnicity is defined, (b) there are people with mixed ethnicities, etc.

<div class="alert alert-block alert-warning">
    <b>CONCLUSION</b>
    
Due to the many wrong labels in this data set, it is by far not optimal as machine learning models trained on it will be confused by the wrong labels. As this project is more about the methods than about the actual result (I am not trying to sell a product in the end), I will go on with this data set, but it shall be noted that a better quality data set would of course be preferrable.
</div>

Let's randomly sample 15 different images and save them to disk:

In [None]:
# 0 = Caucasian
# 1 = Black
# 2 = Asian
# 3 = Indian
# 4 = Latino

sample_imgs = data.sample(n=15, random_state=42)

fig, axs = plt.subplots(3,5, figsize=(10,6))

for i in range(3):
    for j in range(5):
        img = fcaux.pxlvec2pxlarray(fcaux.pxlstring2pxlvec(sample_imgs,sample_imgs.index[i*5+j]))
        axs[i,j].imshow(img, interpolation = "nearest", cmap="gray")
        axs[i,j].axis("off")
        axs[i,j].set_title(f"Image #{sample_imgs.index[i*3+j]}")

fig.suptitle("Example images")
plt.savefig("imgs/random_face_images.png")
plt.show()

### Statistical data analysis

Next, let's analyze the value distributions in the three label columns age, ethnicity and gender to find out where the imbalances are if any.

In [None]:
sns.set_style("whitegrid")

fig, axs = plt.subplots(1,3, figsize=(12,4))
sns.histplot(data = data, 
             x="age", 
             binrange=(0,120),
             bins=30,
             ax=axs[0]
             )
axs[0].set_title("Age distribution")

sns.countplot(data = data, 
             x="ethnicity", 
             ax=axs[1]
             )
axs[1].set_title("Ethnicity distribution")

sns.countplot(data = data, 
             x="gender", 
             ax=axs[2]
             )
axs[2].set_title("Gender distribution")
plt.tight_layout()
plt.savefig("imgs/label_distributions.png")
plt.show()

Clearly, the age and ethnicity labels are significantly imbalanced while the gender label is quite balanced. This makes the gender label attractive for a simple model as it does not require very sophisticated preprocessing to deal with class imbalances.