# **The Problem: Automatic Apparent Age Estimation**



# Auxiliary and introductory material

Getting Started with TensorFlow in Google Colaboratory
Intro to Google Colab:
https://www.youtube.com/watch?v=inN8seMm7UI

Installing Tensorflow (CPU or GPU):
https://www.youtube.com/watch?v=PitcORQSjNM

# Pre-requisites:
Installing tensorflow-gpu (GPU) and OpenCv.
Check GPU usage instructions [here](https://research.google.com/colaboratory/faq.html#gpu-availability)

# IMPORTANT:
**1:** THE CODE WAS TESTED ON TENSORFLOW VERSION 2.4.0

**2:** Sometimes the code downloads data/models from our server. It may happen that you get a "file not found" error due to some instability of the server. In this case, please keep trying! If the error persist, please contact me.

In [None]:
#!pip install tensorflow-gpu==2.4.0
#!pip install opencv-python
#!pip install h5py

# to enable Colab-GPU version:
# 1) Runtime -> reset runtime
# 2) Runtime -> Change runtime type


# Downloading and decompressing the Appa-Real Age Dataset [(source)](http://chalearnlap.cvc.uab.es/challenge/13/track/13/description/)

- As default, RGB images (cropped faces) are in the range of [0, 255], and labels are in the range of ~0.9 to ~90 (years old).
- The data is divided in train, validation and test set. 
- Matadata is also provided
  - gender: male / female 
  - ethnicity: asian / afroamerican / caucasian
  - facial expression: neutral / slightlyhappy / happy / other


In [None]:
# downloading the data
from zipfile import ZipFile
!wget https: // data.chalearnlap.cvc.uab.cat/Colab_2021/app_data.zip

# decompressing the data

with ZipFile('app_data.zip', 'r') as zip:
    zip.extractall()
    print('Data decompressed successfully')

# removing the .zip file after extraction to clean space
!rm app_data.zip


# Loading the train/validation data, and re-scaling the labels to [0..1]
- X_[train,valid,test] = Face images
- Y_[train,valid,test] = Ground truth 
- M_[train,valid,test] = Metadata (gender, ethnicicy, facial expression)

In [None]:
import numpy as np

# loading the train data
X_train = np.load('./data/data_train.npy')
Y_train = np.load('./data/labels_train.npy')
M_train = np.load('./data/meta_data_train.npy')

# loading the validation data
X_valid = np.load('./data/data_valid.npy')
Y_valid = np.load('./data/labels_valid.npy')
M_valid = np.load('./data/meta_data_valid.npy')

# loading the test data
X_test = np.load('./data/data_test.npy')
Y_test = np.load('./data/labels_test.npy')
M_test = np.load('./data/meta_data_test.npy')

# train labels are real numbers, ranging from ~0.9 to ~89 (years old);
# we will re-scale the labels to [0,1] by using a normalization factor of 100,
# assuming there is no sample with age > 100.
Y_train = Y_train/100
Y_valid = Y_valid/100
# Y_test = Y_test/100 # -> we don't normalize the test labels as we will evaluate
# them using the raw data, i.e., the apparent age values

print('Train data size and shape', X_train.shape)
print('Train labels size and shape', Y_train.shape)
print('Train metadata size and shape', M_train.shape)
print('----')
print('Valid data size and shape', X_valid.shape)
print('Valid labels size and shape', Y_valid.shape)
print('Valid metadata size and shape', M_valid.shape)
print('----')
print('Test data size and shape', X_test.shape)
print('Test labels size and shape', Y_test.shape)
print('Test metadata size and shape', M_test.shape)


# Visualizing some training samples
Next, we multiply the normalized age labels by 100 to show the original age values on top of each sample.

In [None]:
import cv2
import random
from matplotlib import pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 20))
for i, ax in enumerate(axes):
    idx = random.randint(0, len(X_train))
    ax.imshow(cv2.cvtColor(X_train[idx, :, :, :], cv2.COLOR_BGR2RGB))
    ax.set_title(Y_train[idx]*100)
    ax.set(xlabel=[M_train[idx][0], M_train[idx][1], M_train[idx][2]])


# Visualizing the age distribution of Train data

In [None]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4))
fig.suptitle('Age distribution', fontsize=14, fontweight='bold')

# labels are multiplied by 100 to show the original values
ax1.hist(Y_train*100, bins=50)
ax1.set_title('Y_train labels')
ax1.set(xlabel='Apparent age', ylabel='Num. of samples')
ax1.set_xlim([0, 100])

ax2.hist(Y_valid*100, bins=50)
ax2.set_title('Y_valid labels')
ax2.set(xlabel='Apparent age', ylabel='Num. of samples')
ax2.set_xlim([0, 100])


# Visualizing the distributions of metadata (Train data)

In [None]:
gender = []
etnhicity = []
emotion = []
for sample in M_train:
    gender.append(sample[0])
    etnhicity.append(sample[1])
    emotion.append(sample[2])

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 4))
fig.suptitle('Metadata distribution', fontsize=14, fontweight='bold')

ax1.hist(gender)
ax2.hist(etnhicity)
ax3.hist(emotion)


# Visualizing the age distribution per Ethnicity
- First, we define a function to visualize the age distribution per ethnicity. Then, we visualize the distributions of train / validation / test sets.

In [None]:
def compute_hist_per_ethnicity(y_data, metadata, set):

    vec_as = []
    vec_af = []
    vec_ca = []
    for i in range(0, len(y_data)):
        if(metadata[i][1] == 'asian'):
            vec_as.append(y_data[i])
        if(metadata[i][1] == 'afroamerican'):
            vec_af.append(y_data[i])
        if(metadata[i][1] == 'caucasian'):
            vec_ca.append(y_data[i])

    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 3))
    fig.suptitle(['Age distribution per Ethnicity ', set],
                 fontsize=14, fontweight='bold')

    ax1.hist(vec_as, bins=50)
    ax1.set_xlim([0, 100])
    ax1.set(xlabel='Asian', ylabel='Num. of samples')

    ax2.hist(vec_af, bins=50)
    ax2.set_xlim([0, 100])
    ax2.set(xlabel='Afroamerican', ylabel='Num. of samples')

    ax3.hist(vec_ca, bins=50)
    ax3.set_xlim([0, 100])
    ax3.set(xlabel='Caucasian', ylabel='Num. of samples')


In [None]:
# train set
compute_hist_per_ethnicity(Y_train*100, M_train, 'Train set')

# validation set
compute_hist_per_ethnicity(Y_valid*100, M_valid, 'Validation set')

# test set
# note, we do not multiply 'Y_test' by 100 because it was not normalized
# to be in the range of [0,1] as the train and validation sets.
compute_hist_per_ethnicity(Y_test, M_test, 'Test set')


# Preprocessing the data (face images)
- Later, we will define our model based on ResNet50 (our backbone). Originally,
ResNet50 uses a regularization that changes the range of the input images. Thus,
to be aligned with the ResNet50 input, we preprocess our input images using the respective 'preprocess_input' function. Later, if you decide to use another model as backbone (rather than ResNet), you may skip the following preprocessing stage.

In [None]:
from tensorflow.keras.applications.resnet50 import preprocess_input

# train
for i in range(0, X_train.shape[0]):
    x = X_train[i, :, :, :]
    x = np.expand_dims(x, axis=0)
    X_train[i, ] = preprocess_input(x)

# validation
for i in range(0, X_valid.shape[0]):
    x = X_valid[i, :, :, :]
    x = np.expand_dims(x, axis=0)
    X_valid[i, ] = preprocess_input(x)

# test
for i in range(0, X_test.shape[0]):
    x = X_test[i, :, :, :]
    x = np.expand_dims(x, axis=0)
    X_test[i, ] = preprocess_input(x)


# Downloading the ResNet50 model pre-trained on Faces
We are using ResNet50 pre-trained on Faces (source [here](https://github.com/ox-vgg/vgg_face2))

In [None]:
# downloading the data
!wget https: // data.chalearnlap.cvc.uab.cat/Colab_2021/model.zip

# decompressing the data
with ZipFile('model.zip', 'r') as zip:
    zip.extractall()
    print('Model decompressed successfully')

# removing the .zip file after extraction  to clean space
!rm model.zip


# Loading the pre-trained model
- You can see the data (e.g., we have downloaded) and structure of Colab by clicking on 'Files', on the left side <-- of this interface.



In [None]:
import h5py
import tensorflow as tf
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau


# loading the pretrained model
model = tf.keras.models.load_model('./model/weights.h5')

# print the model summary
# print(model.summary())


# Adapting the model to our needs
- In summary, we will ignore the last layer 'classifier_low_dim' and will include a few other layers on top of our backbone. Here, we also define the activation function we are going to use as output of the last FC layer (Sigmoid, in the case).

In [None]:
# Using the FC layer before the 'classifier_low_dim' layer as feature vector
fc_512 = model.get_layer('dim_proj').output

# adding a dropout layer to minimize overfiting problems
dp_layer = Dropout(0.5)(fc_512)

# adding a few hidden FC layers to learn hidden representations
fc_64 = Dense(64, activation='relu', name='f_64')(dp_layer)

# Includint an additional FC layer with sigmoid activation, used to regress
# the apparent age
output = Dense(1, activation='sigmoid', name='predict')(fc_64)

# building and pringing the final model
model = Model(inputs=model.get_layer('base_input').output,outputs=output)
#print(model.summary())

# Freezing the first layers to allow the fine-tuning of the last FC layers (only)
- Next, we set some layer to be trainable or not, and print if layers are set to trainable = True or False.


In [None]:
counter = 0
for layer in model.layers:
    if counter <= 174:
        layer.trainable = False
    else:
        layer.trainable = True
    #print(counter, layer.name, layer.trainable)
    counter += 1


# Printing the MODEL (summary) we have just defined

In [None]:
# print(model.summary())


# IMPORTANT: Mounting your google drive to save your results
- Colab gives you LIMITED GPU access. Thus, it may kill your process (of training) if you pass a limited amount of training hours. To allow you to save your model while training, you can mount your google drive, as detailed next. This way, if the process is killed, you can (in a new session) load your checkpoints (trained model, from your google drive) and, for example, continue training or make predictions with the model you obtained (even if trained for a few epochs).
- In the following examples, the **the beset model (based on validation loss) is saved in my google drive inside a "/temp/" directory. You will need to addapt this path to your case.** 
- To save time, and to allow you to quickly 'play' and run the notebook, we have pre-trained some models, which are loaded (or not) based on some boolean variables (later, you will need to change/adapt these codes to achive the goals of this course).

In [None]:
# --------------------------
MOUNT_GOOGLE_DRIVE = False
# --------------------------

if(MOUNT_GOOGLE_DRIVE == True):
    from google.colab import drive
    drive.mount('/content/gdrive')
    # Note, the default path will be: '/content/gdrive/MyDrive/'
    # In my case, the final path will be: '/content/gdrive/MyDrive/temp/' as I
    # created a '/temp/' folder in my google drive for this purpose.


# Training the Model / or downloading a model already trained
- As default, the code below will load a pre-trained model, obtained using the same code if LOAD_BEST_MODEL_ST1 is set to False.
- Later, you can set LOAD_BEST_MODEL_ST1 to False to perfom the training.
  - The code below uses Early stopping (es) with patience = 5 (that is, the training will stop if no improvement on valid_loss is observed on the last 5 epochs).
  - It uses the Mean Squared Error (MSE) as loss function ('loss=tf.keras.losses.MeanSquaredError()'). The code also evaluates the Mean Absolute Error (MAE) during training ('metrics=['mae']'). Learning rate is set to 'learning_rate=1e-5', batch size = 32, and the model will be trained for 50 epochs (if Colab allows it based to the time budget)
  - The model callback (mc) is set to save the best model based on valid_loss (that is, if validation loss decreases from one epoch to another, a new model is saved on the path you specify).
  - Other hyperparameters you can play with are: defining another optimizer, loss function, learning rate, batch size, num of epochs.

- Note: in case you want to save your model, stop training, and resume training, check the end of this file **"II) illustrating how to train + save + stop training + RESUME TRAINING"** where we provide a more detailed example about this procedure. Recommendation: first train your model for a few epochs to avoid the need of resume training. This way, you will get used with the code and the general pipeline. Later, you can play with that.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import pickle

# load a model and train history (defined and trained
# as below, trained for 38 epochs)
# --------------------------
LOAD_BEST_MODEL_ST1 = False  # (training only the last FC layers)
# --------------------------


if(LOAD_BEST_MODEL_ST1 == True):
    # downloading the trained model
    !wget https: // data.chalearnlap.cvc.uab.cat/Colab_2021/best_model_st1.zip
    # decompressing the data
    with ZipFile('best_model_st1.zip', 'r') as zip:
        zip.extractall()
        print('Model decompressed successfully')
    # removing the .zip file after extraction  to clean space
    !rm best_model_st1.zip

else:
    # defining the early stop criteria
    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=20)
    # saving the best model based on val_loss
    mc= ModelCheckpoint('./checkpoint/best_model.h5', monitor='val_loss', mode='min', save_best_only=True)
    mc_2 = ModelCheckpoint('./checkpoint/best_model.h5', monitor='val_loss', mode='min', save_best_only=True)
    reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=1e-6)

    # defining the optimizer
    model.compile(tf.keras.optimizers.Adam(learning_rate=1e-4, amsgrad = True), loss=tf.keras.losses.MeanSquaredError(), metrics=['mae'])

    # training the model
    history = model.fit(X_train, Y_train, validation_data=(
        X_valid, Y_valid), batch_size=32, epochs=200, shuffle=True, verbose=1, callbacks=[es, mc, mc_2, reduce_lr])

    # saving training history (for future visualization)
    with open('./history/train_history.pkl', 'wb') as handle:
        pickle.dump(history.history, handle, protocol=pickle.HIGHEST_PROTOCOL)


# Visualizing the train history


In [None]:
from matplotlib import pyplot as plt

# here, it loads the history of the model we have already trained, or loads the
# history of the model you defined and trained
if(LOAD_BEST_MODEL_ST1 == True):
    train_hist = pickle.load(open("./history/train_history.pkl", "rb"))
else:
    train_hist = pickle.load(open("./history/train_history.pkl", "rb"))

# we plot both, the LOSS and MAE
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4))
fig.suptitle('Training history (stage 1)', fontsize=14, fontweight='bold')

ax1.plot(train_hist['loss'])
ax1.plot(train_hist['val_loss'])
ax1.set(xlabel='epoch', ylabel='LOSS')
ax1.legend(['train', 'valid'], loc='upper right')

ax2.plot(train_hist['mae'])
ax2.plot(train_hist['val_mae'])
ax2.set(xlabel='epoch', ylabel='MAE')
ax2.legend(['train', 'valid'], loc='upper right')


# Loading the saved model and Making predictions on the Test set
- Next, we load the trained model and make predictions on the Test set


In [None]:
from tensorflow.keras.models import load_model

# --------------------------
ENABLE_EVALUATION_ST1 = True
# --------------------------

# loading the saved model
if(LOAD_BEST_MODEL_ST1 == True):
    saved_model = load_model('best_model.h5')
else:
    saved_model = load_model('./checkpoint/best_model.h5')

if(ENABLE_EVALUATION_ST1 == True):
    # predict on the test data
    predictions_st1 = saved_model.predict(X_test, batch_size=32, verbose=1)


# Evaluating the model on the Test set, using the Mean Absolute Error (MAE) as metric.
- Note, as the train/validation labels were re-scaled to be in the range of [0,1], the predictions will be in the same range [0,1]. 
- To evaluate the model on the test set (which was not normalized), we re-scale the predictions back using the normalization factor = 100 (previously defined), in order to have the Mean Absolute Error with respect to the original apparent age labels.

In [None]:
if(ENABLE_EVALUATION_ST1 == True):
    # re-scaling the output predictions (from [0,1] to age range) using the
    # the normalization factor mentioned before
    predictions_st1_f = predictions_st1*100

    # evaluating on test data
    error = []
    for i in range(0, len(Y_test)):
        error.append(abs(np.subtract(predictions_st1_f[i][0], Y_test[i])))

    print('MAE = %.8f' % (np.mean(error)))


In [None]:
if(ENABLE_EVALUATION_ST1 == True):
    # printing some predictions
    for i in range(0, 10):
        print('predicted age = %.3f - Ground truth = %.3f' %
              (predictions_st1_f[i], Y_test[i]))


---
# Performing a 2nd Stage of training, where ALL Layers are set to "trainable"
- Up to here, we have just trained the last FC layers of our model. Now, we will load the model we have trained (referred to it as 1st stage), set all layers to TRAINABLE, and train the whole model. Training will take more time, but we expect to get better results.

In [None]:
# setting all layers of the model to trainable
saved_model.trainable = True

counter = 0
for layer in saved_model.layers:
    #print(counter, layer.name, layer.trainable)
    counter += 1


# Training the WHOLE Model (2nd Stage)

- As default, the code below will load a pre-trained model, obtained using the same code if LOAD_BEST_MODEL_ST2 is set to False.
- Later, you can set LOAD_BEST_MODEL_ST2 to False to perfom the training.
  - As before, the code below uses Early stopping (es) with patience = 5 (that is, the training will stop if no improvement on valid_loss is observed on the last 5 epochs).
  - It uses the Mean Squared Error (MSE) as loss function ('loss=tf.keras.losses.MeanSquaredError()'). The code also evaluates the Mean Absolute Error (MAE) during training ('metrics=['mae']'). Learning rate is set to 'learning_rate=1e-5', batch size = 16, and the model will be trained for 12 epochs (if Colab allows it based to the time budget). Note, if you increase the batch size too much, data may not fit the GPU capacity (as the number of parameters to train increased compared to the 1st stage). This is why we reduced it from 32 to 16.
  - The model callback (mc) is set to save the best model based on valid_loss (that is, if validation loss decreases from one epoch to another, a new model is saved on the path you specify).
  - Other hyperparameters you can play with are: defining another optimizer, loss function, learning rate, batch size, num of epochs.
- WARNING: at this stage, training take more time, and colab may close before you finish training due to time constraints. Thus, you will need to define a good strategy! In case you want to save your model, stop training, and resume training, check the end of this file **"II) illustrating how to train + save + stop training + RESUME TRAINING"** where we provide a more detailed example about this procedure.
- WARNING: if you save your model and resume training, the train history will be lost. To monitore the training history, you may need to save the train history in another way (e.g., you can copy and paste the logs into a text file before resuming the training).

---
---
# Accuracy is not enough! We also need to evaluate how biased is our model!
- Next, we define different different functions, used to compute a bias score given different attributes.
  - Age bias
  - Gender bias
  - Ethnicity bias
  - Facial Expression bias
- In a nutshell, given a particular attribute, we compute the MAE for different groups. For the case of age, detailed next, we will have 4 groups base on different age ranges. Then, we will have $MAE_1$, $MAE_2$, $MAE_3$ and $MAE_4$. Then, we compute the Absolute Difference among all. That is,
  - $D_{1,2} = |MAE_1-MAE_2|$
  - $D_{1,3} = |MAE_1-MAE_3|$
  - $D_{1,4} = |MAE_1-MAE_4|$
  - $D_{2,3} = |MAE_2-MAE_3|$
  - $D_{2,4} = |MAE_2-MAE_4|$
  - $D_{3,4} = |MAE_3-MAE_4|$

- The final score is obtained by the average of the absolute differentes. In the case of age:
  - $B_a = (D_{1,2} + D_{1,3} + D_{1,4} + D_{2,3} + D_{2,4} + D_{3,4})/6$

- To minimize your bias score, given a particular attribute, you will need to minimize the Absolute Difference among the different groups being evaluated.
- The big challenge here is to minimize ALL bias scores (i.e., age, gender, ethnicity and face expression).

---
# Age Bias ($B_a$) 

- Evaluates (on the TEST set) how accurate the model is with respect to different age ranges.
  - group 1: age < 20
  - group 2: 20 <= age < 40
  - group 3: 40 <= age < 60
  - group 4: 60 <= age



In [None]:
def age_bias(predictions, gt):
    error_g1 = []
    error_g2 = []
    error_g3 = []
    error_g4 = []
    for i in range(0, len(gt)):
        if(gt[i] < 20):
            error_g1.append(abs(predictions[i]-gt[i]))
        if(gt[i] >= 20 and gt[i] < 40):
            error_g2.append(abs(predictions[i]-gt[i]))
        if(gt[i] >= 40 and gt[i] < 60):
            error_g3.append(abs(predictions[i]-gt[i]))
        if(gt[i] >= 60):
            error_g4.append(abs(predictions[i]-gt[i]))

    print('=============================')
    print('Age analysis:')
    print('Size group 1 = %d, MAE = %f' % (len(error_g1), np.mean(error_g1)))
    print('Size group 2 = %d, MAE = %f' % (len(error_g2), np.mean(error_g2)))
    print('Size group 3 = %d, MAE = %f' % (len(error_g3), np.mean(error_g3)))
    print('Size group 4 = %d, MAE = %f' % (len(error_g4), np.mean(error_g4)))

    age_bias = (abs(np.mean(error_g1)-np.mean(error_g2)) +
                abs(np.mean(error_g1)-np.mean(error_g3)) +
                abs(np.mean(error_g1)-np.mean(error_g4)) +
                abs(np.mean(error_g2)-np.mean(error_g3)) +
                abs(np.mean(error_g2)-np.mean(error_g4)) +
                abs(np.mean(error_g3)-np.mean(error_g4)))/6

    print('---------')
    print('Age bias (Ba) = ', age_bias)


# Gender Bias ($B_g$) 
- Evaluates (on the test set) how accurate the model is with respect to different gender.
  - group 1: male
  - group 2: female


In [None]:
def gender_bias(predictions, gt, metadata):
    error_m = []
    error_f = []
    for i in range(0, len(gt)):
        if(metadata[i][0] == 'female'):
            error_f.append(abs(predictions[i]-gt[i]))
        else:
            error_m.append(abs(predictions[i]-gt[i]))

    print('=============================')
    print('Gender analysis:')
    print('Size group female = %d, MAE = %f' %
          (len(error_f), np.mean(error_f)))
    print('Size group male = %d, MAE = %f' % (len(error_m), np.mean(error_m)))

    gender_bias = abs(np.mean(error_f)-np.mean(error_m))

    print('---------')
    print('Gender bias (Bg) = ', gender_bias)


# Ethnicity Bias ($B_e$)
- Evaluates (on the test set) how accurate the model is with respect to different ethnicity categories.
  - group 1: asian
  - group 2: afroamerican
  - group 3: caucasian


In [None]:
def ethnicity_bias(predictions, gt, metadata):
    error_as = []
    error_af = []
    error_ca = []
    for i in range(0, len(gt)):
        if(metadata[i][1] == 'asian'):
            error_as.append(abs(predictions[i]-gt[i]))
        if(metadata[i][1] == 'afroamerican'):
            error_af.append(abs(predictions[i]-gt[i]))
        if(metadata[i][1] == 'caucasian'):
            error_ca.append(abs(predictions[i]-gt[i]))

    print('=============================')
    print('Ethnicity Analysis:')
    print('Size group asian = %d, MAE = %f' %
          (len(error_as), np.mean(error_as)))
    print('Size group afroamerican = %d, MAE = %f' %
          (len(error_af), np.mean(error_af)))
    print('Size group caucasian = %d, MAE = %f' %
          (len(error_ca), np.mean(error_ca)))

    ethnicity_bias = (abs(np.mean(error_as)-np.mean(error_af)) +
                      abs(np.mean(error_as)-np.mean(error_ca)) +
                      abs(np.mean(error_af)-np.mean(error_ca)))/3

    print('---------')
    print('Ethnicity bias (Be) = ', ethnicity_bias)


# Face expression bias ($B_f$)
- Evaluates (on the test set) how accurate the model is with respect to different face expression categories.
  - group 1: neutral
  - group 2: slightlyhappy
  - group 3: happy
  - group 4: other


In [None]:
def face_expression_bias(predictions, gt, metadata):
    error_h = []
    error_s = []
    error_n = []
    error_o = []
    for i in range(0, len(gt)):
        if(metadata[i][2] == 'happy'):
            error_h.append(abs(predictions[i]-gt[i]))
        if(metadata[i][2] == 'slightlyhappy'):
            error_s.append(abs(predictions[i]-gt[i]))
        if(metadata[i][2] == 'neutral'):
            error_n.append(abs(predictions[i]-gt[i]))
        if(metadata[i][2] == 'other'):
            error_o.append(abs(predictions[i]-gt[i]))

    print('=============================')
    print('Face experession Analysis:')
    print('Size group happy = %d, MAE = %f' % (len(error_h), np.mean(error_h)))
    print('Size group slightlyhappy = %d, MAE = %f' %
          (len(error_s), np.mean(error_s)))
    print('Size group neutral = %d, MAE = %f' %
          (len(error_n), np.mean(error_n)))
    print('Size group other = %d, MAE = %f' % (len(error_o), np.mean(error_o)))

    face_bias = (abs(np.mean(error_h)-np.mean(error_s)) +
                 abs(np.mean(error_h)-np.mean(error_n)) +
                 abs(np.mean(error_h)-np.mean(error_o)) +
                 abs(np.mean(error_s)-np.mean(error_n)) +
                 abs(np.mean(error_s)-np.mean(error_o)) +
                 abs(np.mean(error_n)-np.mean(error_o)))/6

    print('---------')
    print('Face Expression bias (Bf) = ', face_bias)


---
---
# Strategies to improve Accuracy (i.e., to reduce the Error):
# 2) Custom Loss: sample weights to deal with inbalanced categories
- Next, we will created a "customized loss", which gives more weight to people having less samples in train data. For this, **we will consider the age range only**. This way, we believe the model will be able to generalize a little bit better to those particular groups.

# Load the Train data again (to remove the augmented data) and generate the weigths 
- First, we will generate a weight for each age group (for g =1 to 4);
- The formula used to calculate the weight for each group $j$ is:

  $w_j=n_{samples} / (n_{classes} * n_{samples,j}),$

  Where

    - $w_j$ is the weight for each group $j$,
    - $n_{samples}$ is the number of samples in the train set,
    - $n_{classes}$ is the number of classes (4 in our case, as we divided the ages in 4 groups),
    - $n_{samples,j}$ is the number of samples of class (group) $j$.


In [None]:
"""
from tensorflow.keras.applications.resnet50 import preprocess_input
# loading the train data again (original face images, before preprocessing):
X_train = np.load('./data/data_train.npy')
Y_train = np.load('./data/labels_train.npy')
Y_train = Y_train/100 # normalizing the age values to be between [0,1]

# preprocessing the train data with respect to ResNet-50 Inputs.
for i in range(0,X_train.shape[0]):
  x = X_train[i,:,:,:]
  x = np.expand_dims(x, axis=0)
  X_train[i,] = preprocess_input(x)

# counting the number of samples per group in the train data (age attribute only)
g1 = g2 = g3 = g4 = 0
for i in range(0,Y_train.shape[0]):
    if(Y_train[i]*100<20):
      g1 +=1
    if(Y_train[i]*100>=20 and Y_train[i]*100<40):
      g2 +=1
    if(Y_train[i]*100>=40 and Y_train[i]*100<60):
      g3 +=1
    if(Y_train[i]*100>=60):
      g4 +=1
print('group(s) size = ', [g1, g2, g3, g4])

# generating the weights for each group using the equation defined above
w = sum(np.array([g1, g2, g3, g4]))/(4*np.array([g1, g2, g3, g4]))
print('weights per group = ', w)

# creating a vector with same size as Y_train, that will link a particular label to its weight
sample_weights = []
for i in range(0,Y_train.shape[0]):
    if(Y_train[i]*100<20):
      sample_weights.append(w[0])
    if(Y_train[i]*100>=20 and Y_train[i]*100<40):
      sample_weights.append(w[1])
    if(Y_train[i]*100>=40 and Y_train[i]*100<60):
      sample_weights.append(w[2])
    if(Y_train[i]*100>=60):
      sample_weights.append(w[3])
sample_weights = np.array(sample_weights)
"""

# 48 groups

In [None]:
from tensorflow.keras.applications.resnet50 import preprocess_input
# loading the train data again (original face images, before preprocessing):
X_train = np.load('./data/data_train.npy')
Y_train = np.load('./data/labels_train.npy')
M_train = np.load('./data/meta_data_train.npy')
Y_train = Y_train/100 # normalizing the age values to be between [0,1]

# preprocessing the train data with respect to ResNet-50 Inputs.
for i in range(0,X_train.shape[0]):
  x = X_train[i,:,:,:]
  x = np.expand_dims(x, axis=0)
  X_train[i,] = preprocess_input(x)

# counting the number of samples per group in the train data (age attribute only)
g1 = g2 = g3 = g4 = g5 = g6 = g7 = g8 = g9 = g10 = g11 = g12 = g13 = g14 = g15 = g16 = g17 = g18 = g19 = g20 = g21 = g22 = g23 = g24 = 1
g25 = g26 = g27 = g28 = g29 = g30 = g31 = g32 = g33 = g34 = g35 = g36 = g37 = g38 = g39 = g40 = 1
g41 = g42 = g43 = g44 = g45 = g46 = g47 = g48 = 1
for i in range(0,Y_train.shape[0]):
    if(Y_train[i]*100<20):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                g1 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g2 +=1
            if(M_train[i][2] == 'happy'):
                g3 +=1
            if(M_train[i][2] == 'other'):
                g4 +=1
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                g5 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g6 +=1
            if(M_train[i][2] == 'happy'):
                g7 +=1
            if(M_train[i][2] == 'other'):
                g8 +=1
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                g9 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g10 +=1
            if(M_train[i][2] == 'happy'):
                g11 +=1
            if(M_train[i][2] == 'other'):
                g12 +=1
    if(Y_train[i]*100>=20 and Y_train[i]*100<40):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                g13 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g14 +=1
            if(M_train[i][2] == 'happy'):
                g15 +=1
            if(M_train[i][2] == 'other'):
                g16 +=1
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                g17 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g18 +=1
            if(M_train[i][2] == 'happy'):
                g19 +=1
            if(M_train[i][2] == 'other'):
                g20 +=1
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                g21 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g22 +=1
            if(M_train[i][2] == 'happy'):
                g23 +=1
            if(M_train[i][2] == 'other'):
                g24 +=1
    if(Y_train[i]*100>=40 and Y_train[i]*100<60):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                g25 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g26 +=1
            if(M_train[i][2] == 'happy'):
                g27 +=1
            if(M_train[i][2] == 'other'):
                g28 +=1
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                g29 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g30 +=1
            if(M_train[i][2] == 'happy'):
                g31 +=1
            if(M_train[i][2] == 'other'):
                g32 +=1
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                g33 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g34 +=1
            if(M_train[i][2] == 'happy'):
                g35 +=1
            if(M_train[i][2] == 'other'):
                g36 +=1
    if(Y_train[i]*100>=60):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                g37 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g38 +=1
            if(M_train[i][2] == 'happy'):
                g39 +=1
            if(M_train[i][2] == 'other'):
                g40 +=1
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                g41 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g42 +=1
            if(M_train[i][2] == 'happy'):
                g43 +=1
            if(M_train[i][2] == 'other'):
                g44 +=1
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                g45 +=1
            if(M_train[i][2] == 'slightlyhappy'):
                g46 +=1
            if(M_train[i][2] == 'happy'):
                g47 +=1
            if(M_train[i][2] == 'other'):
                g48 +=1
print('group(s) size = ', [g1 , g2 , g3 , g4 , g5 , g6 , g7 , g8 , g9 , g10 , g11 , g12 , g13 , g14 , g15 , g16 , g17 , g18 , g19 , g20 , g21 , g22 , g23 , g24 , 
g25 , g26 , g27 , g28 , g29 , g30 , g31 , g32 , g33 , g34 , g35 , g36 , g37 , g38 , g39 , g40 , 
g41 , g42 , g43 , g44 , g45 , g46 , g47 , g48])

# generating the weights for each group using the equation defined above
w = sum(np.array([g1 , g2 , g3 , g4 , g5 , g6 , g7 , g8 , g9 , g10 , g11 , g12 , g13 , g14 , g15 , g16 , g17 , g18 , g19 , g20 , g21 , g22 , g23 , g24 , 
g25 , g26 , g27 , g28 , g29 , g30 , g31 , g32 , g33 , g34 , g35 , g36 , g37 , g38 , g39 , g40 , 
g41 , g42 , g43 , g44 , g45 , g46 , g47 , g48]))/(48*np.array([g1 , g2 , g3 , g4 , g5 , g6 , g7 , g8 , g9 , g10 , g11 , g12 , g13 , g14 , g15 , g16 , g17 , g18 , g19 , g20 , g21 , g22 , g23 , g24 , 
g25 , g26 , g27 , g28 , g29 , g30 , g31 , g32 , g33 , g34 , g35 , g36 , g37 , g38 , g39 , g40 , 
g41 , g42 , g43 , g44 , g45 , g46 , g47 , g48]))
print('weights per group = ', w)

# creating a vector with same size as Y_train, that will link a particular label to its weight
sample_weights = []
for i in range(0,Y_train.shape[0]):
    if(Y_train[i]*100<20):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[0])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[1])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[2])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[3])
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[4])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[5])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[6])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[7])
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[8])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[9])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[10])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[11])
    if(Y_train[i]*100>=20 and Y_train[i]*100<40):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[12])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[13])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[14])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[15])
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[16])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[17])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[18])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[19])
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[20])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[21])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[22])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[23])
    if(Y_train[i]*100>=40 and Y_train[i]*100<60):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[24])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[25])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[26])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[27])
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[28])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[29])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[30])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[31])
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[32])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[33])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[34])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[35])
    if(Y_train[i]*100>=60):
        if(M_train[i][1]=='caucasian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[36])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[37])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[38])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[39])
        if(M_train[i][1]=='asian'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[40])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[41])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[42])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[43])
        if(M_train[i][1]=='afroamerican'):
            if(M_train[i][2] == 'neutral'):
                sample_weights.append(w[44])
            if(M_train[i][2] == 'slightlyhappy'):
                sample_weights.append(w[45])
            if(M_train[i][2] == 'happy'):
                sample_weights.append(w[46])
            if(M_train[i][2] == 'other'):
                sample_weights.append(w[47])
sample_weights = np.array(sample_weights)

# 10 groups

In [None]:
from tensorflow.keras.applications.resnet50 import preprocess_input
# loading the train data again (original face images, before preprocessing):
X_train = np.load('./data/data_train.npy')
Y_train = np.load('./data/labels_train.npy')
M_train = np.load('./data/meta_data_train.npy')
Y_train = Y_train/100 # normalizing the age values to be between [0,1]

# preprocessing the train data with respect to ResNet-50 Inputs.
for i in range(0,X_train.shape[0]):
  x = X_train[i,:,:,:]
  x = np.expand_dims(x, axis=0)
  X_train[i,] = preprocess_input(x)

# counting the number of samples per group in the train data (age attribute only)
g1 = g2 = g3 = g4 = g5 = g6 = g7 = g8 = g9 = g10 = 0
for i in range(0,Y_train.shape[0]):
    if(Y_train[i]*100<20):
        if(M_train[i][1]=='caucasian'):
            g1 +=1
        if(M_train[i][1]=='asian'):
            g2 +=1
        if(M_train[i][1]=='afroamerican'):
            g3 +=1
    if(Y_train[i]*100>=20 and Y_train[i]*100<40):
        if(M_train[i][1]=='caucasian'):
            g4 +=1
        if(M_train[i][1]=='asian'):
            g5 +=1
        if(M_train[i][1]=='afroamerican'):
            g6 +=1
    if(Y_train[i]*100>=40 and Y_train[i]*100<60):
        if(M_train[i][1]=='caucasian'):
            g7 +=1
        if(M_train[i][1]=='asian'):
            g8 +=1
        if(M_train[i][1]=='afroamerican'):
            g9 +=1
    if(Y_train[i]*100>=60):
        g10 +=1
                
print('group(s) size = ', [g1 , g2 , g3 , g4 , g5 , g6 , g7 , g8 , g9 , g10])

# generating the weights for each group using the equation defined above
w = sum(np.array([g1 , g2 , g3 , g4 , g5 , g6 , g7 , g8 , g9 , g10]))/(10*np.array([g1 , g2 , g3 , g4 , g5 , g6 , g7 , g8 , g9 , g10]))
print('weights per group = ', w)

# creating a vector with same size as Y_train, that will link a particular label to its weight
sample_weights = []
for i in range(0,Y_train.shape[0]):
    if(Y_train[i]*100<20):
        if(M_train[i][1]=='caucasian'):
            sample_weights.append(w[0])
        if(M_train[i][1]=='asian'):
            sample_weights.append(w[1])
        if(M_train[i][1]=='afroamerican'):
            sample_weights.append(w[2])
    if(Y_train[i]*100>=20 and Y_train[i]*100<40):
        if(M_train[i][1]=='caucasian'):
            sample_weights.append(w[3])
        if(M_train[i][1]=='asian'):
            sample_weights.append(w[4]) 
        if(M_train[i][1]=='afroamerican'):
            sample_weights.append(w[5])
    if(Y_train[i]*100>=40 and Y_train[i]*100<60):
        if(M_train[i][1]=='caucasian'):
            sample_weights.append(w[6])
        if(M_train[i][1]=='asian'):
            sample_weights.append(w[7])
        if(M_train[i][1]=='afroamerican'):
            sample_weights.append(w[8])
    if(Y_train[i]*100>=60):
        sample_weights.append(w[9])
sample_weights = np.array(sample_weights)

# I) Using the SAMPLE WEIGHTS to train our model and,
- Next, you will see the code we used to train our model (2nd stage) from the model we obtained at the 1st stage, using the customized loss option with sample weights.
- As default, the code will load the model already trained. 
- You can change the boolean variable 'LOAD_BEST_MODEL_ST2_WEIGHTED_LOSS' to False to train your model.
- Note, now we include other variables ('RESUME_TRAINING' and 'RESUME_FROM_EPOCH') to allow us resuming training, as well as to inform from what epoch we want to resume the trainind, detailed below.

# II) illustrating how to train + save + stop training + RESUME TRAINING
- **Imagine** you set 'LOAD_BEST_MODEL_ST2_WEIGHTED_LOSS = False', 'NUM_EPOCHS = 12' and 'RESUME_TRAINING = False' to train your model the first time.
- Due to Colab limitations, your process stoped the training at the middle of epoch 10, and you saved the best model based on validation loss on epoch 9.
- In the above example, you can resume training from epoch 9 by setting the following parameters:
  - 'RESUME_TRAINING = True'
  - 'RESUME_FROM_EPOCH = 9'

- IMPORTANT: to resume training, you will need to monitor the epoch number where your model stopped before resuming the training, and change the defined variables properly. 
  - Note that the fit function is adapted to receive the sample weights ('sample_weight=sample_weights').
  - Also note that the fit function changes if you are training from epoch 0 (initial_epoch=0) or resume training (initial_epoch=RESUME_FROM_EPOCH). 
  - Finally, note that when you are resuming training, you load your 'best_model_2nd_stage_weighted.h5' instead of the model trained at stage 1 ('best_model.h5').


In [None]:
from keras.callbacks import ReduceLROnPlateau
# --------------------------
LOAD_BEST_MODEL_ST2_WEIGHTED_LOSS = False
NUM_EPOCHS = 300
# --------------------------
RESUME_TRAINING = False
RESUME_FROM_EPOCH = 90
# --------------------------

if(LOAD_BEST_MODEL_ST2_WEIGHTED_LOSS == True):
    # downloading the trained model
    !wget https: // data.chalearnlap.cvc.uab.cat/Colab_2021/best_model_weighted.zip
    # decompressing the data
    with ZipFile('best_model_weighted.zip', 'r') as zip:
        zip.extractall()
        print('Model decompressed successfully')
    # removing the .zip file after extraction  to clean space
    !rm best_model_weighted.zip

else:
    # loading the saved model (best model learned at stage 1)
    if(RESUME_TRAINING == False):
        # load model from stage 1 best_model
        saved_model = load_model('./checkpoint/best_model.h5')
    else:
        # resume training (stage 2)
        saved_model = load_model('./checkpoint/best_model.h5')

    # setting all layers to traineble
    saved_model.trainable = True

    # =================================================
    # training all layers (2nd stage), given the model saved on stage 1
    saved_model.compile(tf.keras.optimizers.Adam(learning_rate=1e-5, amsgrad=False), loss=tf.keras.losses.MeanSquaredError(), metrics=['mae'])
    # =================================================

    # defining the early stop criteria
    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)
    mc = ModelCheckpoint('./checkpoint/best_model_2nd_stage_weighted.h5',
                         monitor='val_loss', mode='min', save_best_only=True)
    mc_2 = ModelCheckpoint('./checkpoint/best_model_2nd_stage_weighted_mae.h5',
                         monitor='val_loss', mode='min', save_best_only=True)
    reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=6, min_lr=1e-6)


    if(RESUME_TRAINING == False):
        history = saved_model.fit(X_train, Y_train, sample_weight=sample_weights, validation_data=(
            X_valid, Y_valid), batch_size=16, epochs=NUM_EPOCHS, initial_epoch=RESUME_FROM_EPOCH, shuffle=True, verbose=1, callbacks=[es, mc, mc_2, reduce_lr])
    else:
        history = saved_model.fit(X_train, Y_train, sample_weight=sample_weights, validation_data=(
            X_valid, Y_valid), batch_size=16, epochs=NUM_EPOCHS, initial_epoch=RESUME_FROM_EPOCH, shuffle=True, verbose=1, callbacks=[es, mc, mc_2, reduce_lr])


In [None]:
from matplotlib import pyplot as plt

# here, it loads the history of the model we have already trained, or loads the
# history of the model you defined and trained

train_hist = history.history

# we plot both, the LOSS and MAE
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4))
fig.suptitle('Training history (stage 1)', fontsize=14, fontweight='bold')

ax1.plot(train_hist['loss'])
ax1.plot(train_hist['val_loss'])
ax1.set(xlabel='epoch', ylabel='LOSS')
ax1.legend(['train', 'valid'], loc='upper right')

ax2.plot(train_hist['mae'])
ax2.plot(train_hist['val_mae'])
ax2.set(xlabel='epoch', ylabel='MAE')
ax2.legend(['train', 'valid'], loc='upper right')

# Making predictions on the Test set and Evaluating
- Note that in this case, the model obtained MAE = 12.45038828, which is not the best score compared to the ones obtained before. However, are the evaluated biases better? Let's see!

In [None]:
from tensorflow.keras.models import load_model

if(LOAD_BEST_MODEL_ST2_WEIGHTED_LOSS == True):
    saved_model_2nd_weighted = load_model('best_model_2nd_stage_weighted.h5')
else:
    saved_model_2nd_weighted = load_model('./checkpoint/best_model_2nd_stage_weighted.h5')

# --------------------------
ENABLE_EVALUATION_WEIGHTED = True
# --------------------------

if(ENABLE_EVALUATION_WEIGHTED == True):
    # predict on the test data
    predictions_2nd_weighted = saved_model_2nd_weighted.predict(
        X_test, batch_size=32, verbose=1)
    # re-scaling the output predictions (from [0,1] to age range) using the
    # the normalization factor mentioned before
    predictions_2nd_weighted_f = predictions_2nd_weighted*100


In [None]:
if(ENABLE_EVALUATION_WEIGHTED == True):
    # evaluating on test data
    error = []
    for i in range(0, len(Y_test)):
        error.append(
            abs(np.subtract(predictions_2nd_weighted_f[i][0], Y_test[i])))

    print('MAE = %.8f' % (np.mean(error)))


In [None]:
# printing some predictions
for i in range(0, 20):
    print('predicted age = %.3f - Ground truth = %.3f' %
          (predictions_2nd_weighted_f[i], Y_test[i]))


In [None]:

age_bias(predictions_2nd_weighted_f, Y_test)

# computing the gender bias (model_stage_2)
gender_bias(predictions_2nd_weighted_f, Y_test, M_test)

# computing the ethnicity bias (model_stage_2)
ethnicity_bias(predictions_2nd_weighted_f, Y_test, M_test)

# computing the face bias (model_stage_2)
face_expression_bias(predictions_2nd_weighted_f, Y_test, M_test)


In [None]:
import csv
# saving the predictions as a csv file
with open('predictions.csv', 'w') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerows(predictions_2nd_weighted_f)
    csvFile.close()

# compressing the csv file (to be submitted to codalab as prediction)
! zip predictions.zip predictions.csv

# BEST MAE

In [None]:
from tensorflow.keras.models import load_model

if(LOAD_BEST_MODEL_ST2_WEIGHTED_LOSS == True):
    saved_model_2nd_weighted = load_model('best_model_2nd_stage_weighted_mae.h5')
else:
    saved_model_2nd_weighted = load_model('./checkpoint/best_model_2nd_stage_weighted_mae.h5')

# --------------------------
ENABLE_EVALUATION_WEIGHTED = True
# --------------------------

if(ENABLE_EVALUATION_WEIGHTED == True):
    # predict on the test data
    predictions_2nd_weighted = saved_model_2nd_weighted.predict(
        X_test, batch_size=32, verbose=1)
    # re-scaling the output predictions (from [0,1] to age range) using the
    # the normalization factor mentioned before
    predictions_2nd_weighted_f = predictions_2nd_weighted*100


In [None]:
if(ENABLE_EVALUATION_WEIGHTED == True):
    # evaluating on test data
    error = []
    for i in range(0, len(Y_test)):
        error.append(
            abs(np.subtract(predictions_2nd_weighted_f[i][0], Y_test[i])))

    print('MAE = %.8f' % (np.mean(error)))


In [None]:
# printing some predictions
for i in range(0, 20):
    print('predicted age = %.3f - Ground truth = %.3f' %
          (predictions_2nd_weighted_f[i], Y_test[i]))


# Comparing the 2nd stage of training: 
case a) without augmentation and custom loss **VS.** case b) without augmentation but with custom loss.
- Age bias:
  - case a: 8.988896687825521
  - case b: 3.965182622273763
- Gender bias:
  - case a: 0.6280031
  - case b: 0.54932594
- Ethnicity bias:
  - case a: 2.447519620259603
  - case b: 2.094111124674479
- Face Expression bias:
  - case a: 0.8196892738342285
  - case b: 1.220861275990804

As it can be observed, the model with custom loss and weighted samples obtained overall smaller bias scores on all evaluated attributes (except for face expression), even if the weigts were defined based on age attribute only.

In [None]:
age_bias(predictions_2nd_f, Y_test)
age_bias(predictions_2nd_weighted_f, Y_test)

gender_bias(predictions_2nd_f, Y_test, M_test)
gender_bias(predictions_2nd_weighted_f, Y_test, M_test)

ethnicity_bias(predictions_2nd_f, Y_test, M_test)
ethnicity_bias(predictions_2nd_weighted_f, Y_test, M_test)

face_expression_bias(predictions_2nd_f, Y_test, M_test)
face_expression_bias(predictions_2nd_weighted_f, Y_test, M_test)


In [None]:
import csv
# saving the predictions as a csv file
with open('predictions.csv', 'w') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerows(predictions_2nd_weighted_f)
    csvFile.close()

# compressing the csv file (to be submitted to codalab as prediction)
! zip predictions_2.zip predictions.csv

---
---
# Practical Exercises 
Next, we define a serie of practical exercises (Task 1 and 2, and an optional extra exercise).  **Your goal is to maximize accuracy (i.e., reduce the Mean Absolute Error) and minimize the evaluated bias scores on the different attributes**. Task 1 and 2 have some restrictions so that you can compare the results when following different strategies. Note, you can edit and improve the starting kit on each task, but you are free to start from strach and create a new solution. At the end, you will be evaluated based on a set of items (detailed in the practical classes) and **creativity**. 

- IMPORTANT: we will use **Codalab** to motivate the students, as they can submit their results on the platform, compete with each other and improve their solutions, but the ranking shown in the leaderboard will not be considered in the evaluation. This is to justify that more creative solutions will be priefered even if they don't provide the best results.
- Note: you will be requested to share with the lectors (Sergio and Julio) your final **Colab file** (with a clean code and well documented) and a **Report document** where you describe your experiments and solution, compare and discuss the obtained results in a progressive and clear way. Please, check the class material associated to the practical sessions for more details. 

---
- **Task 1 (with data augmentation):** For this task, you should define your model (e.g., a generic backbone with some small changes to solve the problem at hand, like including/removing some layers, etc), play with the different hyperparameters (e.g., number of epochs, learning rate, batch size, etc), regularizers (e.g., dropout layer), loos function (e.g., MSE, MAE, etc). You can also play with the training strategy (e.g., training using different stages - or not - freezing different layers during training - or not, etc). Then, **you will be requested to perform some data augmentation** to achieve your goal. Note, you could simply expand the idea of the starting kit to cover other attributes (e.g., age > 40 or for "happy" expression). However, we expect more creative solutions, where different approaches are employed (e.g., new transformations, covering different attributes, etc). Then, you should submit your solution to codalab and receive real-time feedback, and improve it based on your results.

- **Task 2 (custom loss, without data augmentation):** For this task, you should fix the model employed in Task 1, but you can also play with the different hyperparameters, regularizers, loos function and training strategy. Then, **you will be requested to use a custom loss (e.g., sample weights)** to achieve your goal, **without any data augmentation** method. This way, you will be able to compare the different solutions (Task 1 vs. Task 2). Then, you should submit your solution to codalab and receive real-time feedback, and improve it based on your results.

- **Extra (optional):** For this task, you can exploit your creativity as much as you can. You are free to employ any strategy, data augmentation, custom loss, etc, all together in order to achieve your goal. Then, you will be able to compare the obtained results for Task 1 vs. Task 2 vs. the extra (optional) task. Then, you can also submit your solution to codalab and receive real-time feedback, and improve it based on your results.
---


**Codalab Competition link:** https://codalab.lisn.upsaclay.fr/competitions/2321?secret_key=b66c95cb-997c-4fc9-af4e-987721abfa6c 

