## Initialization

## Load Data

The dataset is stored in the `/datasets/faces/` folder, there you can find
- The `final_files` folder with 7.6k photos
- The `labels.csv` file with labels, with two columns: `file_name` and `real_age`

Given the fact that the number of image files is rather high, it is advisable to avoid reading them all at once, which would greatly consume computational resources. We recommend you build a generator with the ImageDataGenerator generator. This method was explained in Chapter 3, Lesson 7 of this course.

The label file can be loaded as an usual CSV file.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf

from tensorflow.keras.preprocessing.image import load_img

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tensorflow.keras.applications.resnet import ResNet50
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam

In [None]:
labels_path = 'https://practicum-content.s3.us-west-1.amazonaws.com/datasets/faces/labels.csv'
# photos_path = 'https://practicum-content.s3.us-west-1.amazonaws.com/datasets/faces/final_files/'


labels = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/faces/labels.csv')

In [None]:
labels.info()
labels.sample(5)

In [None]:
# Check if all file_name items are unique
are_filenames_unique = labels['file_name'].is_unique

# Print the result
print(f"Are all file names unique? {are_filenames_unique}")

In [None]:
print(labels.isnull().sum())
print(labels['real_age'].describe())

The dataset is complete with no missing values in either the file_name or real_age columns. This is excellent news as it means we can proceed without needing to handle missing data, ensuring that our analysis and model training use the full dataset.

Age Distribution:

- The dataset contains a total of 7,591 entries, indicating a substantial amount of data for training and evaluating the model.
- The mean age is approximately 31.2 years, with a standard deviation of about 17.14 years. This suggests a wide range of ages among the individuals in the dataset.
- The age range is from 1 to 100 years old, demonstrating a very diverse set of data in terms of age. This diversity is beneficial for training a model that can accurately predict a wide range of ages.
- The 25th percentile is at 20 years, the median (50th percentile) is at 29 years, and the 75th percentile is at 41 years. This indicates that half of the dataset's individuals are between 20 and 41 years old, with a skew towards younger ages.

Conclusions:

- Data Quality: The high quality of the dataset (no missing values, unique file names) makes it a solid foundation for further analysis and model training.
- Age Diversity: The broad age range and standard deviation suggest the dataset captures a wide variety of age groups, which is crucial for developing a model capable of accurately assessing ages across the spectrum.
- Model Training Implications: The diversity in age and the distribution skewed slightly towards younger ages might influence how the model is trained, potentially requiring techniques to ensure it does not become biased towards more frequently represented age groups.

Given this analysis, it's clear that the dataset is well-prepared for the next steps in the project, including more detailed exploratory data analysis, model training, and evaluation. The broad age range and substantial dataset size are promising for training a robust model capable of accurately determining individuals' ages from photographs, which is essential for the project's goal of adhering to alcohol laws.

## EDA

In [None]:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Plot the distribution of ages
plt.figure(figsize=(10, 6))
sns.histplot(labels['real_age'], bins=30, kde=True)
plt.title('Distribution of Ages in the Dataset')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

The age distribution in the dataset is roughly bell-shaped but is skewed to the right, indicating a larger proportion of younger individuals. The most common age range appears to be between approximately 20 and 30 years old. There is a significant decline in frequency as age increases, with very few individuals in the older age range (60+ years). This skewness towards younger ages might suggest that the model trained on this dataset could perform better at estimating the ages of younger individuals compared to older ones, due to more examples to learn from.

In [None]:
# Define a function to display sample images from different age groups
def display_sample_images(image_paths, n_samples=10):
    plt.figure(figsize=(20, 4))
    for i, image_path in enumerate(image_paths):
        img = load_img(image_path, target_size=(224, 224))
        plt.subplot(1, n_samples, i + 1)
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

# Select random sample file names
sample_files = labels.sample(n=10)['file_name'].values
sample_image_paths = ['/datasets/faces/final_files/' + file_name for file_name in sample_files]

# Display the images
display_sample_images(sample_image_paths)

From the provided sample images, we can draw the following conclusions:

- Variety in Age: The images represent a range of ages from children to older adults. This variety is crucial for training a model that needs to recognize and predict ages across a broad spectrum.
- Image Quality: The quality of the images varies. Some images appear to be clear, while others seem to be of lower resolution or have some blur, which could impact the model's ability to extract age-related features.
- Lighting Conditions: There is a noticeable variation in lighting conditions across the images. Some faces are well-lit, while others are in shadow or have uneven lighting. Such variations can pose a challenge for age estimation models and may require the use of data augmentation techniques to make the model more robust to different lighting conditions.
- Background and Pose: The backgrounds vary from neutral to noisy, and the subjects have different head poses. The diversity in background and pose is beneficial for training a model to focus on facial features rather than background elements. However, extreme poses or occlusions could make age prediction more challenging.
- Facial Expressions: The sample shows a variety of facial expressions, which can affect apparent age. Training a model to account for these variations is important for accurate age estimation.
- Accessories and Hairstyles: Some individuals are wearing glasses, hats, or have hairstyles that partially obscure their faces. These factors can influence age perception and should be considered when training the model.

The sample images reflect the diversity in age, image quality, lighting, background, pose, and facial expressions that we would expect in a real-world scenario.

### Findings

Age Distribution:

- The dataset contains a wide range of ages, from 1 to 100 years old, with a total of 7,591 images.
- The mean age is approximately 31.2 years, with the age distribution being right-skewed, indicating a higher concentration of younger individuals, particularly in the 20-30 year age range.
- This skewness suggests that there is more data available for younger individuals, which could result in the model being more accurate for these ages due to the larger amount of training data.

Image Samples:

- A visual inspection of a subset of images reveals a diversity in age, suggesting that the dataset has a broad representation that could be beneficial for training a model to recognize a range of ages.
- The sample images also show variability in image quality, lighting conditions, and background noise. These variations are representative of real-world conditions but could pose challenges for the model's performance.
- Some images have accessories like glasses and hats, and there are varying facial expressions and head poses, which could affect age perception and need to be considered during model training.

Conclusions:

- Data Quality and Quantity: The dataset is comprehensive and lacks missing values, providing a strong foundation for model training. However, the skew towards younger ages could lead to biases in the model's performance, which would need to be addressed, potentially through data augmentation or weighted loss functions during training.
- Model Training and Validation: Given the variations in image quality and conditions, the model should be robust to such variations, which might involve using a pre-trained network or incorporating data augmentation techniques that simulate different lighting conditions and pose variations.
- Potential for Bias: The imbalance in age distribution points to the potential for age prediction bias, which should be a consideration when splitting the dataset into training and validation sets. Stratified sampling could help ensure that the model is validated against an age distribution that mirrors the training set.

## Modelling

Define the necessary functions to train your model on the GPU platform and build a single script containing all of them along with the initialization section.

To make this task easier, you can define them in this notebook and run a ready code in the next section to automatically compose the script.

The definitions below will be checked by project reviewers as well, so that they can understand how you built the model.

In [None]:
def load_train(path):
    
    """
    It loads the train part of dataset from path
    """
    
    # Create an instance of the ImageDataGenerator class
    datagen = ImageDataGenerator(
        rescale=1./255,       # Rescale the image by normalizing pixel values
        validation_split=0.2, # Reserve 20% of the data for validation
        horizontal_flip=True, # Augment the data by flipping images horizontally
        vertical_flip=True,   # Augment the data by flipping images vertically
    )
    
    # Create a generator that will read the training data
    train_gen_flow = datagen.flow_from_dataframe(
        dataframe=pd.read_csv(path + 'labels.csv'), # Load labels
        directory=path + 'final_files/',           # Path to the image files
        x_col='file_name',                          # Column in dataframe that contains the filenames
        y_col='real_age',                           # Column in dataframe that contains the target
        target_size=(224, 224),                     # The dimensions to which all images found will be resized
        batch_size=32,                              # Size of the batches of data
        class_mode='raw',                           # Determines the type of label arrays that are returned
        subset='training',                          # Specifies that this is training data
        seed=12345                                  # Random seed for shuffling and transformations
    )


    return train_gen_flow

In [None]:
def load_test(path):
    
    """
    It loads the validation/test part of dataset from path
    """
    
    # Create an instance of the ImageDataGenerator class
    # Here we only rescale the validation data, without augmentation
    datagen = ImageDataGenerator(
        rescale=1./255,       # Rescale the image by normalizing pixel values
        validation_split=0.2  # Reserve 20% of the data for validation
    )
    
    # Create a generator that will read the test data
    test_gen_flow = datagen.flow_from_dataframe(
        dataframe=pd.read_csv(path + 'labels.csv'), # Load labels
        directory=path + 'final_files/',           # Path to the image files
        x_col='file_name',                          # Column in dataframe that contains the filenames
        y_col='real_age',                           # Column in dataframe that contains the target
        target_size=(224, 224),                     # The dimensions to which all images found will be resized
        batch_size=32,                              # Size of the batches of data
        class_mode='raw',                           # Determines the type of label arrays that are returned
        subset='validation',                        # Specifies that this is validation data
        seed=12345                                  # Random seed for shuffling and transformations
    )

    return test_gen_flow

In [None]:
def create_model(input_shape):
    
    """
    It defines the model
    """
    
    # Define the base model, ResNet50, with weights pre-trained on ImageNet
    backbone = ResNet50(input_shape=input_shape,
                        weights='imagenet',
                        include_top=False)
    
    # Freeze the layers of the backbone
    backbone.trainable = False
    
    # Define the custom head for our network
    model = Sequential([
        backbone,
        GlobalAveragePooling2D(),      # Add GAP layer to reduce the spatial dimensions
        Flatten(),                     # Flatten the output
        Dense(256, activation='relu'), # Add a fully connected layer with 256 units
        Dropout(0.5),                  # Add dropout for regularization
        Dense(1, activation='linear')  # Output layer with a single neuron (for regression)
    ])
    
    # Compile the model
    model.compile(optimizer=Adam(lr=0.0001), # Optimizer
                  loss='mean_squared_error',  # Loss function for regression
                  metrics=['mae'])           # Metric to monitor

    return model

In [None]:
def train_model(model, train_data, test_data, batch_size=None, epochs=20,
                steps_per_epoch=None, validation_steps=None):

    """
    Trains the model given the parameters
    """
    
    if steps_per_epoch is None:
        steps_per_epoch = len(train_data)
    if validation_steps is None:
        validation_steps = len(test_data)
        
    # Adding learning rate scheduler and early stopping for better training control
    lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=1, min_lr=0.000001)
    early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1, restore_best_weights=True)
    callbacks = [lr_scheduler, early_stopping]

    model.fit(train_data,
              validation_data=test_data,
              epochs=epochs,
              batch_size=batch_size,
              steps_per_epoch=steps_per_epoch,
              validation_steps=validation_steps,
              verbose=2)

    return model

### Prepare the Script to Run on the GPU Platform

Given you've defined the necessary functions you can compose a script for the GPU platform, download it via the "File|Open..." menu, and to upload it later for running on the GPU platform.

N.B.: The script should include the initialization section as well. An example of this is shown below.

In [None]:
# prepare a script to run on the GPU platform

init_str = """
import pandas as pd

import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.resnet import ResNet50
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam
"""

import inspect

with open('run_model_on_gpu.py', 'w') as f:
    
    f.write(init_str)
    f.write('\n\n')
        
    for fn_name in [load_train, load_test, create_model, train_model]:
        
        src = inspect.getsource(fn_name)
        f.write(src)
        f.write('\n\n')

In [None]:
if __name__ == '__main__':
    path = '/datasets/faces/' 

    train_data = load_train(path)
    test_data = load_test(path)

    model = create_model(input_shape=(150, 150, 3)) 

    model = train_model(model, train_data, test_data, batch_size=32, epochs=20)

### Output

Place the output from the GPU platform as an Markdown cell here.

## Conclusions

# Checklist

- [ ]  Notebook was opened
- [ ]  The code is error free
- [ ]  The cells with code have been arranged by order of execution
- [ ]  The exploratory data analysis has been performed
- [ ]  The results of the exploratory data analysis are presented in the final notebook
- [ ]  The model's MAE score is not higher than 8
- [ ]  The model training code has been copied to the final notebook
- [ ]  The model training output has been copied to the final notebook
- [ ]  The findings have been provided based on the results of the model training