# Face classifier - Baseline Model

In this notebook I train a baseline model in order to establish a minimal value for the performance metric. To do so, I train a simple logistic regression, which is arguably the most simple classification model there is. Using the CNN architectures in the notebook fc_nn.ipynb the goal will be to beat the baseline performance established here.

## Set up

In [None]:
# Stdlib imports
from pathlib import Path
from datetime import datetime as dt

# 3rd party imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score 
from sklearn.metrics import classification_report, roc_curve, roc_auc_score, f1_score

# Local imports
from facecls import fcaux

## Configurations

In this section I configure settings and variables. I start by setting the random seed for the sake of reproducibility, I fix the "target" and "model_type" variables which will be used in several places of this notebook and I create the required folder structure.

In [None]:
# set the seed
seed = 42
np.random.seed(seed)
random.seed(seed)

In [None]:
# define variables
target = "gender"
model_type = "logreg" 

In [None]:
# create folder structure
models_dir = Path(f"results/models/{target.title()}Classifier/")

try:
    last_model_id = max([int(folder.as_posix().split("_")[2]) for folder in models_dir.glob(f'{model_type}*')])
except ValueError:
    last_model_id = 0

print("Last model id:", last_model_id)

new_model_id = last_model_id + 1
file_suffix = f"{model_type}_{target}_{str(new_model_id).zfill(3)}"
new_model_dir = models_dir / file_suffix
print(f"Creating folder \"{new_model_dir}\"...")
new_model_dir.mkdir(parents=True, exist_ok=True)

## Load data

Load the data from file and print some examples:

In [None]:
data = pd.read_csv("data/age_gender_preproc.csv")

In [None]:
data.head()

## Baseline model: Logistic Regression

### Data preprocessing

As seen in fc_eda.ipynb, the actual images are stored in the "pixels" column of the "data" DataFrame in the form of strings of space-separated pixel values. As a preprocessing step conducted in the next cell, this column will be converted into a 2D numpy array. Specifically, each entry (a string) will be converted into a 1D numpy array such that the full column corresponds to a vector of vectors.

In [None]:
full_img_vec_list = np.array([fcaux.pxlstring2pxlvec(data, i) for i in range(data.shape[0])])

In [None]:
full_img_vec_list

This resulting 2D array represents the input data to the baseline model.

### Data split

As usual we split the data set into a training, validation and a test set. The test set is made of 20% of the entire data set, the validation set of 10% of the remaining 80% (i.e. of 8% of the entire data set) and therefore 72% of the full data set make up the training set.

*REMARK:* Here, we will only use the training and test set (and not the validation set) as there will be no hyperparameter tuning for this baseline model. But in order use consistent number of examples both in the baseline as well as in the competitor models, the split is performed identically in both cases.

Notice that we perform the split using indices and not on the feature and target data directly. The motivation is so we can later just safe the train, validation and test example indices in a CSV file which saves more disk space than saving new copies of the full data for each model.

In [None]:
indeces = list(range(len(full_img_vec_list)))

In [None]:
attrs = data[["gender", "ethnicity", "age_decades"]]
all_indices = range(full_img_vec_list.shape[0])

# Stratification is only possible for categorical targets
if target == "age":
    strat = None
else:
    strat = attrs[target].values

# Perform the train-test split
idx_train, idx_test = train_test_split(all_indices,
                                       test_size = 0.2,
                                       stratify = strat,
                                       random_state=seed
                                      )

# Perform the train-val split
idx_train, idx_val  = train_test_split(idx_train,
                                       test_size = 0.1,
                                       stratify = strat[idx_train],
                                       random_state=seed
                                      )

Now use those indices to extract the corresponding features/images and targets:

In [None]:
# First, extract column index for target column ...
target_idx = attrs.columns.get_loc(target)

# ... because we need this to slice the data sets using iloc
X_train = full_img_vec_list[idx_train]
y_train = attrs.iloc[idx_train, target_idx]

X_val = full_img_vec_list[idx_val]
y_val = attrs.iloc[idx_val, target_idx]

X_test = full_img_vec_list[idx_test]
y_test = attrs.iloc[idx_test, target_idx]
attrs_test = attrs.iloc[idx_test, :]

In [None]:
# Just checking: number of elements per data subset
print("#training:", len(X_train))
print("#validation:", len(X_val))
print("#test:", len(X_test))

Now save the three different index data sets to file:

In [None]:
# In order to pack all three index vectors into one single pd.DataFrame
# they all need to be of the same length. To achieve this, we fill the
# test and validation index vectors with NaNs until they have the same
# length as the training index vector.
idx_val += (len(idx_train) - len(idx_val))*[np.nan]
idx_test += (len(idx_train) - len(idx_test))*[np.nan]

# Check that the vectors are now all of equal length
assert len(idx_train) == len(idx_val)
assert len(idx_train) == len(idx_test)

# Pack all three index vectors into a single pd.DataFrame for easy
# and convenient writing to file.
idx_df = pd.DataFrame({"train_idx": idx_train,
                       "val_idx": idx_val,
                       "test_idx": idx_test}, dtype="Int64")

idx_df.to_csv(new_model_dir / f"data_set_indices__{file_suffix}.csv", index=False)

### Training the Logistic Regression Model

As the current goal is merely to establish a baseline model, I construct the logistic regression model using default hyperparameters. I only set the "random_state" parameter for reproducibility, the "n_jobs" for efficiency and the "verbose" for transparancy reasons.

In [None]:
# Instantiate the logistic regression model
model = LogisticRegression(random_state = seed, 
                           n_jobs = -1,
                           verbose=True
                           )

# Train the model and measture the time it take to do so.
start = dt.now()
model.fit(X_train, y_train)
elapsed = dt.now()-start
print(f"Elapsed: {elapsed}s")

Use the trained logistic regression model to predict the labels for the test set:

In [None]:
y_prob_test = model.predict_proba(X_test)

With these results, compute the false positive rate (fpr) and the true positive rate (tpr) various different thresholds (thr), save them to a file and use them to plot the ROC curve as well as to compute the ROC AUC score:

In [None]:
# Compute the ROC curve
fpr, tpr, thr = roc_curve(y_test, y_prob_test[:,1])
pd.DataFrame({"FPR": fpr, "TPR": tpr}).to_csv(new_model_dir / f"fpr_vs_tpr__{file_suffix}.csv")

In [None]:
# Generate the ROC curve plot and compute the ROC AUC metric
fig, ax = plt.subplots()
ax.plot(fpr,tpr)
ax.plot([0,1], [0,1], ls="--", c="k")
ax.grid(True)
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title(f"ROC AUC: {np.round(roc_auc_score(y_test, y_prob_test[:,1]),4)}")
plt.savefig(new_model_dir / f"roc_curve__{file_suffix}.png",
            bbox_inches='tight')
plt.show()

Last but not least, compute the performance metrics for all three data sets (train, validation and test) and write them to file:

In [None]:
# Compute class and probability predictions for all three data sets
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)
y_pred_test = model.predict(X_test)

y_proba_train = model.predict_proba(X_train)
y_proba_val = model.predict_proba(X_val)
y_proba_test = model.predict_proba(X_test)

In [None]:
# Create the classification report and write it to file
cls_report = pd.DataFrame(classification_report(y_test, y_pred_test, output_dict=True))
cls_report.to_csv(new_model_dir / f"classificationo_report__{file_suffix}.csv")

In [None]:
# Create more performance metrics and write them to file, too.
train_metrics = {"accuracy": accuracy_score(y_train, y_pred_train),
                "balanced_accuracy": balanced_accuracy_score(y_train, y_pred_train),
                "roc_auc": roc_auc_score(y_train, y_proba_train[:,1]),
                "F1": f1_score(y_train, y_pred_train)}

val_metrics = {"accuracy": accuracy_score(y_val, y_pred_val),
                "balanced_accuracy": balanced_accuracy_score(y_val, y_pred_val),
                "roc_auc": roc_auc_score(y_val, y_proba_val[:,1]),
                "F1": f1_score(y_val, y_pred_val)}

test_metrics = {"accuracy": accuracy_score(y_test, y_pred_test),
                "balanced_accuracy": balanced_accuracy_score(y_test, y_pred_test),
                "roc_auc": roc_auc_score(y_test, y_proba_test[:,1]),
                "F1": f1_score(y_test, y_pred_test)}

# Organize the results in a data frame for better readability
metrics_df = pd.DataFrame({"train": train_metrics, 
                           "val": val_metrics, 
                           "test": test_metrics})

display(metrics_df)
metrics_df.to_csv(new_model_dir / f"metrics__{file_suffix}.csv")