

<div class="alert alert-block alert-info" style="width:400px, text-color:black"> 
For this task we will use  the <a href="https://www.kaggle.com/datasets/nipunarora8/age-gender-and-ethnicity-face-data-csv?select=age_gender.csv">age-gender-and-ethnicity-face dataset</a> which contains a number of images, each accompanied with features recording the age,  gender,  and ethnicity of the person in the image. This is a  version of the <a href="https://susanqq.github.io/UTKFace/" > UTKface dataset</a>, modified to make it easer to load as one file.<br>
Please note that
    <ul>
        <li>It happens not contain any people who self-identified as non-binary. So gender is labelled as 0 (male) or 1 (female)</li>
        <li> Ethicity is coded as an integer from 0 to 4, denoting White, Black, Asian, Indian, and Others (like Hispanic, Latino, Middle Eastern) (terms from original site).</li>
          <li>  Each row of this version of the dataset contains integer values for the three features, a string with the name or the original jpg from the UTK archive, and a feature called 'pixels' which contains the 48x48 pixels values as a string.</li>
    </ul>

The next two cells load the data into a pandas dataframe and then shpw the first ten lines.

In [None]:
import numpy as np 
import pandas as pd
import os
import socket
import random
from matplotlib import pyplot as plt

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split


from tensorflow.keras.utils import to_categorical # convert to one-hot-encoding
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers as keras_layers

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
if (socket.gethostname()=='csctcloud'): #on csctcloud
    path="/home/common/datasets/"
else: #machine specific- this is for jim's development
    path = "../datasets"
dataframe = pd.read_csv(path+'/utk/teacher_pupil.csv')
dataframe.head()

### Let's start by splitting the images into appropriate numpy arrays
- We first convert the 'pixekls' column of the dataframe into a numpy array
- then split each row into a sub-array using a space as a seperator, 
- before reshaping our array from 23705x2304 floats into 23705 * (48x48) images
- the conversion of the labels is first makes a straightforward 1d array, 
  then uses that to put 1s into the right column of a 2d array for one-hot encoding 

In [None]:
imgs = dataframe['pixels'].to_numpy()
print(f'original shape of imgs array {imgs.shape}')
imgs = np.array([x.split(' ') for x in imgs], dtype=float)
print(f' shape  after splitting: {imgs.shape}')
imgs = imgs.reshape(-1,48,48,1).astype(int)
print(f'shape  after reshaping into 2d images with one channel: {imgs.shape}')

In [None]:
labels= dataframe['gender'].to_numpy()
y = to_categorical(labels,num_classes=2)
print(f'shape of labels is {labels.shape} and y is {y.shape}')
print(f'Split of males:females in the labels is {np.unique(labels,return_counts=True)[1]}')

### This is what ten randomly chosen images look like

In [None]:
fig,axs=plt.subplots(2,5,figsize=(7.5,5))
for i in range(10):
    img = random.randint(0,labels.shape[0])
    axs[i//5][i%5].imshow(imgs[img],cmap='gray')
    axs[i//5][i%5].set_title(f'{ "male" if labels[img]==0 else "female"}\n y[i]= {y[img]}')
                                   

### Finally use standard sklearn function to split data into training and test set

In [None]:

X_train,X_test,y_train,y_test= train_test_split(imgs,y,test_size=7705,shuffle=True,stratify=y)
print(f'For sanity-checking: train and test arrays have shapes {X_train.shape}, {X_test.shape},{y_train.shape},{y_test.shape}')

## Now the convnet bit
- start by specifying  a function to create a straightforward CNN using keras sequential model interface
- then make a model and train it

The architecture is inspired by [this kaggle post](https://www.kaggle.com/code/amishaasrani/gender-detection-by-cnn)
It introduces some new types of layer into each convolution-maxpooling block, which implement some standard tricks to improve deep networks training.
1. *Batch normalisation* is a method that attempts to reduce the random effects of dividing the data into batches.  
   It works by scaling the outputs from each batch of data so they lie roughly within constant bounds estimated as the mean of the training data +/- the std. deviation of the training data.  
   The net effect is usually to make it **faster to train** a network.  
   [keras documentation here](https://keras.io/api/layers/normalization_layers/batch_normalization/)  
   [Machine Learning Mastery blog here](https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/)
2. *Dropout* is a **regularisation** technique applied to try and reduce the number of variables (non-zero weights) in the learned model.  
   It works  by effectively pruning connections.  During training a fraction (0.2 in this case) of the nodes are arbitrarily 'switched off' for each batch,  so that the back-propagation can then reduce weight that do not seme to have any effect. 
   The net effect is usually **to help prevent over-fitting**. 
   [keras documentation here](https://keras.io/api/layers/regularization_layers/dropout/)  
   [Machine Learning Mastery blog here](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/)

In [None]:
def conv_model(num_classes):
    model = Sequential()
    
    #first block of layers
    model.add(keras_layers.Conv2D(32, kernel_size=(3, 3), activation='relu', padding = "same", input_shape=(48,48,1)))
    model.add(keras_layers.BatchNormalization())
    model.add(keras_layers.MaxPool2D(pool_size=(2,2)))
    model.add(keras_layers.Dropout(0.2))
    
    #second block of layers
    model.add(keras_layers.Conv2D(64, kernel_size=(3,3),activation="relu",padding="same"))
    model.add(keras_layers.BatchNormalization())
    model.add(keras_layers.MaxPool2D(pool_size=(2,2)))
    model.add(keras_layers.Dropout(0.2))
    
    #third block of layers
    model.add(keras_layers.Conv2D(64, kernel_size=(3,3),activation="relu",padding="same"))
    model.add(keras_layers.BatchNormalization())
    model.add(keras_layers.MaxPool2D(pool_size=(2,2)))
    model.add(keras_layers.Dropout(0.2))
    
    #fully connected layers followed by softmax output
    model.add(keras_layers.Flatten())
    model.add(keras_layers.Dense(256,activation="relu"))#256
    model.add(keras_layers.Dense(num_classes, activation="softmax"))
    
    model.compile(optimizer='Adam',
              loss= 'BinaryCrossentropy',
              metrics=['accuracy'])
    return model

In [None]:
convnet = conv_model(num_classes=2)
convnet.summary()

### Train the model using an early stopping criteria

In [None]:
early_stopping = EarlyStopping(monitor='val_loss',patience=10, 
                               min_delta=0.001,
                               restore_best_weights=True)


history= convnet.fit(X_train,y_train,validation_split=0.1,epochs=50,batch_size=64, callbacks=early_stopping,verbose=True)

## plot training history

In [None]:
plt.plot(history.epoch, history.history["accuracy"],history.history['val_accuracy'])
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.suptitle('Training History')
plt.show()

### show test accuracy and confusion matrix

In [None]:
loss, acc = convnet.evaluate(X_test, y_test, verbose=0)
print("Test loss: {}".format(loss))
print(f"Test Accuracy: {acc*100}%")
ypred = convnet.predict(X_test,verbose=0)
#convert ypred and y_test back to categorical labels
y_true= np.argmax(y_test,axis=1)
y_pred = np.argmax(ypred,axis=1)

ConfusionMatrixDisplay.from_predictions(y_true,y_pred,display_labels=['male','female'])


## Keras Support for data augmentation
There is a list of preprocessing layers, including data augmentation, and different ways of using them in a workflow in [this Keras guide](https://keras.io/guides/preprocessing_layers/).

For now we will begin by illustrating the effects of some common functions

In [None]:
data_augmentation = [Sequential(keras_layers.RandomFlip()),
                     Sequential(keras_layers.RandomContrast(1)), 
                     Sequential(keras_layers.RandomRotation(0.1)),
                     Sequential(keras_layers.RandomZoom(0.5))]
names=['flip','contrast','rotation','zoom']


In [None]:
tf.get_logger().setLevel('ERROR')
fig,axs= plt.subplots(len(names),5,figsize=(10, 10))
first_image = X_train[0]
for row in range(len(names)):
    axs[row][0].set_ylabel(names[row])
    for col in range(5):
        augmented_image = data_augmentation[row](
        tf.expand_dims(first_image, 0), training=True
        )
        axs[row][col].imshow(augmented_image[0].numpy().astype("int32"),cmap='gray')
plt.axis("off")

# so now to create a pipeline that will be used in training

In [1]:
def get_gender_face_data(seed=12345):
    if (socket.gethostname()=='csctcloud'): 
        path="/home/common/datasets/"
    else: #machine specific
        path = "../datasets"
    dataframe = pd.read_csv(path+'/utk/teacher_pupil.csv')
    
    imgs = dataframe['pixels'].to_numpy()
    imgs = np.array([x.split(' ') for x in imgs], dtype=float)
    imgs = imgs.reshape(-1,48,48,1).astype(int)
    labels= dataframe['gender'].to_numpy()
    y = to_categorical(labels,num_classes=2)
    X_train,X_test,y_train,y_test= train_test_split(imgs,y,test_size=7705,shuffle=True,stratify=y,random_state=seed)
    return X_train,X_test,y_train,y_test

In [None]:
# make a function that will let us choose which augemtnatinos to use
def make_augmenter(flip=True,contrast=True,rotation=True,zoom=True):
    augmenter = Sequential()
    if flip:
        augmenter.add(keras_layers.RandomFlip())
    if contrast:
        augmenter.add(keras_layers.RandomContrast(1))
    if rotation:
        augmenter.add(keras_layers.RandomRotation(0.1))
    if zoom:
        augmenter.add(keras_layers.RandomZoom(0.5))
    return augmenter

In [None]:
# Create a tf.data pipeline of augmented images (and their labels)
#this time we have to take out the validation set manually

def get_augmented_data_streams(X_train,y_train,batchsize):
    #this split ends up losing 32 faces ...
    split=15000
    valsize=1000
    assert  split+valsize <=len(y_train),f"can't split data we don't have {len(y_train)}"
    x_tr= X_train[:split,:]
    y_tr=y_train[:split]
    x_val= X_train[split:split+valsize,:]
    y_val=y_train[split:split+valsize]
    assert split%batchsize==0,"training set size must be multpile of batchsize"
    assert valsize%batchsize==0,"validation set size must be multpile of batchsize"

    train_dataset = tf.data.Dataset.from_tensor_slices((x_tr, y_tr))
    train_dataset = train_dataset.batch(batchsize).map(lambda x, y: (data_augmentation(x), y))
    validation_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batchsize)
    return train_dataset,validation_dataset

In [None]:
#now put all the pieces together into a function that can be called in a loop 

def run_experiment (flip=False,contrast=False,rotation=False,zoom=False,batchsize=50):
    batchsize=batchsize
    
    X_train,X_test,y_train,y_test = get_gender_face_data()
    data_augmentation= make_augmenter(flip=flip,contrast=contrast,rotation=rotation,zoom=zoom)
    train_dataset,val_dataset = get_data_streams(X_train,y_train, batchsize)
    print('Data made')
    
    augmented_cnn = conv_model(num_classes=2)
    early_stopping = EarlyStopping(monitor='val_loss',patience=10, 
                               min_delta=0.001,
                               restore_best_weights=True)
    history= augmented_cnn.fit(train_dataset,steps_per_epoch=split/batchsize,
                               epochs=50,batch_size=batchsize,
                              validation_data=validation_dataset, callbacks=early_stopping,
                               verbose=True)
    loss, acc = augmented_cnn.evaluate(X_test, y_test, verbose=0)
    print("Test loss: {}".format(loss))
    print(f"Test Accuracy: {acc*100}%")
    ypred = augmented_cnn.predict(X_test,verbose=0)
    #convert ypred and y_test back to categorical labels
    y_true= np.argmax(y_test,axis=1)
    y_pred = np.argmax(ypred,axis=1)

    cm=ConfusionMatrixDisplay.from_predictions(y_true,y_pred,display_labels=['male','female']) 
    return cm.confusion_matrix
    

In [None]:
#this runs an expereiment
test_res= run_experiment(flip=True)

In [None]:
test_res

# Questions to investigate:
1. Which of these are valid transformations for human faces?:
 - horizontal shifts
 - vertical shifts
 - rotation
 - horizontal flips
 - vertical flips
 
Does it make a difference what the task is, i.e. gende recognition vs. recognising a specific person?
 
To do this investigation using appropriate scientific method, treat each of these as a hypothesis to be tested. Take a number of observations (e.g. accuracy of trained model) for each case (e.g. using horizontal flips vs not using horizontal flips) then compare the mean results and use appropriate statistcal tests to determine whether the results are statistically significantly different.

# The main task:

- Use the different pipeline components above to experiment with different sorts of data augmenation available within keras e.g. rotations, zoom,contrast changes,  and vertical/ horizontal flips. There are others available that only require a minor extension ot my make_augmenter() function.
- Design an appropriate methodology to evaluate what difference they make singly or in combination to the classification accuracy of the trained system?  
  *Hint*: If you are making several changes to a system you need some way of knowing which have had an effect: [ilustrated in 200 words](https://thaddeus-segura.com/data-aug/)


### class discussion
does data augmentation provide a away of addressing:
- ethical concerns about under-representation of certain groups
- safety concerns for example wrt autonomous vehicles

