In [None]:
# Import all libraries
# Make sure to run this first (or anytime the session times out)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
import cv2
from PIL import Image
from pathlib import Path
import os

In [None]:
# Relative paths for both csv files
path_fer2013 = 'fer2013.csv'
path_phoebe = 'phoebe_AU.csv'

# Store the databases
data_fer = pd.read_csv(path_fer2013)
data_phoebe = pd.read_csv(path_phoebe)

# Part A - Support Vector Machine

## 1. fer2013.csv

In [None]:
# Get the training data

# Make sure to shape to 48x48 and loop through each image and seperate by space
x_train = np.array([np.fromstring(image, dtype=int, sep=' ').reshape(48, 48) for image in data_fer[data_fer['Usage'] == 'Training']['pixels']])
y_train = np.array(data_fer[data_fer['Usage'] == 'Training']['emotion'])

In [None]:
# Train a SVM model
svm_fer = SVC()
svm_fer.fit(x_train.reshape(len(x_train), -1), y_train)

In [None]:
# Read and processs private test data
x_privatetest = np.array([np.fromstring(image, dtype=int, sep=' ').reshape(48, 48) for image in data_fer[data_fer['Usage'] == 'PrivateTest']['pixels']])
y_privatetest = np.array(data_fer[data_fer['Usage'] == 'PrivateTest']['emotion'])

In [None]:
# Predict emotions using SVM
y_pred_a1 = svm_fer.predict(x_privatetest.reshape(len(x_privatetest), -1))

In [None]:
# Look at the classication report
print("Classification Report:")
print(classification_report(y_privatetest, y_pred_a1))

In [None]:
# Create confusion matrix
conf_matrix_a1 = confusion_matrix(y_privatetest, y_pred_a1)

# Plot it
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_a1, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()


As shown in the classification report above, group 1 has a perfect precision (1.0), meaning it has no false positives. So when the classifier predicts group 1, it is always correct. However, despite this, group 1 has the lowest f1-score (0.10) out of all the groups because the recall is really low. The classifier does not predict group 1 a lot, leading to a lot of false negatives.

Groups 3 and 5 have the highest f1-scores (0.58) indicating they are the best, by some margin, at balancing between precision and recall.

The total accuracy of the classifier is 0.45, indicating the classifier correctly predicts the emotional label 45% of the time. The macro average is the average score treating all classes equally and the weighted average takes the size of groups into account. Since the weight average is greater than the macro average, we can say the larger sized groups have a higher f1-score or there are some small outlier groups with a low f1-score. Looking at the data we can see the outlier is group 1 as it has by far the smallest f1-score (0.1) and size (55) compared to the other 6 groups.

The confusion matrix can be seen above. Each group other than group 1 has the largest value in their correctly predicted cell. Groups 3, 4, 5, and 6 have the largest number of predicted labels correct.

In [None]:
# Paths for images
image_one = 'images/unknown/1_01.jpg'
image_two = 'images/unknown/4_01.jpg'
image_three = 'images/unknown/4_20.jpg'
image_four = 'images/unknown/8_01.jpg'
image_five = 'images/unknown/9_41.jpg'
image_six = 'images/unknown/26_123.jpg'
image_seven = 'images/unknown/35_42.jpg'
image_eight = 'images/unknown/41_06.jpg'
image_nine = 'images/unknown/44_01.jpg'
image_ten = 'images/unknown/46_03.jpg'
image_eleven = 'images/unknown/48_01.jpg'
image_twelve = 'images/unknown/52_31.jpg'

# Store paths in an array
image_paths = [image_one, image_two, image_three, image_four, image_five,
               image_six, image_seven, image_eight, image_nine, image_ten,
               image_eleven, image_twelve]

In [None]:
# Initialize lists to store processed images
processed_images = []

# Loop over each image path
for image_path in image_paths:
    # Read image, turn to gray, and resize
    image = cv2.imread(image_path)
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    resized_image = cv2.resize(gray_image, (48, 48))

    # Flatten image to create feature vector
    flattened_image = resized_image.flatten()

    # Move into array
    processed_images.append(flattened_image)

# Convert the list to a numpy array
X_phoebe_unknown = np.array(processed_images)

# Predict emotions for the Phoebe unknown dataset
y_phoebe_unknown_pred = svm_fer.predict(X_phoebe_unknown)

# Print the predictions
print("Predicted emotions for Phoebe unknown dataset:", y_phoebe_unknown_pred)


The model only predicted either emotional label 3 or 4 for all 12 of the unknown images. Label 3 represents happy and label 4 represents sad. Looking at the images, it seems the model predicted happy for all images where Phoebe shows her teeth and sad when she does not (outside of the first two images: 1_01 and 4_01). Overall I would say the model correctly predicted 5 of the images correctly. Those images being 4_20 (sad), 8_01 (happy), 9_41 (happy), 41_06 (happy), and 46_03 (happy). This gives us an average of 5/12 = 0.417, which is similar to the accuracy and average scores from the model's classification report.

## 2. SVM Using Action Units

In [None]:
# Get rid of all the unknown labels
filtered_data = data_phoebe[data_phoebe['label'] != 'unknown']

# Seperate the AU columns and label column
X_phoebe = filtered_data.drop(columns=['label', 'file_name'])
y_phoebe = filtered_data['label']

In [None]:
# Initialize the SVM classifier
svm_phoebe = SVC()

In [None]:
# 5-fold cross-validation
cv_scores = cross_val_score(svm_phoebe, X_phoebe, y_phoebe, cv=5)

In [None]:
# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)
print("Mean accuracy:", cv_scores.mean())

The cross validation scores show the different accuracy scores achieved by the model on different folds of the dataset. The highest accuracy came from the second fold and the lowest came from the third fold. The mean accuracy of the 5 folds was 0.5. This represents an estimation of the model's ability to perform on unseen data. So we can expect this model to be about 50% right at predicting the emotions of the unknown images of Phoebe.

In [None]:
# Train the SVM model
svm_phoebe.fit(X_phoebe, y_phoebe)

In [None]:
# Predict emotions for all 12 unknown samples
unknown_samples = data_phoebe[data_phoebe['label'] == 'unknown']
X_unknown = unknown_samples.drop(columns=['label', 'file_name'])
y_unknown_pred = svm_phoebe.predict(X_unknown)

# Print predictions
print("Predicted emotions for unknown samples:")
print(y_unknown_pred)

As shown above, the model predicts the emotions of the 12 images as surprise, surprise, sad, happy, happy, sad, sad, happy, surprise, happy, surprise, and surprise. These images are in order of file_name (the order they are in the csv file). For example, the first emotion label is for image 1_01 and the second is for 4_01, and so on.

I would label the 12 unknown images (using only the labels used in the csv file) as: 1_01=sad, 4_01=angry, 4_20=sad, 8_01=happy, 9_41=happy, 26_123=angry, 35_42=surprise, 41_06=happy, 44_01=angry, 46_03=happy, 48_01=disgusted, and 52_31=surprise (these images shown in cell below).

Comparing the model's prediction with my own labels of the unknown images, I would say the model got 6/12 (50%) correct. The model was able to correctly predict happy all four times, with perfect precision and recall (f1-score=1). It also predicted sad and surprise correctly 1 time each. The model did not predict either of anger or disgust one time. It predicted correctly 50% of the images, which is exactly what the model predicted the accuracy would be for unseen. However, it is import to remember the unknown dataset is very small.

In [None]:
for img_path in image_paths:
    print(img_path)
    img = mpimg.imread(img_path)
    imgplot = plt.imshow(img)
    plt.show()

# Part B - Neural Network

## 1. Neural Network

In [None]:
# Convert pixel values to numpy arrays and normalize
X = np.array([np.fromstring(image, dtype=int, sep=' ').reshape(48, 48, 1) for image in data_fer['pixels']])
X = X / 255.0

In [None]:
# Convert emotion labels to categorical
y = to_categorical(data_fer['emotion'])

In [None]:
# Split the data into training, public test and private test
X_train_b = X[data_fer['Usage'] == 'Training']
y_train_b = y[data_fer['Usage'] == 'Training']

X_public = X[data_fer['Usage'] == 'PublicTest']
y_public = y[data_fer['Usage'] == 'PublicTest']

X_private = X[data_fer['Usage'] == 'PrivateTest']
y_private = y[data_fer['Usage'] == 'PrivateTest']

In [None]:
# Create the nn model
def create_nn_model():
    model = Sequential([
        Conv2D(32, (3, 3), activation='relu', input_shape=(48, 48, 1)),
        MaxPooling2D((2, 2)),
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D((2, 2)),
        Conv2D(128, (3, 3), activation='relu'),
        MaxPooling2D((2, 2)),
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.5),
        Dense(7, activation='softmax')
    ])
    return model

model_nn = create_nn_model()

In [None]:
# Train the model
model_nn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Data augmentation to help with overfitting
over = ImageDataGenerator(rotation_range=10, width_shift_range=0.1, height_shift_range=0.1, zoom_range=0.1, horizontal_flip=True)
over.fit(X_train_b)

In [None]:
# Testing different batch size and epoch values
# Batch size = 32 and epochs = 6
# Takes about 15 mins to run on my machine
history_nn1 = model_nn.fit(over.flow(X_train_b, y_train_b, batch_size=32), epochs=6, validation_data=(X_public, y_public))

In [None]:
model_nn = create_nn_model()
model_nn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Batch size = 64 and epochs = 5
# Takes about 15 mins to run on my machine
history_nn2 = model_nn.fit(over.flow(X_train_b, y_train_b, batch_size=64), epochs=5, validation_data=(X_public, y_public))

In [None]:
model_nn = create_nn_model()
model_nn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Batch size = 10 and epoch = 8
# Takes about 15 mins to run on my machine
history_nn3 = model_nn.fit(over.flow(X_train_b, y_train_b, batch_size=10), epochs=8, validation_data=(X_public, y_public))


After testing 3 different batch sizes and epochs numerous times, batch size = 10 and epoch = 8 consistently performs the best compared to the other variations. I will stick with these parameters for the next parts of the assignment.

In [None]:
# Evaluate the model and calculate performance
y_pred = np.argmax(model_nn.predict(X_private), axis=1)
y_true = np.argmax(y_private, axis=1)

accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred)

In [None]:
# Print performance metrics
print("Accuracy:", accuracy)
print("Classification Report:\n", report)

In [None]:
conf_matrix = confusion_matrix(y_true, y_pred)

# Plotting the confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

The classification report and confusion matrix are shown above. These values will be different every single time due to the randomness or variation of the model that is being run. The below writing is the results from the last time I ran this model (almost all results are usually pretty similar to these).

Looking at the classification report, group 3 performed the best with precision, recall and f1-scores of 0.74, 0.81, and 0.77 respectively. Group 5 also performed much better than the other groups with a f1-score of 0.66.

One thing that stands out is group 1 having a f1-score of 0.00. Looking at the confusion matrix, we can see the model never predicted group 1. This is similar to Part A.1, where group 1 is not being predicted very much relative to the other 6 groups. Also, similar to Part A.1, group 3 sill has by far the most true positive.

The accuracy of the model is 0.54 with the macro average being 0.43 and weighted average being 0.52. This discrepancy between macro and weighted makes sense since group 1 is really small and has a very low score, while a lot of the bigger groups have higher scores. This trend is consistently true every time I run this model.

## 2. Test

In [None]:
# Function to preprocess the image
def preprocess_image(image_path):
    image = cv2.imread(image_path)
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    resized_image = cv2.resize(gray_image, (48, 48))
    normalized_image = resized_image / 255.0
    preprocessed_image = normalized_image.reshape(1, 48, 48, 1)
    return preprocessed_image

# Classify image using the model
def classify_image(image_path, model):
    preprocessed_image = preprocess_image(image_path)
    prediction = model_nn.predict(preprocessed_image)
    predicted_class = np.argmax(prediction)
    return predicted_class

In [None]:
# Predict each unknown image with neural network model
print("0=angry, 1=disgust, 2=fear, 3=happy, 4=sad, 5=surprise, 6=neutral")

for img in image_paths:
  predicted_class = classify_image(img, model_nn)
  print(f"Predicted class for {img}: {predicted_class}")


The predicted emotion for each image is printed above. The results of these predictions will be slightly different every time the neural network model is run. However, most of the time there are not too many big fluctuations in the results.

When writing this, the model had a accuracy of about 54% and based on the results it was able to predict the unknown Phoebe images at a similar rate. It was able to correctly predict 6/12 images (4_20, 8_01, 9_41, 35_42, 41_06, and 46_03). It also predicted neutral for images 1_01 and 52_31. Even though the Phoebe dataset does not have neutral in it, I could argue both those images could be interpreted as neutral.

Comparing with the SVM model from A.2, both models were able to correctly predict the labels at about the same rate. Both models were very good at predicting happy and sad, but did not predict anger or disgust much. Overall, both models had similar accuracy scores and showed that with their predictions of the images, however, just based of looking at the images, I would prefer the Neural Network model over the SVM.

## 3. Fine-Tune the Neural Network and Re-Classify

In [None]:
# Store the paths of every image into an array

# Directory we are starting from
base_dir = Path("images")

image_paths2 = []

# Iterate through all subdirectories in the base directory
for emotion_dir in base_dir.glob("*"):
    # Check if the subdirectory is a directory and do not include unknown images
    if emotion_dir.is_dir() and emotion_dir.name != "unknown":
        # Iterate through all image files in the subdirectory
        for image_file in emotion_dir.glob("*.jpg"): 
            # Add the relative path of the image file to the list
            relative_path = "images/" + emotion_dir.name + "/" + image_file.name
            # Add the path to our paths array
            image_paths2.append(relative_path)

In [None]:
# Create array to store emotional labels in order
labels_known = []

for image_path in image_paths2:
    # Extract label from image path
    label = os.path.basename(os.path.dirname(image_path))
    # Append label to array
    labels_known.append(label)

In [None]:
# Function to preprocess each image
def preprocess_image2(image_path, target_size=(48, 48)):
    image = Image.open(image_path)
    image = image.resize(target_size)
    # Convert the image to grayscale
    image = image.convert('L')
    image_array = np.array(image).reshape((*target_size, 1))
    image_array = image_array / 255.0
    return image_array

# Preprocess every image with an emotional label
X_ft = np.array([preprocess_image2(image_path) for image_path in image_paths2])

In [None]:
# Turn the emotions into number so we can categorize them
# 0=angry, 1=disgust, 2=fear, 3=happy, 4=sad, 5=surprise, 6=neutral
y_ft = []

for i in labels_known:
  if i == 'angry':
    y_ft.append('0')
  elif i == 'disgust':
    y_ft.append('1')
  elif i == 'fear':
    y_ft.append('2')
  elif i == 'happy':
    y_ft.append('3')
  elif i == 'sad':
    y_ft.append('4')
  elif i == 'surprise':
    y_ft.append('5')
  elif i == 'neutral':
    y_ft.append('6')


In [None]:
y_phoebe_ft = to_categorical(y_ft, num_classes=7)

In [None]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_ft, y_phoebe_ft, test_size=0.2, random_state=42)

# Freeze layers
for layer in model_nn.layers[:4]:
    layer.trainable = False

# Compile with different learning rate
model_nn.compile(optimizer=Adam(learning_rate=0.002), loss='categorical_crossentropy', metrics=['accuracy'])

# Increase the epochs since it will run much faster (20 is more than enough since it converges quickly)
history = model_nn.fit(X_train, y_train, batch_size=10, epochs=20, validation_data=(X_val, y_val))

In [None]:
# Evaluate the model
y_pred2 = np.argmax(model_nn.predict(X_val), axis=1)
y_true2 = np.argmax(y_val, axis=1)

In [None]:
# Calculate performance metrics
accuracy2 = accuracy_score(y_true2, y_pred2)
report2 = classification_report(y_true2, y_pred2)

print("Accuracy:", accuracy2)
print("Classification Report:\n", report2)

In [None]:
conf_matrix2 = confusion_matrix(y_true2, y_pred2)

# Plotting the confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix2, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

The classification report and confusion matrix are shown above. These values will be different every single time due to the randomness or variation of the model that is being run.

Looking at the classification report, the results from the fine-tuned neural network are much better compared to our neural network in Part B.1. The accuracy consistently increases by about 20%. Also, the f1-scores for each group are much better. Groups 3 and 5 are scoring very high (over 0.85).

A lot of the trends seen in the previous neural network are present in the fine-tuned one as well. As discussed previously, group 1 still has a low f1-score, while groups 3 and 5 have the best scores. However, the macro and weighted averages are very similar to each other compared to the other models.

Sample size is very small since the entire dataset was only 87 images and the test dataset is only 18 images. This is much smaller than the model in Part B.1.

In [None]:
print("0=angry, 1=disgust, 2=fear, 3=happy, 4=sad, 5=surprise, 6=neutral")

for img in image_paths:
  predicted_class = classify_image(img, model_nn)
  print(f"Predicted class for {img}: {predicted_class}")

The predicted emotion for each image is printed above. The results of these predictions will be slightly different every time the neural network model is run. However, most of the time there are not too many big fluctuations in the results.

The fine-tuned model's predictions are usually much better compared to the predictions of the model in Part B.2. This is due to the fine-tuned model improving its accuracy by about 20%. So, a lot of the predictions are accurate for the fine-tuned model. This model is able to predict about 8-9 of the 12 images correctly. It is really good at predicting happy and sad but struggles with disgust. It typically does not even attempt to predict disgust leading to image 48_01 being predicted wrong the most amount of times. It also sometimes mistakes anger for sadness such as images 4_01 and 26_123.

The fine-tuned model usually has an accuracy of about 70% and hits at about that rate when making the predictions on the unknown dataset. It is important to remember that the sample size of the unknown dataset is very small so results can fluctuate by a lot. There are only 12 unknown images, so even if the model predicts just one more image correctly or incorrectly, this would alter the prediction accuracy for the unknown Phoebe images by over 8% (1/12).

# Part C - Comparison Between Methods

The accuracies of the four models were: SVM-Fer2013 = 0.45, SVM-OpenFace = 0.5, NN-Fer2013 = ~0.5, and NN-FineTuned = ~0.7. The NN-FineTuned model has a much greater accuracy compared to the other 3 models. This accuracy can be seen when predicting the unknown Phoebe image dataset, where is consistently predicts about 8-9 images correctly while the other 3 models predict only about 6 images correctly. All the models were very good at predicting happy and sad emotions but other than NN-FineTuned they struggled with predicting other emotions such as anger and disgust. This can be seen with the f1-scores of every model where they all had their highest scores with the happy and sad groups.

The SVM-OpenFace model only predicted three emotions: happy, sad, surprised, while the NN-FineTuned model predicted every emotion other than disgust at least once (excluding fear and neutral since they are not emotions in the Phoebe dataset). Since the SVM-OpenFace model is only predicting a few emotions, we cannot expect it to be very accurate, especially compared to the NN-FineTuned model. The SVM-OpenFace model was very good at predicting happiness as 4 out of its 6 correct predictions are happiness. The NN-FineTuned is able to correctly predict many more emotions consistently such as happy, sad, and surprised. The NN-FineTuned model being trained with the Fer2013 data and then fine-tuned with the Phoebe dataset gives it a big advantage and explains why it is much more accurate and better compared to a model like SVM-OpenFace, that has only been trained with the Phoebe images (less than 100 sample size).

The NN-FineTuned worked the best out of the four models. NN-FineTuned is an upgraded version of the NN-Fer2013 model, since it was trained with the Phoebe images as well, so we expect it to perform better and it does. Comparing to the SVM-OpenFace model, NN-FineTuned consistently had about 20% greater accuracy and was able to predict images much better. The SVM-OpenFace model would predict about half the images correctly compared to the NN-FineTuned model, which was able to predict about 70% of the unknown dataset correctly. The NN-FineTuned was also much better than the SVM-Fer2013 model at predicting. It was consistently able to predict about 3-4 more images correctly than the SVM-Fer2013 model. The NN-FineTuned model performed much better than the other 3 models and is the one I would select for this dataset.

A big limitation in this assignment was Fer2013 having 7 different emotional labels, while the Phoebe dataset only had 5. This made creating the NN-FineTuned model much more difficult because I had to account for 2 extra emotions that were not available in the Phoebe dataset. This can lead to the model being biased towards certain more frequent emotions and performing poorly on less represented ones. If Fer2013 had the same emotional labels as the Phoebe dataset, we could have potentially seen a more accurate model. Also, since we are going from one dataset to another, features learned from one dataset may not generalize well to another dataset with different characteristics. This is a potential issue other "emotional recognition" systems face in the real world since emotions are very subjective and you do not know exactly how certain datas group or interpret certain emotions.

Another limitation is the Phoebe dataset is much smaller than the Fer2013 data. When training and predicting model's with the Fer2013 dataset, we know there is a large amount of data, so we do not expect the results to fluctuate by a lot and are more confident with the accuracy scores. Compared to the Phoebe dataset, which has less than 100 images and only about 12 images for predicting. This can potentially lead to high variance in performance across different subsets of data.

One more limitation in this assignment is that I have to compare the predicted emotions with how I interpreted Phoebe's emotions from the unknown dataset. This can be very subjective as for example in images 44_01 or 52_31, it was difficult to tell what her emotion were. This can lead to the model's accuracy being different depending on how the emotions are interpreted by someone. Also, you can never tell for sure what emotions someone is feeling just from one picture, which is the biggest challenge "emotional recognition" models face.