# Affective Computing - Programming Assignment 3

### Objective

Your task is to use the **feature-level** method to combine facial expression features and audio features. A **multi-modal emotion recognition system** is constructed to recognize happy versus sadness facial expressions (binary-class problem) by using a classifier training and testing structure.

The original data is based on lab1 and lab2, from ten actors acting happy and sadness behaviors. 
* Task 1: **Subspace-based feature fusion** method: In this case, z-score normalization is utilized. Please read “Fusing Gabor and LBP feature sets for kernel-based face recognition” and learn how to use subspace-based feature fusion method for multi-modal system.

* Task 2: Based on Task 1, use **Canonical Correlation Analysis(CCA)** to calculate the correlation coefficients of facial expression and audio features. Finally, use CCA to build a multi-modal emotion recognition system. The method is described in one conference paper “Feature fusion method based on canonical correlation analysis and handwritten character recognition”
* Task 3: Based on Task 1, create a **Leave-One-Subject-Out (LOSO) cross-validation** to estimate the performance more reliably.

To produce emotion recognition case, Support Vector Machine (SVM) classifiers are trained.  50 videos from 5 participants are used to train the emotion recognition systems by using spatiotemporal features. The rest of the data (50 videos) are used to evaluate the performances of the trained recognition systems.

## Task 1. Subspace-based method
Please read “Fusing Gabor and LBP feature sets for kernel-based face recognition” and apply their framework for the exercise. We use Support Vector Machine (SVM) with linear kernel for classification. As opposed to using Gabor features we are using the prosodic features from the last exercise.


### Setting up the environment 

First, we need to import the basic modules for loading the data and data processing

In [1]:
import sys
sys.path.append('../')
from skimage import io
from skimage import transform
from skimage import color
from skimage import img_as_ubyte
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import sklearn
import scipy.io as sio

### Loading data 

We load the facial expression data (training data, training class, testing data, testing class) and audio data (training data, testing data)

In [2]:
mdata = sio.loadmat('lab3_data.mat')
print([name for name in mdata])
#Facial expression training and testing data, training and testing class
# print(mdata)
training_data = mdata["training_data"]
testing_data = mdata["testing_data"]
training_class = mdata["training_class"]
testing_class = mdata["testing_class"]

#Audio training and testing data
training_data_proso = mdata["training_data_proso"]
testing_data_proso = mdata["testing_data_proso"]

print(training_data.shape)
print(testing_data.shape)
print(training_data_proso.shape)
print(testing_data_proso.shape)
print(mdata['training_personID'].ravel())
print(mdata['testing_personID'].ravel())


['__header__', '__version__', '__globals__', 'speech_sample', 'testing_class', 'testing_data_mfcc', 'testing_data_proso', 'testing_personID', 'training_class', 'training_data_mfcc', 'training_data_proso', 'training_personID', 'training_data', 'testing_data']
(50, 708)
(50, 708)
(50, 15)
(50, 15)
[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
 4 4 4 5 5 5 5 5 5 5 5 5 5]
[10 10 10 10 10 10 10 10 10 10 12 12 12 12 12 12 12 12 12 12  7  7  7  7
  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9  9  9
  9  9]


### Extract the subspace for facial expression features and audio features
Extract the subspace for facial expression features and audio features using principal component analysis through using __[`sklearn.decomposition.PCA()`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)__ function.
`reduced_dim` is the dimensionality of the reduced subspace.
Set `reduced_dim` to 20 and 15 for facial expression features and audio features, respectively. Normalization should be done sample wise. The test data should be normalized with the values from the training data.
For concatenating the features use the __[`np.concatenate()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html)__ function. Set the random state to be 0. The PCA uses a randomized truncated SVD, meaning that the results may vary depending on the seed.

In [6]:
from sklearn.decomposition import PCA 
from scipy import stats

#Set Reduced_dim for facial expression features and audio features, respectively.
reduced_dim_v = 20
reduced_dim_a = 15

print(training_data.shape)

#Extract the subspace for facial expression features though PCA
pca_v = PCA(reduced_dim_v, random_state=0) #Random state ensures we get same results on different runs
pca_v.fit(training_data)

#Transform training_data and testing data respectively
train_trans_v = pca_v.transform(training_data)
print(train_trans_v.shape)
test_trans_v = pca_v.transform(testing_data)


#Extract the subspace for audio features though PCA
pca_a = PCA(reduced_dim_a, random_state=0) #Random state ensures we get same results on different runs
pca_a.fit(training_data_proso)

#Transform the training_data and testing_data respectively
train_trans_a = pca_a.transform(training_data_proso)
test_trans_a = pca_a.transform(testing_data_proso)

#Normalize the features
std_v = np.std(train_trans_v,0)
std_a = np.std(train_trans_a,0)

mean_v = np.mean(train_trans_v,0)
mean_a = np.mean(train_trans_a,0)



train_norm_v = (train_trans_v - mean_v)/std_v
test_norm_v = (test_trans_v - mean_v)/std_v
train_norm_a = (train_trans_a - mean_a)/std_a
test_norm_a = (test_trans_a - mean_a)/std_a




#Concatenate the transformed training data of facial expression features and audio features together
combined_train = np.concatenate((train_norm_v, train_norm_a),1)



#Concatenate the transformed testing data of facial expression features and audio features together
combined_test = np.concatenate((test_norm_v, test_norm_a),1)



(50, 708)
(50, 20)


### Question 1. Why is PCA used? Why not just concatenate the extracted features without PCA?

### Your answer:

With more dimentions, exponentially more data is needed to provide reliable results. In our case, the sample size is far smaller than the amount of features, and the number of features needs to substantially decrease to get results with our sample size. With PCA, the information of the original features is condenced into fewer, more relevant features, reducing the amount of samples needed. 

### Feature classification
Use the __[`SVM`](http://scikit-learn.org/stable/modules/svm.html)__ function to train Support Vector Machine (SVM) classifiers.
Construct a SVM using the combined training data and linear kernel. The `training_class` group vector contains the class of samples: 1 = happy, 2 = sadness, corresponding to the rows of the training data matrices.

Then, calculate average classification performances for both training and testing data. The correct class labels corresponding with the rows of the training and testing data matrices are in the variables ‘training_class’ and ‘testing_class’, respectively.

In [57]:
from sklearn import svm

# Train SVM classifier
clf = svm.SVC(kernel='linear')
clf.fit(combined_train, training_class.ravel())
#The prediction results
prediction_train = clf.predict(combined_train)
prediction = clf.predict(combined_test)

#Calculate and print the training accuracy and testing accuracy. 
correct_num_train = 0
for train, pred in zip(prediction_train, training_class.ravel()):
    if train == pred:
        correct_num_train += 1
accuracy_train = correct_num_train/len(prediction_train)
print("training data accuracy: {}".format(accuracy_train))
# print(prediction_train)
# print(training_class.ravel())


correct_num_test = 0
for train, pred in zip(prediction, testing_class.ravel()):
    if train == pred:
        correct_num_test += 1
accuracy_test = correct_num_test/len(prediction)
print("testing data accuracy: {}".format(accuracy_test))
# print(prediction)
# print(testing_class.ravel())

training data accuracy: 1.0
testing data accuracy: 0.98


Compute the confusion matrices using __[`sklearn.metrics.confusion_matrix()`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)__function for both the training data and testing data.

In [58]:
from sklearn.metrics import confusion_matrix
train_cm = confusion_matrix(training_class, prediction_train)
print("training confusion matrix:\n {}".format(train_cm))

test_cm = confusion_matrix(testing_class, prediction)
print("testing confusion matrix:\n {}".format(test_cm))


training confusion matrix:
 [[25  0]
 [ 0 25]]
testing confusion matrix:
 [[25  0]
 [ 1 24]]


## Task 2. 
As opposed to a simple concatenation we can try something smarter that utilizes the common characteristics of the fused features. This is achieved using the CCA. Use the PCA transformed vectors and set the number of components for the CCA to be 15.


Use (__[`sklearn.cross_decomposition.CCA()`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html)__) function to calculate the correlation coefficients of facial expression features and audio features. For `n_components` of CCA use the same number as the reduced dimensionality of the audio features in the previous task.

In [59]:
from sklearn.cross_decomposition import CCA
import numpy as np

#Use CCA to construct the Canonical Projective Vector (CPV)
cca = CCA(15)
cca.fit(train_norm_v, train_norm_a)

#Construct Canonical Correlation Discriminant Features (CCDF) for both the training data and testing data
train_scores_v, train_scores_a = cca.transform(train_norm_v, train_norm_a)
test_scores_v, test_scores_a = cca.transform(test_norm_v, test_norm_a)



# Concatenate the CCA transformed features for training data and testing data
cca_combined_train = np.concatenate((train_scores_v, train_scores_a), 1)
cca_combined_test = np.concatenate((test_scores_v, test_scores_a), 1)

print(cca_combined_test.shape)

(50, 30)


In [60]:
print(np.array(train_norm_v).shape)

(50, 20)


Train a SVM classifier using a linear kernel, print the training and testing accuracy and compute the confusion matrix.

In [61]:
#Train svm classifier 
cca_clf = svm.SVC(kernel='linear')
cca_clf.fit(cca_combined_train, training_class.ravel())  

#The prediction results
cca_prediction_train = cca_clf.predict(cca_combined_train)
cca_prediction = cca_clf.predict(cca_combined_test)

#Calculate and print the training accuracy and testing accuracy. 
cca_correct_num_train = 0
for train, pred in zip(cca_prediction_train, training_class):
    if train == pred:
        cca_correct_num_train += 1
cca_accuracy_train = cca_correct_num_train/len(cca_prediction_train)
print("cca training data accuracy: {}".format(cca_accuracy_train))

cca_correct_num_test = 0
for train, pred in zip(cca_prediction, testing_class):
    if train == pred:
        cca_correct_num_test += 1
cca_accuracy_test = cca_correct_num_test/len(cca_prediction)
print("cca test data accuracy: {}".format(cca_accuracy_test))


# Compute the confusion matrix using sklearn.metrics.confusion_matrix() function for training data and testing data respectively
cca_train_cm = confusion_matrix(training_class, cca_prediction_train)
print("cca training confusion matrix:\n {}".format(cca_train_cm))

cca_test_cm = confusion_matrix(testing_class, cca_prediction)
print("cca testing confusion matrix:\n {}".format(cca_test_cm))


cca training data accuracy: 1.0
cca test data accuracy: 0.92
cca training confusion matrix:
 [[25  0]
 [ 0 25]]
cca testing confusion matrix:
 [[25  0]
 [ 4 21]]


### Question 2. In this exercise a feature-level method was used to fuse the features. What are the other types of methods for data fusion?

### Your answer:

Sensor-level fusion is an other pre-classification data fusion method, in addition to feature-level fusion. Match-score and desicion-level methods are post-classification types.

### Question 3. Compare the results from all the the different methods from assignments 1, 2 and 3. What method performed the best? What was the worst? Hypothesize as to why certain methods performed better than others.

### Your answer:

<p>
    Assignment 1: 0.72 accuracy
    <br>
    Assignment 2: prodosic: 0.62, mfcc : 0.84
    <br>
    Assignment 3: Subspace-based feature fusion: 0.98, Subspace-based feature fusion + cca: 0.92
</p>
<p>
    Out of these methods, the Subspace-based feature fusion of the 3rd assignment performed the best with accuracy of 0.98. The worst was prodosic of 2nd assignment.
    <br>
    The sample size is quite small, which might introduce bias to both training and evaluation part of each method. Feature fusion methods perfomed better overall, which I think makes sense. Adding different kinds of features is a good idea, as long as dimentionality is handeled.
    
</p>




## Task 3: 
For a more reliable evaluation, often the **Leave-One-Subject-Out (LOSO) cross-validation** is used instead of the common train-test split. Cross-validation gives us a more reliable measure of the performance as all of the data is used for both training and testing. LOSO is used as emotions are highly dependent on the subject. By using LOSO, we guarantee that a subject is always in either the training or testing data and not in both.

* Join the training/testing data matrices and the class vectors. Combine also the ‘training_personID’ and ‘testing_personID’ vectors.

* Assume we have a total of $n$ subjects. Now, we will create a total of $n$ folds (loops), where each folds' training set contains the data from $n-1$ subjects and the testing set consists of only $1$ subject.

* Follow the steps taken in the first task: project the data to a subspace using PCA, conatenate the audio and video features together, train an SVM and finally evaluate the performance.

* The solution should be able to generalize over different numbers of subjects and samples, *e.g.*, a dataset may have 24 subjects, where subject1 has 4 samples and subject2 has 32 samples.

In [62]:
mdata = sio.loadmat('lab3_data.mat')
# print([name for name in mdata])
#Combine the training data, testing data, labels and person ID for video and audio respectively,
#in order to get the whole dataset. 
# print(type(np.array(mdata['testing_data'])))
video_data = np.concatenate((mdata['training_data'], mdata['testing_data']))
proso_data = np.concatenate((mdata['training_data_proso'], mdata['testing_data_proso']))

labels = np.concatenate((mdata['training_class'], mdata['testing_class'])).ravel()
subjects = np.concatenate((mdata['training_personID'], mdata['testing_personID'])).ravel()

#Get the number of the subject
subject_ids = np.unique(subjects)

#Print the shapes and the list of subject_ids for a sanity check
print(video_data.shape)
print(proso_data.shape)
print(labels.shape)
print(subjects.shape)
print(subject_ids)
# print(subjects.ravel())


(100, 708)
(100, 15)
(100,)
(100,)
[ 1  2  3  4  5  7  8  9 10 12]


In [63]:
# Shape of video_data: (100, 708)
# Shape of proso_data: (100, 15)
# Shape of labels: (100,)
# Shape of subjects: (100,)
# Value of subject_ids: [ 1  2  3  4  5  7  8  9 10 12]

In [69]:
accuracies = []
#Loop over each subject
for subject_id in subject_ids:
    #Create a boolean array for the training and testing set indices
    #The train_idx should be a list of form [True, True, False, ...], where True indicates the position
    #for the samples that are not the current subject_id
    train_idx = subjects != subject_id

    
    #Similar for the test_idx, True indicates the position of the current subject_id
    test_idx = subjects == subject_id

    
    #Create the training and testing sets for lbp, proso and labels by indexing lbp_data, proso_data and labels
    #with the boolean arrays train_idx and test_idx

    loso_training_v = video_data[train_idx,:]
    loso_training_a = proso_data[train_idx,:]
    loso_training_label = labels[train_idx]
    
    
    loso_testing_v = video_data[test_idx,:]
    loso_testing_a = proso_data[test_idx,:]
    loso_testing_label = labels[test_idx]
    
    #Create the PCA for both lbp and proso. We take a slight shortcut compared to task 1,
    #by using the whiten=True parameter for normalizing the features. This means that
    #there is no need for normalization afterwards
    pca_v = PCA(n_components=20, whiten=True, random_state=0)
    pca_a = PCA(n_components=15, whiten=True, random_state=0)
    
    #Fit the PCAs with the training data
    pca_v.fit(loso_training_v)
    pca_a.fit(loso_training_a)
    #Transform both the training and testing data with the PCA
    loso_train_trans_v = pca_v.transform(loso_training_v)
    loso_test_trans_v = pca_v.transform(loso_testing_v)
    
    loso_train_trans_a = pca_a.transform(loso_training_a)
    loso_test_trans_a = pca_a.transform(loso_testing_a)
    
    #Concatenate the features together
    loso_combined_train = np.concatenate((loso_train_trans_v, loso_train_trans_a),1)
    loso_combined_test = np.concatenate((loso_test_trans_v, loso_test_trans_a),1)
    
    #Create a linear SVM and train it
    loso_clf = svm.SVC(kernel='linear')
    loso_clf.fit(loso_combined_train, np.array(loso_training_label))
    
    loso_pred_test = loso_clf.predict(loso_combined_test)
    
    #Calculate the accuracy for the testing data and add it to the list of accuracies
    num_correct_test = 0
    

    for pred, label in zip(loso_pred_test, loso_testing_label):
        if label == pred:
            num_correct_test += 1
    
    

    loso_accuracy_test = num_correct_test/len(loso_pred_test)
    accuracies.append(loso_accuracy_test)
    

#Calculate the average of the accuracies. Print both the list of accuracies and the average
print("mean of accuracies: {}".format(np.mean(accuracies)))
print(accuracies)


mean of accuracies: 0.93
[0.9, 0.8, 1.0, 0.9, 0.9, 1.0, 1.0, 1.0, 0.8, 1.0]


In [65]:
# List of accuracies: [0.9, 0.8, 1.0, 0.9, 0.9, 1.0, 1.0, 1.0, 0.8, 1.0]
# Mean of accuracies: 0.93

### Question 4. The accuracy of LOSO (0.93) is lower than the accuracy achieved by the train-test split (0.98) in task 1. Hypothesize as to why the two are different. Which one is better for evaluation?

### Your answer:

LOSO combats identity bias. If train and test sets include the same subject, the model is biased towards that subject and may overperform. In our case it does not seem that this is the case, as all the subject ids are different in training and testing sets. Our sample is quite small so this might be due to random noise.