# GFE Classification: Training & Testing Using Support Vector Machines

#####  This notebook aims to use Support Vector Machines (SVM) to classify facial capture footage as a specified emotion. Data exploration, training, and modelling will all be discussed below and mitigations will be provided. 
###### (i) Train/test SVM on GFE data on a single emotion and evaluate performance measures
###### (ii) Repeat test on a different facial expression
###### (iii) Invert the roles of the Users
###### (iiii) Use Dimensionality Reduction

# Part A: Training 
## Train "Negative" Emotion

In [None]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn import preprocessing
from sklearn.model_selection import train_test_split,GridSearchCV,KFold, cross_val_score
from sklearn.metrics import accuracy_score, precision_score,plot_roc_curve, recall_score
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from collections import Counter
from sklearn.decomposition import PCA
import pickle

In [None]:
# select emotion
emotion = "negative" 

# read in data file
df_neg = pd.read_csv(f"grammatical_facial_expression/a_{emotion}_datapoints.txt",delimiter = " ",)
df_neg_target = pd.read_csv(f"grammatical_facial_expression/a_{emotion}_targets.txt",delimiter = " ",header=None)

##### Data Exploration 

In [None]:
# combine both dataframes using the target dataset
df_neg['target'] = df_neg_target
df_neg

In [None]:
# summary statistics (exclude first/last columns)
df_neg.iloc[:,1:-1].stack().describe()

Statistics indicate that the data has a range from 0 to 1585 with a standard deviation of 471, which suggests a large distribution of data. 

In [None]:
# collect x,y, & z coordinates as separate dataframes
xs = df_neg[df_neg.columns[1::3]]
ys = df_neg[df_neg.columns[2::3]]
zs = df_neg[df_neg.columns[3::3]]

# remove target col
xs = xs.drop(["target"],axis=1)

# array of 3 coordinate axes
df_neg_coord = np.array((xs,ys,zs))

print(xs.stack().describe())
print(ys.stack().describe())
print(zs.stack().describe())

A look at the coordinate axes indicates significant differences between X/Y and Z coordinates based on descriptive statistics. Normalization and Standardization techniques will be applied to determine if there are significant differences. 

In [None]:
# collect x,y, & z coordinates as separate dataframes
xs = df_neg[df_neg.columns[1::3]]
ys = df_neg[df_neg.columns[2::3]]
zs = df_neg[df_neg.columns[3::3]]

# remove target col
xs = xs.drop(['target'],axis=1)

# Look at 
fig = make_subplots(rows=1, cols=3,subplot_titles=("X-Axis Values Hist.","Y-Axis Values Hist.", "Z-Axis Values Hist."))
fig.add_trace(go.Histogram(x=xs.values.ravel(),name="x-axis"),
    row=1, col=1
)
fig.add_trace(go.Histogram(x=ys.values.ravel(),name="y-axis"),
    row=1, col=2
)
fig.add_trace(go.Histogram(x=zs.values.ravel(),name="z-axis"),
    row=1, col=3
)
fig.update_layout(title_text="GFE (Negative) Data Distribution")
fig.show()

The data distributions above indicate varying ranges between all three axes. Because of the varying scales, standardization would be an optimal preprocessing technique to apply to the data to ensure more accurate results.

This analysis will therefore test the modelling effects with and without scaling the data, with the former likely to produce superior results.

###### Split Data

In [None]:
# split train/test and validation
X_neg = df_neg.iloc[:,1:-1]
y_neg = df_neg.iloc[:,-1]

In [None]:
# scale data
scaler = preprocessing.StandardScaler()
X_neg_scaled = scaler.fit_transform(X_neg)

scaler_norm = preprocessing.MinMaxScaler()
X_neg_norm = scaler_norm.fit_transform(X_neg)

In [None]:
# Look at Distributions before and after standardizing and normalizing
fig = make_subplots(rows=1, cols=3,subplot_titles=("Raw Data", "After Normalized","After Standardized"))
fig.add_trace(go.Violin(y=X_neg.unstack(),name="Raw Data"),
    row=1, col=1
)
fig.add_trace(go.Violin(y=pd.DataFrame(X_neg_norm).unstack(),name="After Normalized"),
    row=1, col=2
)
fig.add_trace(go.Violin(y=pd.DataFrame(X_neg_scaled).unstack(),name="After Standardized"),
    row=1, col=3
)
fig.update_layout(title_text="GFE (Negative) Data Distributions: User A")
fig.show()

The above figures indicate that standardization removes the large varying distributions in the raw data that were observed before. Normalization removes the large contrasts between the distribution peaks, however the standardization of the data scaled the data best, removing large deviations whilst keeping the distribution normal.

### Support Vector Machine Classification

In order to optimize parameter estimates, 5-fold cross validation will be used to compare accuracies to optimize the kernel type, regularization parameter, and for the polynomial kernel the degree of the function.

Kernel Types:  linear, polynomial, radial   
Polynomial Degrees: 2-9    
C-Value: 0.1, 1, 10   

##### Un-Standardized Data

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10]
params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params)
best = grid.fit(X_neg,y_neg)


In [None]:
print("Best 5-Fold CrossValidation Estimates for Un-Standardized Data")
print("Best Kernel:", best.best_estimator_.get_params()['kernel'])
print("Best Degree:", best.best_estimator_.get_params()['degree'])
print("Best C:", best.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best.best_score_)

##### Standardized Data 

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10,100]

params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params)
best_scaled = grid.fit(X_neg_scaled,y_neg)


In [None]:
print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_scaled.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_scaled.best_estimator_.get_params()['degree'])
print("Best C:", best_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_scaled.best_score_)

###### Normalized Data

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10,100]
params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params)
best_norm = grid.fit(X_neg_norm,y_neg)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Normalized Data")
print("Best Kernel:", best_norm.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_norm.best_estimator_.get_params()['degree'])
print("Best C:", best_norm.best_estimator_.get_params()['C'])
print("Best Gamma:", best_norm.best_estimator_.get_params()['gamma'])
print("K-Fold Accuracy:", best_norm.best_score_)

In [None]:
# select best model
optimized_clf_neg_usera = best_scaled.best_estimator_

Raw Data Average Accuracy: 88.7% 

Standardized Average Accuracy: 90.5% 

Normalized Average Accuracy: 90.6% 

The above findings indicate that both Normalized and Standardized data perform better than the raw data. This can be attributed to the varying scales between the coordinate axes, with the z-axis (mm) varying from the x and y-axes coordinate systems which was evident in the exploratory analysis above.

An observation of concern would be for the normalized dataset, using the "polynomial" kernel with a degree = 4. A 4th degree polynomial may fit this training data best, however it may pose a higher risk of overfitting in comparison to the "linear" kernel used by the standardized data cross validation. 

Because standardization improved the distribution of the data prior to cross validation, and nearly identical K-fold cross validation accuracy between normalization and standardization, when performing testing of this dataset standardization will be the preferred technique. 

Best 5-Fold CrossValidation Estimates for Standardized Data    
Best Kernel: linear   
Best Degree: 3   
Best C: 0.1      
K-Fold Accuracy: 0.9047698412698413    

## Repeat Training on "Emphasis" User A

In [None]:
# select emotion
emotion2 = "emphasis" 

# read in data file
df_emp = pd.read_csv(f"grammatical_facial_expression/a_{emotion2}_datapoints.txt",delimiter = " ",)
df_emp_target = pd.read_csv(f"grammatical_facial_expression/a_{emotion2}_targets.txt",delimiter = " ",header=None)

# combine both dataframes using the target dataset
df_emp['target'] = df_emp_target
df_emp

In [None]:
# summary statistics (exclude first/last columns)
# collect x,y, & z coordinates as separate dataframes
xs = df_emp[df_emp.columns[1::3]]
ys = df_emp[df_emp.columns[2::3]]
zs = df_emp[df_emp.columns[3::3]]

# remove target col
xs = xs.drop(["target"],axis=1)

# array of 3 coordinate axes
df_neg_coord = np.array((xs,ys,zs))

print(xs.stack().describe())
print(ys.stack().describe())
print(zs.stack().describe())

The data distributions are similar to the "Negative" emotions from above. 

##### Split Data

In [None]:
# split train/test and validation
X_emp = df_emp.iloc[:,1:-1]
y_emp = df_emp.iloc[:,-1]

In [None]:
# scale data
scaler = preprocessing.StandardScaler()
X_emp_scaled = scaler.fit_transform(X_emp)

scaler_norm = preprocessing.MinMaxScaler()
X_emp_norm = scaler_norm.fit_transform(X_emp)

In [None]:
# Look at Distributions before and after standardizing and normalizing
fig = make_subplots(rows=1, cols=3,subplot_titles=("Raw Data","After Normalized","After Standardized"))
fig.add_trace(go.Violin(y=X_emp.unstack(),name="Raw Data"),
    row=1, col=1
)
fig.add_trace(go.Violin(y=pd.DataFrame(X_emp_norm).unstack(),name="After Normalized"),
    row=1, col=2
)
fig.add_trace(go.Violin(y=pd.DataFrame(X_emp_scaled).unstack(),name="After Standardized"),
    row=1, col=3
)
fig.update_layout(title_text="GFE (Emphasis) Data Distributions: User A")
fig.show()

The above plots indicate that standardizing the data removes the unbalanced distributions in the data the best, which is identical to the "negative" emotion results as well. 

### Support Vector Machine Classification

##### Un-Standardized Data

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9,10]
C = [0.1,1,10,100]

params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs=4)
best_emp = grid.fit(X_emp,y_emp)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Un-Standardized Data")
print("Best Kernel:", best_emp.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_emp.best_estimator_.get_params()['degree'])
print("Best C:", best_emp.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_emp.best_score_)

##### Standardized Data

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10]
params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs = 4)
best_emp_scaled = grid.fit(X_emp_scaled,y_emp)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_emp_scaled.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_emp_scaled.best_estimator_.get_params()['degree'])
print("Best C:", best_emp_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_emp_scaled.best_score_)

##### Normalized Data

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10]
params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params)
best_emp_norm = grid.fit(X_emp_norm,y_emp)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Normalized Data")
print("Best Kernel:", best_emp_norm.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_emp_norm.best_estimator_.get_params()['degree'])
print("Best C:", best_emp_norm.best_estimator_.get_params()['C'])
print("Best Gamma:", best_emp_norm.best_estimator_.get_params()['gamma'])
print("K-Fold Accuracy:", best_emp_norm.best_score_)

In [None]:
# select best model
optimized_clf_emp_usera = best_emp_scaled.best_estimator_

Raw Data Average Accuracy: 97.0% 

Standardized Average Accuracy: 97.8% 

Normalized Average Accuracy: 97.9% 

For "Emphasis" emotions, the average accuracy varies slightly between cross validated models using different scaling techniques. Normalization and Standardization once again are the top two results. Standardization still exhibits the best transformation of the distribution of the data, and also based on it's nearly identical accuracy to normalization it will be chosen as the preferred technique of scaling the data.
 
Best 5-Fold CrossValidation Estimates for Standardized Data  
Best Kernel: linear   
Best Degree: 3   
Best C: 1        
Best Gamma: scale     
K-Fold Accuracy: 0.9778927300457549   

## Implementation of Classifier from Scratch: K-Nearest Neighbor

A K-Nearest Neighbor Classifier will be implemented below using a class of functions. In this implementation Euclidean, Hamming, and Manhattan Distances will be compared to determine the superior distance function. 

This classification will only be using standardized data, since the previous classification indicated that standardized data produced highly accurate results and transformed the distribution of the data best.

In [None]:
class knn:
    '''This is the implemented classifier for K-Nearest Neighbor Classification. Default value for k is 3.
    '''
    def __init__(self, k=3, distance_func="euclidean"):
        self.k = k
        self.distance_func = distance_func
        
    def euclidean_distance(self, row1, row2):
        return np.sqrt(np.sum((row1 - row2)**2))
    
    def hamming_distance(self, row1, row2):
        return sum(abs(e1 - e2) for e1, e2 in zip(row1, row2)) / len(row1)
    
    def manhattan_distance(self, row1, row2):
        return sum(abs(e1-e2) for e1, e2 in zip(row1,row2))
        
    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        # loop thru each row of test data and calculate the nearest neighbor
        y_pred = []
        for x in X:
            y_pred.append(self.nearest_neighbor(x))
            
        return np.array(y_pred)

    def nearest_neighbor(self, x):
        # use the euclidean distance function above to calculate distances between rows of data
        distances = []
        for x_train in self.X_train:
            if self.distance_func == 'euclidean':
                distances.append(self.euclidean_distance(x,x_train))
            elif self.distance_func == 'hamming':
                distances.append(self.hamming_distance(x,x_train))
            else: 
                distances.append(self.manhattan_distance(x,x_train))
                
        # sort by minimum distance and return the index
        index = np.argsort(distances)[:self.k]
        
        # np.take uses the index to return the actual label
        k_neighbor_labels = np.take(self.y_train,index)   
        
        # the Counter function returns the most common label
        label = Counter(k_neighbor_labels).most_common(1)
        
        return label[0][0]
    
    def get_params(self, deep=True):
        # require this setting in order to be compatible with GridSearchCV
        return {"k": self.k, 
                "distance_func": self.distance_func}
    
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

In [None]:
# create empty model object
clf = knn()

# optimization paramters
distance_func = ['euclidean','manning','manhattan']
k = [5,7,9,11,13,15,17,19,21]

params = dict(k=k,distance_func=distance_func)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params, scoring="accuracy")
best_knn_scaled = grid.fit(X_neg_scaled,y_neg)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_knn_scaled.best_estimator_.get_params()['k'])
print("Best Degree:", best_knn_scaled.best_estimator_.get_params()['distance_func'])
print("K-Fold Accuracy:", best_knn_scaled.best_score_)

Best 5-Fold CrossValidation Estimates for Standardized Data   
Best Kernel: 15  
Best Degree: euclidean   
K-Fold Accuracy: 0.8353531746031747   

Using the KNN implementation from scratch on "negative" emotion data, the results produced a model that was 83.5% accurate after 5-Fold Cross Validation. Euclidean distance was the preferred distance function, and the optimized number for k was 15. This is similar to using the "rule of thumb" method for determining k which involves taking the square-root of the number of features and dividing by 2, which yielded a k value of 13.

Overall this classification method is not as accurate as SVM, and required nearly 8 hours of computing time to optimize the parameters. Therefore, the "implemented from scratch" KNN classifier will not be abandoned in the continuation of this analysis due to its inefficiency and lack of predictive capability.

# Part B: Testing on User B

## Test on "Negative" Emotion User B

In [None]:
# select emotion
emotion = "negative" 

# read in data file
df_neg_userb = pd.read_csv(f"grammatical_facial_expression/b_{emotion}_datapoints.txt",delimiter = " ",)
df_neg_target_userb = pd.read_csv(f"grammatical_facial_expression/b_{emotion}_targets.txt",delimiter = " ",header=None)

# combine both dataframes using the target dataset
df_neg_userb['target'] = df_neg_target_userb

##### Split Data

In [None]:
# split train/test and validation
X_neg_userb = df_neg_userb.iloc[:,1:-1]
y_neg_userb = df_neg_userb.iloc[:,-1]

# scale data
scaler = preprocessing.StandardScaler()
X_neg_scaled_userb = scaler.fit_transform(X_neg_userb)

scaler = preprocessing.MinMaxScaler()
X_neg_norm_userb = scaler.fit_transform(X_neg_userb)

##### Results

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_neg_usera.predict(X_neg_scaled_userb)

# calculate model accuracy
acc = accuracy_score(y_neg_userb, y_pred)

# calculate model precision
prec = precision_score(y_neg_userb, y_pred)

# calculate model recall
recall = recall_score(y_neg_userb, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_neg_usera,X_neg_scaled_userb,y_neg_userb)

Model Accuracy: 0.5474083438685209  
Model Precision: 0.4973753280839895  
Model Recall: 0.5323033707865169  

The above results for the testing on "Negative" emotion User B indicate that the model performed very poorly. Accuracy, precision, and model recall are all near 50% and the ROC curve indicates that the model is unable to correctly classify true positives. Overall the model is making predictions at random, similar to flipping a coin. 

Reasons for this could be explained by SVM using a value of 0.1 for C, which is a normal regularization parameter value. The smaller this parameter is, the more likely it is to missclassify points, and therefore make a more "generalized" model. Perhaps re-tuning the parameters using smaller values of C will allow the model to "generalize" more and perform better on the user B test set. 

Therefore, the SVM will be optimized again on user A using smaller values for C and re-tested on user B. Optimally, we would like to improve the testing accuracy (user B) without compromising the training accuracy (user A) in order to prevent the model from being biased.

###### Re-Optimize parameters using smaller value of C and test again

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10]
params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params)
best_scaled = grid.fit(X_neg_scaled,y_neg)

print("Best 5-Fold CrossValidation Estimates for Smaller Values of C")
print("Best Kernel:", best_scaled.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_scaled.best_estimator_.get_params()['degree'])
print("Best C:", best_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_scaled.best_score_)

# save model as optimized v2
optimized_clf_neg_usera_v2 = best_scaled.best_estimator_

Best 5-Fold CrossValidation Estimates for Smaller Values of C   
Best Kernel: linear   
Best Degree: 3   
Best C: 0.01   
K-Fold Accuracy: 0.9012222222222223     

After re-tuning using smaller values of C, accuracy was only compromised by a reduction of 0.3% (90.1%) and the optimal C value was 0.01. Smaller values of C were explored, however large reductions in accuracy were observed in excess of 5%.

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_neg_usera_v2.predict(X_neg_scaled_userb)

# calculate model accuracy
acc = accuracy_score(y_neg_userb, y_pred)

# calculate model precision
prec = precision_score(y_neg_userb, y_pred)

# calculate model recall
recall = recall_score(y_neg_userb, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_neg_usera,X_neg_scaled_userb,y_neg_userb)

Model Accuracy: 0.5689001264222503   
Model Precision: 0.5201612903225806   
Model Recall: 0.5435393258426966   

Overall little effect on the model was observed, with little improvements in accuracy and precision and no overall change in the ROC curve.

## Test on "Emphasis" Emotion User B

In [None]:
# select emotion
emotion2 = "emphasis" 

# read in data file
df_emp_userb = pd.read_csv(f"grammatical_facial_expression/b_{emotion2}_datapoints.txt",delimiter = " ",)
df_emp_target_userb = pd.read_csv(f"grammatical_facial_expression/b_{emotion2}_targets.txt",delimiter = " ",header=None)

# combine both dataframes using the target dataset
df_emp_userb['target'] = df_emp_target_userb

###### Split Data/Scale

In [None]:
# split train/test and validation
X_emp_userb = df_emp_userb.iloc[:,1:-1]
y_emp_userb = df_emp_userb.iloc[:,-1]

# scale data
scaler = preprocessing.StandardScaler()
X_emp_scaled_userb = scaler.fit_transform(X_emp_userb)

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_emp_usera.predict(X_emp_scaled_userb)

# calculate model accuracy
acc = accuracy_score(y_emp_userb, y_pred)

# calculate model precision
prec = precision_score(y_emp_userb, y_pred)

# calculate model recall
recall = recall_score(y_emp_userb, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_emp_usera,X_emp_scaled_userb,y_emp_userb)

Model Accuracy: 0.7715773809523809   
Model Precision: 0.6744548286604362   
Model Recall: 0.815442561205273    

Model performance on User B for the emotion "emphasis" exhibits a model with decent accuracy (77.2%) and recall (81.5%) and poor precision (67.4%). The ROC curve curves to the top left corner and has an AUC value of 0.86 which is good, so overall this model is a good model. 

The metrics above state that this model correctly identifies the emotion "emphasis" 81.5% of the time, however when it does predict "emphasis" it is correct only 67% of the time. Therefore the model is slightly overpredicting. 

We saw an increase in model performance using smaller values of C with the "negative" emotion. We will explore this possibility with "Emphasis" as well below.

###### Re-Optimize parameters using smaller value of C to see if improvement

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear']
C = [0.01,0.0001,0.00001]

params = dict(kernel=kernel, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs = 4)
best_emp_scaled = grid.fit(X_emp_scaled,y_emp)

print("Best 5-Fold CrossValidation Estimates for Smaller Values of C")
print("Best Kernel:", best_emp_scaled.best_estimator_.get_params()['kernel'])
print("Best C:", best_emp_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_emp_scaled.best_score_)

# select best model and save as v2
optimized_clf_emp_usera_v2 = best_emp_scaled.best_estimator_

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_emp_usera_v2.predict(X_emp_scaled_userb)

# calculate model accuracy
acc = accuracy_score(y_emp_userb, y_pred)

# calculate model precision
prec = precision_score(y_emp_userb, y_pred)

# calculate model recall
recall = recall_score(y_emp_userb, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_emp_usera_v2,X_emp_scaled_userb,y_emp_userb)

Model Accuracy: 0.8244047619047619   
Model Precision: 0.7547495682210709   
Model Recall: 0.8229755178907722    

Results above were very good. A regularization parameter value of 0.01 exhibited only a 1% decrease in accuracy during training on User A whilst improving accuracy (+ 5.3%), precision (+ 8.0%), recall (+ 0.7%), and AUC (+ 3.0 %) when testing on User B. 

Again, using a small C value increases the margin size and allows for more misclassified points and produces a more "generalized" model. 

# Part C: Additional Experimentation
Training and testing will be performed by swapping the users now, with training on User B and testing on User A.

## Train on "Negative" emotion User B and test on User A

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear','poly','rbf']
degree = [2,3,4,5,6,7,8,9]
C = [0.1,1,10]
params = dict(kernel=kernel,degree=degree, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs=5)
best_scaled = grid.fit(X_neg_scaled_userb, y_neg_userb)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_scaled.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_scaled.best_estimator_.get_params()['degree'])
print("Best C:", best_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_scaled.best_score_)

In [None]:
# select best model for User B
optimized_clf_neg_userb = best_scaled.best_estimator_

Best 5-Fold CrossValidation Estimates for Standardized Data  
Best Kernel: rbf    
Best Degree: 3  
Best C: 0.1   
K-Fold Accuracy: 0.7483787884838079   

5-Fold Cross Validation on "Negative" emotion on User B yields an average accuracy of only 74.8%, with the best kernel being 'rbf'. This is less accurate than when User A was trained (90.4%). Reasons for this may lie in the distribution of the data, which will be explored below.

In [None]:
# summary statistics (exclude first/last columns)
# collect x,y, & z coordinates as separate dataframes
xs = df_neg_userb[df_neg_userb.columns[1::3]]
ys = df_neg_userb[df_neg_userb.columns[2::3]]
zs = df_neg_userb[df_neg_userb.columns[3::3]]

# remove target col
xs = xs.drop(["target"],axis=1)

# array of 3 coordinate axes
df_neg_coord = np.array((xs,ys,zs))

print(xs.stack().describe())
print(ys.stack().describe())
print(zs.stack().describe())

In [None]:
# Compare distributions between User A and User B
fig = make_subplots(rows=1, cols=2,subplot_titles=("Negative: User A","User B"))
fig.add_trace(go.Violin(y=X_neg.unstack(),name="Negative: User A"),
    row=1, col=1
)
fig.add_trace(go.Violin(y=X_neg_userb.unstack(),name="Negative: User B"),
    row=1, col=2
)

fig.update_layout(title_text="GFE Data (Negative) Distribution Before Standardization")
fig.show()

Both descriptive statistics and distributions of the data are very similar between User A and User B, with the only significant difference being that User B is a slightly larger dataset. However, there is no outright difference between the two datasets that could attribute to the large difference in k-fold cross validation training accuracy between User A and User B.

###### Test on User A

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_neg_userb.predict(X_neg_scaled)

# calculate model accuracy
acc = accuracy_score(y_neg, y_pred)

# calculate model precision
prec = precision_score(y_neg, y_pred)

# calculate model recall
recall = recall_score(y_neg, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_neg_userb,X_neg_scaled,y_neg)

Model Accuracy: 0.6067615658362989  
Model Precision: 0.6162162162162163   
Model Recall: 0.4318181818181818     

Results of testing on User A produced a model that is 60.7% accurate, 61.6% precise, and has a recall of 43.2%.

AUC = 0.68 also is very poor. In comparison to training on User A and testing on User B, this model has performed similarly in terms of accuracy and precision. 

Adjustment of the regularization parameter may induce better model results, try below.

###### Adjustment of regularization parameter

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['rbf']
degree = [2,3,4,5,6]
C = [1,0.1,0.01]

params = dict(kernel=kernel, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs=5)
best_scaled = grid.fit(X_neg_scaled_userb, y_neg_userb)

print("Best 5-Fold CrossValidation Estimates for Smaller Values of C")
print("Best Kernel:", best_scaled.best_estimator_.get_params()['kernel'])
print("Best C:", best_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_scaled.best_score_)

# save model as optimized v2
optimized_clf_neg_userb_v2 = best_scaled.best_estimator_

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_neg_userb_v2.predict(X_neg_scaled)

# calculate model accuracy
acc = accuracy_score(y_neg, y_pred)

# calculate model precision
prec = precision_score(y_neg, y_pred)

# calculate model recall
recall = recall_score(y_neg, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_neg_userb_v2,X_neg_scaled,y_neg)

Model Accuracy: 0.7215302491103203    
Model Precision: 0.7494199535962877     
Model Recall: 0.6117424242424242   

After adjustment of the regularization parameter to C=1, the cross-validated model suffered a mean accuracy decrease of 2.7%, however precision and accuracy increased bh 12.0% and 14.0% respectively when testing the model. In addition the ROC curve improved (AUC +0.13). 

Overall the adjustment saw a slight worsening of the model training performance, but the tradeoff was a significant improvement on the testing performance of the model.


## Train on "Emphasis" User B and test on User A

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear']
degree = [2,3,4,5,6]
C = [0.001, 0.01,0.1, 1, 10]

params = dict(kernel=kernel, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs=5)
best_emp_scaled = grid.fit(X_emp_scaled_userb, y_emp_userb)

In [None]:
print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_emp_scaled.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_emp_scaled.best_estimator_.get_params()['degree'])
print("Best C:", best_emp_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_emp_scaled.best_score_)

In [None]:
# select best model for User B
optimized_clf_emp_userb = best_emp_scaled.best_estimator_

##### Test on User A

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_emp_userb.predict(X_emp_scaled)

# calculate model accuracy
acc = accuracy_score(y_emp, y_pred)

# calculate model precision
prec = precision_score(y_emp, y_pred)

# calculate model recall
recall = recall_score(y_emp, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_emp_userb,X_emp_scaled,y_emp)

Model Accuracy: 0.8460441910192444    
Model Precision: 0.7375          
Model Recall: 0.5363636363636364       

For "Emphasis", testing on User A resulted in an accuracy of 84.6%, precision of 73.8%, and recall of 53.6%. In terms of accuracy this model is similar to previously when we trained on User A and tested on User B (82.4% accuracy). However precision and recall are significantly less. The ROC curve indicates that the model does not have a good balance between specificity and sensitivity, as the model is too sensitive. And overall, the AUC value is 0.66 which is very poor.

Adjusting the regularization parameter may improve these metrics, which will be displayed below. Decreasing C will increase the SVM margin and generalize the model more.

##### Optimize regularization parameter

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear']
degree = [2,3,4,5,6]
C = [0.0001]

params = dict(kernel=kernel, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs=5)
best_emp_scaled = grid.fit(X_emp_scaled_userb, y_emp_userb)

print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_emp_scaled.best_estimator_.get_params()['kernel'])
print("Best Degree:", best_emp_scaled.best_estimator_.get_params()['degree'])
print("Best C:", best_emp_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_emp_scaled.best_score_)

In [None]:
# save optimized regularization model as v2 
optimized_clf_emp_userb_v2 = best_emp_scaled.best_estimator_

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_emp_userb_v2.predict(X_emp_scaled)

# calculate model accuracy
acc = accuracy_score(y_emp, y_pred)

# calculate model precision
prec = precision_score(y_emp, y_pred)

# calculate model recall
recall = recall_score(y_emp, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_emp_userb_v2,X_emp_scaled,y_emp)

Model Accuracy: 0.8524590163934426   
Model Precision: 0.7757847533632287   
Model Recall: 0.5242424242424243    
AUC: 0.85   

After decreasing the C value to 0.0001, the cross validation mean accuracy of the model increased from 90.9% to 85%, however the test accuracy on User A increased by 0.6% and more importantly the ROC curve is significantly improved (AUC = 0.85). Overall this would suggest a robust model, and it carries a good tradeoff between specificity and sensitivity.

## Dimensionality Reduction: Principal Component Analysis

### PCA on User A "Negative"
PCA will be performed on "Negative" User A and trained, then tested on User B to view the performance.


In [None]:
# Perform PCA on User A "Negative" emotion
pca_neg = PCA(.99).fit(X_neg_scaled)

# plot to see total variance explained by components
px.scatter(np.cumsum(pca_neg.explained_variance_ratio_), title="PCA: 99% Cumulative Variance for Negative Emotion", labels={
    "value":"Cumulative Variance", "index":"Components"
})

The graph above suggests that 99% of the variance lies within the first 23 components of the 300 components. Therefore, PCA will only use the first 23 components.

##### Select PCA components and train model on User A

In [None]:
# save 23 components of PCA
n_comp = 23

# create pca instance and fit it to data
pca = PCA(n_comp)
pca.fit(X_neg_scaled)

# transform dataset
X_neg_scaled_pca = pca.transform(X_neg_scaled)

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear']
degree = [2,3,4,5,6,7,8,9]
C = [0.01,0.1,1,10]

params = dict(kernel=kernel, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params)
best_scaled = grid.fit(X_neg_scaled_pca,y_neg)

print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_scaled.best_estimator_.get_params()['kernel'])
print("Best C:", best_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_scaled.best_score_)

In [None]:
# select best model
optimized_clf_neg_usera_pca = best_scaled.best_estimator_

###### Test on User B

In [None]:
# perform PCA on User B

X_neg_scaled_userb_pca = pca.transform(X_neg_scaled_userb)

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_neg_usera_pca.predict(X_neg_scaled_userb_pca)

# calculate model accuracy
acc = accuracy_score(y_neg_userb, y_pred)

# calculate model precision
prec = precision_score(y_neg_userb, y_pred)

# calculate model recall
recall = recall_score(y_neg_userb, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_neg_usera_pca,X_neg_scaled_userb_pca,y_neg_userb)

Model Accuracy: 0.5960809102402023    
Model Precision: 0.548472775564409    
Model Recall: 0.5800561797752809    


The results from PCA are similar to non-PCA modelling, with an accuracy of 59.6% (slightly higher than before) and a precision and recall of 54.8% and 58.0% accordingly. The ROC curve does not suggest a good balance between specificity and sensitivity, and therefore the model performs similarly to the non-PCA model.

Overall PCA was successful as it produced a similar model using only 23 components in comparison to 300.

Compare the differences below between User A and B to see if the number of principle components varies between the two users.

###### Compare PCA between User A and User B

In [None]:
# Perform PCA on User B "Negatve" emotion
pca_neg_userb = PCA().fit(X_neg_scaled_userb)

x = np.arange(1,301)

# Look at Distributions before and after standardizing and normalizing
fig = make_subplots(rows=1, cols=2,subplot_titles=("User A","User B"))
fig.add_trace(go.Scatter(x=x,y=np.cumsum(pca_neg.explained_variance_ratio_),name="User A"),
    row=1, col=1
)
fig.add_trace(go.Scatter(x=x,y=np.cumsum(pca_neg_userb.explained_variance_ratio_),name="User B"),
    row=1, col=2
)
fig.update_layout(title_text="PCA Comparisons: User A v.s. User B")
fig.update_yaxes(
        title_text = "Cumulative Variance")
fig.update_xaxes(
        title_text = "# of Components")
fig.show()

The visualization above suggests that for User A ~99% of the variance is within the first 23 parameters, whist for User B it is within the first 81 parameters. Using less parameters showed very little change in model performance, which would suggest that for "negative" emotions the only a handful of components are of signficance, and perhaps the more data used the worse the predictions.


### PCA on User A "Emphasis"
PCA will be performed on "Emphasis" User A and trained, then tested on User B to view the performance.

In [None]:
# Perform PCA on User A "Emphasis" emotion
pca_emp = PCA(0.99).fit(X_emp_scaled)

# plot to see total variance explained by components
px.scatter(np.cumsum(pca_emp.explained_variance_ratio_), title="PCA: 99% Cumulative Variance for Emphasis Emotion", labels={
    "value":"Cumulative Variance", "index":"Components"
})

The graph above suggests that 99% of the variance lies within the first 22 components of the 300 components. Therefore, PCA will only use the first 22 components.

##### Select PCA components and train model on User A

In [None]:
# save 22 components of PCA
n_comp = 22

pca = PCA(n_comp)
pca.fit(X_emp_scaled)

X_emp_scaled_pca = pca.transform(X_emp_scaled)

In [None]:
# create empty model object
clf = svm.SVC()

# optimization paramters
kernel = ['linear']
degree = [2,3,4,5,6,7,8,9]
C = [0.001,0.01,0.1,1,10]

params = dict(kernel=kernel, C=C)

# Use GridSearch to optimize parameters through 5-Fold Cross Validation
grid = GridSearchCV(clf, params,n_jobs = 4)
best_emp_scaled = grid.fit(X_emp_scaled_pca,y_emp)

print("Best 5-Fold CrossValidation Estimates for Standardized Data")
print("Best Kernel:", best_emp_scaled.best_estimator_.get_params()['kernel'])
print("Best C:", best_emp_scaled.best_estimator_.get_params()['C'])
print("K-Fold Accuracy:", best_emp_scaled.best_score_)

In [None]:
# select best model
optimized_clf_emp_usera_pca = best_emp_scaled.best_estimator_

###### Test on User B

In [None]:
# perform PCA on User B

X_emp_scaled_userb_pca = pca.transform(X_emp_scaled_userb)

In [None]:
# make predictions on test data using optimized model
y_pred = optimized_clf_emp_usera_pca.predict(X_emp_scaled_userb_pca)

# calculate model accuracy
acc = accuracy_score(y_emp_userb, y_pred)

# calculate model precision
prec = precision_score(y_emp_userb, y_pred)

# calculate model recall
recall = recall_score(y_emp_userb, y_pred)

print("Model Accuracy:", acc)
print("Model Precision:", prec)
print("Model Recall:", recall)

# plot ROC Curve
plot_roc_curve(optimized_clf_emp_usera_pca,X_emp_scaled_userb_pca,y_emp_userb)

Model Accuracy: 0.8139880952380952   
Model Precision: 0.7513416815742398    
Model Recall: 0.7909604519774012     

Using only 22 components for the emotion "Emphasis", the model performance decrease by 1% in terms of accuracy in comparison to the non-PCA results. The ROC curve is very similar also, with only a 1% decrease of AUC. 

Again PCA shows slight decreases in terms of performance, but large reductions in dimensionality. This would again suggest that only a handful of parameters are of importance when predicting "emphasis" emotions.