# Assignment Description

You can see that we provide two parts of the dataset: dataset.csv and test_to_submit.csv. While the former contains all the features and labels, the latter only contains features. During this assignment you will train and test a model using the first dataset. Then, using your trained model, you will be asked to classify the samples in test_to_submit.csv and submit the output of your model. We will check, using our internal labels, the performance of your model on this last dataset.

To pass this Assignment, you are required to perform the following steps:

1. Read and Preprocess the dataset in a format that is appropriate for training
2. Do a balanced split of the dataset for train/val/test.
3. Select an appropriate model for the task. You can choose among the following:
    * A model that was presented during Lab 2
    * A model that is part of scikit-learnLinks to an external site. that was not presented in Lab 2
    * A different model using PyTorchLinks to an external site.
4. Do some kind of hyperparameter tuning/model selection using the validation dataset. Some examples are the following:
    * Using different kernels with Support Vector Machines (SVM)
    * Using different k values when using k-Nearest Neighbours
    * Changing the depth and breadth of a Multi-Layer Perceptron (MLP)
    * Testing different model e.g. k-NN vs SVM vs MLP
5. Analyse and report the performance of your selected model with your selected hyperparameter(s) on your test set.
6. Classify the samples of test_to_submit.csv
7. Submit everything on Studium

For this assignment you will be asked to submit:

A very brief report (in pdf) with bullet points answering the following questions (also see example answers):

* Name: Your name
* Train/val/test split percentage: 70/20/10
* Selected model(s): k-NN
* Hyperparameter tuning or model selection: Hyperparameter tuning
* If Hyperparameter tuning, parameter that was tuned and range of values: k as in the number of clusters, values: 1, 3, 5, 15
* Best model/hyperparameter: k=5
* Performance of best model on your test set (accuracy): 42%

Beside this, you are also asked to submit a file containing the classifications of your model of the samples in test_to_submit.csv. The file must be named outputs (with no extension i.e. outputs.txt will not be accepted) and have exactly one word per line. The word corresponds to the emotion label given by your model i.e. the i-th line of your file indicates the emotion label given by your model to the i-th sample in test_to_submit.csv.

Finally, you need to submit your code, in a single script or notebook. Do not zip your files.

Failure to comply to this given name/structure will result in automatic failure of the assignment. In such case, you can still resubmit your assignment until the end of the deadline.

## Imports and Init

In [93]:
#imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

import torch
import torch.nn as nn
import torch.nn.functional as F

import optuna

##
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Assuming that we are on a CUDA machine, this should print a CUDA device:

print(device)

cpu


# Data Handling

## Data acquisition and inspection

In [94]:
#read files
data = pd.read_csv("dataset.csv")
eval_set = pd.read_csv("test_to_submit.csv")

In [95]:
#inspect data
data.head(5)

Unnamed: 0,emotion,AU01,AU02,AU04,AU05,AU06,AU07,AU09,AU10,AU11,...,AU14,AU15,AU17,AU20,AU23,AU24,AU25,AU26,AU28,AU43
0,neutral,0.450774,0.289915,0.409713,0.518726,0.086218,0.0,0.187309,0.354838,0.0,...,0.32069,0.411641,0.431646,0.0,0.277122,0.335435,0.262999,0.189863,0.051967,0.05137
1,disgust,0.50045,0.314694,0.625174,0.335747,0.262984,0.0,0.504238,0.383201,0.0,...,0.544159,0.440429,0.495913,0.0,0.514737,0.420401,0.052358,0.143576,0.500994,0.155117
2,sad,0.273191,0.191327,0.140938,0.358091,0.246593,0.0,0.312881,0.188845,1.0,...,0.284598,0.761539,0.491468,0.0,0.134049,0.670237,0.024796,0.109462,0.325429,0.191367
3,neutral,0.464508,0.301702,0.50037,0.296161,0.189114,0.0,0.521304,0.039475,0.0,...,0.491734,0.16343,0.552469,0.0,0.419418,0.30692,0.224105,0.072518,0.652248,0.505568
4,happy,0.274483,0.232007,0.601821,0.281365,0.900241,1.0,0.784789,0.198816,0.0,...,0.703261,0.549239,0.425561,0.0,0.203916,0.561599,0.966706,0.108249,0.464104,0.786888


In [96]:
data.shape

(1161, 21)

In [97]:
eval_set.head(5)

Unnamed: 0,AU01,AU02,AU04,AU05,AU06,AU07,AU09,AU10,AU11,AU12,AU14,AU15,AU17,AU20,AU23,AU24,AU25,AU26,AU28,AU43
0,0.405237,0.479319,0.265762,0.274633,0.580491,1.0,0.548356,0.023748,0.0,0.860109,0.660568,0.330212,0.474759,0.0,0.560595,0.339413,0.712796,0.041511,0.144623,0.377551
1,0.409071,0.38834,0.281202,0.318302,0.275725,0.0,0.43824,0.73788,1.0,0.280625,0.297689,0.618692,0.373158,1.0,0.362552,0.071052,0.999756,0.706376,0.097503,0.371425
2,0.35426,0.398113,0.184397,0.412723,0.119522,0.0,0.170188,0.195084,1.0,0.100838,0.534023,0.552444,0.511086,0.0,0.424541,0.537576,0.805593,0.156587,0.06454,0.101149
3,0.157341,0.140977,0.329866,0.341054,0.150011,0.0,0.263753,0.078781,0.0,0.114097,0.323225,0.151836,0.487765,0.0,0.253994,0.273257,0.035888,0.080303,0.563623,0.283418
4,0.273054,0.354161,0.177498,0.337357,0.155787,0.0,0.309611,0.002358,1.0,0.134032,0.228593,0.482891,0.363071,1.0,0.351714,0.436756,0.979297,0.319223,0.172709,0.204042


In [98]:
data.shape

(1161, 21)

## Split data
Train/val/test split percentage: 70/20/10

In [99]:
labels = data["emotion"]
inputs = data.drop(labels="emotion", axis=1)

In [100]:
set(labels)

{'angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'}

In [101]:
label_map = {
    'angry': 0, 
    'disgust': 1, 
    'fear': 2, 
    'happy': 3, 
    'neutral': 4, 
    'sad': 5, 
    'surprise': 6
    }

In [None]:
#relabel data to integers so predictions are in numeric format
y = labels.map(label_map)
y

0       4
1       1
2       5
3       4
4       3
       ..
1156    5
1157    5
1158    4
1159    4
1160    6
Name: emotion, Length: 1161, dtype: int64

In [103]:
#convert to tensors for compatibility with pytorch
features = torch.tensor(inputs.values, dtype=torch.float32)  
y = torch.tensor(y.values, dtype=torch.long)

print(y.shape)
print(features.shape)   

torch.Size([1161])
torch.Size([1161, 20])


In [104]:
X_data, X_test, y_labels, y_test = train_test_split(
    features,
    y,
    test_size=.1,
    random_state = 42, #shuffled in same way
    stratify = y#make sure label distribution is same over train and test
)
X_train, X_val, y_train, y_val = train_test_split(
    X_data,
    y_labels,
    test_size=.2222222222,  #
    random_state = 42, #shuffled in same way
    stratify = y_labels#make sure label distribution is same over train and test
)

In [105]:
len(X_test)/len(inputs)

0.10077519379844961

In [106]:
len(X_val)/len(inputs)

0.19982773471145565

In [107]:
len(X_train)/len(inputs)

0.6993970714900948

close enough!

# Model Definition
https://optuna.org/ for tutorial on hyperparameter tuning
https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_simple.py for example code

In [None]:
#Just a simple sequential model for classification delete when optuna implemented
#note for future. don't be lazy and just extend nn.Module next time.
def create_model(trial = None, params = None):

    n_layers = trial.suggest_int('n_layers', 1, 3) if trial else params['n_layers']
    layers = []

    in_features = 20
    for i in range(n_layers):
        out_features = trial.suggest_int(f'n_units_l{i}', 32, 128,step=32) if trial else params[f'n_units_l{i}']
        layers.append(torch.nn.Linear(in_features, out_features))
        layers.append(torch.nn.ReLU())
        perc = trial.suggest_float("dropout_l{}".format(i), 0, 0.5, step=.1) if trial else params[f'dropout_l{i}']
        layers.append(torch.nn.Dropout(perc))
        in_features = out_features
    
    layers.append(torch.nn.Linear(in_features, 7))
    return nn.Sequential(*layers)

In [None]:
def objective(trial):
    global best_acc, best_model_state
    #initialize model. device defined in header, eh just use cpu
    model = create_model(trial)
    
    learning_rate_init = trial.suggest_float(
        "learning_rate_init", 1e-5, 1e-3, log=True
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate_init)
    criterion = nn.CrossEntropyLoss()   #multiclass classification
    epochs = 10
    
    #Training loop
    for epoch in range(epochs):
        model.train() #enables dropout
        #loop over data
        for (data, target) in zip(X_train,y_train):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output,target)
            loss.backward()
            optimizer.step()
            
        #validation
        model.eval()    #eval() for disabling dropout
        correct = 0
        with torch.no_grad():
            for (data,target) in zip(X_val,y_val):
                output = model(data)
                pred = output.argmax()
                correct += pred.eq(target.view_as(pred)).sum().item()
                
        accuracy = correct/len(X_val)   #should minimize val loss instead of maximize val acc
        
        trial.report(accuracy, epoch)

        # Handle pruning based on the intermediate value.
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()
    
    #save model if best?
    if accuracy > best_acc:
        best_acc = accuracy
        best_model_state = model.state_dict()
    
    return accuracy

# Tuning


In [138]:
best_acc = 0
best_model_state = None
#create study and start tuning!
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20, timeout=600)

best = study.best_trial
torch.save(best_model_state, "best_model.pth")

[I 2024-11-22 15:21:19,001] A new study created in memory with name: no-name-7a1130ff-fe57-4675-95cb-a0bea67b6724
[I 2024-11-22 15:25:23,394] Trial 0 finished with value: 0.5818965517241379 and parameters: {'n_layers': 3, 'n_units_l0': 64, 'dropout_l0': 0.30000000000000004, 'n_units_l1': 128, 'dropout_l1': 0.5, 'n_units_l2': 32, 'dropout_l2': 0.5, 'learning_rate_init': 0.0001892126622843035}. Best is trial 0 with value: 0.5818965517241379.
[I 2024-11-22 15:26:49,062] Trial 1 finished with value: 0.5948275862068966 and parameters: {'n_layers': 2, 'n_units_l0': 128, 'dropout_l0': 0.2, 'n_units_l1': 128, 'dropout_l1': 0.2, 'learning_rate_init': 0.0008210126044095739}. Best is trial 1 with value: 0.5948275862068966.
[I 2024-11-22 15:28:13,190] Trial 2 finished with value: 0.6120689655172413 and parameters: {'n_layers': 3, 'n_units_l0': 96, 'dropout_l0': 0.30000000000000004, 'n_units_l1': 128, 'dropout_l1': 0.5, 'n_units_l2': 96, 'dropout_l2': 0.4, 'learning_rate_init': 0.000170762424824720

In [139]:
#best trial parameters
print("  Params: ")
for key, value in best.params.items():
    print("    {}: {}".format(key, value))

  Params: 
    n_layers: 3
    n_units_l0: 96
    dropout_l0: 0.30000000000000004
    n_units_l1: 128
    dropout_l1: 0.5
    n_units_l2: 96
    dropout_l2: 0.4
    learning_rate_init: 0.00017076242482472065


# Evaluation

load best model params, saved weights, and evaluate

In [142]:
#params
best_model = create_model(params = best.params)

#load weights
best_model.load_state_dict(torch.load("best_model.pth"))

#eval
best_model.eval()    #eval() for disabling dropout
correct = 0
with torch.no_grad():
    for (data,target) in zip(X_test,y_test):
        output = best_model(data)
        pred = output.argmax()
        correct += pred.eq(target.view_as(pred)).sum().item()
                
accuracy = correct/len(X_test)

  best_model.load_state_dict(torch.load("best_model.pth"))


In [None]:
#didn't want to implement early stopping and keep training, so here's accuracy after tuning
accuracy

0.6239316239316239

# Classify

In [144]:
test_to_submit = pd.read_csv("test_to_submit.csv")

In [145]:
test_to_submit.head(5)

Unnamed: 0,AU01,AU02,AU04,AU05,AU06,AU07,AU09,AU10,AU11,AU12,AU14,AU15,AU17,AU20,AU23,AU24,AU25,AU26,AU28,AU43
0,0.405237,0.479319,0.265762,0.274633,0.580491,1.0,0.548356,0.023748,0.0,0.860109,0.660568,0.330212,0.474759,0.0,0.560595,0.339413,0.712796,0.041511,0.144623,0.377551
1,0.409071,0.38834,0.281202,0.318302,0.275725,0.0,0.43824,0.73788,1.0,0.280625,0.297689,0.618692,0.373158,1.0,0.362552,0.071052,0.999756,0.706376,0.097503,0.371425
2,0.35426,0.398113,0.184397,0.412723,0.119522,0.0,0.170188,0.195084,1.0,0.100838,0.534023,0.552444,0.511086,0.0,0.424541,0.537576,0.805593,0.156587,0.06454,0.101149
3,0.157341,0.140977,0.329866,0.341054,0.150011,0.0,0.263753,0.078781,0.0,0.114097,0.323225,0.151836,0.487765,0.0,0.253994,0.273257,0.035888,0.080303,0.563623,0.283418
4,0.273054,0.354161,0.177498,0.337357,0.155787,0.0,0.309611,0.002358,1.0,0.134032,0.228593,0.482891,0.363071,1.0,0.351714,0.436756,0.979297,0.319223,0.172709,0.204042


In [None]:
#convert to tensors for compatibility with pytorch
features = torch.tensor(test_to_submit.values, dtype=torch.float32)


print(features.shape)

torch.Size([233, 20])


In [162]:
output = best_model(features).argmax(axis=1)

In [152]:
output.shape

torch.Size([233])

In [153]:
reverse_label_map = dict(zip(label_map.values(), label_map.keys()))
reverse_label_map

{0: 'angry',
 1: 'disgust',
 2: 'fear',
 3: 'happy',
 4: 'neutral',
 5: 'sad',
 6: 'surprise'}

In [163]:
output = pd.Series(output.numpy())

In [164]:
output.head(5)

0    3
1    6
2    4
3    4
4    6
dtype: int64

In [165]:
output = output.map(reverse_label_map)

In [166]:
output.head()

0       happy
1    surprise
2     neutral
3     neutral
4    surprise
dtype: object

In [169]:
output.to_csv("outputs",header=False,index=False)