# EmoDJ Music Emotion Recognition Model

Training, hyperparameter tuning, and testing of EmoDJ Music Emotion Recognition Model is performed. SVR with RBF kernal is used.

Two models are built, one for predicting valence value, one for arousal value. The trained models are saved for the use in EmoDJ Music Player to recognise music emotion.

### Load dataset
Emotion in Music Database (1000 songs) is used. Downloaded from http://cvml.unige.ch/databases/emoMusic/

Reference:<br>
1000 Songs for Emotional Analysis of Music. Proceedings of the ACM multimedia 2013 workshop on Crowdsourcing for Multimedia
https://ibug.doc.ic.ac.uk/media/uploads/documents/cmm13-soleymani.pdf



In [1]:
import pandas as pd
import numpy as np

MODEL_FOLDER = 'model/'
DATASET_ANNOT_FOLDER = 'dataset/annotations/'
DATASET_FEATURE_FOLDER = 'dataset/default_features/'
DATASET_CLIPS_FOLDER = 'dataset/clips_45seconds/'
ID_FIELD = 'song_id'

def load_dataset():
    song_list = pd.read_csv(MODEL_FOLDER + DATASET_ANNOT_FOLDER + 'arousal_cont_average.csv', header=0)[ID_FIELD].astype(str).values
    song_list = [s + '.mp3' for s in list(song_list)]
    arousal_df = pd.read_csv(MODEL_FOLDER + DATASET_ANNOT_FOLDER + 'arousal_cont_average.csv', header=0).drop(ID_FIELD,axis=1)
    arousal_df = arousal_df.astype(np.float16)
    valence_df = pd.read_csv(MODEL_FOLDER + DATASET_ANNOT_FOLDER + 'valence_cont_average.csv', header=0).drop(ID_FIELD,axis=1)
    valence_df = valence_df.astype(np.float16)

    return song_list,arousal_df, valence_df

train_song_list,arousal_df, valence_df = load_dataset()
arousal_vector = arousal_df.values.flatten()
valence_vector = valence_df.values.flatten()

### Pre-process Features
Extract MFCCs for music clips as training data

Reference:<br>
Feature Selection for Content-Based, Time-Varying Musical Emotion Regression http://music.ece.drexel.edu/files/Navigation/Publications/schmidt2010.pdf

In [4]:
import os
import librosa
import sklearn

def preprocess_feature(folder_path, song_list, istrain):
    n_mfcc = 12
    mfcc_all = []
    #MFCC per time period (500ms)
    for file_name in song_list:
        count=0
        x, sr = librosa.load(folder_path + file_name)
        start = int(sr*15) if istrain else 0
        for i in range(start, len(x), int(sr*500/1000)):
            x_cont = x[i:i+int(sr*500/1000)]
            mfccs = librosa.feature.mfcc(x_cont,sr=sr,n_mfcc=n_mfcc)
            #for training data, as to align with dataset
            #append feature value for music interval shorter than 500ms
            mfccs = np.hstack((mfccs, np.zeros((12,22 - mfccs.shape[1]))))
            mfccs = mfccs.flatten()
            mfcc_all.append(mfccs)
            count += 1
        #for training data, as to align with dataset
        #append feature value for music shorter than 45000ms
        if istrain:
            if count < 61:
                for i in range(61-count):
                    mfcc_all.append(mfccs)

    return np.vstack(mfcc_all)

feature_matrix = preprocess_feature(MODEL_FOLDER + DATASET_CLIPS_FOLDER, train_song_list, True)

In [5]:
print(feature_matrix.shape)

(45384, 264)


### Hyperparameter Tuning
SVR with RBF kernel is used.

Reference:<br>
Automated Music Emotion Recognition: A Systematic Evaluation 
https://pdfs.semanticscholar.org/1a43/325def098ee57ff6f1c0b19a30811fe92304.pdf <br>

In [6]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

grid_parameters = [{'C': [1, 10, 100, 1000]}]

feature_matrix_mean = feature_matrix.mean(axis=0)
feature_matrix_std = feature_matrix.std(axis=0)
scaler = StandardScaler()

feature_matrix_scaled = scaler.fit_transform(feature_matrix)

X_aro_train, X_aro_test, y_aro_train, y_aro_test = train_test_split(feature_matrix_scaled, arousal_vector, test_size=0.25,
                                              random_state=5)

X_val_train, X_val_test, y_val_train, y_val_test = train_test_split(feature_matrix_scaled, valence_vector, test_size=0.25,
                                              random_state=5)

svr_arousal=SVR(kernel='rbf',gamma='auto')
svr_arousal_cv=GridSearchCV(svr_arousal,grid_parameters,cv=5,return_train_score=True)
svr_arousal_cv.fit(X_aro_train, y_aro_train)
print("Best parameters set found for arousal SVR:")
print(svr_arousal_cv.best_params_)
print("Score:", svr_arousal_cv.best_score_)

svr_valence=SVR(kernel='rbf',gamma='auto')
svr_valence_cv=GridSearchCV(svr_valence,grid_parameters,cv=5,return_train_score=True)
svr_valence_cv.fit(X_val_train, y_val_train)
print("Best parameters set found for valence SVR:")
print(svr_valence_cv.best_params_)
print("Score:", svr_valence_cv.best_score_)


Best parameters set found for arousal SVR:
{'C': 1}
Score: 0.6597154521291105
Best parameters set found for valence SVR:
{'C': 1}
Score: 0.4534642313757411


### Compute 3-fold cross validation on average distance and mean absolute errors

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

pred_aro_cv = []
pred_val_cv = []
average_distance_cv = []
mae_aro_cv = []
mae_val_cv = []

feature_matrix_mean = feature_matrix.mean(axis=0)
feature_matrix_std = feature_matrix.std(axis=0)
scaler = StandardScaler()

feature_matrix_scaled = scaler.fit_transform(feature_matrix)

for i in range(3):
    y_for_split = []
    for x,y in zip(arousal_vector,valence_vector):
        y_for_split.append((x,y))
    y_for_split = np.vstack(y_for_split)

    X_train, X_test, y_train, y_test = train_test_split(feature_matrix_scaled, y_for_split, test_size=0.05)

    y_aro_train, y_val_train = np.hsplit(y_train, 2)
    y_aro_test, y_val_test = np.hsplit(y_test, 2)
    
    y_aro_train = y_aro_train.flatten()
    y_val_train = y_val_train.flatten()
    y_aro_test = y_aro_test.flatten()
    y_val_test = y_val_test.flatten()
    
    svr_arousal_score=SVR(kernel='rbf',gamma='auto',C=1.0)
    svr_valence_score=SVR(kernel='rbf',gamma='auto',C=1.0)
    svr_arousal_score.fit(X_train,y_aro_train)
    svr_valence_score.fit(X_train,y_val_train)
    pred_aro_test = svr_arousal_score.predict(X_test)
    pred_val_test = svr_valence_score.predict(X_test)
    
    mae_aro = mean_absolute_error(y_aro_test,pred_aro_test)
    mae_val = mean_absolute_error(y_val_test,pred_val_test)
    average_distance = np.sqrt((pred_aro_test-y_aro_test)**2+(pred_val_test-y_val_test)**2).mean()
    
    pred_aro_cv.append(pred_aro_test)
    pred_val_cv.append(pred_val_test)
    mae_aro_cv.append(mae_aro)
    mae_val_cv.append(mae_val)
    average_distance_cv.append(average_distance)

In [7]:
print('Average distance on 3-fold cross validation:', sum(average_distance_cv)/len(average_distance_cv))
print('Mean absolute error of valence value on 3-fold cross validation:', sum(mae_val_cv)/len(mae_val_cv))
print('Mean absolute error of arousal value on 3-fold cross validation:', sum(mae_aro_cv)/len(mae_aro_cv))

Average distance on 3-fold cross validation: 0.20419702700876996
Mean absolute error of valence value on 3-fold cross validation: 0.13571890856879773
Mean absolute error of arousal value on 3-fold cross validation: 0.12937129979933767


### 1-sided T Test against mean absolute error of baseline
Reference: <br>
Automated Music Emotion Recognition: A Systematic Evaluation <br>
Table 7. MAE and R2 achieved by the best regressors (SVR-RBF with parameter search and using all standardized features)<br>
https://pdfs.semanticscholar.org/1a43/325def098ee57ff6f1c0b19a30811fe92304.pdf

In [17]:
import scipy.stats as stats
print("P-value for H1 MAE of proposed valence regressor < Baseline:")
print(stats.ttest_1samp(mae_val_cv,0.198).pvalue*0.5)

print("P-value for H1 MAE of proposed arousal regressor < Baseline:")
print(stats.ttest_1samp(mae_aro_cv,0.156).pvalue*0.5)

P-value for H1 MAE of proposed valence regressor < Baseline:
0.00023089405986836825
P-value for H1 MAE of proposed arousal regressor < Baseline:
0.0010279466571317222


### Train Arousal and Valence SVR with all training data

In [8]:
arousal_model = SVR(kernel='rbf',gamma='auto',C=1)
valence_model = SVR(kernel='rbf',gamma='auto',C=1)

arousal_model.fit(feature_matrix_scaled, arousal_vector)
valence_model.fit(feature_matrix_scaled, valence_vector)

SVR(C=1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

### Save Models

In [11]:
import pickle

with open(MODEL_FOLDER + 'arousal_model.pkl', 'wb') as f:
    pickle.dump(arousal_model,f)
with open(MODEL_FOLDER + 'valence_model.pkl', 'wb') as f:
    pickle.dump(valence_model,f)
with open(MODEL_FOLDER + 'feature_matrix_mean.pkl', 'wb') as f:
    pickle.dump(feature_matrix_mean,f)
with open(MODEL_FOLDER + 'feature_matrix_std.pkl', 'wb') as f:
    pickle.dump(feature_matrix_std,f)
with open(MODEL_FOLDER + 'feature_matrix_scaled.pkl', 'wb') as f:
    pickle.dump(feature_matrix_scaled,f)    