## Evaluating regression techniques for speaker characterization
### Laura Fernández Gallardo

Similarly as done for classification, the performances of different regression techniques for characterizing users by their voices are assessed in this notebook.

I am addressing the prediction of each of the 34 interpersonal speaker [characteristics](https://github.com/laufergall/Subjective_Speaker_Characteristics) (continuous numeric labels of the [NSC corpus](http://www.qu.tu-berlin.de/?id=nsc-corpus)). These characteristics are: 

'non_likable', 'secure', 'attractive', 'unsympathetic', 'indecisive', 'unobtrusive', 'distant', 'bored', 'emotional', 'not_irritated', 'active', 'pleasant', 'characterless', 'sociable', 'relaxed', 'affectionate', 'dominant', 'unaffected', 'hearty', 'old', 'personal', 'calm', 'incompetent', 'ugly', 'friendly', 'masculine', 'submissive', 'indifferent', 'interesting', 'cynical', 'artificial', 'intelligent', 'childish', 'modest'.

I will consider the common RMSE (root-mean-square error) and the more robust to outliers MAPE (median absolute percentage) as the metrics for success:

\begin{equation}
RMSE = \sqrt{\frac{\sum_i(y_i-\hat{y}_i)^2}{n}}  
\end{equation}

\begin{equation}
MAPE = median(\left | \frac{y_i-\hat{y}_i}{y_i} \right |)
\end{equation}

where $y_i$ and $\hat{y_i}$ are the observed and the predicted values for the $i^{th}$ data point.


In [1]:
import io
import requests
import time # for timestamps

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# fix random seed for reproducibility
seed = 2302
np.random.seed(seed)

## Speaker characteristics 

The files "SC_ratings_medians.csv" and "SC_ratings_medians.csv" have been generated in the ..\data folder.

We consider the same train/test partition of speakers as done for the WAAT classification task.

In [3]:
# load scores (averaged across listeners)

path = "https://raw.githubusercontent.com/laufergall/ML_Speaker_Characteristics/master/data/generated_data/"

url = path + "ratings_SC_means.csv"
s = requests.get(url).content
ratings =pd.read_csv(io.StringIO(s.decode('utf-8')))

sc_names = list(ratings.drop(['speaker_ID','speaker_gender'], axis=1))

ratings.head()


Unnamed: 0,speaker_ID,speaker_gender,non_likable,secure,attractive,unsympathetic,indecisive,unobtrusive,distant,bored,...,friendly,masculine,submissive,indifferent,interesting,cynical,artificial,intelligent,childish,modest
0,1,female,36.571429,65.214286,59.785714,37.357143,33.714286,66.857143,35.642857,35.642857,...,75.428571,20.285714,59.0,34.571429,60.571429,43.071429,35.785714,65.285714,46.857143,61.071429
1,2,female,66.666667,57.2,39.333333,54.066667,33.066667,57.266667,56.466667,55.733333,...,55.6,18.333333,55.133333,58.733333,38.533333,51.533333,63.2,51.133333,33.533333,60.266667
2,3,female,45.8125,72.5625,47.125,30.9375,27.9375,46.25,38.625,33.4375,...,64.125,19.9375,46.4375,41.5625,55.5625,50.25,40.6875,60.25,14.4375,54.8125
3,4,male,40.071429,59.857143,44.571429,54.428571,35.071429,52.285714,48.571429,49.785714,...,51.428571,75.785714,47.071429,51.357143,49.142857,55.857143,38.071429,55.785714,40.5,46.928571
4,5,male,42.117647,60.529412,53.823529,50.764706,35.705882,59.764706,49.764706,42.647059,...,54.176471,80.764706,47.823529,53.235294,57.352941,47.705882,35.823529,62.823529,29.294118,49.823529


In [6]:
# train/test partitions

# read partitions from multioutput multiclass classification
url = path + "classes_train.csv"
s = requests.get(url).content
classes_train =pd.read_csv(io.StringIO(s.decode('utf-8')))

url = path + "classes_test.csv"
s = requests.get(url).content
classes_train =pd.read_csv(io.StringIO(s.decode('utf-8')))


### ------- Select trait for regression

In [5]:
# select a trait

target_trait = sc_names[0]

target = ratings[['speaker_ID', 'speaker_gender', target_trait]]
target.head()

Unnamed: 0,speaker_ID,speaker_gender,non_likable
0,1,female,36.571429
1,2,female,66.666667
2,3,female,45.8125
3,4,male,40.071429
4,5,male,42.117647
