# Contents
* [Intro](#Intro)
* [Imports and config](#Imports-and-config)
* [Load data](#Load-data)
  * [Undersample data](#Undersample-data)
* [Trim and pad](#Trim-and-pad)
* [Train test split](#Train-test-split)
* [Random Interval Spectral Forest](#Random-Interval-Spectral-Forest)
* [Results](#Results)
* [Discussion](#Discussion)

## Intro

This notebook explores the Random Interval Sectral Ensemble (RISE) method for binary (negative/non-negative) classification, which is like the random forest of time series classification, applied to the samples of medium duration without hyperparameter tuning. The model is trained directly on the padded wav arrays.

The results are better than a dummy classifier but not by much. Training takes over an hour despite undersampling to less than 1% of the full dataset.

## Imports and config

In [1]:
# Core
import numpy as np
import pandas as pd
import librosa

# time series
from sktime.classification.interval_based import RandomIntervalSpectralForest

# util
import gc

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
SEED = 2021

# Location of medium.pkl, which contains the samples of medium duration
PICKLED_DF_FOLDER = "../1.0-mic-divide_data_by_duration"

# Location where this notebook will output
DATA_OUT_FOLDER = "."

# The preprocessed data from the Unified Multilingual Dataset of Emotional Human utterances
WAV_DIRECTORY = (
    "../../unified_multilingual_dataset_of_emotional_human_utterances/data/preprocessed"
)

time: 2 ms


## Load data

In [4]:
medium_df = pd.read_pickle(f"{PICKLED_DF_FOLDER}/medium.pkl")
medium_df.head()

Unnamed: 0_level_0,file,duration,source,speaker_id,speaker_gender,emo,valence,lang1,lang2,neg,neu,pos,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,00000+aesdd+aesdd.1+f+ang+-1+ell+el-gr.wav,4.129,aesdd,aesdd.1,f,ang,-1,ell,el-gr,1,0,0,medium
1,00001+aesdd+aesdd.2+f+ang+-1+ell+el-gr.wav,3.448,aesdd,aesdd.2,f,ang,-1,ell,el-gr,1,0,0,medium
2,00002+aesdd+aesdd.3+m+ang+-1+ell+el-gr.wav,3.98,aesdd,aesdd.3,m,ang,-1,ell,el-gr,1,0,0,medium
3,00003+aesdd+aesdd.4+m+ang+-1+ell+el-gr.wav,3.39,aesdd,aesdd.4,m,ang,-1,ell,el-gr,1,0,0,medium
4,00004+aesdd+aesdd.5+f+ang+-1+ell+el-gr.wav,4.042,aesdd,aesdd.5,f,ang,-1,ell,el-gr,1,0,0,medium


time: 233 ms


In [5]:
medium_df.duration.value_counts()

2.000000    226
3.004000    218
1.045000    201
2.603000    196
1.835000    196
           ... 
5.355937      1
3.723937      1
3.124938      1
2.721938      1
0.780000      1
Name: duration, Length: 6324, dtype: int64

time: 20 ms


### Undersample data

Let's grab 10% of samples from each data source. We only need enough to quickly test out several models.

In [6]:
sample_df = medium_df.groupby("source").sample(frac=0.10, random_state=SEED)
len(medium_df)
len(sample_df)
np.unique(medium_df.source) == np.unique(sample_df.source)
sample_df.head()

81099

8109

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True])

Unnamed: 0_level_0,file,duration,source,speaker_id,speaker_gender,emo,valence,lang1,lang2,neg,neu,pos,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1392,01392+BAUM1+BAUM1.s019+f+neu+0+tur+tr-tr.wav,3.821,BAUM1,BAUM1.s019,f,neu,0,tur,tr-tr,0,1,0,medium
723,00723+BAUM1+BAUM1.s017+f+dis+-1+tur+tr-tr.wav,3.16,BAUM1,BAUM1.s017,f,dis,-1,tur,tr-tr,1,0,0,medium
702,00702+BAUM1+BAUM1.s014+m+hap+1+tur+tr-tr.wav,2.755,BAUM1,BAUM1.s014,m,hap,1,tur,tr-tr,0,0,1,medium
1585,01585+BAUM1+BAUM1.s022+f+con+-1+tur+tr-tr.wav,1.795,BAUM1,BAUM1.s022,f,con,-1,tur,tr-tr,1,0,0,medium
652,00652+BAUM1+BAUM1.s008+m+ang+-1+tur+tr-tr.wav,4.513,BAUM1,BAUM1.s008,m,ang,-1,tur,tr-tr,1,0,0,medium


time: 246 ms


This is still more data than is necessary to test out a model, so we will undersample again.

In [7]:
smaller_sample = sample_df.sample(frac=0.10, random_state=SEED)

time: 8 ms


In [8]:
test_df = smaller_sample.copy()

time: 2 ms


## Trim and pad

There are some leading silences that are less than 10 ms in duration. In this section, we trim the leading silences and pad the samples up to the maximum duration of the set.

In [9]:
# Trim leading silence (more precise than orginally)
test_df["ragged"] = test_df.apply(
    lambda row: np.trim_zeros(
        librosa.load(path=f"{WAV_DIRECTORY}/{row.file}", sr=None)[0], trim="f"
    ).astype(np.float32),
    axis=1,
)

max_ragged = test_df.ragged.apply(len).max()

# Zero pad with leading silence
test_df["padded"] = test_df.apply(
    lambda row: np.pad(
        row.ragged,
        (max_ragged - len(row.ragged), 0),
        mode="constant",
        constant_values=0,
    ),
    axis=1,
)

time: 1.33 s


In [10]:
test_df.ragged.apply(len).describe()
test_df.padded.apply(len).describe()

count      811.000000
mean     42168.583231
std      16137.424269
min       8223.000000
25%      31748.500000
50%      41600.000000
75%      50864.000000
max      85665.000000
Name: ragged, dtype: float64

count      811.0
mean     85665.0
std          0.0
min      85665.0
25%      85665.0
50%      85665.0
75%      85665.0
max      85665.0
Name: padded, dtype: float64

time: 28 ms


## Train test split

To avoid leakage of speaker characteristics, we segregate the speakers of the train split from the test/validation split.

In [11]:
test_speakers = (
    pd.DataFrame(np.unique(test_df.speaker_id))
    .sample(frac=0.30, random_state=SEED)[0]
    .values
)

X_test = (_ := test_df.loc[test_df.speaker_id.isin(test_speakers)])[["padded"]]
y_test = _.neg
X_train = (_ := test_df.loc[~test_df.speaker_id.isin(test_speakers)])[["padded"]]
y_train = _.neg
len(test_df) == len(y_test) + len(y_train)

True

time: 19 ms


In [12]:
# We will use this to compare the results of training
score_to_beat = (
    _ := test_df.loc[test_df.speaker_id.isin(test_speakers)]
).neg.value_counts().values[0] / len(_)

time: 6 ms


In [13]:
del medium_df
del sample_df
del smaller_sample
del test_df
gc.collect()

4957

time: 98.3 ms


## Random Interval Spectral Ensemble

RISE is a random forest method for time series classification. It might do well since it extracts spectral features.

In [14]:
rise = RandomIntervalSpectralForest(random_state=SEED)

time: 994 µs


In [15]:
gc.collect()

442

time: 72 ms


In [16]:
fitted_rise = rise.fit(X_train, y_train)

time: 1h 45min 35s


In [17]:
gc.collect()

207

time: 325 ms


## Results

How well would a dummy classifier do? (The task is to distinguish between negative and non-negative.)

In [18]:
score_to_beat

0.5551724137931034

time: 4 ms


How well did RISE do in comparison?

In [19]:
(rise_score := rise.score(X_test, y_test))
rise_score - score_to_beat

0.596551724137931

0.04137931034482756

time: 16min 45s


## Discussion

In this notebook, I tried using RISE to classify the audio signal directly. Cursory analysis reveals that it performs slightly better than a dummy classifier would. Although certainly better than a coin flip, the margin is small.

Moreover, more than an hour elapsed during training! I am already using less than 1% of the full dataset and this was supposed to be a "quick" method. Feature engineering to produce a set of features of manageable size may be warranted.

[^top](#Contents)