# Contents
* [Intro](#Intro)
* [Imports and config](#Imports-and-config)
* [Load data](#Load-data)
  * [Undersample data](#Undersample-data)
* [Trim and pad](#Trim-and-pad)
* [Train test split](#Train-test-split)
* [Minimally Random Convolutional Kernel Transform](#Minimally-Random-Convolutional-Kernel-Transform)
* [Results](#Results)
* [Discussion](#Discussion)

## Intro

This notebook explores the MINImally RandOm Convolutional KErnel Transform (MINIROCKET) method applied to the samples of medium duration without hyperparameter tuning. ROCKET transforms a times series with random convolutional kernels to extract features that are modeled by a linear classifier and MINIROCKET is a scaled-down version of it, training much faster at a marginal cost of classification performance. The model is trained directly on the padded wav arrays.

The results are better than a dummy classifier but not by much. Since this method is acclaimed for its speed of training, it may still be worth applying it to the MFCCs or spectrograms as time series.

## Imports and config

In [1]:
# Extensions
%load_ext lab_black
%load_ext nb_black
%load_ext autotime

In [2]:
# Core
import numpy as np
import pandas as pd
import librosa

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

# suppress warnings
import warnings

warnings.filterwarnings("ignore")

from gc import collect as gc_collect

time: 5.5 s


In [3]:
from tsai.all import *

computer_setup()

os             : Windows-10-10.0.22000-SP0
python         : 3.8.12
tsai           : 0.2.23
fastai         : 2.5.2
fastcore       : 1.3.26
torch          : 1.9.1+cpu
n_cpus         : 8
device         : cpu
time: 11.5 s


In [4]:
# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

# suppress warnings
import warnings

warnings.filterwarnings("ignore")

time: 10.3 ms


In [5]:
SEED = 2021

# Location of medium.pkl, which contains the samples of medium duration
PICKLED_DF_FOLDER = "../1.0-mic-divide_data_by_duration"

# Location where this notebook will output
DATA_OUT_FOLDER = "."

# The preprocessed data from the Unified Multilingual Dataset of Emotional Human utterances
WAV_DIRECTORY = (
    "../../unified_multilingual_dataset_of_emotional_human_utterances/data/preprocessed"
)

time: 11.6 ms


## Load data

In [6]:
medium_df = pd.read_pickle(f"{PICKLED_DF_FOLDER}/medium.pkl")
medium_df.head()

Unnamed: 0_level_0,file,duration,source,speaker_id,speaker_gender,emo,valence,lang1,lang2,neg,neu,pos,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,00000+aesdd+aesdd.1+f+ang+-1+ell+el-gr.wav,4.129,aesdd,aesdd.1,f,ang,-1,ell,el-gr,1,0,0,medium
1,00001+aesdd+aesdd.2+f+ang+-1+ell+el-gr.wav,3.448,aesdd,aesdd.2,f,ang,-1,ell,el-gr,1,0,0,medium
2,00002+aesdd+aesdd.3+m+ang+-1+ell+el-gr.wav,3.98,aesdd,aesdd.3,m,ang,-1,ell,el-gr,1,0,0,medium
3,00003+aesdd+aesdd.4+m+ang+-1+ell+el-gr.wav,3.39,aesdd,aesdd.4,m,ang,-1,ell,el-gr,1,0,0,medium
4,00004+aesdd+aesdd.5+f+ang+-1+ell+el-gr.wav,4.042,aesdd,aesdd.5,f,ang,-1,ell,el-gr,1,0,0,medium


time: 241 ms


In [7]:
medium_df.duration.value_counts()

2.000000    226
3.004000    218
1.045000    201
2.603000    196
1.835000    196
           ... 
5.355937      1
3.723937      1
3.124938      1
2.721938      1
0.780000      1
Name: duration, Length: 6324, dtype: int64

time: 23.1 ms


### Undersample data

Let's grab 10% of samples from each data source. We only need enough to quickly test out several models.

In [8]:
sample_df = medium_df.groupby("source").sample(frac=0.10, random_state=SEED)
len(medium_df)
len(sample_df)
np.unique(medium_df.source) == np.unique(sample_df.source)
sample_df.head()

81099

8109

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True])

Unnamed: 0_level_0,file,duration,source,speaker_id,speaker_gender,emo,valence,lang1,lang2,neg,neu,pos,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1392,01392+BAUM1+BAUM1.s019+f+neu+0+tur+tr-tr.wav,3.821,BAUM1,BAUM1.s019,f,neu,0,tur,tr-tr,0,1,0,medium
723,00723+BAUM1+BAUM1.s017+f+dis+-1+tur+tr-tr.wav,3.16,BAUM1,BAUM1.s017,f,dis,-1,tur,tr-tr,1,0,0,medium
702,00702+BAUM1+BAUM1.s014+m+hap+1+tur+tr-tr.wav,2.755,BAUM1,BAUM1.s014,m,hap,1,tur,tr-tr,0,0,1,medium
1585,01585+BAUM1+BAUM1.s022+f+con+-1+tur+tr-tr.wav,1.795,BAUM1,BAUM1.s022,f,con,-1,tur,tr-tr,1,0,0,medium
652,00652+BAUM1+BAUM1.s008+m+ang+-1+tur+tr-tr.wav,4.513,BAUM1,BAUM1.s008,m,ang,-1,tur,tr-tr,1,0,0,medium


time: 251 ms


In [9]:
smaller_sample = sample_df.sample(frac=0.10, random_state=SEED)

time: 15.3 ms


In [10]:
test_df = smaller_sample.copy()

time: 6.84 ms


## Trim and pad

There are some leading silences that are less than 10 ms in duration. In this section, we trim the leading silences and pad the samples up to the maximum duration of the set.

In [11]:
# Trim leading silence (more precise than orginally)
test_df["ragged"] = test_df.apply(
    lambda row: np.trim_zeros(
        librosa.load(path=f"{WAV_DIRECTORY}/{row.file}", sr=None)[0], trim="f"
    ).astype(np.float32),
    axis=1,
)

max_ragged = test_df.ragged.apply(len).max()

# Zero pad with leading silence
test_df["padded"] = test_df.apply(
    lambda row: np.pad(
        row.ragged,
        (max_ragged - len(row.ragged), 0),
        mode="constant",
        constant_values=0,
    ),
    axis=1,
)

time: 1.36 s


In [12]:
test_df.ragged.apply(len).describe()
test_df.padded.apply(len).describe()

count      811.000000
mean     42168.583231
std      16137.424269
min       8223.000000
25%      31748.500000
50%      41600.000000
75%      50864.000000
max      85665.000000
Name: ragged, dtype: float64

count      811.0
mean     85665.0
std          0.0
min      85665.0
25%      85665.0
50%      85665.0
75%      85665.0
max      85665.0
Name: padded, dtype: float64

time: 30.8 ms


## Train test split

To avoid leakage of speaker characteristics, we segregate the speakers of the train split from the test/validation split.

In [13]:
test_speakers = (
    pd.DataFrame(np.unique(test_df.speaker_id))
    .sample(frac=0.30, random_state=SEED)[0]
    .values
)

X_test = (_ := test_df.loc[test_df.speaker_id.isin(test_speakers)])[["padded"]]
y_test = _.neg
X_train = (_ := test_df.loc[~test_df.speaker_id.isin(test_speakers)])[["padded"]]
y_train = _.neg
len(test_df) == len(y_test) + len(y_train)

True

time: 38.1 ms


In [14]:
# We will use this to compare the results of training
score_to_beat = test_df.neg.value_counts().values[0] / len(test_df)

time: 9.4 ms


In [15]:
del medium_df
del sample_df
del smaller_sample
del test_df
gc_collect()

23346

time: 140 ms


## Minimally Random Convolutional Kernel Transform

MiniRocket was [published in August 2021](https://doi.org/10.1145/3447548.3467231), touting state-of-the-art performance on benchmark time series classification tasks.

In [16]:
model = MiniRocketClassifier(random_state=SEED, verbose=True)

time: 7.33 ms


In [17]:
gc_collect()

455

time: 117 ms


In [18]:
fitted_minirocket = model.fit(X_train, y_train)

[Pipeline]  (step 1 of 2) Processing minirocketmultivariate, total= 8.0min
[Pipeline] . (step 2 of 2) Processing ridgeclassifiercv, total=   0.7s
time: 8min 2s


In [19]:
gc_collect()

310

time: 184 ms


## Results

How well would a dummy classifier do? (The task is to distinguish between negative and non-negative.)

In [20]:
score_to_beat

0.5376078914919852

time: 8.19 ms


How well did MINIROCKET do in comparison?

In [21]:
(minirocket_score := fitted_minirocket.score(X_test, y_test))
minirocket_score - score_to_beat

0.5655172413793104

0.027909349887325186

time: 4min 33s


## Conclusion

In this notebook, I tried using MINIROCKET to classify the audio signal directly. Cursory analysis reveals that it performs slightly better than a dummy classifier would. Although certainly better than a coin flip, the margin is small. The time taken for inference is a little slower than I would have expected (>50% of training time).

It may yet be interesting to try MINIROCKET on the spectrograms. Its convolutional nature may be compared to a Convolutional Neural Network, a common architecture for applying computer vision techniques to spectrograms.

[^top](#Contents)