# Contents
* [Intro](#Intro)
* [Imports and config](#Imports-and-config)
* [Load data](#Load-data)
* [Add features](#Add-features)
* [Results](#Results)

## Intro

Earlier, I undersampled the medium set to 800+ samples to evaluate the performance of Random Interval Spectral Ensemble (RISE) from `sktime` and MiniROCKET from `tsai`. Despite sampling <1% of the dataset, both of these took longer to train than I expected. If I recall correctly, RISE took between 2 and 2.5 hours; MiniROCKET was much faster, but still not what I would have expected for a bunch of short audio clips. Moreover, the accuracy scores were only a couple of points higher than a dummy classifier.

I also tried Time Series Support Vector Classifier and Learning Shapelets from `tslearn`, but I kept getting out-of-memory errors.

I didn't include the notebooks I used for these intial probes since it became clear I would need to adjust my exploration approach to iterate faster. I considered using my Paperspace credits for access to more compute, dimensionality reduction techniques, and PySpark.

Perhaps one of the reasons training is so slow is that the main feature is an array of some 80,000+ elements. I might be able to iterate faster if I trained on the MFCCs instead. Accordingly, I am extracting the MFCCs and the mel-and-decibel-scaled spectrograms before I try other models. (Some of the other techniques I'd like to try will use the spectrograms.)

For now, we are are only dealing with the short set to avoid memory issues the full dataset entails.

## Imports and config

In [1]:
# Core
import numpy as np
import pandas as pd
import librosa

# util
from gc import collect as gc_collect
from os import remove
from shutil import rmtree

In [2]:
# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [3]:
# pyspark
import findspark

findspark.init()

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import (
    ArrayType,
    ByteType,
    FloatType,
    StringType,
    StructField,
    StructType,
)

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
spark = SparkSession.builder.getOrCreate()

In [4]:
# Extensions
%load_ext lab_black
%load_ext nb_black
%load_ext autotime

In [5]:
SEED = 2021

# Location of padded data (as pickle)
PICKLED_DF_FOLDER = "../3.0-mic-pad_data_short"

# Location where this notebook will output
DATA_OUT_FOLDER = "."

# The preprocessed data from the Unified Multilingual Dataset of Emotional Human utterances
WAV_DIRECTORY = (
    "../../../unified_multilingual_dataset_of_emotional_human_utterances/data/preprocessed"
)

time: 11.4 ms


## Load data

In [6]:
schema = StructType(
    [
        StructField("file", StringType(), nullable=False),
        StructField("duration", StringType(), nullable=False),
        StructField("source", StringType(), nullable=False),
        StructField("speaker_id", StringType(), nullable=False),
        StructField("speaker_gender", StringType(), nullable=False),
        StructField("emo", StringType(), nullable=False),
        StructField("valence", StringType(), nullable=False),
        StructField("lang1", StringType(), nullable=False),
        StructField("lang2", StringType(), nullable=False),
        StructField("neg", ByteType(), nullable=False),
        StructField("neu", ByteType(), nullable=False),
        StructField("pos", ByteType(), nullable=False),
        StructField("length", StringType(), nullable=False),
        StructField("padded", ArrayType(FloatType()), nullable=False),
    ]
)

time: 47 ms


In [7]:
gc_collect()
short_df = spark.createDataFrame(
    pd.read_pickle(f"{PICKLED_DF_FOLDER}/short_padded.pkl"),
    schema=schema,
)

4122

time: 10.1 s


In [8]:
short_df.printSchema()
gc_collect()
short_df.show()

root
 |-- file: string (nullable = false)
 |-- duration: string (nullable = false)
 |-- source: string (nullable = false)
 |-- speaker_id: string (nullable = false)
 |-- speaker_gender: string (nullable = false)
 |-- emo: string (nullable = false)
 |-- valence: string (nullable = false)
 |-- lang1: string (nullable = false)
 |-- lang2: string (nullable = false)
 |-- neg: byte (nullable = false)
 |-- neu: byte (nullable = false)
 |-- pos: byte (nullable = false)
 |-- length: string (nullable = false)
 |-- padded: array (nullable = false)
 |    |-- element: float (containsNull = true)



369

+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+
|                file| duration| source|         speaker_id|speaker_gender|emo|valence|lang1|lang2|neg|neu|pos|length|              padded|
+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+
|01788+BAUM1+BAUM1...|    0.387|  BAUM1|         BAUM1.s028|             f|hap|      1|  tur|tr-tr|  0|  0|  1| short|[0.0, 0.0, 0.0, 0...|
|02024+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S087|             f|ang|     -1|  eng|   en|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|
|02196+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S239|             f|ang|     -1|  tur|tr-tr|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|
|10245+ekorpus+eko...|    0.485|ekorpus|          ekorpus.0|             f|hap|      1|  est|et-ee|  0|  0|  1| short|[0.0, 0.0, 0.0, 0...|
|10512+ekorpus+eko..

In [9]:
gc_collect()
short_df.show()

211

+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+
|                file| duration| source|         speaker_id|speaker_gender|emo|valence|lang1|lang2|neg|neu|pos|length|              padded|
+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+
|01788+BAUM1+BAUM1...|    0.387|  BAUM1|         BAUM1.s028|             f|hap|      1|  tur|tr-tr|  0|  0|  1| short|[0.0, 0.0, 0.0, 0...|
|02024+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S087|             f|ang|     -1|  eng|   en|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|
|02196+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S239|             f|ang|     -1|  tur|tr-tr|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|
|10245+ekorpus+eko...|    0.485|ekorpus|          ekorpus.0|             f|hap|      1|  est|et-ee|  0|  0|  1| short|[0.0, 0.0, 0.0, 0...|
|10512+ekorpus+eko..

## Add features

Next, we'll add a column for the Mel Frequency Cepstrum Coefficients.

In [10]:
mfcc = librosa.feature.mfcc

time: 4.01 ms


In [11]:
# Extract MFCCs
@pandas_udf(returnType=ArrayType(ArrayType((FloatType()))))
def make_mfcc(field: pd.Series) -> pd.Series:
    """Given a Series of wav arrays, return a Series of extracted Mel Freqeuency Cepstrum Coefficients."""
    return field.apply(lambda _: mfcc(_, sr=16000).tolist())

time: 14 ms


In [12]:
short_df = short_df.withColumn("mfcc", make_mfcc("padded"))

time: 136 ms


In [13]:
gc_collect()
short_df.show()

216

+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+--------------------+
|                file| duration| source|         speaker_id|speaker_gender|emo|valence|lang1|lang2|neg|neu|pos|length|              padded|                mfcc|
+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+--------------------+
|01788+BAUM1+BAUM1...|    0.387|  BAUM1|         BAUM1.s028|             f|hap|      1|  tur|tr-tr|  0|  0|  1| short|[0.0, 0.0, 0.0, 0...|[[-680.11646, -68...|
|02024+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S087|             f|ang|     -1|  eng|   en|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|[[-573.9075, -419...|
|02196+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S239|             f|ang|     -1|  tur|tr-tr|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|[[-651.4625, -595...|
|10245+ekorpus+eko...|    0.485|ek

We will also add a column for a decibel-scaled spectrogram on the mel scale. These are represented by arrays also.

In [14]:
melspectrogram = librosa.feature.melspectrogram
amplitude_to_db = librosa.amplitude_to_db
gc_collect()

169

time: 96.3 ms


In [15]:
@pandas_udf(returnType=ArrayType(ArrayType((FloatType()))))
def make_melspec_db(field: pd.Series) -> pd.Series:
    """Given a Series of wav arrays, return a Series of extracted mel-db-scaled spectrogram arrays."""
    return field.apply(
        lambda _: amplitude_to_db(melspectrogram(_, sr=16000), ref=np.max).tolist()
    )

time: 17 ms


In [16]:
short_df = short_df.withColumn("melspec_db", make_melspec_db(short_df.padded))

time: 81.8 ms


In [17]:
gc_collect()
short_df.show()

688

+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+--------------------+--------------------+
|                file| duration| source|         speaker_id|speaker_gender|emo|valence|lang1|lang2|neg|neu|pos|length|              padded|                mfcc|          melspec_db|
+--------------------+---------+-------+-------------------+--------------+---+-------+-----+-----+---+---+---+------+--------------------+--------------------+--------------------+
|01788+BAUM1+BAUM1...|    0.387|  BAUM1|         BAUM1.s028|             f|hap|      1|  tur|tr-tr|  0|  0|  1| short|[0.0, 0.0, 0.0, 0...|[[-680.11646, -68...|[[-80.0, -80.0, -...|
|02024+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S087|             f|ang|     -1|  eng|   en|  1|  0|  0| short|[0.0, 0.0, 0.0, 0...|[[-573.9075, -419...|[[-80.0, -80.0, -...|
|02196+BAUM2+BAUM2...|    0.417|  BAUM2|         BAUM2.S239|             f|ang|     -1|  t

## Results

In [18]:
short_df.select("mfcc", "melspec_db").show()
gc_collect()

+--------------------+--------------------+
|                mfcc|          melspec_db|
+--------------------+--------------------+
|[[-680.11646, -68...|[[-80.0, -80.0, -...|
|[[-573.9075, -419...|[[-80.0, -80.0, -...|
|[[-651.4625, -595...|[[-80.0, -80.0, -...|
|[[-387.688, -308....|[[-80.0, -80.0, -...|
|[[-499.3053, -235...|[[-80.0, -66.6326...|
|[[-507.36606, -43...|[[-80.0, -80.0, -...|
|[[-479.65567, -35...|[[-80.0, -80.0, -...|
|[[-540.9684, -540...|[[-80.0, -80.0, -...|
|[[-575.67035, -57...|[[-80.0, -80.0, -...|
|[[-548.1814, -548...|[[-80.0, -80.0, -...|
|[[-552.3849, -552...|[[-80.0, -80.0, -...|
|[[-504.175, -401....|[[-80.0, -80.0, -...|
|[[-352.87973, -23...|[[-80.0, -80.0, -...|
|[[-269.65677, -18...|[[-80.0, -80.0, -...|
|[[-490.36093, -45...|[[-80.0, -80.0, -...|
|[[-593.6202, -518...|[[-80.0, -80.0, -...|
|[[-603.3398, -603...|[[-80.0, -80.0, -...|
|[[-607.82526, -60...|[[-80.0, -80.0, -...|
|[[-627.8071, -627...|[[-80.0, -80.0, -...|
|[[-640.9115, -640...|[[-80.0, -

207

time: 6.86 s


Now we can save our work.

In [19]:
save_file = f"{DATA_OUT_FOLDER}/short_plus.parquet"
try:
    remove(save_file)
    print("removed old parquet file")
except OSError:
    pass

try:
    rmtree(f"{save_file}/")
    print("removed old parquet tree")
except OSError:
    pass

removed old parquet tree
time: 13 ms


In [20]:
short_df.write.save(save_file)

time: 31.7 s


[^top](#Contents)