# Contents
* [Intro](#Intro)
* [Imports and config](#Imports-and-config)
* [Load data](#Load-data)
* [Train test split](#Train-test-split)
* [Convolutional Neural Network](#Convolutional-Neural-Network)
  * [Ternary](#Ternary)
    * [Fit ternary](#Fit-ternary)
    * [Results ternary](#Results-ternary)
  * [Binary](#Binary)
      * [Fit binary](#Fit-binary)
      * [Results binary](#Results-binary)
* [Discussion](#Discussion)

## Intro

This notebook sets up a Convolutional Neural Network (CNN) to classify audio by spectrogram input. Both ternary and binary classification are considered. In all cases except for the binary positive/non-positive case, the trained classifier was able to outperform the dummy classifier.

## Imports and config

In [1]:
# set seed
from numpy.random import seed

seed(SEED := 2021)

In [2]:
# Extensions
%load_ext lab_black
%load_ext nb_black
%load_ext autotime

In [3]:
# Core
import numpy as np
import pandas as pd
from collections import namedtuple

# keras
from keras.models import Sequential
from keras.layers import (
    Conv2D,
    GlobalMaxPooling2D,
    Dense,
)
import tensorflow as tf

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

# suppress warnings
import warnings

warnings.filterwarnings("ignore")

time: 6.88 s


In [4]:
# Location of parquet
PARQUET_DF_FOLDER = "../5.0-mic-extract_spectrograms_and_MFCCs_short"

# Location where this notebook will output
DATA_OUT_FOLDER = "."

# The preprocessed data from the Unified Multilingual Dataset of Emotional Human utterances
WAV_DIRECTORY = (
    "../../unified_multilingual_dataset_of_emotional_human_utterances/data/preprocessed"
)

time: 8.98 ms


## Load data

In [5]:
short_df = pd.read_parquet(f"{PARQUET_DF_FOLDER}/short_plus.parquet")
short_df.head(1)

Unnamed: 0,file,duration,source,speaker_id,speaker_gender,emo,valence,lang1,lang2,neg,neu,pos,length,padded,mfcc,melspec_db
0,01788+BAUM1+BAUM1.s028+f+hap+1+tur+tr-tr.wav,0.387,BAUM1,BAUM1.s028,f,hap,1,tur,tr-tr,0,0,1,short,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[[-680.11646, -680.11646, -673.7514, -377.4224...","[[-80.0, -80.0, -80.0, -80.0, -80.0, -80.0, -7..."


time: 366 ms


## Train test split

The custom split ensures no data leakage due to speaker characteristics.

In [6]:
short_speakers = (
    pd.DataFrame(np.unique(short_df.speaker_id))
    .sample(frac=0.30, random_state=SEED)[0]
    .values
)

time: 15 ms


In [7]:
X_test = (_ := short_df.loc[short_df.speaker_id.isin(short_speakers)]).melspec_db
y_test = _.valence
X_train = (_ := short_df.loc[~short_df.speaker_id.isin(short_speakers)]).melspec_db
y_train = _.valence
len(short_df) == len(y_test) + len(y_train)
print(f"{len(y_test)} in test, {len(y_train)} in train")

True

190 in test, 290 in train
time: 22.3 ms


Some additional preprocessing is needed to format the data for keras.

In [8]:
y_test = tf.keras.utils.to_categorical(y_test, num_classes=3, dtype="float32")
y_train = tf.keras.utils.to_categorical(y_train, num_classes=3, dtype="float32")

stack = np.stack
reshaper: np.ndarray = lambda x: stack(x.apply(lambda _: stack(_))).reshape(
    len(x), 128, 16, 1
)

X_train, X_test = reshaper(X_train), reshaper(X_test)

time: 86.6 ms


## Convolutional Neural Network

### Ternary

We will start with a simple CNN. After the input layer, there is one convolutional layer, a global max pooling layer, and a softmax output layer for three classes. I chose global max pooling over the local analogue since spectral features are not localized to a single part of the image (unlike the edges of a shape in object detection, for instance). I did a bit of trial and error in configuring the architecture (not documented here) and it seemed to work better this way.

In [9]:
model_cnn = Sequential(
    [
        Conv2D(filters=128, kernel_size=3, activation="relu", input_shape=(128, 16, 1)),
        GlobalMaxPooling2D(),
        Dense(3, activation="softmax"),
    ]
)

model_cnn.compile(
    loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)
model_cnn.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 126, 14, 128)      1280      
_________________________________________________________________
global_max_pooling2d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 3)                 387       
Total params: 1,667
Trainable params: 1,667
Non-trainable params: 0
_________________________________________________________________
time: 652 ms


#### Fit ternary

Let's see how well the ternary classifier works.

In [10]:
model_cnn.fit(
    X_train,
    y_train,
    validation_data=(X_test, y_test),
    epochs=15,
    batch_size=1,
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x232995d3bb0>

time: 32.5 s


#### Results ternary

How well would a dummy classifier do in ternary classification?

In [11]:
print("dummy classifer:")
len_full_test = len(X_test)
for valence in {"-1", "0", "1"}:
    test_valence_set = short_df.loc[
        (short_df.valence == valence) & short_df.speaker_id.isin(short_speakers)
    ]
    print(
        f"{(_ := len(test_valence_set.loc[test_valence_set.valence == valence]))} samples of valence {valence} in test split ({(__ := _ / len_full_test):.3f} / {1 - __:.3f})"
    )

dummy classifer:
85 samples of valence 0 in test split (0.447 / 0.553)
66 samples of valence -1 in test split (0.347 / 0.653)
39 samples of valence 1 in test split (0.205 / 0.795)
time: 24.8 ms


The best validation accuracy from the fifteen epochs above was about 54.7%. It outperformed the dummy classifier (best score of 44.7%) by about 10%.

### Binary

Now we will set up to run the same architecture with slight modifications for the binary cases. Namely, the output layer reflects the number of classes and uses a sigmoid activation function rather than softmax; also, the loss function was changed from categorical cross entropy to binary cross entropy.

In [12]:
criterion = short_df.speaker_id.isin(short_speakers)

time: 4 ms


In [13]:
OvrSet = namedtuple("OvrSet", "name, test, train, dummy")

time: 3.73 ms


In [14]:
binary_valence = [
    OvrSet(
        name=valence,
        test=(_ := short_df.loc[criterion][valence]),
        train=short_df.loc[~criterion][valence],
        dummy=_.apply(lambda _: _ == 1).sum() / len(_),
    )
    for valence in ("neg", "neu", "pos")
]

time: 22.9 ms


#### Fit binary

The following cell loops through the binary classification sets.

In [15]:
for ovr_set in binary_valence:
    dummy = ovr_set.dummy
    print("valence:", ovr_set.name, "dummy score:", dummy if dummy > 0.5 else 1 - dummy)
    model_cnn = Sequential(
        [
            Conv2D(
                filters=128, kernel_size=3, activation="relu", input_shape=(128, 16, 1)
            ),
            GlobalMaxPooling2D(),
            Dense(1, activation="sigmoid"),
        ]
    )

    model_cnn.compile(
        loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    model_cnn.summary()
    model_cnn.fit(
        X_train,
        ovr_set.train,
        validation_data=(X_test, ovr_set.test),
        epochs=15,
        batch_size=1,
    )

valence: neg dummy score: 0.6526315789473685
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 126, 14, 128)      1280      
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,409
Trainable params: 1,409
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x2329c2e5c40>

valence: neu dummy score: 0.5526315789473684
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 126, 14, 128)      1280      
_________________________________________________________________
global_max_pooling2d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 1,409
Trainable params: 1,409
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x2329c6911c0>

valence: pos dummy score: 0.7947368421052632
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 126, 14, 128)      1280      
_________________________________________________________________
global_max_pooling2d_3 (Glob (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 129       
Total params: 1,409
Trainable params: 1,409
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x2329dad5b50>

time: 1min 24s


#### Results binary

In the negative/non-negative case, the dummy score on the test set was 65.3%, which underperformed the CNN classifier's best validation score of 68.4% by about 2.1%.

In the neutral/non-neutral case, the dummy score on the test set was 55.3%, which underperformed the CNN classifier's best validation score of 65.3% by about 10.0%.

In the positive/non-positive case, the dummy score on the test set was 79.5%, which was approximately equal to the classifier's best validation score.

## Discussion

The best-performing model in the ternary case as assessed by the best validation score of 15 epochs noticeably outperformed the dummy classifier, but the score did not surpass 55%.

In the binary cases, only the positive/non-positive classifier failed to surpass the dummy classifier's performance of the dummy classifier. This is the only case where the accuracy surpassed 70%. Class imbalance may be a factor.

It may be better for the three one-vs-rest classifiers to share their lower layers; ensembling these would likely yield better performance.

Overall, the results are unremarkable, but there are many possible improvements to be considered. Firstly, we only have a few hundred observations in our subsample. Secondly, the architecture could be reconfigured with more layers and nodes.

[^top](#Contents)