# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Load data](#Load-data)
* [Preprocess data](#Preprocess-data)
* [Development sample](#Development-split)
* [Load audio](#Load-audio)
* [Save prepared interim data](#Save-prepared-interim-data)

## Introduction

This notebook preprocceses the data in preparation for extraction of FRILL embeddings.

I tried several different formats, but pickle was the only one I could get to work with data ready to load and extract.

## Imports and configuration

In [1]:
# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)

In [2]:
# Extensions
%load_ext lab_black
%load_ext nb_black
%load_ext autotime

In [3]:
# Core
import numpy as np
import pandas as pd
from torchaudio import load as torchaudio_load
from sklearn.model_selection import StratifiedGroupKFold

# util
import swifter
from gc import collect as gc_collect
from os import remove
from shutil import rmtree

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

# hide warnings
import warnings

warnings.filterwarnings("ignore")

time: 3.88 s


In [4]:
# Location of pickled dataframes
PICKLED_DF_FOLDER = "../../Step 6 - Experiment With Various Models + Step 7 - Machine Learning Prototype/1.0-mic-divide_data_by_duration"

# Location where this notebook will output
DATA_OUT_FOLDER = "D:/interim_data"

# The preprocessed data from the Unified Multilingual Dataset of Emotional Human utterances
WAV_DIRECTORY = "../../../unified_multilingual_dataset_of_emotional_human_utterances/data/preprocessed"

time: 4.98 ms


## Load data

In [5]:
keep_columns = [
    "id",
    "file",
    "source",
    "length",
    "speaker_id",
    "speaker_gender",
    "lang1",
    "emo",
    "valence",
    "neg",
    "neu",
    "pos",
]
df = pd.read_pickle(f"{PICKLED_DF_FOLDER}/trimmed_dataframe.pkl").reset_index()[
    keep_columns
]

time: 249 ms


## Preprocess data

In this section, the combined dataframe is preprocessed before feature extraction. Three data sources contain non-speech samples, so we recode their language attributes first.

In [6]:
df.loc[df.source.isin({"vivae", "LimaCastroScott", "MAV"}), "lang1"] = "___"
df.lang1.value_counts()

eng    59410
cmn    17500
pes     2876
fra     1370
tur     1324
___     1292
est      863
ell      603
arz      579
deu      535
urd      400
Name: lang1, dtype: int64

time: 39 ms


In [7]:
# non-negative integer labels
# {0: negative/-1, 1: neutral/0, 2: positive/1}
df.valence = np.int8(df.valence) + 1

# encode as boolean for compactness
for valence in {"neg", "neu", "pos"}:
    df[valence] = np.bool_(df[valence])

# "category" is more compact than object dtype
for categorical in {"source", "length", "speaker_id", "speaker_gender", "lang1", "emo"}:
    df[categorical] = df[categorical].astype("category")

# np.uint32 can store 0 to 4294967295
df.id = df.id.astype(np.uint32)

time: 151 ms


In [8]:
df.info()
df.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86752 entries, 0 to 86751
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   id              86752 non-null  uint32  
 1   file            86752 non-null  object  
 2   source          86752 non-null  category
 3   length          86752 non-null  category
 4   speaker_id      86752 non-null  category
 5   speaker_gender  86752 non-null  category
 6   lang1           86752 non-null  category
 7   emo             86752 non-null  category
 8   valence         86752 non-null  int8    
 9   neg             86752 non-null  bool    
 10  neu             86752 non-null  bool    
 11  pos             86752 non-null  bool    
dtypes: bool(3), category(6), int8(1), object(1), uint32(1)
memory usage: 1.9+ MB


Unnamed: 0,id,file,source,length,speaker_id,speaker_gender,lang1,emo,valence,neg,neu,pos
0,0,00000+aesdd+aesdd.1+f+ang+-1+ell+el-gr.wav,aesdd,medium,aesdd.1,f,ell,ang,0,True,False,False


time: 55 ms


## Development split

In this section, we prepare a separate development set with stratified grouped sampling.

First, we set up the strata, making sure each stratum has at least two observations.

In [9]:
strata = df[["speaker_id", "emo", "valence", "lang1", "length"]]
strata.valence = strata.valence.astype(str)
strata = strata.swifter.apply("".join, axis=1)


def fix_strata_counts(strata: pd.Series) -> None:
    """Given a Series of strata labels, randomly pair those that only appear once."""
    solos = (
        strata.loc[
            strata.isin(
                (strata_counts := strata.value_counts()).loc[strata_counts == 1].index
            )
        ]
        .sample(frac=1, random_state=SEED)
        .reset_index(drop=True)
    )
    count = 0
    n_solos = len(solos)
    if n_solos == 1:
        pass
    else:
        for stratum1, stratum2 in zip(solos[: (_ := n_solos // 2)], solos[_:]):
            strata.replace((stratum1, stratum2), f"stratum_pair_{count}", inplace=True)
            count += 1
        if n_solos % 2:
            strata.replace((solos[n_solos - 1]), f"stratum_pair_0", inplace=True)


fix_strata_counts(strata)
strata.value_counts()

Dask Apply: 100%|██████████| 16/16 [00:02<00:00,  5.94it/s]


MELD.Joeyneu1engmedium              825
MELD.Rossneu1engmedium              815
MELD.Chandlerneu1engmedium          728
MELD.Rachelneu1engmedium            666
MELD.Phoebeneu1engmedium            659
                                   ... 
BAUM2.S055sad0engmedium               2
stratum_pair_536                      2
stratum_pair_194                      2
BAUM2.S055sad0englong                 2
LEGOv2.20061122)023ang0engmedium      2
Length: 4344, dtype: int64

time: 13.7 s


Using these strata, we select the development set.

In [10]:
# hack for ~25% stratified sample with non-overlapping groups
_, presplit = next(
    StratifiedGroupKFold(n_splits=2, shuffle=True, random_state=SEED).split(
        X=df, y=strata, groups=df.speaker_id
    )
)

X = df.iloc[presplit]
y = strata.iloc[presplit]
fix_strata_counts(y)

_, dev_idx = next(
    StratifiedGroupKFold(n_splits=2, shuffle=True, random_state=SEED).split(
        X=X, y=y, groups=X.speaker_id
    )
)
nondev_idx = df.drop(df.index[dev_idx]).index

f"{100 * len(dev_idx) / len(df):.1f}% in dev set"

'20.4% in dev set'

time: 2.01 s


In [11]:
# clean up names
del _
del fix_strata_counts
del PICKLED_DF_FOLDER
del presplit
del strata
del X
del y
_ = gc_collect()

time: 148 ms


## Load audio

In [12]:
# Load wav files
df["ragged"] = df.file.apply(
    lambda row: torchaudio_load(filepath=f"{WAV_DIRECTORY}/{row}")[0][0]
)
_ = gc_collect()

time: 16min 18s


Next is the trim operation that also prepares the sequence for FRILL processing.

In [13]:
df.ragged = df.ragged.swifter.apply(
    lambda row: np.expand_dims(np.float32(np.trim_zeros(row.numpy())), axis=0)
)
_ = gc_collect()

Pandas Apply: 100%|██████████| 86752/86752 [00:14<00:00, 5801.54it/s] 


time: 15.7 s


## Save prepared interim data

In [14]:
# save it somewhere you have space
save_file_dev_labels = f"{DATA_OUT_FOLDER}/dev_labels.feather"
save_file_nondev_labels = f"{DATA_OUT_FOLDER}/nondev_labels.feather"


def remove_old(old_file: str) -> None:
    """Given a file path, remove it if it exists."""
    try:
        remove(old_file)
        print("removed old file")
    except OSError:
        pass

    try:
        rmtree(f"{old_file}/")
        print("removed old tree")
    except OSError:
        pass


remove_old(save_file_dev_labels)
remove_old(save_file_nondev_labels)

_ = gc_collect()

removed old file
removed old file
time: 2.77 s


In [15]:
df.drop(columns="ragged").iloc[dev_idx].reset_index().to_feather(save_file_dev_labels)
df.drop(columns="ragged").iloc[nondev_idx].reset_index().to_feather(
    save_file_nondev_labels
)
_ = gc_collect()

time: 466 ms


In [21]:
def save_prefrill(dev: bool = True, div_size: int = 2500) -> None:
    """Save the pre-frill features in batches of div_size."""
    # select index set
    idx = dev_idx if dev else nondev_idx
    prefix = f"{DATA_OUT_FOLDER}/{'dev' if dev else 'nondev_prefrill/nondev'}_prefrill_"
    len_idx = len(idx)
    for i in range(last_i := len_idx // div_size):  # each "fold"
        remove_old(save_path := f"{prefix}{i}.pkl")
        _ = gc_collect()
        # save a pickle
        df[["id", "ragged"]].iloc[idx[(_ := i * div_size) : _ + div_size]].reset_index(
            drop=True
        ).to_pickle(save_path)
    if len_idx % div_size:  # leftovers
        remove_old(save_path := f"{prefix}{last_i}.pkl")
        df[["id", "ragged"]].iloc[idx[(last_i * div_size) :]].reset_index(
            drop=True
        ).to_pickle(save_path)

time: 32 ms


Let's save and preview the pickles.

In [17]:
_ = gc_collect()

time: 146 ms


In [18]:
save_prefrill(dev=True)
pd.read_pickle(f"{DATA_OUT_FOLDER}/dev_prefrill_0.pkl").head(1)
_ = gc_collect()

removed old file
removed old file
removed old file
removed old file
removed old file
removed old file


Unnamed: 0,id,ragged
0,0,"[[0.0007324219, 0.0012207031, 0.002380371, 0.0..."


time: 56.1 s


In [22]:
# reduced div_size so I may potentially continue extracting FRILL features on the side while I worked actively with dev features
save_prefrill(dev=False, div_size=500)
pd.read_pickle(f"{DATA_OUT_FOLDER}/nondev_prefrill/nondev_prefrill_0.pkl").head(1)
_ = gc_collect()

removed old file


Unnamed: 0,id,ragged
0,1,"[[0.0020446777, 0.0004272461, -9.1552734e-05, ..."


time: 3min 17s


[^top](#Contents)