# Contents
* [Intro](#Intro)
* [Imports and config](#Imports-and-config)
* [Load data](#Load-data)
* [Additional preprocessing](#Additional-preprocessing)
  * [Trim again](#Trim-again)
  * [Find maximum duration](#Find-maximum-duration)
  * [Pad](#Pad)
* [Results](#Results)

## Intro

The code in this notebook reads the set of files of short duration categorized previously. The files are processed into wav arrays. Leading zeros are trimmed from these arrays and a maximum duration calculated. The arrays are zero-padded initially up to the maximum duration. The new dataframes are saved to disk.

The medium and long sets are not processed in this notebook since we just need enough data to quickly test several models. In addition, adding new columns of arrays as observations increases the size of the data rather drastically.

## Imports and config

In [1]:
# Core
import numpy as np
import pandas as pd
import librosa

# util
from gc import collect as gc_collect
import swifter

In [2]:
# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [3]:
# Location of pickled dataframe
PICKLED_DF_FOLDER = "../1.0-mic-divide_data_by_duration"

# Location where this notebook will output
DATA_OUT_FOLDER = "."

# The preprocessed data from the Unified Multilingual Dataset of Emotional Human utterances
WAV_DIRECTORY = (
    "../../unified_multilingual_dataset_of_emotional_human_utterances/data/preprocessed"
)

In [4]:
# Extensions
%load_ext lab_black
%load_ext nb_black
%load_ext autotime

## Load data

In [5]:
_ = gc_collect()
short_df = pd.read_pickle(f"{PICKLED_DF_FOLDER}/short.pkl")
short_df.head(3)

Unnamed: 0_level_0,file,duration,source,speaker_id,speaker_gender,emo,valence,lang1,lang2,neg,neu,pos,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1788,01788+BAUM1+BAUM1.s028+f+hap+1+tur+tr-tr.wav,0.387,BAUM1,BAUM1.s028,f,hap,1,tur,tr-tr,0,0,1,short
2024,02024+BAUM2+BAUM2.S087+f+ang+-1+eng+en.wav,0.417,BAUM2,BAUM2.S087,f,ang,-1,eng,en,1,0,0,short
2196,02196+BAUM2+BAUM2.S239+f+ang+-1+tur+tr-tr.wav,0.417,BAUM2,BAUM2.S239,f,ang,-1,tur,tr-tr,1,0,0,short


time: 97 ms


## Additional preprocessing

Although the original dataset already trimmed leading silences, the precision was based on a 10 ms window. Therefore, there may remain leading zeros where the leading silence does not exceed 10 ms in duration. The following sections remove those silences and implement initial zero padding.

### Trim again

In [6]:
trim_zeros = np.trim_zeros
load = librosa.load

time: 4 ms


In [7]:
_ = gc_collect()
# Trim leading silence (more precise than orginally)
short_df["ragged"] = short_df.file.apply(
    lambda row: np.float32(
        trim_zeros(load(path=f"{WAV_DIRECTORY}/{row}", sr=None)[0], trim="f")
    )
)

time: 281 ms


### Find maximum duration

The following cell obtains the maximum duration after the sequences have been freshly trimmed.

In [8]:
_ = gc_collect()
max_ragged = short_df.ragged.swifter.apply(len).max()

Pandas Apply: 100%|██████████| 480/480 [00:00<00:00, 245610.09it/s]

time: 73 ms





### Pad

The following cell pads the wav arrays with initial zeros up to the length of the longest array.

In [9]:
_ = gc_collect()
# Zero pad with leading silence
short_df["padded"] = short_df.ragged.swifter.apply(
    lambda row: np.pad(
        row,
        (max_ragged - len(row), 0),
        mode="constant",
        constant_values=0,
    ).tolist()
    # The arrays are cast to lists for downstream type concordance with PySpark
)

Pandas Apply: 100%|██████████| 480/480 [00:00<00:00, 2887.52it/s]

time: 292 ms





## Results

In [10]:
_ = gc_collect()
assert len(short_df.ragged) == len(short_df.padded)

time: 94 ms


In [11]:
_ = gc_collect()
short_df_ragged = short_df.drop(columns="padded")
short_df.drop(columns="ragged", inplace=True)

time: 103 ms


Finally, we will save our work. We will save the padded and ragged dataframes separately.

In [12]:
_ = gc_collect()
short_df.to_pickle(path=f"{DATA_OUT_FOLDER}/short_padded.pkl")
short_df_ragged.to_pickle(path=f"{DATA_OUT_FOLDER}/short_ragged.pkl")

time: 251 ms


[^top](#Contents)