# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Metadata and labels](#Metadata-and-labels)
  * [emotiontts](#emotiontts)
  * [Thorsten_OGV_emotional](#Thorsten_OGV_emotional)
  * [x4nth055_SER_custom](#x4nth055_SER_custom)
* [Load data](#Load-data)
* [Discussion](#Discussion)

# Introduction

Three holdout datasets have been identified. This notebook prepares the labels and metadata for these datasets and then prepares them for extraction of FRILL embeddings.

In [1]:
from time import time

notebook_begin_time = time()

# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)
del environ
del random_seed
del np_seed
del set_seed
del reset_seeds

In [2]:
# extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
# core
import numpy as np
import pandas as pd

# audio processing
from pydub import AudioSegment, effects
from pydub.silence import detect_leading_silence
from torchaudio import load as torchaudio_load

# tensorflow & tensorflow_hub
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub

# utility
from pathlib import Path
from gc import collect as gc_collect
from tqdm.notebook import tqdm

# faster
import swifter
from sklearnex import patch_sklearn

patch_sklearn()
del patch_sklearn

# typing
from typing import List

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
del InteractiveShell

time: 4.44 s


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [4]:
# Location of audio directories
IN_FOLDER = "."

# Location where this notebook will output
OUT_FOLDER = "./interim"

# Location where the FRILL module is stored locally
LOCAL_FRILL = "../../../FRILL/"

_ = gc_collect()

time: 117 ms


# Metadata and labels

## emotiontts

https://github.com/emotiontts/emotiontts_open_db

In [5]:
dataset = "emotiontts"
emo_codes = (
    lambda code: "neu"
    if code <= 100
    else "hap"
    if code <= 200
    else "ang"
    if code <= 300
    else "sad"
)
valence = {"neu": "1", "hap": "2", "ang": "0", "sad": "0"}  # non-negative coding
lang1 = "kor"  # ISO 639-3 Korean
with open(f"{dataset}_data_files.tsv", "w") as f:
    for file in (Path(".") / dataset / "Emotional").glob("**/*.wav"):
        _ = f.write(
            "\t".join(
                # dataset+speakerid+gender+emo+val+lang1
                [
                    f"{dataset}/Emotional/{file.parents[2].name}/{(speaker := file.parents[1].name)}/wav/{(fname := file.name)}",
                    dataset,
                    f"{dataset}.{speaker}",
                    speaker[-1],
                    emo_code := emo_codes(int(fname[5:8])),
                    valence[emo_code],
                    f"{lang1}\n",
                ]
            )
        )

time: 36 ms


## Thorsten_OGV_emotional

https://zenodo.org/record/5525023

In [6]:
dataset = "Thorsten_OGV_emotional"
# I deleted some of the unusued folders already
# non-negative coding
valence = dict.fromkeys(["ang", "dis", "sur"], "0")
valence["neu"] = "1"
valence["amu"] = "2"
lang1 = "deu"  # ISO 639-3 German
with open(f"{dataset}_data_files.tsv", "w") as f:
    for file in tqdm((Path(".") / dataset).glob("**/*.wav")):
        _ = f.write(
            "\t".join(
                [
                    # dataset+speakerid+gender+emo+val+lang1
                    f"{dataset}/thorsten-emotional_v02/{(emotion := file.parent.name)}/{(fname := file.name)}",
                    dataset,
                    f"{dataset}.thorsten",
                    "m",
                    emo_code := emotion[:3],
                    valence[emo_code],
                    f"{lang1}\n",
                ]
            )
        )

0it [00:00, ?it/s]

time: 68 ms


## x4nth055_SER_custom

https://github.com/x4nth055/emotion-recognition-using-speech

In [7]:
dataset = "x4nth055_SER_custom"
# non-negative coding
valence = {"neu": "0", "hap": "2"}
speaker_gender = {  # I tried my best, but buyer beware: no guarantee of accuracy
    "soumaya": "f",
    "walidlite": "m",
    "zaki": "m",
    "hamed": "m",
    "nadjib": "m",
    "oumaima": "f",
    "rockikz": "m",
}
lang1 = "arb"  # ISO 639-3 Modern Standard Arabic
with open(f"{dataset}_data_files.tsv", "w") as f:
    for file in tqdm((Path(".") / dataset).glob("**/*.wav")):
        # one English sample
        if file.stem == "this-is-good_happy":
            _ = f.write(
                "\t".join(
                    [
                        # dataset+speakerid+gender+emo+val+lang1
                        f"{dataset}/{file.parent.name}/this-is-good_happy.wav",
                        dataset,
                        f"{dataset}.unk",
                        "u",
                        "hap",
                        "2",
                        "eng\n",
                    ]
                )
            )
        else:
            speaker, emotion = file.stem.split("_")
            _ = f.write(
                "\t".join(
                    [
                        # dataset+speakerid+gender+emo+val+lang1
                        f"{dataset}/{file.parent.name}/{file.name}",
                        dataset,
                        f"{dataset}.{(speaker := speaker[:-1])}",
                        speaker_gender[speaker],
                        emo_code := emotion[:3],
                        valence[emo_code],
                        f"{lang1}\n",
                    ]
                )
            )

0it [00:00, ?it/s]

time: 40 ms


# Preprocess data

In [8]:
def load_file(in_path: str) -> AudioSegment:
    "Load and return an AudioSegment from file."
    try:
        return AudioSegment.from_file(
            in_path, format=in_path.rsplit(".", maxsplit=1)[-1]
        )
    except:
        try:
            return AudioSegment.from_file(in_path)
        except:
            try:
                return AudioSegment.from_file(in_path, format="wav")
            except:
                try:
                    return AudioSegment.from_wav(in_path)
                except:
                    try:
                        return AudioSegment.from_file_using_temporary_files(in_path)
                    except:
                        try:
                            return AudioSegment.from_file(in_path, format="aac")
                        except:
                            return AudioSegment.from_file(in_path, codec="pcm_u16be")


trim_leading_silence: AudioSegment = lambda x: x[
    detect_leading_silence(x, silence_threshold=-60.0) :
]
# detect_leading_silence(_) should return 0 if no leading silence found.
# detect_leading_silence() should return len(_) if no end is found (it's all silent) and raise IndexError as a result
# default silence_threshold is -50.0 dFBS; lowering it reduces the amount of information lost (in theory)


def process_audio(file_: str, line_info: List[str], count: int = 0) -> int:
    "Load, trim, set sample width, set to mono, set frame rate, and write a sound file to OUT_FOLDER."
    "Returns 1 if zero duration detected after trim operations, 2 if longer than 1 minute, or 0 after export."

    # Load
    _ = load_file(in_file := f"{IN_FOLDER}/{file_}")

    # Sample width 16 bits
    try:
        _ = _.set_sample_width(2)
    except:
        _ = AudioSegment.from_file_using_temporary_files(in_file).set_sample_width(2)
    # Normalize
    _ = effects.normalize(_)

    # Trim silence
    try:
        _ = trim_leading_silence(_)
        # if not DISCOURSE_CONTEXT[file_.split("+", maxsplit=2)[1]]:
        # none of the holdout data were sampled from a discourse context
        _ = trim_leading_silence(_.reverse()).reverse()
    except IndexError:
        print(f"all silence detected during trim for {file_}")
        return 1
    # Filter short and long
    if (duration := len(_)) < 200:
        print(f"{file_} is less than 200 ms after trim")
        return 1
    if duration > 60000:
        print(f"{file_} exceeds 1 minute in duration")
        return 2
    # Set mono 16 kHz
    _ = _.set_channels(1)
    _ = _.set_frame_rate(16000)

    # Write
    # count+dataset+speakerid+gender+emo+val+lang1
    fname = "+".join([f"{count:05}"] + line_info)
    _.export(
        out_f=f"{OUT_FOLDER}/{fname}.wav",
        format="wav",
        codec="pcm_s16le",
        bitrate="128k",
    )

    return 0


def get_list_of_files() -> List[str]:
    """Return the files and labels to process by concatenating the TSVs from above"""
    datasets = ["emotiontts", "Thorsten_OGV_emotional", "x4nth055_SER_custom"]
    tsv_file = lambda _: f"{_}_data_files.tsv"
    with open(tsv_file(datasets[0]), "r") as f1:
        with open(tsv_file(datasets[1]), "r") as f2:
            with open(tsv_file(datasets[2]), "r") as f3:
                return f1.readlines() + f2.readlines() + f3.readlines()

time: 3 ms


In [9]:
bad_files, short_files, long_files = [], [], []

count = 0
for line in tqdm(get_list_of_files()):
    line = line.rstrip().split("\t")
    sample_file = line.pop(0)
    try:
        if (return_code := process_audio(sample_file, line, count)) == 1:
            short_files.append(sample_file)
        elif return_code == 2:
            long_files.append(sample_file)
    except Exception as e:
        bad_files.append((sample_file, e))
    count += 1
else:
    print("no problems in conversion")

print(
    f"There were {len(short_files)} files discarded because they were shorter than 200 milliseconds."
)
print(
    f"There were {len(long_files)} files discarded because duration exceeded 1 minute."
)
print(f"There were {len(bad_files)} bad files.")
if bad_files:
    print(*bad_files, sep="\n")

  0%|          | 0/1957 [00:00<?, ?it/s]

no problems in conversion
There were 0 files discarded because they were shorter than 200 milliseconds.
There were 0 files discarded because duration exceeded 1 minute.
There were 0 bad files.
time: 3min 12s


# Labels

In [10]:
labels = {
    "id": [],
    "file": [],
    "source": [],
    "speaker_id": [],
    "speaker_gender": [],
    "lang1": [],
    "emo": [],
    "valence": [],
    "neg": [],
    "neu": [],
    "pos": [],
}
for file in tqdm(Path("./interim").glob("*.wav")):
    labels["file"].append(file := file.name)
    line = file.rstrip(".wav").split("+")
    for key in (
        "id",
        "source",
        "speaker_id",
        "speaker_gender",
        "emo",
        "valence",
        "lang1",
    ):
        labels[key].append(line.pop(0))
    valence = labels["valence"][-1]
    labels["neg"].append(valence == "0")
    labels["neu"].append(valence == "1")
    labels["pos"].append(valence == "2")
labels = pd.DataFrame(labels)
for categorical in {"source", "speaker_id", "speaker_gender", "lang1", "emo"}:
    # "category" is more compact than object dtype
    labels[categorical] = labels[categorical].astype("category")
labels.valence = labels.valence.astype(np.int8)
labels.id = labels.id.astype(np.uint32)  # np.uint32 can store 0 to 4294967295
labels.to_feather("./holdout_labels.feather")
labels.info()

0it [00:00, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   id              1957 non-null   uint32  
 1   file            1957 non-null   object  
 2   source          1957 non-null   category
 3   speaker_id      1957 non-null   category
 4   speaker_gender  1957 non-null   category
 5   lang1           1957 non-null   category
 6   emo             1957 non-null   category
 7   valence         1957 non-null   int8    
 8   neg             1957 non-null   bool    
 9   neu             1957 non-null   bool    
 10  pos             1957 non-null   bool    
dtypes: bool(3), category(5), int8(1), object(1), uint32(1)
memory usage: 41.8+ KB
time: 126 ms


# Prepare for FRILL extraction

In [11]:
df = labels.loc[:, ["id", "file"]]
del labels
_ = gc_collect()

time: 135 ms


In [12]:
# Load wav files
df["ragged"] = df.file.apply(
    lambda row: torchaudio_load(filepath=f"interim/{row}")[0][0]
)
_ = gc_collect()

time: 18.7 s


Next is the trim operation that also prepares the sequence for FRILL processing.

In [13]:
df.ragged = df.ragged.swifter.apply(
    lambda row: np.expand_dims(np.float32(np.trim_zeros(row.numpy())), axis=0)
)
_ = gc_collect()

Pandas Apply: 100%|██████████| 1957/1957 [00:00<00:00, 46596.50it/s]

time: 170 ms





In [14]:
# Load FRILL
tf.enable_v2_behavior()
# module = hub.load("https://tfhub.dev/google/nonsemantic-speech-benchmark/frill/1")
module = hub.load(LOCAL_FRILL)

time: 31.2 s


In [15]:
df["frill"] = df.ragged.swifter.apply(lambda _: module(_)["embedding"][0])
_ = gc_collect()
id = df.id

Pandas Apply: 100%|██████████| 1957/1957 [13:49<00:00,  2.36it/s]


time: 15min 3s


In [16]:
df = pd.DataFrame(df.frill.tolist())
_ = gc_collect()

time: 13min 30s


In [17]:
df = df.astype(np.float32)
df.columns = df.columns.astype(str)
df.id = id
_ = gc_collect()

time: 27min 31s


In [18]:
df.to_feather(f"./holdout_FRILL.feather")
_ = pd.read_feather(f"./holdout_FRILL.feather")
_.head()
_.info()
del _
_ = gc_collect()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2038,2039,2040,2041,2042,2043,2044,2045,2046,2047
0,-0.035624,0.148898,-0.109508,-0.016199,-0.084427,-0.067872,0.174007,-0.033383,-0.017896,-0.024012,...,-0.153698,0.041551,-0.003445,-0.059062,-0.028037,-0.026091,0.101007,-0.078234,0.112814,0.098813
1,-0.103035,0.082011,0.110948,-0.060209,-0.073397,0.009772,-0.02083,0.044276,0.046892,-0.115516,...,0.065289,0.075569,-0.089013,-0.161597,0.017945,0.073644,0.058319,0.030772,0.066552,0.088602
2,-0.065101,0.038458,-0.029423,0.005429,-0.023002,0.152135,0.053979,-0.039011,0.017791,-0.077413,...,-0.042246,-0.00432,-0.0402,-0.049798,-0.081791,-0.091844,0.164188,0.060485,-0.003494,0.04829
3,0.083481,0.138969,-0.030347,0.062699,0.073357,-0.039354,-0.089715,0.113456,0.039636,-0.041954,...,0.02292,-0.181787,0.011134,-0.073927,-0.014689,0.049779,0.044533,-0.063511,0.020476,-0.018876
4,0.019141,0.170539,-0.056348,0.032532,-0.089482,0.079703,0.037175,0.046567,-0.053409,-0.061816,...,0.078809,-0.044437,-0.086229,0.048539,0.104972,-0.050087,0.052254,0.117372,0.042914,0.049517


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957 entries, 0 to 1956
Columns: 2048 entries, 0 to 2047
dtypes: float32(2048)
memory usage: 15.3 MB
time: 5.21 s


# Discussion

emotiontts is the free sampler part of a larger Korean corpus for text-to-speech synthesis.

Thorsten_OGV_emotional consists of utterances from a male German speaker

x4nth055_SER_custom are from GitHub user x4nth055's repository for a speech emotion recognition project. While they utilized a lot of the same data as our project, they also uploaded some custom-recorded speech samples.

In [19]:
print(f"Time elapsed since notebook_begin_time: {time() - notebook_begin_time} s")
_ = gc_collect()

Time elapsed since notebook_begin_time: 3625.6541702747345 s
time: 2 s


[^top](#Contents)