# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Data](#Data)
* [Methods](#Methods)
  * [Method 1](#Method-1)
  * [Method 2](#Method-2)
  * [Method 3](#Method-3)
* [Results](#Results)
  * [Trial 1](#Trial-1)
  * [Trial 2](#Trial-2)
  * [Trial 3](#Trial-3)
* [Discussion](#Discussion)

## Introduction

In this notebook, we compare methods for reformatting the extracted FRILL embeddings into a `pandas` DataFrame. Using a slow method is a drag on development that could be avoided. It appears to be slightly faster to use the `.tolist()` method than to use a list comprehension.

## Imports and configuration

In [1]:
# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)

In [2]:
# extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
# core
import numpy as np
import pandas as pd
import swifter

# tensorflow & tensorflow_hub
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub

# utility
from gc import collect as gc_collect

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

time: 1.93 s


In [4]:
# Location of trial data
TRIAL_DATA_FOLDER = "D:/interim_data"

# Location where the FRILL module is stored locally
LOCAL_FRILL = "../../../FRILL/"

time: 1e+03 µs


In [5]:
# Load FRILL
tf.enable_v2_behavior()
# module = hub.load("https://tfhub.dev/google/nonsemantic-speech-benchmark/frill/1")
module = hub.load(LOCAL_FRILL)

time: 31 s


## Data

In [6]:
df = pd.read_pickle(f"{TRIAL_DATA_FOLDER}/dev_prefrill_0.pkl").sample(
    frac=1, random_state=SEED
)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 2394 to 1140
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2500 non-null   uint32
 1   ragged  2500 non-null   object
dtypes: object(1), uint32(1)
memory usage: 48.8+ KB
time: 421 ms


## Methods

### Method 1

In [7]:
def method1(sample: pd.DataFrame) -> pd.DataFrame:
    """Test list comprehension"""
    return pd.DataFrame([_ for _ in sample.frill])

time: 982 µs


### Method 2

In [8]:
def method2(sample: pd.DataFrame) -> pd.DataFrame:
    """Test .tolist()"""
    return pd.DataFrame(sample.frill.tolist())

time: 1 ms


## Results

### Trial 1

In [10]:
trial_num = 1

time: 1 ms


In [11]:
sample = df.sample(frac=0.05, random_state=SEED + trial_num)
_ = gc_collect()
sample["frill"] = sample.ragged.swifter.apply(lambda _: module(_)["embedding"][0])

Pandas Apply: 100%|██████████| 125/125 [01:21<00:00,  1.54it/s]

time: 1min 27s





In [12]:
_ = gc_collect()
%timeit _ = method1(sample)
del _

30.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 4min 11s


In [13]:
_ = gc_collect()
%timeit _ = method2(sample)
del _

29.9 s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 4min 2s


### Trial 2

In [14]:
del sample
trial_num = 2

time: 2.99 ms


In [15]:
sample = df.sample(frac=0.05, random_state=SEED + trial_num)
_ = gc_collect()
sample["frill"] = sample.ragged.swifter.apply(lambda _: module(_)["embedding"][0])

Pandas Apply: 100%|██████████| 125/125 [01:21<00:00,  1.54it/s]

time: 1min 26s





In [16]:
_ = gc_collect()
%timeit _ = method1(sample)
del _

31.3 s ± 822 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 4min 12s


In [17]:
_ = gc_collect()
%timeit _ = method2(sample)
del _

30.9 s ± 975 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 4min 10s


### Trial 3

In [18]:
del sample
trial_num = 3

time: 6 ms


In [19]:
sample = df.sample(frac=0.05, random_state=SEED + trial_num)
_ = gc_collect()
sample["frill"] = sample.ragged.swifter.apply(lambda _: module(_)["embedding"][0])

Pandas Apply: 100%|██████████| 125/125 [01:27<00:00,  1.44it/s]

time: 1min 31s





In [20]:
_ = gc_collect()
%timeit _ = method1(sample)
del _

32.2 s ± 1.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 4min 23s


In [21]:
_ = gc_collect()
%timeit _ = method2(sample)
del _

31 s ± 1.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 4min 12s


## Discussion

The second method wins by a small margin in all three trials.

[^top](#Contents)