# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Data](#Data)
* [Methods](#Methods)
  * [Method 1](#Method-1)
  * [Method 2](#Method-2)
  * [Method 3](#Method-3)
  * [Method 4](#Method-4)
* [Results](#Results)
  * [Trial 1](#Trial-1)
  * [Trial 2](#Trial-2)
  * [Trial 3](#Trial-3)
* [Discussion](#Discussion)

## Introduction

In this notebook, we compare methods for reformatting the extracted FRILL embeddings into a `pandas` DataFrame. Using a slow method is a drag on development that could be avoided.  It appears to be slightly faster to use the `.astype()` method to cast to `np.float32`.

## Imports and configuration

In [1]:
# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)

In [2]:
# extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
# core
import numpy as np
import pandas as pd
import swifter

# tensorflow & tensorflow_hub
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub

# utility
from gc import collect as gc_collect

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

time: 2.1 s


In [4]:
# Location of trial data
TRIAL_DATA_FOLDER = "D:/interim_data"

# Location where the FRILL module is stored locally
LOCAL_FRILL = "../../../FRILL/"

time: 999 µs


In [5]:
# Load FRILL
tf.enable_v2_behavior()
# module = hub.load("https://tfhub.dev/google/nonsemantic-speech-benchmark/frill/1")
module = hub.load(LOCAL_FRILL)

time: 33.4 s


## Data

In [6]:
df = pd.read_pickle(f"{TRIAL_DATA_FOLDER}/dev_prefrill_0.pkl").sample(
    frac=1, random_state=SEED
)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 2394 to 1140
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2500 non-null   uint32
 1   ragged  2500 non-null   object
dtypes: object(1), uint32(1)
memory usage: 48.8+ KB
time: 412 ms


## Methods

### Method 1

In [7]:
def method1(sample: pd.DataFrame) -> pd.DataFrame:
    """Test cast by type argument"""
    _ = pd.DataFrame(sample.frill.tolist(), dtype=np.float32)
    return _

time: 1 ms


### Method 2

In [8]:
def method2(sample: pd.DataFrame) -> pd.DataFrame:
    """Test cast by function call"""
    _ = pd.DataFrame(sample.frill.tolist())
    return np.float32(_)

time: 2.01 ms


### Method 3

In [10]:
def method3(sample: pd.DataFrame) -> pd.DataFrame:
    """Test cast by swifter apply"""
    _ = pd.DataFrame(sample.frill.tolist())
    return _.swifter.apply(np.float32)

time: 985 µs


### Method 4

In [11]:
def method4(sample: pd.DataFrame) -> pd.DataFrame:
    """Test cast by .astype()"""
    _ = pd.DataFrame(sample.frill.tolist())
    return _.astype(np.float32)

time: 1.04 ms


## Results

In [12]:
make_sample: pd.DataFrame = lambda trial_num: df.sample(
    n=100, random_state=SEED + trial_num
)

time: 979 µs


### Trial 1

In [13]:
trial_num = 1

time: 1.03 ms


In [14]:
sample = make_sample(trial_num)
_ = gc_collect()
sample["frill"] = sample.ragged.swifter.apply(lambda _: module(_)["embedding"][0])

Pandas Apply: 100%|██████████| 100/100 [01:04<00:00,  1.54it/s]

time: 1min 9s





In [15]:
_ = gc_collect()
%timeit _ = method1(sample)
del _

25.7 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 25s


In [16]:
_ = gc_collect()
%timeit _ = method2(sample)
del _

26.8 s ± 1.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 35s


In [17]:
_ = gc_collect()
%timeit _ = method3(sample)
del _

25.9 s ± 773 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 29s


In [18]:
_ = gc_collect()
%timeit _ = method4(sample)
del _

24.8 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 20s


### Trial 2

In [19]:
del sample
trial_num = 2

time: 3.05 ms


In [20]:
sample = make_sample(trial_num)
_ = gc_collect()
sample["frill"] = sample.ragged.swifter.apply(lambda _: module(_)["embedding"][0])

Pandas Apply: 100%|██████████| 100/100 [00:59<00:00,  1.68it/s]

time: 1min 3s





In [21]:
_ = gc_collect()
%timeit _ = method1(sample)
del _

23.8 s ± 809 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 13s


In [22]:
_ = gc_collect()
%timeit _ = method2(sample)
del _

23.2 s ± 294 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 6s


In [23]:
_ = gc_collect()
%timeit _ = method3(sample)
del _

23.3 s ± 684 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 9s


In [24]:
_ = gc_collect()
%timeit _ = method4(sample)
del _

23.1 s ± 479 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 6s


### Trial 3

In [25]:
del sample
trial_num = 3

time: 3 ms


In [26]:
sample = make_sample(trial_num)
_ = gc_collect()
sample["frill"] = sample.ragged.swifter.apply(lambda _: module(_)["embedding"][0])

Pandas Apply: 100%|██████████| 100/100 [00:53<00:00,  1.87it/s]

time: 56.8 s





In [27]:
_ = gc_collect()
%timeit _ = method1(sample)
del _

22.7 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 3s


In [28]:
_ = gc_collect()
%timeit _ = method2(sample)
del _

23.4 s ± 618 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 7s


In [29]:
_ = gc_collect()
%timeit _ = method3(sample)
del _

24.8 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 22s


In [30]:
_ = gc_collect()
%timeit _ = method4(sample)
del _

24.8 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
time: 3min 19s


## Discussion

Method #4 won 2 of 3 trials.

[^top](#Contents)