In [1]:
# !pip install memory_profiler

In [2]:
%load_ext autoreload
%load_ext memory_profiler
%autoreload 2

In [3]:
import numpy as np
import pandas as pd

# Series Processing pipeline

`TODO`

# Feature extraction

The most classical way to extract features from time series is by using a **strided-window** manner.

**Challenges**:

1. Existing solutions often assume a **single** stride & window size for all features to be calculated.
This raises 2 problems:
    - There is no clean interface for _multiple stride-window_ feature calculation.
    - You are responsible for the efficient execution, e.g., you would need to perform the bookkeeping that feature-calculations on the same stride-window pair are executed on the same time-series expansion (so that it needs to be expanded only once).
2. Additionaly these solutions often serve easy support for aggregation of multivariate series, each possibly having different sampling rates  
   *(e.g., Polysomnography data, Wearable data, building data)*

3. No efficient implementations for timestamped data (e.g., pd.Series, pd.DataFrames with a timeindex):
    - `pd.rolling`: assumes same input<->output dimensions, hence no stride possible:
    See:[https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html)[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html)
    - `tsfresh`: has a significant memory and time overhead for `Strided-Rolling` (feature-calculation)
    [https://tsfresh.readthedocs.io/en/latest/api/tsfresh.utilities.html#tsfresh.utilities.dataframe_functions.roll_time_series](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.utilities.html#tsfresh.utilities.dataframe_functions.roll_time_series)
    even more:
        - There is no convenient way to retain the time-index
        - It inherently makes a constant window-stride assumption for the features.
    - `seglearn`: `TODO`
    - `sktime`: `TODO`
4. Little focus on serialization (+parallelization) of local-scope objects  
   To the best of our knowledge, no existing time-series library takes this into consideration, thus hampering deployment in different environments.

**What tslib does**:

= Intuitive *time-first interface* for **multiple** stride-window feature calculation on multiple (possibly differently sampled) time-series signals. <a style="color:orange">*(solving 1 & 2.)*</a>

Providing the following features:

- A single registry, in which all bookkeeping is done, enabling efficient processing <a style="color:orange">(solves 2)</a>
- Maintains the time-index (after feature calculation) <a style="color:orange">(solves 3)</a>
- Time-first interface
- Serialization of `lambda's` and local-scope methods <a style="color:orange">(solves 4.)</a>
- Batching - series gap detection

---

**Current assumptions**:

- `pd.DateTimeIndexed` data
- The time series we are sampled at fixed rates (there are no time series gaps).

    Series processing serves as an option to mitigate this + the `chunk_signals` also mitigates this

**Future work(s)**:

- [x]  Big-Data: Perform feature-extraction in batch -> batch-like generators in stridedrolling objects.
- [x]  Time-wise strided rolling?
- [ ]  Make sklearn-estimation compatible

In [5]:
import sys

# Serialization
import dill as pickle
import scipy.stats as ss

# load our library
sys.path.append('../')
from time_series import FeatureCollection, NumpyFuncWrapper
from time_series.features import FeatureDescriptor, MultipleFeatureDescriptors

pickle.settings["recurse"] = True  # allows to serialize lambda's YAY!

`TODO` Misschien nog een image / class diagram van hoe de feature extraction werkt?

## Defining functions

* `TODO` what with categorical functions? 

Functions are defined by making use of the [NumpyFuncWrapper](time_series/features/function_wrapper.py) class.  

The `NumpyFuncWrapper` interface is easy and convienient; you define:

|      attribute 	|          type         	| info                                                     	|
|---------------:	|:---------------------:	|----------------------------------------------------------	|
|         `func` 	|        Callable       	| The wrapped function that will operate on `numpy` arrays 	|
| `output_names` 	| Union[List[str], str] 	| The name of the outputs of `func`                        	|
|     `**kwargs` 	|        Optional       	| Additional keyword-arguments for the `func`              	|

**Note**: this library does `not` provide any feature-functions as:
* There exist many other feature extraction libraries such as numpy, scipy, tsfresh.
* (Relevant) features are dependent on the objective and signals-modalites, making features methods very problem specific
* Finally, as can be seen below, our `NumpyFuncWrapper`'s `func`-attribute is versatile enough to wrap the end-user's desired features 

In [6]:
# --------------------- some custom feature extraction functions ---------------------
# -- 1. one-to-many functions
#    To compute quantiles, you need sort the windowed data, which is a rather expensive
#    operation O(n*log(n)). Hence, you might want to calculate all your desired 
#    quantiles in a single function-wrapper, returning multiple outputs.

quantiles = [0.25, 0.5, 0.75]
f_quantiles = NumpyFuncWrapper(
    func=np.quantile,  # the wrapped function that will operate on numpy arrays
    output_names=[f"quantile_{q}" for q in quantiles],  # the output column names
    q=quantiles,  # optional - additional function-related kwargs
)


# -- 2. in-line functions
#    You can define your functions locally; these will serialize flawleslly
def slope(x):
    return np.polyfit(np.arange(0, len(x)), x, 1)[0]


f_slope = NumpyFuncWrapper(slope, output_names="slope")

# -- 3. Lambda's
#    Or even use lambda's and other modules' functions
f_rms = NumpyFuncWrapper(lambda x: np.sqrt(np.mean(x ** 2)), output_names="rms")
f_area = NumpyFuncWrapper(np.sum, output_names="area")


# (For convenience) we store the constructed `NumpyFuncWrappers` in a list
segment_funcs = [
    np.mean,
    np.std,
    np.var,
    np.max,
    np.min,
    ss.skew,  # use other libraries such as scipy
    ss.kurtosis,
    f_quantiles,
    f_slope,
    f_rms,
    f_area,
]
segment_funcs

[<function numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)>,
 <function numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)>,
 <function numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)>,
 <function numpy.amax(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>,
 <function numpy.amin(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>,
 <function scipy.stats.stats.skew(a, axis=0, bias=True, nan_policy='propagate')>,
 <function scipy.stats.stats.kurtosis(a, axis=0, fisher=True, bias=True, nan_policy='propagate')>,
 NumpyFuncWrapper(quantile, ['quantile_0.25', 'quantile_0.5', 'quantile_0.75'], {'q': [0.25, 0.5, 0.75]}),
 NumpyFuncWrapper(slope, ['slope'], {}),
 NumpyFuncWrapper(<lambda>, ['rms'], {}),
 NumpyFuncWrapper(sum, ['area'], {})]

## Single series feature extraction

The defined functions above will be encapsulated in a [FeatureDescriptor](time_series/features/feature.py) object.

A `FeatureDescriptor` describes a feature, and has 4 main attributes:

|  attribute 	|                  type                 	| info                                                                                                             	|
|-----------:	|:-------------------------------------:	|------------------------------------------------------------------------------------------------------------------	|
| `function` 	| Union[Callable, <br>NumpyFuncWrapper] 	| The `function` that calculates this feature.                                                                     	|
|      `key` 	|                 string                	| The signal key; i.e., the `pd.DataFrame` column name or <br> `pd.Series` name on which the function will operate.     	|
|   `window` 	|                  int                  	| The window size on which this feature will be applied, <br> expressed in the number of samples from the input signal. 	|
|   `stride` 	|                  int                  	| The stride of the window rolling process, also as a <br> number of samples of the input signal.                       	|

**note**: [MultipleFeatureDescriptor](time_series/features/feature.py) is actaully a factory for `FeatureDescriptor` objects.

### Fixed window size & stride

**note**: this functionality is exposed by most existing time-series libraries.

In this example, we will use the _temperature_ signal from a wearable

In [7]:
df_tmp = pd.read_feather("data/tmp.feather").set_index("timestamp")
df_tmp.sample(2)

Unnamed: 0_level_0,TMP
timestamp,Unnamed: 1_level_1
2017-06-13 12:26:58.750000+02:00,31.65
2017-06-13 12:27:04.750000+02:00,31.69


In [8]:
# The data is datetime Idexed
type(df_tmp.index)

pandas.core.indexes.datetimes.DatetimeIndex

Note how the `TMP`-column is used as signal_key in the `FeatureCollection`

In [9]:
# Define the sample frequency and window size
fs_tmp = 4  # 4Hz
tmp_win_size: int = 60 * fs_tmp  # window of 60s
tmp_stride_size: int = 30 * fs_tmp  # stride of 30s


tmp_feat_extr = FeatureCollection(
    feature_descriptors=[
        MultipleFeatureDescriptors(
            functions=segment_funcs,  # The list of functions we constructed earlier
            keys=["TMP"],
            windows=[tmp_win_size],
            strides=[tmp_stride_size],
        )
    ]
)

# The FeatureCollection's __repr__() gives a nice overview of the structure
print(tmp_feat_extr)

# to extract the features we just call the collection's `calculate()` function
extracted_feats = tmp_feat_extr.calculate(
    data=df_tmp,  # The signals on which features are calculated
    merge_dfs=True,  # If true, an outer merge on the feature-outputs will be performed
    n_jobs=1         # If > 1, the feature extraction is parallellized
)

extracted_feats.sample(2)

TMP: (
	win: 240 samples, stride: 120 samples: [
		FeatureDescriptor - func: NumpyFuncWrapper(mean, ['mean'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(std, ['std'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(var, ['var'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(amax, ['amax'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(amin, ['amin'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(skew, ['skew'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(kurtosis, ['kurtosis'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(quantile, ['quantile_0.25', 'quantile_0.5', 'quantile_0.75'], {'q': [0.25, 0.5, 0.75]}),
		FeatureDescriptor - func: NumpyFuncWrapper(slope, ['slope'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(<lambda>, ['rms'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(sum, ['area'], {}),
	]
)



Unnamed: 0_level_0,TMP__mean__w=240_s=120,TMP__std__w=240_s=120,TMP__var__w=240_s=120,TMP__amax__w=240_s=120,TMP__amin__w=240_s=120,TMP__skew__w=240_s=120,TMP__kurtosis__w=240_s=120,TMP__quantile_0.25__w=240_s=120,TMP__quantile_0.5__w=240_s=120,TMP__quantile_0.75__w=240_s=120,TMP__slope__w=240_s=120,TMP__rms__w=240_s=120,TMP__area__w=240_s=120
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2017-06-13 11:56:39.750000+02:00,32.727167,0.013915,0.000194,32.77,32.68,-0.153887,1.962549,32.725,32.73,32.73,-4.1e-05,32.72717,7854.52
2017-06-13 12:13:09.750000+02:00,32.162667,0.027681,0.000766,32.21,32.11,-0.324479,-0.516593,32.15,32.16,32.18,-0.000312,32.162679,7719.04


### Multiple `time-based` window sizes and strides

_In this example, we use **multiple** stride-window-size combinations on a wearables' ElectorDermal Activity (EDA)_

In [10]:
df_gsr = pd.read_feather("data/gsr.feather").set_index("timestamp")
df_gsr.sample(2)

Unnamed: 0_level_0,EDA
timestamp,Unnamed: 1_level_1
2017-06-13 11:48:50.250000+02:00,0.153767
2017-06-13 12:41:00.250000+02:00,0.160163


Note that we do not use int-based window-stride combinations, but `time-based` ones. Also take a closer look at the `__repr__` string.

In [11]:
# PoC: we will select a random combination of the window_size stride combination
window_size_s = ['30s', '120s', '90s', '1h']
stride_size_s = ['15s', '30s']

import random

gsr_feat_extr = FeatureCollection(
    [
        FeatureDescriptor(
            key="EDA",
            window=random.choice(window_size_s),
            stride=random.choice(stride_size_s),
            function=f,
        )
        for f in segment_funcs
    ]
)

# the __repr__ string outputs the windows & strides in a time-string representation :)
print(gsr_feat_extr)

gsr_feat_extr.calculate(df_gsr, merge_dfs=True, show_progress=False, n_jobs=0).sample(2)

EDA: (
	win: 1m30s , stride: 30s: [
		FeatureDescriptor - func: NumpyFuncWrapper(mean, ['mean'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(skew, ['skew'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(quantile, ['quantile_0.25', 'quantile_0.5', 'quantile_0.75'], {'q': [0.25, 0.5, 0.75]}),
		FeatureDescriptor - func: NumpyFuncWrapper(sum, ['area'], {}),
	]
	win: 2m    , stride: 30s: [
		FeatureDescriptor - func: NumpyFuncWrapper(std, ['std'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(slope, ['slope'], {}),
	]
	win: 30s   , stride: 15s: [
		FeatureDescriptor - func: NumpyFuncWrapper(var, ['var'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(amin, ['amin'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(kurtosis, ['kurtosis'], {}),
	]
	win: 1h    , stride: 15s: [
		FeatureDescriptor - func: NumpyFuncWrapper(amax, ['amax'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(<lambda>, ['rms'], {}),
	]
)



Unnamed: 0_level_0,EDA__mean__w=1m30s_s=30s,EDA__skew__w=1m30s_s=30s,EDA__quantile_0.25__w=1m30s_s=30s,EDA__quantile_0.5__w=1m30s_s=30s,EDA__quantile_0.75__w=1m30s_s=30s,EDA__area__w=1m30s_s=30s,EDA__std__w=2m_s=30s,EDA__slope__w=2m_s=30s,EDA__var__w=30s_s=15s,EDA__amin__w=30s_s=15s,EDA__kurtosis__w=30s_s=15s,EDA__amax__w=1h_s=15s,EDA__rms__w=1h_s=15s
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2017-06-13 11:08:24.750000+02:00,,,,,,,,,2e-06,0.133301,4.558216,,
2017-06-13 11:01:39.750000+02:00,0.298254,0.471538,0.248426,0.295756,0.33509,107.371338,0.098884,-0.000692,0.000144,0.212609,-1.029535,,


## Multiple series feature extraction

In [12]:
import itertools

In [13]:
list(itertools.chain.from_iterable(gsr_feat_extr._feature_desc_dict.values()))

[FeatureDescriptor(('EDA',), 0 days 00:01:30, 0 days 00:00:30),
 FeatureDescriptor(('EDA',), 0 days 00:01:30, 0 days 00:00:30),
 FeatureDescriptor(('EDA',), 0 days 00:01:30, 0 days 00:00:30),
 FeatureDescriptor(('EDA',), 0 days 00:01:30, 0 days 00:00:30),
 FeatureDescriptor(('EDA',), 0 days 00:02:00, 0 days 00:00:30),
 FeatureDescriptor(('EDA',), 0 days 00:02:00, 0 days 00:00:30),
 FeatureDescriptor(('EDA',), 0 days 00:00:30, 0 days 00:00:15),
 FeatureDescriptor(('EDA',), 0 days 00:00:30, 0 days 00:00:15),
 FeatureDescriptor(('EDA',), 0 days 00:00:30, 0 days 00:00:15),
 FeatureDescriptor(('EDA',), 0 days 01:00:00, 0 days 00:00:15),
 FeatureDescriptor(('EDA',), 0 days 01:00:00, 0 days 00:00:15)]

In [14]:
# Construct the feature FeatureCollection
#   =  higher order wrapper which aggregates the featuredescriptions
multimodal_feature_extraction = FeatureCollection(
    feature_descriptors=[gsr_feat_extr, tmp_feat_extr]
)

print(multimodal_feature_extraction)

df_feat = multimodal_feature_extraction.calculate(
    [df_gsr, df_tmp], merge_dfs=True
)
df_feat.sample(2)

EDA: (
	win: 1m30s , stride: 30s: [
		FeatureDescriptor - func: NumpyFuncWrapper(mean, ['mean'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(skew, ['skew'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(quantile, ['quantile_0.25', 'quantile_0.5', 'quantile_0.75'], {'q': [0.25, 0.5, 0.75]}),
		FeatureDescriptor - func: NumpyFuncWrapper(sum, ['area'], {}),
	]
	win: 2m    , stride: 30s: [
		FeatureDescriptor - func: NumpyFuncWrapper(std, ['std'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(slope, ['slope'], {}),
	]
	win: 30s   , stride: 15s: [
		FeatureDescriptor - func: NumpyFuncWrapper(var, ['var'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(amin, ['amin'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(kurtosis, ['kurtosis'], {}),
	]
	win: 1h    , stride: 15s: [
		FeatureDescriptor - func: NumpyFuncWrapper(amax, ['amax'], {}),
		FeatureDescriptor - func: NumpyFuncWrapper(<lambda>, ['rms'], {}),
	]
)
TMP: (
	win: 240 samples, stride: 120 samples: [
		FeatureDes

Unnamed: 0_level_0,EDA__mean__w=1m30s_s=30s,EDA__area__w=1m30s_s=30s,EDA__std__w=2m_s=30s,EDA__amin__w=30s_s=15s,EDA__quantile_0.25__w=1m30s_s=30s,EDA__quantile_0.5__w=1m30s_s=30s,EDA__quantile_0.75__w=1m30s_s=30s,EDA__var__w=30s_s=15s,EDA__slope__w=2m_s=30s,EDA__skew__w=1m30s_s=30s,...,TMP__amin__w=240_s=120,TMP__area__w=240_s=120,TMP__rms__w=240_s=120,TMP__quantile_0.25__w=240_s=120,TMP__quantile_0.5__w=240_s=120,TMP__quantile_0.75__w=240_s=120,EDA__rms__w=1h_s=15s,TMP__slope__w=240_s=120,TMP__kurtosis__w=240_s=120,TMP__skew__w=240_s=120
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-06-13 10:54:24.750000+02:00,,,,0.202376,,,,0.000161,,,...,,,,,,,,,,
2017-06-13 12:00:09.750000+02:00,2.087682,751.565357,0.52328,1.984113,1.945938,2.031702,2.228435,0.069581,0.003251,0.648958,...,32.66,7846.64,32.69434,32.68,32.695,32.71,0.38966,-0.00018,-1.123432,-0.053414


## Logging

`TODO`

## Use case: batch-based feature extraction

In [16]:
from time_series import chunk_data

* maybe execute this on a highdimensional series, like the `sleep data`

In [17]:
same_range_chunks = chunk_data(
    data=[df_tmp],
    fs_dict={"EDA": 4, "TMP": 4},
    max_chunk_dur_s=60 * 10
)

In [15]:
# %mprun?
# %mprun -c -f chunk_signals  chunk_signals(signals=[df_gsr, df_tmp], fs_dict={'EDA': 4, 'TMP': 4}, max_chunk_dur_s=60*10, copy=True)

## Serialization

Serialization is mandatory to store and share your pipelines.
`TODO`

In [25]:
multimodal_feature_extraction.serialize("data/example_serialization.pkl")

## Other packages

### tsfresh

https://tsfresh.readthedocs.io/en/latest/api/tsfresh.utilities.html#tsfresh.utilities.dataframe_functions.roll_time_series

In [None]:
from tsfresh.feature_extraction import extract_features
from tsfresh.utilities.dataframe_functions import roll_time_series

In [None]:
# define the window-size and stride
# used the largest window and smallest strided, defined above
window = 480
stride = 40

df_gsr_id = df_gsr.reset_index(drop=False).copy()  # .set_index('timestamp', drop=True)
df_gsr_id["id"] = 1
df_gsr_id.sample(2)

**Note**: This ouputs merely one expansion with a fixed window and stride.

In [21]:
%%memit
tsf_out = roll_time_series(
    df_gsr_id,
    column_id="id",
    max_timeshift=window,
    min_timeshift=window,
    rolling_direction=stride,
)

Rolling: 100%|██████████| 80/80 [00:01<00:00, 51.69it/s]


peak memory: 373.57 MiB, increment: 133.35 MiB


In [22]:
%%time
roll_time_series(
    df_gsr_id.reset_index(drop=True),
    column_id="id",
    max_timeshift=window,
    min_timeshift=window,
    rolling_direction=stride,
).sample(2)

Rolling: 100%|██████████| 80/80 [00:01<00:00, 50.77it/s]


CPU times: user 1.89 s, sys: 442 ms, total: 2.33 s
Wall time: 2.34 s


Unnamed: 0,timestamp,EDA,id,sort
81134,2017-06-13 11:03:10.750000+02:00,0.21133,"(1, 7237)",7083
199035,2017-06-13 11:44:14.750000+02:00,0.151209,"(1, 17037)",16939


In [23]:
%%memit
tsf_feats = extract_features(tsf_out.drop(columns="timestamp"), column_id="id")

Feature Extraction: 100%|██████████| 79/79 [01:13<00:00,  1.08it/s]


peak memory: 702.43 MiB, increment: 305.52 MiB


In [24]:
# some logic re-needed to add timestamp to features

In [25]:
%%time
extract_features(
    roll_time_series(
        df_gsr_id.reset_index(drop=True),
        column_id="id",
        max_timeshift=window,
        min_timeshift=window,
        rolling_direction=stride,
    ).drop(columns="timestamp"),
    column_id="id",
).sample(2)

Rolling: 100%|██████████| 80/80 [00:01<00:00, 56.32it/s]
Feature Extraction: 100%|██████████| 79/79 [01:13<00:00,  1.08it/s]


CPU times: user 5.25 s, sys: 1.45 s, total: 6.7 s
Wall time: 1min 17s


Unnamed: 0,Unnamed: 1,EDA__variance_larger_than_standard_deviation,EDA__has_duplicate_max,EDA__has_duplicate_min,EDA__has_duplicate,EDA__sum_values,EDA__abs_energy,EDA__mean_abs_change,EDA__mean_change,EDA__mean_second_derivative_central,EDA__median,...,sort__permutation_entropy__dimension_5__tau_1,sort__permutation_entropy__dimension_6__tau_1,sort__permutation_entropy__dimension_7__tau_1,sort__query_similarity_count__query_None__threshold_0.0,"sort__matrix_profile__feature_""min""__threshold_0.98","sort__matrix_profile__feature_""max""__threshold_0.98","sort__matrix_profile__feature_""mean""__threshold_0.98","sort__matrix_profile__feature_""median""__threshold_0.98","sort__matrix_profile__feature_""25""__threshold_0.98","sort__matrix_profile__feature_""75""__threshold_0.98"
1,30957,0.0,0.0,0.0,1.0,71.245141,10.60511,0.001655,-6.1e-05,4e-06,0.146092,...,-0.0,-0.0,-0.0,,0.0,0.0,0.0,0.0,0.0,0.0
1,19877,0.0,0.0,1.0,1.0,111.884515,26.125586,0.002033,8e-06,5e-06,0.235635,...,-0.0,-0.0,-0.0,,0.0,0.0,0.0,0.0,0.0,0.0


### Seglearn

https://tsfresh.readthedocs.io/en/latest/api/tsfresh.utilities.html#tsfresh.utilities.dataframe_functions.roll_time_series

In [None]:
# !pip install -U seglearn

In [27]:
from numpy.random import rand
from seglearn.pipe import Pype
from seglearn.transform import FeatureRep, Segment
from seglearn.base import TS_Data
from seglearn.util import ts_stats, check_ts_data

In [28]:
s = Segment(width=480, step=40, order="F")

In [29]:
%%time
s.fit_transform(np.column_stack(df_gsr['EDA']), y=None)
s.transform(np.column_stack(df_tmp['TMP']), y=None)

CPU times: user 216 ms, sys: 9.13 ms, total: 225 ms
Wall time: 223 ms


(array([[382.21, 382.21, 382.21, ...,  31.35,  31.35,  31.35],
        [ 31.13,  31.13,  31.13, ...,  31.37,  31.37,  31.37],
        [ 31.15,  31.15,  31.15, ...,  31.37,  31.37,  31.37],
        ...,
        [ 31.39,  31.39,  31.39, ...,  31.35,  31.35,  31.35],
        [ 31.41,  31.41,  31.41, ...,  31.37,  31.37,  31.37],
        [ 31.39,  31.39,  31.39, ...,  31.35,  31.35,  31.35]]),
 None,
 None)

Speed seems to be in the same magnitude, but the time index is gone.

In [30]:
f_extr_pype = Pype([
    ("segment", Segment(width=480, step=40, order="F")),
    ("features", FeatureRep(features="default")),
])



In [33]:
f_extr_pype.fit_transform(np.column_stack(df_gsr['EDA']), y=None)

TypeError: object of type 'NoneType' has no len()

`TODO`: https://dmbee.github.io/seglearn/auto_examples/plot_feature_rep_mix_example.html#sphx-glr-auto-examples-plot-feature-rep-mix-example-py
still need to further look into this

# Serialization

## Series Pipeline

`TODO`

## Feature extraction

In [None]:
# restart the kernel
import os

os._exit(0)

In [1]:
import pickle
import sys

import pandas as pd

time_series_dir = "../time_series/"
data_dir = "data/"

sys.path.append(time_series_dir)

In [2]:
with open(f"data/example_serialization.pkl", "rb") as f:
    multimodal_feature_extraction = pickle.load(f)

df_gsr = pd.read_feather(f"{data_dir}gsr.feather").set_index("timestamp")
df_tmp = pd.read_feather(f"{data_dir}tmp.feather").set_index("timestamp")

**note**: This is truly amazing, we do not need redefine which local funcs were used;  
We only need a python kernel which knows the paths to the modules that are used in the serialization.

In [3]:
df_feat = multimodal_feature_extraction.calculate([df_gsr, df_tmp], merge_dfs=True)
df_feat.sample(2)

HBox(children=(FloatProgress(value=0.0, max=22.0), HTML(value='')))




Unnamed: 0_level_0,EDA_mean__w=1m30s_s=15s,EDA_slope__w=1m30s_s=15s,EDA_var__w=2m_s=30s,EDA_rms__w=2m_s=30s,EDA_area__w=2m_s=30s,EDA_std__w=1h_s=30s,EDA_amax__w=1h_s=15s,EDA_amin__w=30s_s=30s,TMP_mean__w=240_s=120,TMP_var__w=240_s=120,...,EDA_kurtosis__w=30s_s=30s,TMP_slope__w=240_s=120,TMP_quantile_0.25__w=240_s=120,TMP_quantile_0.5__w=240_s=120,TMP_quantile_0.75__w=240_s=120,EDA_quantile_0.25__w=1h_s=15s,EDA_quantile_0.5__w=1h_s=15s,EDA_quantile_0.75__w=1h_s=15s,TMP_kurtosis__w=240_s=120,TMP_skew__w=240_s=120
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-06-13 12:03:39.750000+02:00,2.423588,-0.000846,0.027532,2.477573,1186.564884,0.624615,3.111371,2.213085,32.472333,0.000315,...,-0.592603,-0.000173,32.47,32.47,32.49,0.125626,0.158884,0.183188,0.11472,-0.624008
2017-06-13 12:20:39.750000+02:00,0.344951,-0.000151,0.000675,0.358119,171.443896,0.854842,3.119045,0.317502,31.963333,0.000262,...,-0.828058,9.4e-05,31.95,31.97,31.97,0.157605,0.190863,0.874332,-0.526573,-0.076756


# Bonus - Get LAYD: Look At Your Data

And as a bonus, for running/reading this notebook, you get some nice visualization code, for
ofcourse time-series.

In [4]:
import ipywidgets as widgets
import plotly.graph_objects as go
from ipywidgets import interact_manual
from plotly.subplots import make_subplots

In [5]:
df_dict = {"tmp": df_tmp, "gsr": df_gsr}

In [6]:
feat_widget = widgets.SelectMultiple(options=df_feat.columns)
sig_widget = widgets.SelectMultiple(options=["gsr", "tmp"])

In [7]:
@interact_manual
def visuzalize(features=feat_widget, signals=sig_widget):
    row_titles = list(signals) + ["features"] if len(features) else []
    fig = make_subplots(
        rows=len(row_titles),
        cols=1,
        shared_xaxes=True,
        vertical_spacing=0.1 / len(row_titles),
        row_titles=row_titles,
    )
    fig.update_layout(height=300 * len(row_titles))

    # first, visualize the "raw" signals
    row_idx = 1
    for sig in signals:
        df_sig = df_dict[sig][10:].resample("1s").mean()
        for col in set(df_sig.columns).difference(["index", "timestamp"]):
            fig.add_trace(
                go.Scattergl(x=df_sig.index, y=df_sig[col], name=col, hoverinfo="skip"),
                row=row_idx,
                col=1,
            )
        row_idx += 1

    # then visualize the features
    for feature in features:
        df_ff = df_feat[feature].dropna()
        fig.add_trace(
            go.Scattergl(
                connectgaps=True,
                x=df_ff.index,
                y=df_ff,
                name=feature,
                hoverinfo="skip",
                mode="markers",
                showlegend=True,
            ),
            row=row_idx,
            col=1,
        )

    return fig

interactive(children=(SelectMultiple(description='features', options=('EDA_mean__w=1m30s_s=15s', 'EDA_slope__w…