# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
74,0 -1.8914 1 -1.8894 2 -1.8531 3 ...
108,0 -2.0692 1 -2.0677 2 -2.0495 3 ...
52,0 -1.9858 1 -1.9843 2 -1.9625 3 ...
42,0 -1.9921 1 -2.0144 2 -1.9611 3 ...
153,0 -1.6628 1 -1.6740 2 -1.6541 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:17,  4.48s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.44s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.42s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.40s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.30s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.33s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,249.999057,83.688048,0.299973,0.330694,0.07811,-0.288386,-0.609495,-1.264774,0.106336,0.749217,...,1.0,0.053377,0.015943,-0.00844,0.0,0.0,0.0,0.996012,0.0,-508111.6
1,250.000002,84.814658,0.273345,0.27391,0.067191,-0.27749,-0.55869,-1.112529,0.082668,0.612604,...,1.0,0.061994,0.005993,-0.028464,0.0,0.0,0.0,0.996016,0.0,1518179.0
2,249.999929,88.35475,0.25014,0.285763,0.077826,-0.287491,-0.565849,-1.245895,0.11399,0.607339,...,1.0,0.069554,0.015329,-0.010554,0.0,0.0,0.0,0.996016,0.0,-805464.5
3,249.999669,92.154196,0.157192,0.182119,0.103257,-0.109189,-0.412927,-1.369386,0.166426,0.884207,...,1.0,0.066259,0.023101,-0.001812,0.0,0.0,0.0,0.996015,0.0,-613969.0
4,250.000864,76.331544,0.355974,0.412909,0.085993,-0.380598,-0.670945,-1.147883,0.076846,0.667578,...,1.0,0.036064,0.007578,-0.001184,0.0,0.0,1.0,0.996019,0.0,632575.6


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.71s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.58s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.50s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.44s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.33s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.33s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.58s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.55s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.54s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.53s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.44s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.46s/it]




0.8867924528301887

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
31,0 0.036607 1 0.036607 2 0.265778 3...,0 0.341686 1 0.341686 2 -0.164943 3...,0 -0.694948 1 -0.694948 2 -0.635560 3...,0 -0.253020 1 -0.253020 2 -0.354229 3...,0 -0.082565 1 -0.082565 2 -0.516694 3...,0 -0.090555 1 -0.090555 2 1.470182 3...
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
39,0 1.211973 1 1.211973 2 -0.605948 3...,0 -0.247107 1 -0.247107 2 -3.855673 3...,0 0.327837 1 0.327837 2 7.113185 3...,0 0.058594 1 0.058594 2 0.900220 3...,0 -0.527348 1 -0.527348 2 -1.326360 3...,0 -0.042614 1 -0.042614 2 -0.095881 3...
30,0 -0.771623 1 -0.771623 2 -2.32382...,0 0.372042 1 0.372042 2 -0.29603...,0 -0.145753 1 -0.145753 2 1.71501...,0 -0.031960 1 -0.031960 2 0.383526 3...,0 0.167792 1 0.167792 2 0.229050 3...,0 -0.362219 1 -0.362219 2 -0.23970...
20,0 -0.294498 1 -0.294498 2 -0.050044 3...,0 0.540218 1 0.540218 2 -0.515245 3...,0 0.218114 1 0.218114 2 -0.301108 3...,0 -0.045277 1 -0.045277 2 0.103872 3...,0 -0.002663 1 -0.002663 2 -0.183773 3...,0 0.031960 1 0.031960 2 0.037287 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.90s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.78s/it]

Feature Extraction:  60%|██████    | 3/5 [00:22<00:15,  7.64s/it]

Feature Extraction:  80%|████████  | 4/5 [00:30<00:07,  7.61s/it]

Feature Extraction: 100%|██████████| 5/5 [00:37<00:00,  7.59s/it]

Feature Extraction: 100%|██████████| 5/5 [00:37<00:00,  7.56s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,5923.622075,351.902197,-0.022711,-0.054728,0.045949,18.952323,4.7868,-1.586263,44.834159,28.00206,...,1.0,-1.76202,4.734516,13.574916,0.0,0.0,0.0,12.647317,1.0,-97.393541
1,10.629914,22.690124,0.039365,0.029099,0.008885,1.021608,0.068493,-0.493076,0.172195,1.6382,...,1.0,0.019919,-0.005089,-0.02841,0.0,0.0,0.0,0.260379,0.0,9.377847
2,8508.951625,459.048651,-0.004866,-0.047343,0.044864,19.405911,7.093272,-1.529865,54.095927,29.210932,...,1.0,-41.246265,26.299019,59.697679,0.0,0.0,0.0,21.230098,1.0,58.366824
3,12647.878199,542.656064,-0.026511,-0.067215,0.036655,23.291787,7.266123,-0.715149,64.776848,28.768162,...,1.0,-32.549793,16.94087,21.610465,0.0,0.0,0.0,34.694581,1.0,-5.073784
4,110.735119,85.854825,-0.018927,-0.03565,0.027377,2.309696,0.354539,-0.770196,0.975286,3.041947,...,1.0,-0.025912,0.047528,0.317551,0.0,0.0,0.0,1.089646,1.0,184.873936
