# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
97,0 -0.93466 1 -0.93115 2 -0.93276 3...
15,0 -0.74395 1 -0.74394 2 -0.74876 3...
89,0 -0.65473 1 -0.62292 2 -0.60388 3...
148,0 -1.2651 1 -1.2561 2 -1.2594 3 ...
67,0 -0.70328 1 -0.70312 2 -0.69938 3...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.96s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.84s/it]

Feature Extraction:  60%|██████    | 3/5 [00:11<00:07,  3.79s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.74s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.72s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.70s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,149.000377,33.04328,0.502065,0.528335,0.045675,-1.05146,-1.064215,-0.983288,-0.017221,-0.574658,...,1.0,-0.000613,0.006701,0.018449,0.0,0.0,0.0,0.993336,0.0,24916560.0
1,149.000528,26.809051,0.487139,0.548132,0.083345,-1.138353,-1.141532,-0.83979,-0.110429,-0.349328,...,1.0,0.001232,-0.000904,-0.002479,0.0,0.0,0.0,0.993337,0.0,725725.4
2,148.999027,24.7671,0.483854,0.542441,0.09118,-1.140132,-1.161469,-0.791838,-0.105415,-0.412108,...,1.0,0.000576,-0.001456,-0.002994,0.0,0.0,0.0,0.993327,0.0,-655695.9
3,149.000736,41.57984,0.334902,0.315385,0.0673,-0.279243,-0.751708,-1.502206,0.347713,0.993578,...,1.0,-0.001412,-0.001238,-0.005247,0.0,0.0,0.0,0.993338,0.0,-566286.1
4,149.000824,25.52531,0.488875,0.547169,0.086151,-1.194932,-1.170003,-0.78831,-0.105411,-0.35274,...,1.0,0.001306,-0.0038,-0.003115,0.0,0.0,0.0,0.993339,0.0,799463.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.82s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.76s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.70s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.67s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.65s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.27s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.25s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.23s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.22s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.21s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.21s/it]




1.0

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
30,0 -0.623875 1 -0.623875 2 -1.081529 3...,0 -2.123436 1 -2.123436 2 -0.121519 3...,0 -0.513654 1 -0.513654 2 0.809464 3...,0 -0.143822 1 -0.143822 2 -1.081329 3...,0 0.058594 1 0.058594 2 -0.127842 3...,0 1.086656 1 1.086656 2 0.066584 3...
4,0 -0.123238 1 -0.123238 2 -0.249547 3...,0 0.379341 1 0.379341 2 0.541501 3...,0 -0.286006 1 -0.286006 2 0.208420 3...,0 -0.098545 1 -0.098545 2 -0.023970 3...,0 0.058594 1 0.058594 2 0.175783 3...,0 -0.074574 1 -0.074574 2 0.114525 3...
28,0 -0.373788 1 -0.373788 2 0.076140 3...,0 0.248056 1 0.248056 2 -1.703104 3...,0 0.164594 1 0.164594 2 0.803796 3...,0 -0.143822 1 -0.143822 2 0.026634 3...,0 -0.183773 1 -0.183773 2 -0.620566 3...,0 -0.015980 1 -0.015980 2 -3.941791 3...
15,0 -0.359319 1 -0.359319 2 4.011746 3...,0 0.152819 1 0.152819 2 1.04881...,0 -0.064578 1 -0.064578 2 -1.804903 3...,0 0.039951 1 0.039951 2 -1.347667 3...,0 -0.042614 1 -0.042614 2 0.572625 3...,0 0.125179 1 0.125179 2 -1.861697 3...
15,0 -0.159076 1 -0.159076 2 -0.97770...,0 0.376722 1 0.376722 2 0.38349...,0 -0.445368 1 -0.445368 2 1.695360 3...,0 -0.029297 1 -0.029297 2 -0.255684 3...,0 0.029297 1 0.029297 2 0.375536 3...,0 -0.047941 1 -0.047941 2 0.516694 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.86s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.83s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.81s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.81s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.83s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.81s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,4402.264342,368.61827,-0.02578,-0.068173,0.027653,14.404859,3.550912,-2.012951,33.151709,28.820866,...,1.0,-81.172944,21.164744,10.776898,0.0,0.0,0.0,15.749446,1.0,-112.371986
1,8.912128,11.658267,-0.014735,-0.007014,0.043948,-0.03043,-0.274553,-0.528818,0.026664,0.126049,...,1.0,-0.000509,-0.000807,-0.000318,0.0,2.0,0.0,0.026815,0.0,12.272156
2,193.578195,99.122458,-0.026095,-0.032476,0.025073,2.9846,0.760303,-0.864634,1.540174,5.334237,...,1.0,-0.15158,-0.110877,-0.458514,0.0,0.0,0.0,2.194275,1.0,-27.767232
3,18144.435446,998.086426,0.005495,-0.061116,0.163137,21.954887,5.377214,-18.450373,190.873929,23.318779,...,1.0,20.476114,23.430968,52.614071,0.0,0.0,0.0,32.054682,1.0,1742.432343
4,20089.782616,936.012458,-0.031604,-0.070448,0.144797,24.032611,6.174375,-16.526685,200.755496,27.548164,...,1.0,5.090285,19.718272,76.965414,0.0,1.0,0.0,34.822337,1.0,-24.347572
