# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
59,0 -1.9969 1 -2.0076 2 -2.0010 3 ...
141,0 -1.7993 1 -1.7962 2 -1.7725 3 ...
16,0 -0.79626 1 -0.77368 2 -0.66440 3...
68,0 -1.9245 1 -1.9210 2 -1.9066 3 ...
96,0 -2.3110 1 -2.3084 2 -2.2969 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:17,  4.42s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.37s/it]

Feature Extraction:  60%|██████    | 3/5 [00:12<00:08,  4.34s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.33s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.24s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.26s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,249.999003,85.444066,0.146004,0.177188,0.126852,-0.003343,-0.287678,-1.033774,0.119964,0.957359,...,1.0,0.067221,0.003165,-0.013674,0.0,0.0,0.0,0.996012,0.0,-1123314.0
1,249.999633,80.266136,0.301887,0.325918,0.077819,-0.319642,-0.589998,-1.13815,0.088368,0.677951,...,1.0,0.043408,0.008022,-0.016433,0.0,0.0,0.0,0.996014,0.0,-2341115.0
2,250.000369,62.313686,-0.021289,-0.060843,0.348135,0.562619,-0.217615,-1.120359,0.263682,2.23282,...,1.0,0.00173,0.003613,0.013407,0.0,0.0,0.0,0.996017,0.0,1291235.0
3,249.99969,86.85108,0.179057,0.203519,0.108137,-0.162086,-0.430559,-1.220161,0.121316,0.799149,...,1.0,0.054972,0.012204,-0.012224,0.0,0.0,0.0,0.996015,0.0,1138633.0
4,250.00059,81.795692,0.28222,0.269636,0.053123,-0.243313,-0.590967,-0.961988,0.083091,0.483924,...,1.0,0.059395,-0.010595,-0.052214,0.0,0.0,0.0,0.996018,0.0,-1681207.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:17,  4.43s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.45s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.41s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.37s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.25s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.30s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.53s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:04,  1.51s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:02,  1.49s/it]

Feature Extraction:  80%|████████  | 4/5 [00:05<00:01,  1.48s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.39s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.42s/it]




0.8867924528301887

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
13,0 2.580342 1 2.580342 2 -7.26891...,0 -0.850954 1 -0.850954 2 -6.06223...,0 -0.150030 1 -0.150030 2 0.96421...,0 -0.005327 1 -0.005327 2 0.002663 3...,0 0.050604 1 0.050604 2 -0.364882 3...,0 0.311615 1 0.311615 2 -0.772378 3...
38,0 -0.396060 1 -0.396060 2 -0.268022 3...,0 -0.686878 1 -0.686878 2 0.103510 3...,0 -0.350328 1 -0.350328 2 -1.281489 3...,0 -0.157139 1 -0.157139 2 0.274327 3...,0 -0.058594 1 -0.058594 2 -0.471417 3...,0 0.151812 1 0.151812 2 0.332922 3...
39,0 0.901645 1 0.901645 2 -0.05469...,0 2.581916 1 2.581916 2 -0.01142...,0 -0.353783 1 -0.353783 2 -0.009521 3...,0 -0.455437 1 -0.455437 2 -0.250357 3...,0 0.106535 1 0.106535 2 -0.069248 3...,0 0.245030 1 0.245030 2 0.005327 3...
34,0 0.140313 1 0.140313 2 0.903629 3...,0 -0.604627 1 -0.604627 2 1.621493 3...,0 -0.221660 1 -0.221660 2 0.486719 3...,0 0.079901 1 0.079901 2 0.420813 3...,0 -0.085228 1 -0.085228 2 -0.428803 3...,0 -0.010653 1 -0.010653 2 1.171884 3...
28,0 0.369660 1 0.369660 2 -0.635316 3...,0 -0.645952 1 -0.645952 2 -4.169368 3...,0 0.063500 1 0.063500 2 -0.315898 3...,0 -0.101208 1 -0.101208 2 0.122515 3...,0 -0.029297 1 -0.029297 2 -0.205080 3...,0 0.045277 1 0.045277 2 0.197090 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:29,  7.45s/it]

Feature Extraction:  40%|████      | 2/5 [00:14<00:22,  7.40s/it]

Feature Extraction:  60%|██████    | 3/5 [00:22<00:14,  7.40s/it]

Feature Extraction:  80%|████████  | 4/5 [00:29<00:07,  7.26s/it]

Feature Extraction: 100%|██████████| 5/5 [00:35<00:00,  7.12s/it]

Feature Extraction: 100%|██████████| 5/5 [00:35<00:00,  7.17s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,10764.169856,671.27214,0.016019,-0.037014,0.161924,10.704797,1.946699,-12.205448,65.296221,15.367535,...,1.0,4.948201,5.331894,16.386624,0.0,0.0,0.0,19.585905,1.0,196.180932
1,2898.605554,309.024483,-0.02014,-0.074435,0.025175,8.286761,2.022176,-1.280324,13.471884,19.513924,...,1.0,-16.34425,-19.345028,6.058669,0.0,1.0,0.0,11.472258,1.0,9.708549
2,6541.21025,400.323922,-0.01683,-0.063614,0.02744,13.214625,3.573906,-2.799902,28.263487,27.77237,...,1.0,-44.790364,6.344127,32.062888,0.0,1.0,0.0,15.337062,1.0,-13.147464
3,6047.333821,397.974224,0.022006,-0.046906,0.041267,13.382946,6.20152,0.965148,26.606829,29.394165,...,1.0,7.155026,14.952607,22.434919,0.0,0.0,0.0,14.874024,1.0,-238.557676
4,123.411342,83.323817,-0.021147,-0.008992,0.074056,2.027459,0.411596,-0.957229,0.800756,2.808579,...,1.0,0.149593,0.361487,0.451426,0.0,0.0,0.0,1.521051,1.0,13.044029
