# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
154,0 -1.6106 1 -1.6097 2 -1.5854 3 ...
136,0 -1.6425 1 -1.6378 2 -1.6359 3 ...
16,0 -2.0537 1 -2.0369 2 -2.0330 3 ...
80,0 -1.9186 1 -1.9088 2 -1.8903 3 ...
69,0 -1.7998 1 -1.7987 2 -1.7942 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.52s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.49s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.49s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.47s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.40s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.42s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,250.000382,76.65071,0.36614,0.41698,0.080675,-0.382364,-0.701712,-1.235681,0.099315,0.621025,...,1.0,0.032009,0.010176,-0.005448,0.0,0.0,0.0,0.996017,0.0,-640664.2
1,250.000498,77.475422,0.350508,0.398169,0.082753,-0.310222,-0.681474,-1.174927,0.086283,0.616033,...,1.0,0.032245,0.0048,-0.013597,0.0,0.0,0.0,0.996018,0.0,1456394.0
2,250.000758,81.133902,0.311659,0.325673,0.063002,-0.326345,-0.655443,-1.096561,0.084894,0.59166,...,1.0,0.051569,-0.003595,-0.041251,0.0,0.0,0.0,0.996019,0.0,-1368852.0
3,249.998779,82.283528,0.271333,0.303863,0.096954,-0.092147,-0.457649,-1.064945,0.098616,0.791964,...,1.0,0.055075,-0.000432,-0.011301,0.0,0.0,0.0,0.996011,0.0,50099780.0
4,249.998516,83.557328,0.234075,0.274195,0.106038,-0.229716,-0.491494,-1.1766,0.101083,0.736171,...,1.0,0.051651,0.015625,0.013673,0.0,0.0,0.0,0.99601,0.0,-3131234.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:19,  4.92s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:14,  4.76s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.66s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.58s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.45s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.46s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.61s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.58s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.56s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.56s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.46s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.49s/it]




0.8113207547169812

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
14,0 1.186069 1 1.186069 2 0.02547...,0 0.013225 1 0.013225 2 1.92628...,0 -0.377872 1 -0.377872 2 -1.253871 3...,0 0.367545 1 0.367545 2 0.221060 3...,0 -0.253020 1 -0.253020 2 -0.500714 3...,0 0.114525 1 0.114525 2 -0.173119 3...
22,0 0.175924 1 0.175924 2 0.194403 3...,0 0.548757 1 0.548757 2 -3.699192 3...,0 -1.191314 1 -1.191314 2 -0.554051 3...,0 0.039951 1 0.039951 2 0.042614 3...,0 0.263674 1 0.263674 2 -0.178446 3...,0 0.937507 1 0.937507 2 0.071911 3...
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
8,0 0.498121 1 0.498121 2 0.196889 3...,0 0.031305 1 0.031305 2 -3.122323 3...,0 -0.358509 1 -0.358509 2 0.258171 3...,0 0.047941 1 0.047941 2 0.143822 3...,0 -0.119852 1 -0.119852 2 0.015980 3...,0 0.005327 1 0.005327 2 0.010653 3...
17,0 0.324449 1 0.324449 2 9.29442...,0 -0.977516 1 -0.977516 2 -6.96322...,0 -1.260218 1 -1.260218 2 -2.498493 3...,0 -0.788358 1 -0.788358 2 2.434323 3...,0 0.316941 1 0.316941 2 -0.079901 3...,0 0.588605 1 0.588605 2 6.535916 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:08<00:32,  8.15s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.97s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.91s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.87s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.81s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.78s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,15899.87279,936.942443,-0.011471,-0.101987,0.132759,18.427284,4.24486,-14.080067,116.261213,21.03813,...,1.0,6.013089,15.974199,78.793027,0.0,0.0,0.0,33.619891,1.0,59.79233
1,383.560959,127.01821,-0.000228,0.001823,0.125313,3.757798,1.359549,-0.84666,2.566076,4.442557,...,1.0,0.2535,0.439563,1.125258,0.0,0.0,0.0,2.834777,1.0,-56.442891
2,10.629914,22.690124,0.039365,0.029099,0.008885,1.021608,0.068493,-0.493076,0.172195,1.6382,...,1.0,0.019919,-0.005089,-0.02841,0.0,0.0,0.0,0.260379,0.0,9.377847
3,9.806233,19.646853,0.008286,0.009657,0.029126,0.480531,-0.025416,-0.654871,0.153734,0.979739,...,1.0,0.000909,0.004847,0.008813,0.0,3.0,0.0,0.123778,0.0,44.626858
4,13876.020277,736.256653,0.005391,-0.171305,0.18451,14.628194,5.399811,-12.349274,75.352924,18.280817,...,1.0,2.145671,14.394118,24.538999,0.0,0.0,0.0,26.76324,1.0,-23.953564
