# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
164,0 -1.8055 1 -1.7985 2 -1.7606 3 ...
60,0 -1.9674 1 -1.9672 2 -1.9512 3 ...
38,0 -2.1322 1 -2.1192 2 -2.0902 3 ...
18,0 -1.9501 1 -1.9645 2 -1.9495 3 ...
135,0 -1.7096 1 -1.7071 2 -1.6632 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:17,  4.45s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.42s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.63s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.74s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.53s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.59s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,250.000226,76.69021,0.293866,0.330204,0.090908,-0.282153,-0.547447,-0.969357,0.073632,0.690201,...,1.0,0.044604,0.002661,-0.016694,0.0,0.0,0.0,0.996017,0.0,409313.1
1,249.998592,87.688278,0.205576,0.230787,0.094525,-0.119874,-0.455357,-1.269181,0.139167,0.980217,...,1.0,0.066438,0.015675,-0.003061,0.0,0.0,0.0,0.99601,0.0,813307.8
2,249.999868,91.793236,0.101214,0.104178,0.118303,0.075591,-0.302352,-1.243037,0.17488,0.91731,...,1.0,0.074464,0.003644,-0.034402,0.0,0.0,0.0,0.996015,0.0,-1431425.0
3,250.000417,87.342758,0.205766,0.223395,0.098595,-0.170609,-0.461739,-1.268942,0.126348,0.80362,...,1.0,0.067745,0.018134,-0.00187,0.0,0.0,0.0,0.996018,0.0,515431.5
4,249.999798,77.424482,0.336698,0.384817,0.085215,-0.286724,-0.624776,-1.114332,0.091544,0.682365,...,1.0,0.037713,0.006068,-0.012956,0.0,0.0,0.0,0.996015,0.0,566740.7


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:19,  4.79s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.63s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.58s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.52s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.41s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.41s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.60s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.58s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.55s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.54s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.44s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.47s/it]




0.8867924528301887

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
26,0 -0.761604 1 -0.761604 2 0.121078 3...,0 0.260125 1 0.260125 2 -1.423255 3...,0 -0.064487 1 -0.064487 2 0.075600 3...,0 0.069248 1 0.069248 2 -0.282318 3...,0 0.242367 1 0.242367 2 -0.332922 3...,0 -0.007990 1 -0.007990 2 0.239704 3...
24,0 0.383922 1 0.383922 2 -0.272575 3...,0 0.302612 1 0.302612 2 -1.381236 3...,0 -0.398075 1 -0.398075 2 -0.681258 3...,0 0.071911 1 0.071911 2 -0.761725 3...,0 0.175783 1 0.175783 2 -0.114525 3...,0 -0.087891 1 -0.087891 2 -0.503377 3...
5,0 -1.182602 1 -0.765368 2 -0.519464 3...,0 -0.612973 1 -2.759566 2 -3.213704 3...,0 0.167450 1 0.414760 2 0.907956 3...,0 -0.276991 1 -0.508704 2 -0.077238 3...,0 -0.082565 1 -0.114525 2 -0.261010 3...,0 -0.213070 1 -0.426140 2 0.215733 3...
15,0 -0.159076 1 -0.159076 2 -0.97770...,0 0.376722 1 0.376722 2 0.38349...,0 -0.445368 1 -0.445368 2 1.695360 3...,0 -0.029297 1 -0.029297 2 -0.255684 3...,0 0.029297 1 0.029297 2 0.375536 3...,0 -0.047941 1 -0.047941 2 0.516694 3...
35,0 1.102297 1 1.102297 2 0.73238...,0 -1.790773 1 -1.790773 2 0.661191 3...,0 0.001413 1 0.001413 2 -1.57956...,0 0.258347 1 0.258347 2 -0.127842 3...,0 -0.165129 1 -0.165129 2 -0.16779...,0 0.516694 1 0.516694 2 -0.58860...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.87s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.75s/it]

Feature Extraction:  60%|██████    | 3/5 [00:22<00:15,  7.65s/it]

Feature Extraction:  80%|████████  | 4/5 [00:30<00:07,  7.54s/it]

Feature Extraction: 100%|██████████| 5/5 [00:37<00:00,  7.50s/it]

Feature Extraction: 100%|██████████| 5/5 [00:37<00:00,  7.49s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,220.949429,104.677565,-0.009585,-0.050995,0.059427,2.753245,0.62732,-0.862156,1.634518,3.490647,...,1.0,0.029599,0.139377,0.368081,0.0,0.0,0.0,1.671577,1.0,62.315369
1,354.117244,114.057871,-0.023083,-0.032382,0.030085,3.329963,1.457159,-0.272249,1.32189,6.289511,...,1.0,0.15313,0.718881,1.765228,0.0,0.0,0.0,2.662879,1.0,-20.101505
2,14.584008,25.125394,-0.011048,-0.021158,0.022999,0.803062,-0.027104,-0.612088,0.193713,1.887386,...,1.0,0.01075,0.028248,0.022197,0.0,0.0,0.0,0.2009,0.0,103.88259
3,20089.782616,936.012458,-0.031604,-0.070448,0.144797,24.032611,6.174375,-16.526685,200.755496,27.548164,...,1.0,5.090285,19.718272,76.965414,0.0,1.0,0.0,34.822337,1.0,-24.347572
4,8664.94077,324.813455,-0.034966,-0.112034,0.071626,20.912103,5.560128,-1.942906,63.322971,28.026657,...,1.0,32.364692,48.229669,93.801281,0.0,1.0,0.0,19.9436,1.0,-14.591905
