# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
15,0 -2.1645 1 -2.1785 2 -2.0660 3 ...
2,0 -1.8660 1 -1.8420 2 -1.8350 3 ...
10,0 -2.0336 1 -2.0052 2 -1.9754 3 ...
165,0 -1.8302 1 -1.8123 2 -1.8122 3 ...
168,0 -1.5317 1 -1.5413 2 -1.5150 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.52s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.52s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.50s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.47s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.37s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.40s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,250.000147,90.75308,0.025819,0.027341,0.156608,0.58199,0.081349,-1.1014,0.294958,1.408367,...,1.0,0.080009,0.020304,-0.006633,0.0,0.0,0.0,0.996017,0.0,2216810.0
1,250.000681,78.336636,0.272725,0.30754,0.093725,-0.289365,-0.539812,-1.012311,0.076638,0.66019,...,1.0,0.048147,0.004981,-0.01761,0.0,0.0,0.0,0.996019,0.0,1482248.0
2,250.000207,92.344784,0.140379,0.166116,0.115197,0.002486,-0.350879,-1.450639,0.229205,0.898647,...,1.0,0.070395,0.035235,0.011076,0.0,0.0,0.0,0.996017,0.0,642306.7
3,250.000184,79.858768,0.3017,0.332384,0.080103,-0.327842,-0.590065,-1.135305,0.090937,0.676802,...,1.0,0.05093,0.009605,-0.010717,0.0,0.0,0.0,0.996017,0.0,-2070245.0
4,250.000137,74.267442,0.377671,0.444606,0.092169,-0.226793,-0.669387,-1.186474,0.095314,0.709578,...,1.0,0.027791,0.00408,-0.010069,0.0,0.0,0.0,0.996016,0.0,1278059.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:05<00:22,  5.53s/it]

Feature Extraction:  40%|████      | 2/5 [00:11<00:16,  5.66s/it]

Feature Extraction:  60%|██████    | 3/5 [00:17<00:11,  5.91s/it]

Feature Extraction:  80%|████████  | 4/5 [00:23<00:05,  5.71s/it]

Feature Extraction: 100%|██████████| 5/5 [00:28<00:00,  5.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:28<00:00,  5.75s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.60s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.59s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.57s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.56s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.46s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.49s/it]




0.8301886792452831

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
15,0 -0.359319 1 -0.359319 2 4.011746 3...,0 0.152819 1 0.152819 2 1.04881...,0 -0.064578 1 -0.064578 2 -1.804903 3...,0 0.039951 1 0.039951 2 -1.347667 3...,0 -0.042614 1 -0.042614 2 0.572625 3...,0 0.125179 1 0.125179 2 -1.861697 3...
31,0 0.130669 1 0.130669 2 0.06882...,0 -0.119724 1 -0.119724 2 -4.08360...,0 -1.019916 1 -1.019916 2 5.39025...,0 0.684487 1 0.684487 2 0.394179 3...,0 0.290308 1 0.290308 2 0.617902 3...,0 0.679160 1 0.679160 2 1.595360 3...
13,0 2.580342 1 2.580342 2 -7.26891...,0 -0.850954 1 -0.850954 2 -6.06223...,0 -0.150030 1 -0.150030 2 0.96421...,0 -0.005327 1 -0.005327 2 0.002663 3...,0 0.050604 1 0.050604 2 -0.364882 3...,0 0.311615 1 0.311615 2 -0.772378 3...
26,0 -0.761604 1 -0.761604 2 0.121078 3...,0 0.260125 1 0.260125 2 -1.423255 3...,0 -0.064487 1 -0.064487 2 0.075600 3...,0 0.069248 1 0.069248 2 -0.282318 3...,0 0.242367 1 0.242367 2 -0.332922 3...,0 -0.007990 1 -0.007990 2 0.239704 3...
3,0 0.289855 1 0.289855 2 -0.669185 3...,0 0.284130 1 0.284130 2 -0.210466 3...,0 0.213680 1 0.213680 2 0.252267 3...,0 -0.314278 1 -0.314278 2 0.018644 3...,0 0.074574 1 0.074574 2 0.007990 3...,0 -0.079901 1 -0.079901 2 0.237040 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:08<00:32,  8.00s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.88s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.81s/it]

Feature Extraction:  80%|████████  | 4/5 [00:30<00:07,  7.76s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.71s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.70s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,18144.435446,998.086426,0.005495,-0.061116,0.163137,21.954887,5.377214,-18.450373,190.873929,23.318779,...,1.0,20.476114,23.430968,52.614071,0.0,0.0,0.0,32.054682,1.0,1742.432343
1,7638.280878,494.592749,-0.037506,-0.046871,0.015511,13.940877,5.248257,-0.141364,21.7896,27.20738,...,1.0,-11.139032,8.109147,-0.337394,0.0,0.0,0.0,13.456041,1.0,-11.257922
2,10764.169856,671.27214,0.016019,-0.037014,0.161924,10.704797,1.946699,-12.205448,65.296221,15.367535,...,1.0,4.948201,5.331894,16.386624,0.0,0.0,0.0,19.585905,1.0,196.180932
3,220.949429,104.677565,-0.009585,-0.050995,0.059427,2.753245,0.62732,-0.862156,1.634518,3.490647,...,1.0,0.029599,0.139377,0.368081,0.0,0.0,0.0,1.671577,1.0,62.315369
4,6.150112,19.595197,0.009984,0.018959,0.011848,0.183837,-0.084124,-0.391862,0.037496,0.289855,...,1.0,0.002119,0.003174,-0.002938,0.0,3.0,0.0,0.061584,0.0,8.401754
