# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [2]:
# !pip install --upgrade tsfresh

In [3]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

from sktime.datasets import load_arrow_head, load_basic_motions
from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/main/examples/02_classification_univariate.ipynb).

In [4]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


  warn(


In [5]:
X_train.head()

Unnamed: 0,dim_0
42,0 -1.9921 1 -2.0144 2 -1.9611 3 ...
128,0 -1.6729 1 -1.6837 2 -1.6643 3 ...
117,0 -2.0520 1 -2.0515 2 -2.0022 3 ...
100,0 -1.9503 1 -1.9472 2 -1.9191 3 ...
21,0 -1.8127 1 -1.8257 2 -1.7844 3 ...


In [6]:
#  binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype='<U1')

## Using tsfresh to extract features

In [7]:
# tf = TsFreshTransformer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn(
Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.74s/it]


Unnamed: 0,dim_0__variance_larger_than_standard_deviation,dim_0__has_duplicate_max,dim_0__has_duplicate_min,dim_0__has_duplicate,dim_0__sum_values,dim_0__abs_energy,dim_0__mean_abs_change,dim_0__mean_change,dim_0__mean_second_derivative_central,dim_0__median,...,dim_0__permutation_entropy__dimension_5__tau_1,dim_0__permutation_entropy__dimension_6__tau_1,dim_0__permutation_entropy__dimension_7__tau_1,dim_0__query_similarity_count__query_None__threshold_0.0,"dim_0__matrix_profile__feature_""min""__threshold_0.98","dim_0__matrix_profile__feature_""max""__threshold_0.98","dim_0__matrix_profile__feature_""mean""__threshold_0.98","dim_0__matrix_profile__feature_""median""__threshold_0.98","dim_0__matrix_profile__feature_""25""__threshold_0.98","dim_0__matrix_profile__feature_""75""__threshold_0.98"
0,0.0,0.0,0.0,1.0,-0.000408,249.999669,0.057172,9.4e-05,-4.8e-05,-0.18541,...,2.043598,2.341616,2.5723,0.0,1.80637,11.244777,6.309331,7.787638,3.15103,8.403401
1,0.0,1.0,0.0,1.0,-9.8e-05,250.000942,0.046874,2.1e-05,4.9e-05,0.28796,...,2.291622,2.559976,2.749413,0.0,1.538801,16.826697,13.190382,16.705296,9.914928,16.729647
2,0.0,0.0,0.0,0.0,9.7e-05,250.001186,0.062056,-1.3e-05,-2e-06,-0.021274,...,2.972583,3.488558,3.851388,0.0,1.48854,8.077041,3.175156,2.463372,1.860125,3.919197
3,0.0,1.0,0.0,1.0,0.000295,250.00022,0.053527,0.000101,-0.000104,-0.09258,...,1.894422,2.153446,2.383973,0.0,2.016207,10.271474,5.152559,5.38364,3.628404,6.327998
4,0.0,0.0,0.0,0.0,-8.1e-05,250.000555,0.066529,-6e-05,2e-05,-0.006648,...,1.828227,2.075237,2.289825,0.0,1.985902,7.510787,5.18588,5.343722,4.692984,6.303666


## Using tsfresh with sktime

In [8]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier(),
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn(
Feature Extraction: 100%|██████████| 5/5 [00:15<00:00,  3.10s/it]
  warn(
Feature Extraction: 100%|██████████| 5/5 [00:05<00:00,  1.03s/it]


0.8679245283018868

## Multivariate time series classification data

In [9]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


  warn(


In [10]:
#  multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
10,0 0.206148 1 0.206148 2 6.53436...,0 -0.658294 1 -0.658294 2 4.597327 3...,0 0.469612 1 0.469612 2 -2.723661 3...,0 -0.106535 1 -0.106535 2 -0.439456 3...,0 0.306288 1 0.306288 2 1.717875 3...,0 0.950824 1 0.950824 2 -1.041379 3...
25,0 -0.185181 1 -0.185181 2 -1.319727 3...,0 0.059288 1 0.059288 2 -1.194247 3...,0 0.250270 1 0.250270 2 0.418052 3...,0 0.154476 1 0.154476 2 0.047941 3...,0 0.167792 1 0.167792 2 -0.215733 3...,0 0.732428 1 0.732428 2 -0.050604 3...
22,0 -0.697643 1 -0.697643 2 -0.199924 3...,0 -0.561693 1 -0.561693 2 -0.820724 3...,0 -0.950458 1 -0.950458 2 1.146612 3...,0 -1.158567 1 -1.158567 2 -0.479407 3...,0 0.727101 1 0.727101 2 -0.410159 3...,0 -1.376964 1 -1.376964 2 0.130505 3...
31,0 0.036607 1 0.036607 2 0.265778 3...,0 0.341686 1 0.341686 2 -0.164943 3...,0 -0.694948 1 -0.694948 2 -0.635560 3...,0 -0.253020 1 -0.253020 2 -0.354229 3...,0 -0.082565 1 -0.082565 2 -0.516694 3...,0 -0.090555 1 -0.090555 2 1.470182 3...
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...


In [11]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn(
Feature Extraction: 100%|██████████| 5/5 [00:26<00:00,  5.29s/it]


Unnamed: 0,dim_0__variance_larger_than_standard_deviation,dim_0__has_duplicate_max,dim_0__has_duplicate_min,dim_0__has_duplicate,dim_0__sum_values,dim_0__abs_energy,dim_0__mean_abs_change,dim_0__mean_change,dim_0__mean_second_derivative_central,dim_0__median,...,dim_5__permutation_entropy__dimension_5__tau_1,dim_5__permutation_entropy__dimension_6__tau_1,dim_5__permutation_entropy__dimension_7__tau_1,dim_5__query_similarity_count__query_None__threshold_0.0,"dim_5__matrix_profile__feature_""min""__threshold_0.98","dim_5__matrix_profile__feature_""max""__threshold_0.98","dim_5__matrix_profile__feature_""mean""__threshold_0.98","dim_5__matrix_profile__feature_""median""__threshold_0.98","dim_5__matrix_profile__feature_""25""__threshold_0.98","dim_5__matrix_profile__feature_""75""__threshold_0.98"
0,1.0,0.0,0.0,1.0,395.985445,11192.65897,6.5837,0.099344,0.0,8.60897,...,3.01413,3.525453,3.919983,0.0,0.785774,1.535116,1.103338,1.090985,0.976776,1.180772
1,1.0,0.0,0.0,1.0,54.45523,182.497205,0.870803,0.011501,0.006315,0.515937,...,2.501944,3.018153,3.441942,0.0,0.969245,2.097508,1.282048,1.256701,1.09597,1.437341
2,1.0,1.0,0.0,1.0,52.882361,185.780037,0.89986,0.01135,-0.00048,0.280859,...,2.598501,3.17495,3.608497,0.0,0.854373,3.400177,1.402914,1.380205,1.079456,1.514417
3,1.0,0.0,0.0,1.0,409.281059,5923.622075,3.217789,0.046896,0.002163,2.145581,...,3.725071,4.239265,4.434494,0.0,0.731863,2.992405,1.661758,1.592637,1.337768,2.071533
4,0.0,0.0,0.0,1.0,-8.618429,10.629914,0.16445,-0.002871,-6.1e-05,-0.164268,...,3.222908,3.878028,4.281449,0.0,0.875058,3.187555,1.615318,1.556645,1.18298,1.809264


# Classification of extracted features with SVM

As we can see below the after Feature Extraction the data is of dimensions 60 (number of samples) on 4686 = 6 * 781, where 6 is the number of different channels of the measurement of motion data and 781 is the number of efficient features which are extracted using TSFreshFeatureExtractor.

## Result
As we can see, SVM classifier handles perfectly with the huge number of features and gives 100% acuraccy.

In [12]:
Xt.shape

(60, 4686)

In [13]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(Xt, y_train)
Xtest = t.fit_transform(X_test)
clf.score(Xtest, y_test)

  warn(
Feature Extraction: 100%|██████████| 5/5 [00:08<00:00,  1.80s/it]


1.0

## Using tsfresh for forecasting
You can also use tsfresh to do univariate forecasting. To find out more about forecasting, check out our forecasting tutorial notebook.

In [14]:
from sklearn.ensemble import RandomForestRegressor

from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import make_reduction
from sktime.forecasting.model_selection import temporal_train_test_split

y = load_airline()
y_train, y_test = temporal_train_test_split(y)

regressor = make_pipeline(
    TSFreshFeatureExtractor(show_warnings=False, disable_progressbar=True),
    RandomForestRegressor(),
)
forecaster = make_reduction(
    regressor, scitype="time-series-regressor", window_length=12
)
forecaster.fit(y_train)

fh = ForecastingHorizon(y_test.index, is_relative=False)
y_pred = forecaster.predict(fh)