## TSFRESH QUICK START
https://tsfresh.readthedocs.io/en/latest/text/quick_start.html  
https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html  
https://github.com/blue-yonder/tsfresh/tree/main/notebooks/examples  
  
https://github.com/blue-yonder/tsfresh/blob/main/notebooks/examples/01%20Feature%20Extraction%20and%20Selection.ipynb  


In [5]:
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures

download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()

print(timeseries.head())

timeseries.describe()


   id  time  F_x  F_y  F_z  T_x  T_y  T_z
0   1     0   -1   -1   63   -3   -1    0
1   1     1    0    0   62   -3   -1    0
2   1     2   -1   -1   61   -3    0    0
3   1     3   -1   -1   63   -2   -1    0
4   1     4   -1   -1   63   -3   -1    0


Unnamed: 0,id,time,F_x,F_y,F_z,T_x,T_y,T_z
count,1320.0,1320.0,1320.0,1320.0,1320.0,1320.0,1320.0,1320.0
mean,44.5,7.0,-2.345455,8.913636,-128.214394,-39.02803,-4.517424,0.868182
std,25.411399,4.322131,50.36796,45.845475,346.816091,147.269399,101.609308,18.31725
min,1.0,0.0,-260.0,-353.0,-1547.0,-672.0,-646.0,-137.0
25%,22.75,3.0,-4.0,-2.0,-117.25,-39.25,-13.0,-1.0
50%,44.5,7.0,-1.0,1.0,46.0,-9.0,-3.0,0.0
75%,66.25,11.0,3.0,11.0,60.0,-1.0,3.0,2.0
max,88.0,14.0,342.0,236.0,157.0,686.0,601.0,123.0


In [2]:
print(y) # y contains the information which robot id reported a failure and which not:

1      True
2      True
3      True
4      True
5      True
      ...  
84    False
85    False
86    False
87    False
88    False
Length: 88, dtype: bool


The first column is the DataFrame index and has no meaning here. There are six different time series **(F<sub>x</sub>, F<sub>y</sub>, F<sub>z</sub>, T<sub>x</sub>, T<sub>y</sub>, T<sub>z</sub>)** for the different sensors. The different robots are denoted by the ids column.  

On the other hand, **F<sub>y</sub>** contains the information which robot ID reported a failure and which did not.  
In the following we illustrate the time series of the sample id 3 reporting no failure, and for ID 20 reporting a failure:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
timeseries[timeseries['id'] == 3].plot(subplots=True, sharex=True, figsize=(10,20))
plt.show()

timeseries[timeseries['id'] == 20].plot(subplots=True, sharex=True, figsize=(10,20))
plt.show()

In [3]:
# To extract features, we do:

from tsfresh import extract_features
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")

Feature Extraction: 100%|██████████| 38/38 [00:03<00:00,  8.33it/s]


You end up with a DataFrame extracted_features with all more than 1200 different  
extracted features. We will now remove all NaN values (that were created by feature  
calculators, than can not be used on the given data, e.g. because it has too low  
statistics) and select only the relevant features next:  

In [4]:
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

impute(extracted_features)
features_filtered = select_features(extracted_features, y)

print(features_filtered)

 'F_x__partial_autocorrelation__lag_8'
 'F_x__partial_autocorrelation__lag_9' ...
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"min"'
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'] did not have any finite values. Filling with zeros.


    F_x__value_count__value_-1  F_x__abs_energy  \
1                         14.0             14.0   
2                          7.0             25.0   
3                         11.0             12.0   
4                          5.0             16.0   
5                          9.0             17.0   
..                         ...              ...   
84                         0.0          96833.0   
85                         0.0           1683.0   
86                         0.0          83497.0   
87                         0.0        1405437.0   
88                         0.0           1427.0   

    F_x__range_count__max_1__min_-1  F_y__abs_energy  T_y__variance  \
1                              15.0             13.0       0.222222   
2                              13.0             76.0       4.222222   
3                              14.0             40.0       3.128889   
4                              10.0             60.0       7.128889   
5                              1

In [None]:
# Only around 300 features were classified as relevant enough.
# Further, you can even perform the extraction, imputing and 
# filtering at the same time with the 
# tsfresh.extract_relevant_features() function:

from tsfresh import extract_relevant_features
features_filtered_direct = extract_relevant_features(timeseries, y, column_id='id', column_sort='time')

Let's try visualizing some of this data.  
https://github.com/blue-yonder/tsfresh/blob/main/notebooks/examples/01%20Feature%20Extraction%20and%20Selection.ipynb  

In [None]:
%matplotlib inline
import matplotlib.pylab as plt

from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
from tsfresh.examples import robot_execution_failures

robot_execution_failures.download_robot_execution_failures()
df, y = robot_execution_failures.load_robot_execution_failures()
df.head()

In [None]:
# Let's draw some features.

df[df.id == 3][['time', 'F_x', 'F_y', 'F_z', 'T_x', 'T_y', 'T_z']].plot(x='time', title='Success example (id 3)', figsize=(12, 6));
df[df.id == 20][['time', 'F_x', 'F_y', 'F_z', 'T_x', 'T_y', 'T_z']].plot(x='time', title='Failure example (id 20)', figsize=(12, 6));

### Feature Extraction

In [None]:
# We are very explicit here and specify the `default_fc_parameters`. If you remove this argument,
# the ComprehensiveFCParameters (= all feature calculators) will also be used as default.
# Have a look into the documentation (https://tsfresh.readthedocs.io/en/latest/text/feature_extraction_settings.html)
# or one of the other notebooks to learn more about this.
extraction_settings = ComprehensiveFCParameters()

X = extract_features(df, column_id='id', column_sort='time',
                     default_fc_parameters=extraction_settings,
                     # we impute = remove all NaN features automatically
                     impute_function=impute)

In [None]:
X.head()

### Feature Selection

In [None]:
X_filtered = select_features(X, y)
X_filtered.head()

### Train and Evaluate Classifier

In [None]:
# Train
X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4)
X_filtered_train, X_filtered_test = X_full_train[X_filtered.columns], X_full_test[X_filtered.columns]

# Evaluate
classifier_full = DecisionTreeClassifier()
classifier_full.fit(X_full_train, y_train)
print(classification_report(y_test, classifier_full.predict(X_full_test)))

In [None]:
# Compared to using all features (classifier_full), using only the relevant 
# features (classifier_filtered) achieves better classification performance with less data.

classifier_filtered = DecisionTreeClassifier()
classifier_filtered.fit(X_filtered_train, y_train)
print(classification_report(y_test, classifier_filtered.predict(X_filtered_test)))

In [None]:
# Above, we performed the feature extraction and selection independently. 
# If you are only interested in the list of selected features, you can run this in one step:
X_filtered_2 = extract_relevant_features(df, y, column_id='id', column_sort='time', default_fc_parameters=extraction_settings)

(X_filtered.columns == X_filtered_2.columns).all()
