# Feature Engineering on Time Series

Extracting relevant features from time series data is an advanced feature engineering task with many real-world applications. Characterizing time series - or segments of time series - by tabular attributes allows us to use them as input to classical machine learning methods.

However, deriving features and selecting the relevant ones is not trivial. In the following, we look at examples and demonstrate tools that can simplify and improve the feature engineering process.

## Preamble

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import seaborn
import matplotlib.pyplot as plt
import pandas
import numpy

In [None]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style(dark=True)

## Example Dataset: Kepler Exoplanet Search

For the following examples we are going to use data from [NASA's Kepler telescope](https://www.nasa.gov/mission_pages/kepler/). Kepler detects exoplanets by the **transit method** - a small decrease in the brightness curve of a star reveals a planet transiting in front of it. This means that the shape of the light intensity (=flux) curve over time can reveal the presence of a planet. 

In [None]:
from IPython.display import HTML

In [None]:
HTML(
    """
    <iframe width="560" height="315" src="https://www.youtube.com/embed/S_HRh0ZynjE" 
    frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
    """,
)

For each star, we receive about 3000 light intensity data points ordered in time, as well as a label: Was a planet orbiting the star confirmed? If we can manage to derive the right features from the light curve, this could make it possible to train a reliable classifier that automates detection:

In [None]:
data = pandas.read_parquet("../.assets/data/kepler/kepler_flux.parquet")

In [None]:
data

In [None]:
data

We have data from over 5000 stars, but there is a strong class imbalance:

In [None]:
data.shape

In [None]:
data["LABEL"].value_counts()

In [None]:
data.head()

### Example Time Series

Let us plot some examples from each class. Positives...

In [None]:
data[data["LABEL"] == 2]\
    .drop(["LABEL"], axis="columns")\
    .sample(n=1)\
    .transpose()\
    .plot()

... and negatives:

In [None]:
data[data["LABEL"] == 1]\
    .drop(["LABEL"], axis="columns")\
    .sample(n=1)\
    .transpose()\
    .plot()

## Manual Feature Extraction

### Exercise: Space Exploration

Apply data visualization and exploration to these time series. Can you identify features that point to the presence of a planet? 

In [None]:
# Your code here...

## Automated Feature Extraction and Supervised Selection with `tsfresh`

If you have so far struggled with extracting relevant features, not all is lost. This is work that you may be able to automate: [**tsfresh**](https://github.com/blue-yonder/tsfresh), short for "Time Series Feature extraction based on Scalable Hypothesis tests", is an algorithm that claims just this:

> TSFRESH automatically extracts 100s of features from time series. Those features describe basic characteristics of the time series such as the number of peaks, the average or maximal value or more complex features such as the time reversal symmetry statistic.

The vast number of automatically generated features can also be tested against the target variable to select only those features that are robustly correlated with the target.

![](graphics/tsfresh.png)

Since this is a rather brute-force approach, it is quite compute intensive. Fortunately, it is also embarrasingly parallel and can be accelerated by adding more cores.

In the following, we demonstrate step by step how `tsfresh` can be applied to the example data.

## Preprocessing

Since the `tsfresh` feature extraction algorithm is compute intensive, we need to work with a small sample of time series here:

In [None]:
n = 42
data = pandas.concat(
    [
        data[data["LABEL"] == 2].sample(n=n),
        data[data["LABEL"] == 1].sample(n=2*n)    
    ]
)

We convert the label to booleans:

In [None]:
label = data["LABEL"]
label = label - 1 # to 0/1
label = label.astype("bool")

In [None]:
label.value_counts()

`tsfresh` expects the data set to be in a specific format: A long-form data frame with
- the values of _all_ time series in one column
- the identifier of the time series in another column, annotating every data point
- a third column denoting time 

In [None]:
data = data.sample(n=5) # remove sampling for full feature extraction
y = data["LABEL"]

In [None]:
ts = data.drop(["LABEL"], axis="columns")\
    .transpose()\
    .melt(var_name="id", value_name="flux")

In [None]:
ts["time"] = ts.index

In [None]:
ts.head()

In [None]:
ts.dtypes

## Applying `tsfresh` Feature Extraction

In [None]:
import tsfresh

This function applies brute-force feature generation without selection:

In [None]:
%%time 
features = tsfresh.extract_features(
    ts, 
    column_id="id", 
    column_sort="time"
)

In [None]:

features.head()

In [None]:
features.columns

## Exercise: Automated Feature Selection

Apply the following function to perform automated feature selection. Inspect the features and compare the two feature sets.

In [None]:
tsfresh.extract_relevant_features?

### Exercise: Model Trained on Generated Features

Train a classifier on the generated features and properly evaluate its performance. Does the classifier improve with feature selection?

In [None]:
# Your code here

Some useful building blocks and tools:

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
from sklearn.model_selection import cross_val_score, ShuffleSplit

In [None]:
from sklearn.metrics import f1_score, make_scorer

In [None]:
y.astype("int").value_counts()

In [None]:
cv_gen = ShuffleSplit(n_splits=10, test_size=0.2)

In [None]:
f1_scorer = make_scorer(f1_score, greater_is_better=True)

In [None]:
cross_val_score(
    estimator=RandomForestClassifier(),
    X=features,
    y=y,
    scoring=f1_scorer,
    cv=cv_gen
)

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
cross_val_score(
    estimator=DummyClassifier(strategy="stratified"),
    X=features,
    y=y,
    scoring=f1_scorer,
    cv=cv_gen
)

### Summary `tsfresh`

**pro**

- extracts a large amount of generic features from time series
- given labels on the time series, selects relevant features through statistical tests
- easy to apply

**con**

- very compute intensive (but parallelized)
- fresh library, expect a few stability issues (and reported them to the developers)

## References


- [TSFRESH Paper: Distributed and parallel time series feature extraction for
industrial big data applications](https://arxiv.org/pdf/1610.07717.pdf)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_