# Machine Learning – Additional materials: Timeseries analysis




In [None]:
!pip3 -q install tsfresh
!pip3 -q install pandas

## Time-series analysis using the `tsfresh` library

The `tsfresh` library contains a lot of (but, of course, not all) features that can be extracted from timeseries (an exhaustive list can be found [here](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#module-tsfresh.feature_extraction.feature_calculators)). We can allow `tsfresh` to automatically generate and select a set of features for us or we can use these feature calculators directly. We'll look at cases below.

### Sample data

To illustrate the use of the `tsfresh` library, we'll use some sample data from a wearable sensor that records a collection of movement parameters, such as accelerometer and gyroscope readings.

In [None]:
import tsfresh as ts
import pandas as pd

df = pd.read_csv('aljaz_short.csv')[1900:2900]

df

Since `tsfresh` supports only numeric values for feature calculation, we remove any string variables first. For demonstration purposes, we'll pick an arbitrary value to predict, e.g., `quat0`.

In [None]:
# Clean out some superflous columns
del df["placement"], df["scenario"], df["Timestamp"], df["timestamp_2"], df["quat_check"], df["SerialNo"], df["label"]

targets = df["quat0"]
del df["quat0"]

### Automatic feature generation

`tsfresh` can automatically generate all possible features it knows by using `extract_features`. We will also require an ID attribute, which indicates to which timeseries a measuremenet belongs (a dataframe can hold more than one timeseries). For now, we'll consider all of the data to be the same timeseries, so we'll assign the same ID to all measurements.

In [None]:
df["id"] = 1

features = ts.extract_features(df[:200], column_id="id", column_sort="time")

features

We get a very large set of features calculated over the entire timeseries -- what we have achieved so far is only a summarization/aggregation of the entire timeseries. This can be useful if we want to learn from multiple timeseries of the same type, for example if we have measurements of multiple people.

However, if we want to do timeseries prediction, we want to predict some values as the timeseries progresses. This means that we can't treat the entire timeseries as a single entity but we must brake it down into smaller sub-series. To this end we'll modify the `id` column to treat 10 consequtive measurements as a separate sub-series.

In [None]:
df["id"] = (df.index // 10) * 10
df["id"][:25]

Now we can recalculate the features for the sliced up timeseries.

In [None]:
features = ts.extract_features(df[:200], column_id="id", column_sort="time")

features

 We can use `impute` and `select_features` to fill in any missing values and then make a meaning selection of the features, respectfully. Feature selection is done based on the target column using various significance tests (this can be configured through parameters). Since we condensed the original timeseries into sub-timeseries, we must do the same with the target values. To simplify, we'll just use the first appearing label in our example, even though a majority voting scheme might be more appropriate.

In [None]:
condensed_targets = targets[:200:10]

ts.utilities.dataframe_functions.impute(features)

features_filtered = ts.select_features(features, condensed_targets)

features_filtered

However, we can complete this entire process all at once by using `extract_relevant_features`:

In [None]:
features_filtered_direct = ts.extract_relevant_features(df[:200], condensed_targets, column_id='id', column_sort='time')

features_filtered_direct

### Sliding windows

The above approach is not ideal, as we cut up the timeseries into non-overlapping sub-series. As such, we can only predict a value for one out of ten measurements. We want to use as much information as is possible, so we should use all data within a time window available up to the last measurement. This process is called _sliding windows_ or _timeseries rolling_.

![alt text](https://tsfresh.readthedocs.io/en/latest/_images/rolling_mechanism_2.png "A sliding window")

In `tsfresh` this approach is available in the [`roll_time_series`](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.utilities.html#tsfresh.utilities.dataframe_functions.roll_time_series) function. As before, we provide the `column_id` and `column_sort` parameters, but we can also provide `max_timeshift` which tells us how many time steps back (in addition to the current time step) should be considered. For example, a `max_timeshift=2` will yield windows of length 3. If the parameter is ommited, the timeseries will keep getting longer and longer as time progresses. Whether this is desirable depends largely on the problem being solved, though for long timeseries it can incurr significant resource demands.

In [None]:
from tsfresh.utilities.dataframe_functions import roll_time_series

df["id"] = 1

# for demonstration purposes, we'll shrink our data a bit
shrink = df[["roll", "pitch", "yaw", "time", "id"]]

shrink_rolled = roll_time_series(shrink[:20], column_id="id", column_sort="time", max_timeshift=2)

shrink_rolled

Once we have rolled our timeseries, we can employ the preivous feature calculation process using `extract_features`, now using the rolled `id`.

In [None]:
features = ts.extract_features(shrink_rolled, column_id="id", column_sort="time")

features

### Calculating features directly

The `tsfresh` library also allows access to all of it's feature caluclation functions directly. They are located in the [`tsfresh.feature_extraction.feature_calculators`](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#module-tsfresh.feature_extraction.feature_calculators) module.

Let's calculate the _absolute energy_ ($E = \sum_i x_i^2 $) for some of the columns in our timeseries.

In [None]:
from tsfresh.feature_extraction.feature_calculators import abs_energy

abs_energy(df["pitch"])

These calculator functions always take the timeseries as a parameter, but they may also have other configurable paramteres. For example, if we want to extract specific fast Fourier transformation coefficients, we can provide which ones we are interested in.

In [None]:
from tsfresh.feature_extraction.feature_calculators import fft_coefficient

list(fft_coefficient(df["pitch"], [{"coeff": 1, "attr": "real"}, {"coeff": 1, "attr": "imag"}]))