# Time Series Regression with Panel Data

Source:

Time Series Classification, Regression, Clustering & More
https://www.sktime.net/en/latest/examples/02_classification.html

Notebook:
https://github.com/sktime/sktime/blob/main/examples/02_classification.ipynb

### Overview of this notebook

* Introduction to time series classification, regression, clustering
* `sktime` data format fo "time series panels" = collections of time series
* Basic vignettes for TSC, TSR, TSCl
* Advanced vignettes - pipelines, ensembles, tuning


Deal with *collections of time series* = "panel data"



Regression = try to assign one continuous *numerical value* per time series, after training on time series/category examples

- Example: Temperature/pressure/time profile of chemical reactor - Predict total purity (fraction of 1)



Time Series Classification:

<img src="./img/tsc.png" width="600" alt="time series classification"> [<i>&#x200B;</i>](./img/tsc.png)

In [67]:
import numpy as np
import pandas as pd

# Increase display width
pd.set_option("display.width", 1000)

## Panel data - `sktime` data formats <a name="panel"></a>

`Panel` is an abstract data type where the values are observed for:

* `instance`, e.g., patient
* `variable`, e.g., blood pressure, body temperature of the patient
* `time`/`index`, e.g., January 12, 2023 (usually but not necessarily a time index!)

One value X is: "patient 'A' had blood pressure 'X' on January 12, 2023"

Time series classification, regression, clustering: slices `Panel` data by instance


Preferred format 1: `pd.DataFrame` with 2-level `MultiIndex`, (instance, time) and columns: variables

Preferred format 2: 3D `np.ndarray` with index (instance, variable, time)

* `sktime` supports and recognizes multiple data formats for convenience and internal use, e.g., `dask`, `xarray`
* abstract data type = "scitype"; in-memory specification = "mtype"
* More information in tutorial on [in-memory data representations and data loading](https://www.sktime.net/en/latest/examples/AA_datatypes_and_datasets.html#In-memory-data-representations-and-data-loading)

### 1. Preferred format 1 - `pd-multiindex` specification

`pd-multiindex` = `pd.DataFrame` with 2-level `MultiIndex`, (instance, time) and columns: variables

In [68]:
from sktime.datasets import load_italy_power_demand

# load an example time series panel in pd-multiindex mtype
X, _ = load_italy_power_demand(return_type="pd-multiindex")

# renaming columns for illustrative purposes
X.columns = ["total_power_demand"]
X.index.names = ["day_ID", "hour_of_day"]

The Italy power demand dataset has:

* 1096 individual time series instances = single days of total power demand (mean subtracted)
* one single variable per time series instances, `total_power_demand`
    * total power demand on that day, in that hourly period
    * Since there's only one column, it is a univariate dataset
* individual time series are observed at 24 time (period) points (the same number for all instances)

In the dataset, days are jumbled and of different scope (independent sampling).
* considered independent - because `hour_of_day` in one sample doesn't affect `hour_of_day` in another
* for task, e.g., "identify season or weekday/weekend from pattern"

In [69]:
X

Unnamed: 0_level_0,Unnamed: 1_level_0,total_power_demand
day_ID,hour_of_day,Unnamed: 2_level_1
0,0,-0.710518
0,1,-1.183320
0,2,-1.372442
0,3,-1.593083
0,4,-1.467002
...,...,...
1095,19,0.180490
1095,20,-0.094058
1095,21,0.729587
1095,22,0.210995


In [70]:
from sktime.datasets import load_basic_motions

# load an example time series panel in pd-multiindex mtype
X, _ = load_basic_motions(return_type="pd-multiindex")

# renaming columns for illustrative purposes
X.columns = ["accel_1", "accel_2", "accel_3", "gyro_1", "gyro_2", "gyro_3"]
X.index.names = ["trial_no", "timepoint"]

The basic motions dataset has:

* 80 individual time series instances = trials = person engaging in an activity like running, badminton, etc.
* six variables per time series instance, `dim_0` to `dim_5` (renamed according to the values they represent)
    * 3 accelerometer and 3 gyrometer measurements
    * hence a multivariate dataset
* individual time series are observed at 100 time points (the same number for all instances)

In [71]:
# The outermost index represents the instance number
# whereas the inner index represents the index of the particular index
# within that instance.
X

Unnamed: 0_level_0,Unnamed: 1_level_0,accel_1,accel_2,accel_3,gyro_1,gyro_2,gyro_3
trial_no,timepoint,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883
0,1,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883
0,2,-0.903497,-3.666397,-0.282844,-0.095881,-0.319605,0.972131
0,3,1.116125,-0.656101,0.333118,1.624657,-0.569962,1.209171
0,4,1.638200,1.405135,0.393875,1.187864,-0.271664,1.739182
...,...,...,...,...,...,...,...
79,95,28.459024,-16.633770,3.631869,8.978229,-3.611533,-1.491489
79,96,10.260094,0.102775,1.269261,-1.645964,-3.377157,1.283746
79,97,4.316471,-3.574319,2.063831,-1.717875,-1.843054,0.484734
79,98,0.704446,-4.920444,2.851857,-2.982977,-0.809665,-0.721774


pandas provides a simple way to access a range of value in the multi-indexed dataframe:

In [72]:
# Select:
# * the fourth variable (gyroscope 1)
# * of the first instance (trial 1 = 0 in python)
# * values at all 100 timestamps
#
X.loc[0, "gyro_1"]

timepoint
0     0.351565
1     0.351565
2    -0.095881
3     1.624657
4     1.187864
        ...   
95    0.039951
96   -0.029297
97    0.000000
98    0.000000
99   -0.007990
Name: gyro_1, Length: 100, dtype: float64

Or if you want to access the individual values:

In [73]:
# Select:
# * the fifth time time point (5 = 4 in python, because of 0-indexing)
# * the third variable (accelerometer 3)
# * of the forty-third instance (trial 43 = 42 in python)

X.loc[(42, 4), "accel_3"]

np.float64(-1.27952)

### 2. Preferred format 2 - `numpy3D` specification

`numpy3D` = 3D `np.ndarray` with index (instance, variable, time)

instance/time index is interpreted as integer

IMPORTANT: unlike `pd-multiindex`, this assumes:

* all individual series have the same length
* all individual series have the same index

In [74]:
from sktime.datasets import load_italy_power_demand

# load an example time series panel in numpy mtype
X, _ = load_italy_power_demand(return_type="numpy3D")

The Italy power demand dataset has:

* 1096 individual time series instances = single days of total power demand (mean subtracted)
* one single variable per time series instances, unnamed in numpy
* individual time series are observed at 24 time (period) points (the same number for all instances)

In [75]:
# (num_instances, num_variables, length)
X.shape

(1096, 1, 24)

In [76]:
from sktime.datasets import load_basic_motions

# load an example time series panel in numpy mtype
X, _ = load_basic_motions(return_type="numpy3D")

The basic motions dataset has:

* 80 individual time series instances = trials = person engaging in activity (running, badminton, etc)
* six variables per time series instance, unnamed in numpy
* individual time series are observed at 100 time points (the same number for all instances)

In [77]:
X.shape

(80, 6, 100)

## Time Series Regression

Above tasks are very similar to "tabular" classification, regression, clustering, as in `sklearn`

Main distinction:
* in "tabular" classification etc, one (feature) instance row vector of features
* in TSC, one (feature) instance is a full time series, possibly unequal length, distinct index set

![](./img/tsc.png)

### Time Series Regression - basic vignettes

TSR vignettes are exactly the same as TSC, except that:

* `y` in `fit` input and `predict` output should be float 1D `np.ndarray`, not categorical
* other algorithms are commonly used and/or performant

In [78]:
# steps 1, 2 - prepare dataset (train and new)
from sktime.datasets import load_covid_3month
from sktime.regression.distance_based import KNeighborsTimeSeriesRegressor
from sktime.dists_kernels import FlatDist, ScipyDist

X_train, y_train = load_covid_3month(split="train")
y_train = y_train.astype("float")
X_new, _ = load_covid_3month(split="test")
X_new = X_new.loc[:2]  # smaller dataset for faster notebook runtime

# step 3 - specify the regressor
eucl_dist = FlatDist(ScipyDist())
reg_model = KNeighborsTimeSeriesRegressor(n_neighbors=3, distance=eucl_dist)

# step 4 - fit/train the regressor
reg_model.fit(X_train, y_train)

# step 5 - predict labels on new data
y_pred = reg_model.predict(X_new)

In [79]:
y_pred  # predictions are array of float

array([0.02957762, 0.0065062 , 0.00183655])

## Pipelines, Feature Extraction, Tuning, Composition


similar to `sklearn` for "tabular" classification, regression, etc,

`sktime` has a rich set of tools for:

* feature extraction via transformers
* pipeline transformers with any estimator
* tuning individual estimators or pipelines via grid search and similar
* building ensembles out of individual estimators, or other composites

`sktime` is also fully interoperable with `sklearn` interface if `numpy` based data mtypes are used

(although this loses support for unequal length time series)

### Primer on `sktime` transformers for feature extraction

all `sktime` transformers work natively with panel data:

In [80]:
from sktime.datasets import load_italy_power_demand
from sktime.transformations.series.detrend import Detrender

# load some panel data
X, _ = load_italy_power_demand(return_type="pd-multiindex")

# specify a linear detrender
detrender = Detrender()

# detrend X by removing linear trend from each instance
X_detrended = detrender.fit_transform(X)
X_detrended

Unnamed: 0,Unnamed: 1,dim_0
0,0,0.267711
0,1,-0.290155
0,2,-0.564339
0,3,-0.870044
0,4,-0.829027
...,...,...
1095,19,-0.425904
1095,20,-0.781304
1095,21,-0.038512
1095,22,-0.637956


for panel tasks such as TSC, TSR, clustering, there are two distinctions to be aware of:

* series-to-series transformers transform individual series to series, panels to panels. E.g., instance-wise detrender above
* series-to-primitive transformers transform individual series to a set of tabular features. E>g., summary feature extractor

either type of transform can be instance-wise:

* instance-wise transforms use only the i-th series to transform the i-th series. E.g., instance-wise detrender
* non-instance-wise transforms train on all series to transform the i-th series. E.g., PCA, overall mean detrender

In [81]:
# example of a series-to-primitive transformer
from sktime.transformations.series.summarize import SummaryTransformer

# specify summary transformer
summary_trafo = SummaryTransformer()

# extract summary features - one per instance in the panel
X_summaries = summary_trafo.fit_transform(X)
X_summaries

Unnamed: 0,mean,std,min,max,0.1,0.25,0.5,0.75,0.9
0,-1.041667e-09,1.0,-1.593083,1.464375,-1.372442,-0.805078,0.030207,0.936412,1.218518
1,-1.958333e-09,1.0,-1.630917,1.201393,-1.533955,-0.999388,0.384871,0.735720,1.084018
2,-1.775000e-09,1.0,-1.397118,2.349344,-1.003740,-0.741487,-0.132687,0.265374,1.515756
3,-8.541667e-10,1.0,-1.646458,1.344487,-1.476779,-0.898722,0.266022,0.776495,1.039641
4,-3.416667e-09,1.0,-1.620240,1.303502,-1.511644,-0.978061,0.405495,0.692648,1.061249
...,...,...,...,...,...,...,...,...,...
1091,-1.041667e-09,1.0,-1.817799,1.630397,-1.323058,-0.643414,0.081208,0.568453,1.390523
1092,-4.166666e-10,1.0,-1.550077,1.513605,-1.343747,-0.768526,0.075550,0.857101,1.276013
1093,4.166667e-09,1.0,-1.706992,1.052255,-1.498879,-1.139943,0.467669,0.713195,0.993797
1094,1.583333e-09,1.0,-1.673857,2.420163,-0.744173,-0.479768,-0.266538,0.159923,1.550184


just like classifiers, we can search for transformers of either type via the right tag:

* `"scitype:transform-input"` and `"scitype:transform-output"` define input and output, e.g., "series-to-series" (both are scitype strings)
* `"scitype:instancewise"` is boolean and tells us whether the transform is instance-wise

In [82]:
# example: looking for all series-to-primitive transformers that are instance-wise
from sktime.registry import all_estimators

all_estimators(
    "transformer",
    as_dataframe=True,
    filter_tags={
        "scitype:transform-input": "Series",
        "scitype:transform-output": "Primitives",
        "scitype:instancewise": True,
    },
)

Unnamed: 0,name,object
0,Catch22,<class 'sktime.transformations.panel.catch22.C...
1,Catch22Wrapper,<class 'sktime.transformations.panel.catch22wr...
2,FittedParamExtractor,<class 'sktime.transformations.panel.summarize...
3,HurstExponentTransformer,<class 'sktime.transformations.series.hurst.Hu...
4,RandomIntervalFeatureExtractor,<class 'sktime.transformations.panel.summarize...
5,RandomIntervals,<class 'sktime.transformations.panel.random_in...
6,RandomShapeletTransform,<class 'sktime.transformations.panel.shapelet_...
7,SignatureMoments,<class 'sktime.transformations.series.signatur...
8,SignatureTransformer,<class 'sktime.transformations.panel.signature...
9,SummaryTransformer,<class 'sktime.transformations.series.summariz...


Further details on transformations and feature extraction can be found in the tutorial 3, transformers.

All composition steps therein (e.g., chaining, column subsetting) work together with all estimator types in `sktime`, including classifiers, regressors, clusterers.

### Pipelines for time series panel tasks

all panel estimators pipeline with `sktime` transformers, via the `*` dunder or `make_pipeline`.

The pipeline does the following:

* in `fit`: runs the transformers' `fit_transform` in sequence, then `fit` of the panel estimator
* in `predict`, runs the fitted transformers' `transform` in sequence, then `predict` of the panel estimator

(the logic is same as for `sklearn` pipelines)

In [83]:
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.transformations.series.exponent import ExponentTransformer

pipe = ExponentTransformer() * KNeighborsTimeSeriesClassifier()

# this constructs a ClassifierPipeline, which is also a classifier
pipe

In [84]:
# alternative to construct:
from sktime.pipeline import make_pipeline

pipe = make_pipeline(ExponentTransformer(), KNeighborsTimeSeriesClassifier())

In [85]:
from sktime.datasets import load_unit_test

X_train, y_train = load_unit_test(split="TRAIN")
X_test, _ = load_unit_test(split="TEST")

# this is a ClassifierPipeline with the same interface as knn-classifier
# first applies exponent transform, then knn-classifier
pipe.fit(X_train, y_train)

`sktime` transformers pipeline with `sklearn` classifiers!

This allows to build "time series feature extraction then `sklearn` classify`" pipelines:

In [86]:
from sklearn.ensemble import RandomForestClassifier

from sktime.transformations.series.summarize import SummaryTransformer

# specify summary transformer
summary_rf = SummaryTransformer() * RandomForestClassifier()

summary_rf.fit(X_train, y_train)

### Using transformers to deal with unequal length or missing values

pro tip: useful transformers to pipeline are those that "improve" capabilities!

Search for these transformer tags:

* `"capability:unequal_length:removes"` - ensures all instances in the panel have equal length afterwards. Examples: padding, cutting, resampling.
* `"capability:missing_values:removes"` - removes all missing values from the data (e.g., series, panel) passed to it. Example: mean imputation

In [87]:
# all transformers that guarantee that the output is equal length and equal index
from sktime.registry import all_estimators

all_estimators(
    "transformer",
    as_dataframe=True,
    filter_tags={"capability:unequal_length:removes": True},
)

Unnamed: 0,name,object
0,ClearSky,<class 'sktime.transformations.series.clear_sk...
1,IntervalSegmenter,<class 'sktime.transformations.panel.segment.I...
2,PaddingTransformer,<class 'sktime.transformations.panel.padder.Pa...
3,RandomIntervalSegmenter,<class 'sktime.transformations.panel.segment.R...
4,SlopeTransformer,<class 'sktime.transformations.panel.slope.Slo...
5,SubsequenceExtractionTransformer,<class 'sktime.transformations.series.subseque...
6,TimeBinAggregate,<class 'sktime.transformations.series.binning....
7,TruncationTransformer,<class 'sktime.transformations.panel.truncatio...


In [88]:
# all transformers that guarantee the output has no missing values
from sktime.registry import all_estimators

all_estimators(
    "transformer",
    as_dataframe=True,
    filter_tags={"capability:missing_values:removes": True},
)

Unnamed: 0,name,object
0,ClearSky,<class 'sktime.transformations.series.clear_sk...
1,ClustererAsTransformer,<class 'sktime.clustering.compose._as_transfor...
2,DetectorAsTransformer,<class 'sktime.detection.compose._as_transform...
3,Imputer,<class 'sktime.transformations.series.impute.I...


minor note:

some transformers guarantee "no missing values" under some conditions but not always, e.g., `TimeBinAggregate`

let's check the tags in one example

In [89]:
# list all classifiers in sktime
from sktime.classification.feature_based import SummaryClassifier

no_missing_reg = SummaryClassifier()

no_missing_reg.get_tags()

{'python_version': None,
 'python_dependencies': None,
 'env_marker': None,
 'sktime_version': '0.35.0',
 'object_type': 'classifier',
 'X_inner_mtype': 'numpy3D',
 'y_inner_mtype': 'numpy1D',
 'capability:multioutput': False,
 'capability:multivariate': True,
 'capability:unequal_length': False,
 'capability:missing_values': False,
 'capability:train_estimate': False,
 'capability:feature_importance': False,
 'capability:contractable': False,
 'capability:multithreading': True,
 'capability:predict_proba': True,
 'requires_cython': False,
 'authors': ['MatthewMiddlehurst'],
 'maintainers': 'sktime developers',
 'classifier_type': 'feature'}

In [90]:
from sktime.transformations.series.impute import Imputer

reg_can_do_missing = Imputer() * SummaryClassifier()

reg_can_do_missing.get_tags()

{'python_version': None,
 'python_dependencies': None,
 'env_marker': None,
 'sktime_version': '0.35.0',
 'object_type': 'classifier',
 'X_inner_mtype': 'pd-multiindex',
 'y_inner_mtype': 'numpy1D',
 'capability:multioutput': False,
 'capability:multivariate': True,
 'capability:unequal_length': False,
 'capability:missing_values': True,
 'capability:train_estimate': False,
 'capability:feature_importance': False,
 'capability:contractable': False,
 'capability:multithreading': False,
 'capability:predict_proba': True,
 'requires_cython': False,
 'authors': ['fkiraly'],
 'maintainers': 'sktime developers'}

### Tuning and model selection

`sktime` classifiers are compatible with `sklearn` model selection and composition tools using `sktime` data formats.

This extends to grid tuning and cross-validation, as long as `numpy` based formats or length/instance indexed formats are used.

In [91]:
from sktime.datasets import load_unit_test

X_train, y_train = load_unit_test(split="TRAIN")
X_test, _ = load_unit_test(split="TEST")

Cross-validation using the `sklearn` `cross_val_score` and `KFold` functionality:

In [92]:
from sklearn.model_selection import KFold, cross_val_score

from sktime.classification.feature_based import SummaryClassifier

reg_model = SummaryClassifier()

cross_val_score(reg_model, X_train, y=y_train, cv=KFold(n_splits=4))

array([0.4, 0.8, 0.6, 0.8])

Parameter tuning using `sklearn` `GridSearchCV`, we tune the _k_ and distance measure for a K-NN classifier:

In [93]:
from sklearn.model_selection import GridSearchCV

from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

knn = KNeighborsTimeSeriesClassifier()
param_grid = {"n_neighbors": [1, 5], "distance": ["euclidean", "dtw"]}
parameter_tuning_method = GridSearchCV(knn, param_grid, cv=KFold(n_splits=4))

parameter_tuning_method.fit(X_train, y_train)
y_pred = parameter_tuning_method.predict(X_test)