# Activity recognition with accelerometer data

This demo shows how the `sklearn_xarray` package works with the `Pipeline` and `GridSearchCV` methods from scikit-learn providing a metadata-aware grid-searchable pipeline mechansism.

The package combines the metadata-handling capabilities of `xarray` with the machine-learning framework of `sklearn`. It enables the user to apply preprocessing steps group by group, use transformers that change the number of samples, use metadata directly as labels for classification tasks and more.

The example performs activity recognition from raw accelerometer data with a feedforward neural network. It uses the [WISDM activity prediction dataset](http://www.cis.fordham.edu/wisdm/dataset.php) which contains the activities walking, jogging, walking upstairs, walking downstairs, sitting and standing from 36 different subjects.

In [1]:
import sklearn_xarray.dataarray as da
from sklearn_xarray import Target
from sklearn_xarray.preprocessing import (Splitter, Sanitizer, Featurizer)

from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import Pipeline, make_pipeline

## Load the dataset

In [2]:
def get_wisdm(url='http://www.cis.fordham.edu/wisdm/includes/'
                  'datasets/latest/WISDM_ar_latest.tar.gz',
              folder='.',
              file='WISDM_ar_v1.1/WISDM_ar_v1.1_raw.txt',
              tmp_file='widsm.tar.gz'):

    import os, urllib, tarfile
    import pandas as pd
    import xarray as xr

    if not os.path.isfile(folder + '/' + file):
        urllib.request.urlretrieve(url, tmp_file)
        tar = tarfile.open(tmp_file)
        tar.extractall(folder)
        tar.close()
        os.remove(tmp_file)

    column_names = ['subject', 'activity', 'timestamp', 'x', 'y', 'z']
    df = pd.read_csv(folder + '/' + file, header=None, names=column_names,
                     comment=';')

    time = pd.TimedeltaIndex(start=0, periods=df.shape[0], freq='50ms')

    coords = {'subject': ('sample', df.subject),
              'activity': ('sample', df.activity),
              'sample': time,
              'axis': ['x', 'y', 'z']}

    X = xr.DataArray(df.iloc[:, 3:6], coords=coords, dims=('sample', 'axis'))

    return X

In [3]:
X = get_wisdm()

## Define the pipeline

We define a pipeline with various preprocessing steps and a classifier.

The preprocessing consists of splitting the data into segments, removing segments with `nan` values and standardizing.  Since the accelerometer data is three-dimensional but the standardizer and classifier expect a one-dimensional feature vector, we have to vectorize the samples.

Finally, we use a feedforward neural network to perform the classification.

In [4]:
pl = make_pipeline(
    Splitter(groupby=['subject', 'activity'], new_dim='timepoints', new_len=30),
    Sanitizer(),
    Featurizer(),
    da.TransformerWrapper(StandardScaler()),
    da.EstimatorWrapper(MLPClassifier(hidden_layer_sizes=(100,)), reshapes='features')
)

The label to classify is the activity which we convert to a binary representation for the classification.

In [7]:
y = Target('activity', LabelBinarizer(), dims=['feature'])

Finally, we fit the pipeline to our dataset.

In [8]:
pl.fit(X, y);

Pipeline(memory=None,
     steps=[('splitter', Splitter(dim='sample', group_dim='sample', groupby=['subject', 'activity'],
     keep_coords_as=None, new_dim='timepoints',
     new_index_func=<built-in function arange>, new_len=30,
     reduce_index='subsample')), ('sanitizer', Sanitizer(dim='sample', group_dim='sample', group...1, validation_fraction=0.1,
       verbose=False, warm_start=False),
         reshapes='features'))])