# Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via statistical techniques. These features will be used as the basis of machine learning models. Feature engineering is performed on the windowed time series data to convert the raw (time or frequency) signals into a set of informative characteristics.

## Data Loading

Import the required libraries to perform feature selection.

In [1]:
import sys
import math
import numpy as np
import scipy as sp
import pandas as pd
sys.path.insert(1, '../src/')
from glob import glob
from utils import load_raw_data, load_dataset_data
from scipy.signal import medfilt
from scipy.fftpack import fft, fftfreq, ifft 
from statsmodels.robust import mad
from scipy.stats import iqr, entropy

Load the data using the functions defined earlier.

In [2]:
acc_df, gyro_df, label_df = load_raw_data()

## Feature Engineering

We'll explore a basic set of features types that are applied to both the frequency and time signals.
These features are a set of statistical measures that describe the time series.

Rather than defining these features manually, you can often use existing Python libraries (e.g. tsfresh) to extract a large set of time series features (https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html)

### Common Axial Features

In [8]:
# the mean of the signal
def _mean(signal):
    return np.mean(signal)

# the standard deviation of the signal
def _std(signal):
    return np.std(signal)

# the median absolute deviation of the signal
def _mad(signal):
    return mad(signal)

# the max of the signal
def _max(signal):
    return np.max(signal)

# the min of the signal
def _min(signal):
    return np.min(signal)

# the interquantile range of the signal
def _iqr(signal):
    return iqr(signal)

# the entropy of the signal
def _entropy(signal):
    return entropy(signal+np.max(signal))

def calculate_signal_time_features(df, feature_cols, window_size = 30, sampling_frequency = 50):
    features_f = [_mean, _std, _mad, _max, _min, _iqr, _entropy]
    len_signal = len(df[feature_cols[0]].values)
    
    features = []
    for window in range(0, len_signal, sampling_frequency*window_size):
        window_features = [window]
        feature_names = ["window"]
        for signal in feature_cols:
            for feature in features_f:
                feature_names.append(signal+feature.__name__)
                window_features.append(feature(df[signal].values[window:window+window_size]))
        features.append(window_features)
    features_df = pd.DataFrame(data=features, columns=feature_names)

    return features_df

We'll calculate the features for an entire activity. The time series are first windowed (split into non-overlapping segments of length window_size) and we calculate the features for each of the different time windows.

In [9]:
features = calculate_signal_time_features(acc_df, ["acc_X", "acc_Y", "acc_Z"])
features

Unnamed: 0,window,acc_X_mean,acc_X_std,acc_X_mad,acc_X_max,acc_X_min,acc_X_iqr,acc_X_entropy,acc_Y_mean,acc_Y_std,...,acc_Y_min,acc_Y_iqr,acc_Y_entropy,acc_Z_mean,acc_Z_std,acc_Z_mad,acc_Z_max,acc_Z_min,acc_Z_iqr,acc_Z_entropy
0,0,0.186620,0.019663,0.013385,0.238889,0.163889,0.018750,3.400148,0.007361,0.019860,...,-0.027778,0.036111,3.285185,0.985602,0.033895,0.029858,1.031945,0.904167,0.044097,3.401056
1,1500,0.979676,0.001880,0.002059,0.984722,0.976389,0.002431,3.401197,-0.179167,0.002913,...,-0.184722,0.004167,3.401163,-0.280324,0.005161,0.006178,-0.272222,-0.290278,0.009028,3.401154
2,3000,1.016898,0.002709,0.004118,1.023611,1.012500,0.004167,3.401197,0.083287,0.003237,...,0.076389,0.002778,3.401020,-0.107546,0.005218,0.004118,-0.095833,-0.116667,0.006597,3.400868
3,4500,0.944769,0.002862,0.002059,0.950000,0.938889,0.003819,3.401196,-0.195648,0.002772,...,-0.202778,0.002778,3.401172,-0.365231,0.004281,0.005148,-0.358333,-0.373611,0.006944,3.401180
4,6000,0.189167,0.002164,0.002059,0.193056,0.184722,0.002778,3.401181,0.561019,0.004332,...,0.550000,0.006944,3.401190,0.815000,0.006732,0.007207,0.826389,0.798611,0.009375,3.401189
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
744,1116000,0.993889,0.181121,0.157527,1.380556,0.654167,0.253472,3.398283,-0.238194,0.197082,...,-0.790278,0.237500,-inf,-0.026389,0.091415,0.047361,0.243056,-0.190278,0.069097,3.314550
745,1117500,0.978009,0.125361,0.136935,1.233333,0.783333,0.176389,3.399603,-0.154537,0.166952,...,-0.533333,0.167014,-inf,0.055926,0.088494,0.086485,0.193056,-0.190278,0.099653,3.317437
746,1119000,0.996898,0.313886,0.279017,1.665278,0.488889,0.304514,3.394306,-0.255370,0.183766,...,-0.638889,0.151389,3.228200,-0.032824,0.101505,0.077219,0.144444,-0.202778,0.126736,-inf
747,1120500,0.945370,0.239891,0.200769,1.656945,0.631944,0.298958,3.397088,-0.237593,0.158394,...,-0.620833,0.183681,3.196823,0.085231,0.134957,0.162674,0.351389,-0.109722,0.214931,3.353895


From the current calculations, we end up with 22 computed features that describe the activity.

Let's investigate the range of features that have already been precalculated across all experiments.

In [13]:
X_train, y_train, subject_train, X_test, y_test, subject_test = load_dataset_data()
list(X_train.columns), len(list(X_train.columns))

(['tBodyAcc-Mean-1_0',
  'tBodyAcc-Mean-2_1',
  'tBodyAcc-Mean-3_2',
  'tBodyAcc-STD-1_3',
  'tBodyAcc-STD-2_4',
  'tBodyAcc-STD-3_5',
  'tBodyAcc-Mad-1_6',
  'tBodyAcc-Mad-2_7',
  'tBodyAcc-Mad-3_8',
  'tBodyAcc-Max-1_9',
  'tBodyAcc-Max-2_10',
  'tBodyAcc-Max-3_11',
  'tBodyAcc-Min-1_12',
  'tBodyAcc-Min-2_13',
  'tBodyAcc-Min-3_14',
  'tBodyAcc-SMA-1_15',
  'tBodyAcc-Energy-1_16',
  'tBodyAcc-Energy-2_17',
  'tBodyAcc-Energy-3_18',
  'tBodyAcc-IQR-1_19',
  'tBodyAcc-IQR-2_20',
  'tBodyAcc-IQR-3_21',
  'tBodyAcc-ropy-1_22',
  'tBodyAcc-ropy-1_23',
  'tBodyAcc-ropy-1_24',
  'tBodyAcc-ARCoeff-1_25',
  'tBodyAcc-ARCoeff-2_26',
  'tBodyAcc-ARCoeff-3_27',
  'tBodyAcc-ARCoeff-4_28',
  'tBodyAcc-ARCoeff-5_29',
  'tBodyAcc-ARCoeff-6_30',
  'tBodyAcc-ARCoeff-7_31',
  'tBodyAcc-ARCoeff-8_32',
  'tBodyAcc-ARCoeff-9_33',
  'tBodyAcc-ARCoeff-10_34',
  'tBodyAcc-ARCoeff-11_35',
  'tBodyAcc-ARCoeff-12_36',
  'tBodyAcc-Correlation-1_37',
  'tBodyAcc-Correlation-2_38',
  'tBodyAcc-Correlation-3_39',


The features we've extracted so far have only been calculated in the time domain, and not the frequency domain. Additionally, a larger range of features have been calculated (e.g. 'Energy', 'SMA, 'ARCoeff', 'Kurtosis'), resulting in a total of 561 features.

In the next notebooks, we'll investigate how an activity classifier can be built on top of these features.