# Chapter 23 How to Develop ML Models for Human Activity Recognition

In this tutorial, you will discover how to evaluate a diverse suite of machine learning algorithms on the Activity Recognition Using Smartphones dataset. After completing this tutorial, you will know:
- How to load and evaluate nonlinear and ensemble machine learning algorithms on the feature-engineered version of the activity recognition dataset.
- How to load and evaluate machine learning algorithms on the raw signal data for the activity recognition dataset.
- How to define reasonable lower and upper bounds on the expected performance of more sophisticated algorithms capable of feature learning, such as deep learning methods.
Let’s get started.

## 23.1 Tutorial Overview
This tutorial is divided into three parts; they are:
1. Activity Recognition Using Smartphones Dataset 
2. Modeling Feature Engineered Data
3. Modeling Raw Data

## 23.2 Activity Recognition Using Smartphones Dataset

see Chapter 22.

## 23.3 Modeling Feature Engineered Data

In this section, we will develop code to load the feature-engineered version of the dataset and evaluate a suite of nonlinear machine learning algorithms, including SVM used in the original paper. The goal is to achieve at least 89% accuracy on the test dataset.

1. Load Dataset
2. Define Models
3. Evaluate Models
4. Summarize Results 
5. Complete Example

### 23.3.1 Load Dataset

- HARDataset/train/X_train.txt 
- HARDataset/train/y_train.txt 
- HARDataset/test/X_test.txt
- HARDataset/test/y_test.txt

In [6]:
from pandas import read_csv
# load a single file as a numpy array
def load_file(filepath):
    dataframe = read_csv(filepath, header=None, delim_whitespace=True)
    return dataframe.values

# load a dataset group, such as train or test
def load_dataset_group(group, prefix=''):
    # load input data
    X = load_file(prefix + group + '/X_'+group+'.txt') 
    # load class output
    y = load_file(prefix + group + '/y_'+group+'.txt') 
    return X, y

# load the dataset, returns train and test X and y elements
def load_dataset(prefix=''):
    # load all train
    trainX, trainy = load_dataset_group('train', prefix + 'HARDataset/') 
    print("train:",trainX.shape, trainy.shape)
    # load all test
    testX, testy = load_dataset_group('test', prefix + 'HARDataset/') 
    print("test: ",testX.shape, testy.shape)
    # flatten y
    trainy, testy = trainy[:,0], testy[:,0]
    print('flatten y:', trainX.shape, trainy.shape, testX.shape, testy.shape)
    return trainX, trainy, testX, testy


# load dataset
trainX, trainy, testX, testy = load_dataset()

train: (7352, 561) (7352, 1)
test:  (2947, 561) (2947, 1)
flatten y: (7352, 561) (7352,) (2947, 561) (2947,)


### 23.3.2 Define Models

We will evaluate the models using default configurations. We are not looking for optimal configurations of these models at this point, just a general idea of how well sophisticated models with default configurations perform on this problem. We will evaluate a diverse set of nonlinear and ensemble machine learning algorithms, specifically:

Nonlinear Algorithms:
- k-Nearest Neighbors
- Classification and Regression Tree 
- Support Vector Machine
- Naive Bayes

Ensemble Algorithms:
- Bagged Decision Trees
- Random Forest
- Extra Trees
- Gradient Boosting Machine

In [8]:
# create a dict of standard models to evaluate {name:object}
def define_models(models=dict()):
    # nonlinear models
    models['knn'] = KNeighborsClassifier(n_neighbors=7) 
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    # ensemble models
    models['bag'] = BaggingClassifier(n_estimators=100) 
    models['rf'] = RandomForestClassifier(n_estimators=100) 
    models['et'] = ExtraTreesClassifier(n_estimators=100) 
    models['gbm'] = GradientBoostingClassifier(n_estimators=100) 
    print('Defined %d models' % len(models))
    return models

### 23.3.3 Evaluate Models

In this case we will use classification accuracy that will capture the performance (or error) of a model given the balance observations across the six activities (or classes). 

The evaluate model() function below implements this behavior, evaluating a given model and returning **the classification accuracy** as a percentage.

In [9]:
# evaluate a single model
def evaluate_model(trainX, trainy, testX, testy, model):
    # fit the model
    model.fit(trainX, trainy)
    # make predictions
    yhat = model.predict(testX)
    # evaluate predictions
    accuracy = accuracy_score(testy, yhat)
    return accuracy * 100.0

In [10]:
# evaluate a dict of models {name:object}, returns {name:score}
def evaluate_models(trainX, trainy, testX, testy, models):
    results = dict()
    for name, model in models.items():
        # evaluate the model
        results[name] = evaluate_model(trainX, trainy, testX, testy, model)
        # show process
        print('>%s: %.3f' % (name, results[name])) 
    return results

### 23.3.4 Summarize Results

The final step is to summarize the findings. We can sort all of the results by the classification accuracy in descending order because we are interested in maximizing accuracy. 

In [11]:
# print and plot the results
def summarize_results(results, maximize=True):
    # create a list of (name, mean(scores)) tuples
    mean_scores = [(k,v) for k,v in results.items()]
    # sort tuples by mean score
    mean_scores = sorted(mean_scores, key=lambda x: x[1])
    # reverse for descending order (e.g. for accuracy)
    if maximize:
        mean_scores = list(reversed(mean_scores))
    print()
    for name, score in mean_scores: 
        print('Name=%s, Score=%.3f' % (name, score))

### 23.3.5 Complete Example

In [12]:
# spot check ml algorithms on engineered-features from the har dataset
from pandas import read_csv
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier

def load_file(filepath):
    dataframe = read_csv(filepath, header=None, delim_whitespace=True)
    return dataframe.values

# load a dataset group, such as train or test
def load_dataset_group(group, prefix=''):
    # load input data
    X = load_file(prefix + group + '/X_'+group+'.txt') 
    # load class output
    y = load_file(prefix + group + '/y_'+group+'.txt') 
    return X, y

# load the dataset, returns train and test X and y elements
def load_dataset(prefix=''):
    # load all train
    trainX, trainy = load_dataset_group('train', prefix + 'HARDataset/') 
    print("train:",trainX.shape, trainy.shape)
    # load all test
    testX, testy = load_dataset_group('test', prefix + 'HARDataset/') 
    print("test: ",testX.shape, testy.shape)
    # flatten y
    trainy, testy = trainy[:,0], testy[:,0]
    print('flatten y:', trainX.shape, trainy.shape, testX.shape, testy.shape)
    return trainX, trainy, testX, testy

# create a dict of standard models to evaluate {name:object}
def define_models(models=dict()):
    # nonlinear models
    models['knn'] = KNeighborsClassifier(n_neighbors=7) 
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    # ensemble models
    models['bag'] = BaggingClassifier(n_estimators=100) 
    models['rf'] = RandomForestClassifier(n_estimators=100) 
    models['et'] = ExtraTreesClassifier(n_estimators=100) 
    models['gbm'] = GradientBoostingClassifier(n_estimators=100) 
    print('Defined %d models' % len(models))
    return models

# evaluate a single model
def evaluate_model(trainX, trainy, testX, testy, model):
    # fit the model
    model.fit(trainX, trainy)
    # make predictions
    yhat = model.predict(testX)
    # evaluate predictions
    accuracy = accuracy_score(testy, yhat)
    return accuracy * 100.0

# evaluate a dict of models {name:object}, returns {name:score}
def evaluate_models(trainX, trainy, testX, testy, models):
    results = dict()
    for name, model in models.items():
        # evaluate the model
        results[name] = evaluate_model(trainX, trainy, testX, testy, model)
        # show process
        print('>%s: %.3f' % (name, results[name])) 
    return results

# print and plot the results
def summarize_results(results, maximize=True):
    # create a list of (name, mean(scores)) tuples
    mean_scores = [(k,v) for k,v in results.items()]
    # sort tuples by mean score
    mean_scores = sorted(mean_scores, key=lambda x: x[1])
    # reverse for descending order (e.g. for accuracy)
    if maximize:
        mean_scores = list(reversed(mean_scores))
    print()
    for name, score in mean_scores: 
        print('Name=%s, Score=%.3f' % (name, score))
        
if __name__ == '__main__':        
    # load dataset
    trainX, trainy, testX, testy = load_dataset()
    # get model list
    models = define_models()
    # evaluate models
    results = evaluate_models(trainX, trainy, testX, testy, models)
    # summarize results
    summarize_results(results)

train: (7352, 561) (7352, 1)
test:  (2947, 561) (2947, 1)
flatten y: (7352, 561) (7352,) (2947, 561) (2947,)
Defined 8 models
>knn: 90.329
>cart: 85.409
>svm: 95.046
>bayes: 77.027
>bag: 89.922
>rf: 92.535
>et: 93.824
>gbm: 93.892

Name=svm, Score=95.046
Name=gbm, Score=93.892
Name=et, Score=93.824
Name=rf, Score=92.535
Name=knn, Score=90.329
Name=bag, Score=89.922
Name=cart, Score=85.409
Name=bayes, Score=77.027


Running the example first loads the train and test datasets. The eight models are then evaluated in turn, printing the performance for each. Finally, a rank of the models by their performance on the test set is displayed.

We can see that both the ExtraTrees ensemble method and the Support Vector Machines nonlinear methods achieve a performance of about 94% accuracy on the test set. This is a great result, exceeding the reported 89% by SVM in the original paper.

## 23.4 Modeling Raw Data

The raw data does require some more work to load. There are three main signal types in the raw data: **total acceleration**, **body acceleration**, and **body gyroscope**. Each has three axes of data. This means that there are a total of **nine variables** for each time step. 

Further, each series of data has been partitioned into overlapping windows of **2.65** seconds of data, or **128 time steps**. These windows of data correspond to the windows of engineered features (rows) in the previous section.

This means that one row of data has 128 × 9 or 1,152 elements. This is a little less than double the size of the 561 element vectors in the previous section and it is likely that there is some redundant data. The signals are stored in the /Inertial Signals/ directory under the train and test subdirectories. Each axis of each signal is stored in a separate file, meaning that each of the train and test datasets have nine input files to load and one output file to load. We can batch the loading of these files into groups given the consistent directory structures and file naming conventions.

First, we can load all data for a given group into a single three-dimensional NumPy array, where the dimensions of the array are [samples, timesteps, features]. To make this clearer, there are 128 time steps and nine features, where the number of samples is the number of rows in any given raw signal data file. The load group() function below implements this behavior.

In [None]:
# load a list of files into a 3D array of [samples, timesteps, features]
def load_group(filenames, prefix=''): 
    loaded = list()
    for name in filenames:
        data = load_file(prefix + name)
        loaded.append(data)
    # stack group so that features are the 3rd dimension
    loaded = dstack(loaded)
    return loaded

In [None]:
# load a dataset group, such as train or test
def load_dataset_group(group, prefix=''):
    filepath = prefix + group + '/Inertial Signals/'
    # load all 9 files as a single array
    filenames = list()
    # total acceleration
    filenames += ['total_acc_x_'+group+'.txt', 'total_acc_y_'+group+'.txt',
    'total_acc_z_'+group+'.txt']
    # body acceleration
    filenames += ['body_acc_x_'+group+'.txt', 'body_acc_y_'+group+'.txt',
    'body_acc_z_'+group+'.txt']
    # body gyroscope
    filenames += ['body_gyro_x_'+group+'.txt', 'body_gyro_y_'+group+'.txt',
    'body_gyro_z_'+group+'.txt']
    # load input data
    X = load_group(filenames, filepath)
    # load class output
    y = load_file(prefix + group + '/y_'+group+'.txt') 
    return X, y

# load the dataset, returns train and test X and y elements
def load_dataset(prefix=''):
    # load all train
    trainX, trainy = load_dataset_group('train', prefix + 'HARDataset/') 
    print(trainX.shape, trainy.shape)
    # load all test
    testX, testy = load_dataset_group('test', prefix + 'HARDataset/') 
    print(testX.shape, testy.shape)
    # flatten X
    trainX = trainX.reshape((trainX.shape[0], trainX.shape[1] * trainX.shape[2]))
    testX = testX.reshape((testX.shape[0], testX.shape[1] * testX.shape[2]))
    # flatten y
    trainy, testy = trainy[:,0], testy[:,0]
    print(trainX.shape, trainy.shape, testX.shape, testy.shape)
    return trainX, trainy, testX, testy

**Putting this all together, the complete example is listed below.**

In [13]:
# spot check on raw data from the har dataset
from numpy import dstack
from pandas import read_csv
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier

def load_file(filepath):
    dataframe = read_csv(filepath, header=None, delim_whitespace=True)
    return dataframe.values

# load a list of files into a 3D array of [samples, timesteps, features]
def load_group(filenames, prefix=''): 
    loaded = list()
    for name in filenames:
        data = load_file(prefix + name)
        loaded.append(data)
    # stack group so that features are the 3rd dimension
    loaded = dstack(loaded)
    return loaded

# load a dataset group, such as train or test
def load_dataset_group(group, prefix=''):
    filepath = prefix + group + '/Inertial Signals/'
    # load all 9 files as a single array
    filenames = list()
    # total acceleration
    filenames += ['total_acc_x_'+group+'.txt', 'total_acc_y_'+group+'.txt',
    'total_acc_z_'+group+'.txt']
    # body acceleration
    filenames += ['body_acc_x_'+group+'.txt', 'body_acc_y_'+group+'.txt',
    'body_acc_z_'+group+'.txt']
    # body gyroscope
    filenames += ['body_gyro_x_'+group+'.txt', 'body_gyro_y_'+group+'.txt',
    'body_gyro_z_'+group+'.txt']
    # load input data
    X = load_group(filenames, filepath)
    # load class output
    y = load_file(prefix + group + '/y_'+group+'.txt') 
    return X, y

# load the dataset, returns train and test X and y elements
def load_dataset(prefix=''):
    # load all train
    trainX, trainy = load_dataset_group('train', prefix + 'HARDataset/') 
    print(trainX.shape, trainy.shape)
    # load all test
    testX, testy = load_dataset_group('test', prefix + 'HARDataset/') 
    print(testX.shape, testy.shape)
    # flatten X
    trainX = trainX.reshape((trainX.shape[0], trainX.shape[1] * trainX.shape[2]))
    testX = testX.reshape((testX.shape[0], testX.shape[1] * testX.shape[2]))
    # flatten y
    trainy, testy = trainy[:,0], testy[:,0]
    print(trainX.shape, trainy.shape, testX.shape, testy.shape)
    return trainX, trainy, testX, testy

# create a dict of standard models to evaluate {name:object}
def define_models(models=dict()):
    # nonlinear models
    models['knn'] = KNeighborsClassifier(n_neighbors=7) 
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    # ensemble models
    models['bag'] = BaggingClassifier(n_estimators=100) 
    models['rf'] = RandomForestClassifier(n_estimators=100) 
    models['et'] = ExtraTreesClassifier(n_estimators=100) 
    models['gbm'] = GradientBoostingClassifier(n_estimators=100) 
    print('Defined %d models' % len(models))
    return models

# evaluate a single model
def evaluate_model(trainX, trainy, testX, testy, model):
    # fit the model
    model.fit(trainX, trainy)
    # make predictions
    yhat = model.predict(testX)
    # evaluate predictions
    accuracy = accuracy_score(testy, yhat)
    return accuracy * 100.0

# evaluate a dict of models {name:object}, returns {name:score}
def evaluate_models(trainX, trainy, testX, testy, models):
    results = dict()
    for name, model in models.items():
        # evaluate the model
        results[name] = evaluate_model(trainX, trainy, testX, testy, model)
        # show process
        print('>%s: %.3f' % (name, results[name])) 
    return results

# print and plot the results
def summarize_results(results, maximize=True):
    # create a list of (name, mean(scores)) tuples
    mean_scores = [(k,v) for k,v in results.items()]
    # sort tuples by mean score
    mean_scores = sorted(mean_scores, key=lambda x: x[1])
    # reverse for descending order (e.g. for accuracy)
    if maximize:
        mean_scores = list(reversed(mean_scores))
    print()
    for name, score in mean_scores: 
        print('Name=%s, Score=%.3f' % (name, score))
        
if __name__ == '__main__':        
    # load dataset
    trainX, trainy, testX, testy = load_dataset()
    # get model list
    models = define_models()
    # evaluate models
    results = evaluate_models(trainX, trainy, testX, testy, models)
    # summarize results
    summarize_results(results)

(7352, 128, 9) (7352, 1)
(2947, 128, 9) (2947, 1)
(7352, 1152) (7352,) (2947, 1152) (2947,)
Defined 8 models
>knn: 61.893
>cart: 71.802
>svm: 88.734
>bayes: 72.480
>bag: 84.934
>rf: 84.628
>et: 86.495
>gbm: 87.207

Name=svm, Score=88.734
Name=gbm, Score=87.207
Name=et, Score=86.495
Name=bag, Score=84.934
Name=rf, Score=84.628
Name=bayes, Score=72.480
Name=cart, Score=71.802
Name=knn, Score=61.893


As noted in the previous section, these results provide a lower-bound on accuracy for any more sophisticated methods that may attempt to learn higher order features automatically (e.g. via feature learning in deep learning methods) from the raw data. In summary, the bounds for such methods extend on this dataset from about 87% accuracy with GBM on the raw data to about 94% with Extra Trees and SVM on the highly processed dataset, [87% to 94%]

如上一节所述，这些结果为任何更复杂的方法提供了精度的下限，这些方法可能会尝试从原始数据中自动学习高阶特征（例如，通过深度学习方法中的特征学习）。总之，这些方法的边界在这个数据集上从原始数据上使用GBM的约87%的准确率扩展到高度处理的数据集上使用额外树和SVM的约94%，[87%到94%]

## 23.5 Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- More Algorithms. Only eight machine learning algorithms were evaluated on the problem; try some linear methods and perhaps some more nonlinear and ensemble methods.
- Algorithm Tuning. No tuning of the machine learning algorithms was performed; mostly default configurations were used. Pick a method such as SVM, ExtraTrees, or Gradient Boosting and grid search a suite of different hyperparameter configurations to see if you can further lift performance on the problem.
- Data Scaling. The data is already scaled to [-1,1], perhaps per subject. Explore whether additional scaling, such as standardization, can result in better performance, perhaps on methods sensitive to such scaling such as kNN.

## 23.7 Summary
In this tutorial, you discovered how to evaluate a diverse suite of machine learning algorithms on the Activity Recognition Using Smartphones dataset. Specifically, you learned:
- How to load and evaluate nonlinear and ensemble machine learning algorithms on the feature-engineered version of the activity recognition dataset.
- How to load and evaluate machine learning algorithms on the raw signal data for the activity recognition dataset.
- How to define reasonable lower and upper bounds on the expected performance of more sophisticated algorithms capable of feature learning, such as deep learning methods.