HAR_Prediction.Rmd

---
title: "Human Activity Recognition - A Model Stacking Prediction Approach"
author: "Dung Nguyen"
date: "12/8/2020"
output:
  md_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 1. Background

### a. Sypnosis
Using devices such as *Jawbone Up*, *Nike FuelBand*, and *Fitbit* it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how *much* of a particular activity they do, but they rarely quantify *how well they do it*. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

### b. Prediction goal
The goal is to predict the manner in which they did the exercise. This is the "classe" variable in the training set.

This report covers how the model was built, what the expected out-of-sample error is, and evaluates the prediction model.

## 2. Data

### a. Training set and testing set

The training set is downloaded and stored as "pml-training.csv". I load the data and partition it into a training and test set.

```{r}
#Load the caret package
library(caret)

#Download and read and store the csv data used for training to train_dat
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
              destfile = "pml-training.csv",
              method = "curl")
train_dat <- read.csv("pml-training.csv", 
                      na.strings = c("NA","#DIV/0!", ""))
train_dat$classe <- as.factor(train_dat$classe) #coerce into factor

#Remove the first 7 columns of the data since they are the personal information and clearly of no use for the prediction. Also remove "Near Zero variables" and columns with more than 60% of missing data
train_dat <- train_dat[,-c(1:7)]
train_dat <- train_dat[,-nearZeroVar(train_dat)]
tempData <- colSums(is.na(train_dat)) <= 0.6*nrow(train_dat)
train_dat <- train_dat[, tempData]

#Create data partitions: train and test for training and testing data, with 75% and 25% of the data points respectively.
trainIndex <- createDataPartition(train_dat$classe,
                                  p=0.75, list = FALSE)
train <- train_dat[trainIndex,]
test <- train_dat[-trainIndex,]
```

### b. Evaluation set

The evaluation set contains of 20 different test cases.The final model will be evaluate using this evaluation set.

```{r}
#Download and load evaluation data
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
              destfile = "pml-testing.csv",
              method = "curl")
eva <- read.csv("pml-testing.csv",
                na.strings = c("NA","#DIV/0!", ""))[,-c(1:7)]
```

## 3. Prediction algorithm

### a. Prediction approach

My approach to predict "classe" from the Human Activity Recognition dataset is rather simple and naive. I will choose a model from the linear model family and a tree-based algorithm, estimate their expected out-of-sample error using cross-validation and blend (stacking) them. I will use Linear Discriminant Analysis (LDA) and Random Forest. I will stack them using Support Vector Machine (SVM).

### b. Fitting models and prediction algorithms

Throughout the model building process, I will use 7-fold cross validation integrated in caret to estimate accuracy

#### Model 1: Linear discriminant analysis (LDA)

```{r}
#Set seed and set cross validation parameters
set.seed(4796)
t.control <- trainControl(method = "repeatedcv", number = 7, repeats = 3,
                              allowParallel = TRUE)

#Fit Linear Discriminant Analysis
model_lda <- train(classe ~ ., method = "lda", trControl = t.control, data = train)
model_lda
```

The accuracy of LDA model is `r model_lda$results$Accuracy`. This accuracy is estimated from repeated 7-fold cross-validation with 3 repeats. The estimated out-of-sample error = 1 - estimated out-of-sample accuracy: `r 1 - model_lda$results[[2]]`.

#### Model 2: Random Forest

```{r}
#Fit Random Forest with total 128 trees grown and 7 variables each split with parallel computing
library(doParallel)
cl = makeCluster(4)
registerDoParallel(cl)

model_rf <- train(classe ~ ., method = "rf", ntree=128,
                  tuneGrid=data.frame(.mtry = 7),
                  trControl = t.control, 
                  data = train)
stopCluster(cl)
remove(cl)
registerDoSEQ()
model_rf
```
The accuracy of Random Forest model is `r model_rf$results$Accuracy`. This accuracy is estimated from repeated 7-fold cross-validation with 3 repeats. The estimated out-of-sample error = 1 - estimated out-of-sample accuracy: `r 1 - model_rf$results[[2]]`.

#### Model stacking

Now we can prepare and compare the accuracy of predicted classes produced from 2 models.

```{r}
#Prediction from LDA
pred_lda <- predict(model_lda, newdata = test)

#Prediction from Random Forest
pred_rf <- predict(model_rf, newdata = test)

#Data frame contains the actual classe value in testing set and 2 prediction values from 2 models
meta <- data.frame(classe_sb = test$classe,
                   lda = pred_lda, 
                   rf = pred_rf)
```
Now I compared the accuracy of 2 models to the test data.

```{r}
confusionMatrix(pred_lda, test$classe)
confusionMatrix(pred_rf, test$classe)
```

From the plot above, the LDA model seems to classify not so good compared to the Random Forest model.

To stack 2 models for even better prediction, I combine 2 models using Support Vector Machine (SVM). This is "meta-learning" actually.

```{r}
#Create prediction from 2 models using SVM
model_svm <- train(classe_sb ~ ., data = meta, method = "svmLinear",
                   trControl = t.control)
pred_svm <- predict(model_svm, data = meta)

#Result
confusionMatrix(pred_svm, test$classe)
```

The accuracy of the stacking model is `r model_svm$results$Accuracy`. This accuracy is estimated from repeated 7-fold cross-validation with 3 repeats. The estimated out-of-sample error = 1 - estimated out-of-sample accuracy: `r 1 - model_svm$results$Accuracy`.

```{r}
#Compare models
compare <- resamples(list(LDA=model_lda, RF=model_rf, stacking=model_svm))
bwplot(compare, metric = "Accuracy")
```

### 4. Submission

In the evaluation set, I will apply 2 models of LDA and Random Forest and stack them using SVM model.

I will make prediction for 20 cases using model 1 (LDA), model 2 (Random Forest) and final model (stacking model using SVM to blend LDA and Random Forest)

```{r}
#Prediction from 2 models
eva_mod_lda <- predict(model_lda, eva)
eva_mod_rf <- predict(model_rf, eva)

#Model stacking
stacking <- data.frame(lda = eva_mod_lda,
                       rf = eva_mod_rf)
pred <- predict(model_svm, newdata = stacking)
data.frame(eva_mod_lda, eva_mod_rf, pred)
```

The prediction using the final stacking model gives the same result with the prediction from the Random Forest model.