## Prepare problem
- Load libraries
- Load Dataset
- Split-out validation dataset

In [None]:
install.packages("mlbench")
library(mlbench)
data(package="mlbench")
data(PimaIndiansDiabetes)

## Summarize Data

### Understand Data with Descriptive Statistics

- Understand your data using the head() function to look at the first few rows.
- Review the distribution of your data with the summary() function.
- Review the dimensions of your data with the dim() function.
- Calculate pair-wise correlation between your variables using the cor() function.

In [None]:
head(PimaIndiansDiabetes)

In [None]:
summary(PimaIndiansDiabetes)

In [None]:
dim(PimaIndiansDiabetes)
[1] 768   9

## Understand Data with Visualization

- Use the hist() function to create a histogram of each attribute.
- Use the boxplot() function to create box and whisker plots of each attribute.
- Use the pairs() function to create pair-wise scatterplots of all attributes.

In [None]:
boxplot(PimaIndiansDiabetes)

## Prepare For Modeling by Pre-Processing Data

- Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options.
- Normalize numerical data (e.g. to a range of 0-1) using the range option.
- Explore more advanced power transforms like the Box-Cox power transform with the BoxCox option.

In [None]:
install.packages("caret")
# load caret package
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,1:8], method=c("range"))
# transform the dataset using the pre-processing parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,1:8])
# summarize the transformed dataset
summary(transformed)

## Algorithm Evaluation

### Algorithm Evaluation With Resampling Methods

We can use statistical methods called resampling methods to split our training dataset up into subsets, some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data.
The different resampling methods are available in the caret package. Look up the help on the **createDataPartition()**, **trainControl()** and **train()** functions in R.
- Split a dataset into training and test sets.
- Estimate the accuracy of an algorithm using k-fold cross validation.
- Estimate the accuracy of an algorithm using repeated k-fold cross validation.

In [None]:
# define training control
trainControl <- trainControl(method="cv", number=10)
# estimate the accuracy of Naive Bayes on the dataset
fit <- train(diabetes~., data=PimaIndiansDiabetes, trControl=trainControl, method="nb")
# summarize the estimated accuracy
print(fit)

### Algorithm Evaluation Metrics
There are many different metrics that you can use to evaluate the skill of a machine learning algorithm on a dataset.

You can specify the metric used for your test harness in caret in the **train()** function and defaults can be used for regression and classification problems.

- Practice using the Accuracy and Kappa metrics on a classification problem (e.g. iris dataset).
- Practice using RMSE and RSquared metrics on a regression problem (e.g. longley dataset).
- Practice using the ROC metrics on a binary classification problem (e.g. PimaIndiansDiabetes dataset from the mlbench package).

In [None]:
# prepare 5-fold cross validation and keep the class probabilities
control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)
# estimate accuracy using LogLoss of the CART algorithm
fit <- train(diabetes~., data=PimaIndiansDiabetes, method="rpart", metric="logLoss", trControl=control)
# display results
print(fit)

### Spot-Check Algorithms

We have to discover which which algorithm will perform best on our data using a process of trial and error. That is called spot-checking algorithms. The caret package provides an interface to many machine learning algorithms and tools to compare the estimated accuracy of those algorithms.
- Spot check linear algorithms on a dataset (e.g. linear regression, logistic regression and linear discriminate analysis).
- Spot check some non-linear algorithms on a dataset (e.g. KNN, SVM and CART).
- Spot-check some sophisticated ensemble algorithms on a dataset (e.g. random forest and stochastic gradient boosting).

**Help:** We can get a list of models that we can use in caret by typing: **names(getModelInfo())**

In [None]:
# prepare 10-fold cross validation
trainControl <- trainControl(method="cv", number=10)
# estimate accuracy of logistic regression
set.seed(7)
fit.lr <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=trainControl)
# estimate accuracy of linear discriminate analysis
set.seed(7)
fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl)
# collect resampling statistics
results <- resamples(list(LR=fit.lr, LDA=fit.lda))
# summarize results
summary(results)

### Model Comparison and Selection
Now that We know how to spot check machine learning algorithms on our dataset, we need to know how to compare the estimated performance of different algorithms and select the best model.

The caret package provides a suite of tools to plot and summarize the differences in performance between models.
- Use the summary() caret function to create a table of results.
- Use the dotplot() caret function to compare results.
- Use the bwplot() caret function to compare results.
- Use the diff() caret function to calculate the statistical significance between results.

In [None]:
# plot the results
dotplot(results)
bwplot(results)

In [None]:
# calculate statistical significance
diff(results)

## Improve Accuracy

### Algorithm Tuning
Once we have found one or two algorithms that perform well on our dataset, we may want to improve the performance of those models.One way to increase the performance of an algorithm is to tune it’s parameters to our specific dataset.

The caret package provides three ways to search for combinations of parameters for a machine learning algorithm.

- Tune the parameters of an algorithm automatically (e.g. see the tuneLength argument to train()).
- Tune the parameters of an algorithm using a grid search that we specify.
- Tune the parameters of an algorithm using a random search.

Take a look at the help for the **trainControl()** and **train()** functions and take note of the method and the tuneGrid arguments.

In [None]:
# load the library
library(caret)
# load the iris dataset
data(PimaIndiansDiabetes)
# define training control
trainControl <- trainControl(method="cv", number=10)
# define a grid of parameters to search for random forest
grid <- expand.grid(.mtry=c(1,2,3,4,5,6,7,8,10))
# estimate the accuracy of Random Forest on the dataset
fit <- train(diabetes~., data=PimaIndiansDiabetes, trControl=trainControl, tuneGrid=grid, method="rf")
# summarize the estimated accuracy
print(fit)

### Ensemble Predictions
Another way that we can improve the performance of our models is to combine the predictions from multiple models.

Some models provide this capability built-in such as **random forest** for *bagging* and **stochastic gradient boosting** for *boosting*. Another type of ensembling called **stacking (or blending)** can learn how to best combine the predictions from multiple models and is provided in the package *caretEnsemble.*

- Bagging ensembles with the random forest and bagged CART algorithms in caret.
- Boosting ensembles with the gradient boosting machine and C5.0 algorithms in caret.
- Stacking ensembles using the caretEnsemble package and the caretStack() function.

In [None]:
# Load packages
library(mlbench)
library(caret)
library(caretEnsemble)
# load the Pima Indians Diabetes dataset
data(PimaIndiansDiabetes)
# create sub-models
trainControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE)
algorithmList <- c('knn', 'glm')
set.seed(7)
models <- caretList(diabetes~., data=PimaIndiansDiabetes, trControl=trainControl, methodList=algorithmList)
print(models)
# learn how to best combine the predictions
stackControl <- trainControl(method="cv", number=5, savePredictions=TRUE, classProbs=TRUE)
set.seed(7)
stack.glm <- caretStack(models, method="glm", trControl=stackControl)
print(stack.glm)

## Finalize And Save Model
The tasks related to finalizing our model.

- Using the predict() function to make predictions with a model trained using caret.
- Training standalone versions of well performing models.
- Saving trained models to file and loading them up again using the saveRDS() and readRDS() functions.

In [None]:
# load package
library(randomForest)
# load iris data
data(PimaIndiansDiabetes)
# train random forest model
finalModel <- randomForest(diabetes~., PimaIndiansDiabetes, mtry=2, ntree=2000)
# display the details of the final model
print(finalModel)