reproducibleDocs/CentreScale.Rmd

---
title: "The Dangers(?) of Improperly Centering and Scaling Your Data"
author: "Joel Carlson"
date: "August 15, 2015"
output: pdf_document
permalink: /reproducible/centrescale/
---

Recently I was at a conference where many of the presentations involved some sort of machine learning. One thing that I noticed was that often times the speakers would make no mention of being careful to not contaminate their test set with information from their training set.
 	
For example, for certain machine learning algorithms, such as support vector machines, centering and scaling your data is essential for the algorithm to perform. Centering is the process of subtracting the mean of the data from each observation to make the mean of the data zero. Scaling brings the range of the data into [-1,1]. If you center and scale your data before splitting it into training and test sets, then you have used information from your training set to make calculations on your test set. This is a problem. Or could be, at least.

This was concerning me at the conference, but I didn't speak up about it because although I had been instructed to avoid this mistake, and it made intuitive sense to me, I had never actually tested to see if there was a significant difference between the two methods. 

In this post, I am going to explore this by assessing the accuracy of support vector machine models on three datasets: 

  - Not centered or scaled
  - Training and testing centered and scaled together
  - Training and testing centered separately
  
For each situation we will train 1000 svm models using the default parameters from the `e1071` R package. Datawise, we will use a set from the [UCI machine learning repository](http://archive.ics.uci.edu/ml/datasets/Wine). The dataset is a series of wine measurements, quantifying things like "Alcohol Content", "Color Intensity", and "Hue" for 178 wines. The goal is to predict the region the wine came from based on the measurements. 

There are actually 13 variables in the data, but we are going to use only the 5 with the highest standard deviation, since this is for demonstration purposes, not pure predictive accuracy.

##Summarizing the data

```{r, message=FALSE, warning=FALSE, echo=FALSE}
#Enter the Hadleyverse
library(dplyr);library(reshape2);library(ggplot2); library(knitr);
library(ggthemes);library(e1071); library(caret); library(scales)

#Globals
test_len <- 1000
DATA_PATH <-"http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"

wine <- read.csv(DATA_PATH, header=FALSE)

#Colnames are from the documentation
colnames(wine) <- c("Region", "Alcohol", "Acid","Ash",
                    "Alcalinity","Magnesium","Phenols",
                    "Flavanoids","NonFlavanoidPhenols",
                    "Proanthocyanins","ColorIntensity","Hue",
                    "OpticalDensity","Proline")

wine <- as.data.frame(apply(wine, 2, as.numeric))
wine$Region <- as.factor(wine$Region)
```
 
Let's take a quick look at the mean and standard deviation of each column in the data (sorted by SD):

```{r,echo=FALSE}
kable( wine %>% 
         melt(id.vars="Region", variable.name="Variable") %>%
         group_by(Variable) %>% 
         summarize("Mean" = round(mean(value), 2), "SD" = round(sd(value), 2)) %>% 
         arrange(desc(SD)),
       format="html"
     )
```

As previously stated, we are only going to use the five variables with the highest standard deviation (and alcohol, because hey, we're talking about wine!).

```{r, echo=FALSE}
wine <- select(wine, Region, Alcohol, Proline, Magnesium, Alcalinity, ColorIntensity, Acid)
```

##A bit of intuition

Just so that we have a feel for the data, let's plot a variable or two:
We might be able to guess that there is some relationship between alcohol content and region:

```{r, echo=FALSE}
ggplot(data=wine, aes(x=Region, y=Alcohol, fill=Region)) + 
  geom_boxplot(outlier.size = 0) +
  theme_hc() +
  scale_fill_hc() +
  labs(x = "Region",y ="Alcohol Content (%)") + 
  guides(fill=FALSE)

```

And indeed there is, not a big surprise there.

Wikipedia tells me that, during brewing, [Proline](https://en.wikipedia.org/wiki/Proline) may produce haze, so perhaps it would be related to Color Intensity?

```{r, echo=FALSE}
ggplot(data=wine, aes(x=ColorIntensity, y=Proline, col=Region)) + 
  geom_point() +
  theme_hc() +
  scale_colour_hc() +
  labs(x = "Color Intensity",y ="Proline Content") + 
  guides(fill=FALSE)
```

As expected. It looks like these variables are going to be able to feed a pretty accurate model. Based on the way the second plot is organized into clusters I imagine even a simple approach like KNN would do well.

##Building the models

The features of this data vary wildly in scale, for example, the range of the `Proline` column is from 278 to 1680, whereas the `Acid` column goes from 0.74 to 5.80. This is an issue for support vector machines (and some other algorithms), so we need to center and scale the data.

For each situation described above, I trained 1000 svm models (with default parameters) and recorded the accuracy on a held out test set. Each model was trained on a random subset of 50% of the data, and tested on the remaining 50%. The test set acccuracy distributions in each situation are presented as histograms.

###No centering and scaling

To demonstrate the importance of centering and scaling for support vector machines, the models were trained on the raw data.

```{r, echo=FALSE, warning=FALSE}

noSC <- data.frame("name"="NoSC", "Accuracy"=numeric(test_len), stringsAsFactors=FALSE)
for( i in 1:(test_len)){
  #if(i%%100 == 0) print(i)
  set.seed(i * sample(1:100, 1))
  inTrain <- createDataPartition(y=wine$Region, p=0.50)
  train <- wine[inTrain$Resample1,]
  test <- wine[-inTrain$Resample1,]
  noSCmod <- svm(Region ~ . , data=train, scale=FALSE)
  preds <- predict(noSCmod, test) 
  levels(preds) <- levels(test$Region)
  
  acc <- confusionMatrix( test$Region, preds)
  noSC[i,"Accuracy"] <- acc$overall["Accuracy"]
}

ggplot(data=noSC, aes(x=Accuracy)) + 
  theme_hc()+
  geom_histogram(aes(y=..count../sum(..count..)), binwidth=diff(range(noSC$Accuracy))/length(table(noSC$Accuracy))) +
  scale_y_continuous(labels = percent) +
  scale_x_continuous(labels = percent) +
  labs(y="Percent") + 
  ggtitle(paste("No Centering or Scaling\nMean Accuracy =", round(mean(noSC$Accuracy), 3), "\nSD =",  round(sd(noSC$Accuracy), 3)))

```

As we can see, the accuracy is pretty abysmal. There are three categories, so the mean of `r round(mean(noSC$Accuracy),3)` is better than guessing, but not by much. Of course, this was expected, centering and scaling are a necessary step for svm's.

###Scaling Training and Test together

This time we scale the training and test together, and then split them apart for training of the model. This means that there is some information from the test set leaking into the training set, and vice versa.

```{r, warning=FALSE, echo=FALSE}
togetherSC <- data.frame("name"="Together","Accuracy"=numeric(test_len), stringsAsFactors=FALSE)
for( i in 1:test_len){
  #if(i%%100 == 0) print(i)
  set.seed(i * sample(1:100, 1))

  dat_scaled <- data.frame(scale(wine[,-1]), "Region"=wine[,1])
  inTrain <- createDataPartition(y=dat_scaled$Region, p=0.50)
  train <- dat_scaled[inTrain$Resample1,]
  test <- dat_scaled[-inTrain$Resample1,]
  togetherSCmod <- svm(Region ~ . , data=train, scale=FALSE)
  preds <- predict(togetherSCmod, test) 
  levels(preds) <- levels(test$Region)
  
  acc <- confusionMatrix( test$Region, preds)
  togetherSC[i, "Accuracy"] <- acc$overall["Accuracy"]
}

ggplot(data=togetherSC, aes(x=Accuracy)) + 
  theme_hc()+
  geom_histogram(aes(y=..count../sum(..count..)), binwidth=diff(range(togetherSC$Accuracy))/length(table(togetherSC$Accuracy))) + 
  scale_y_continuous(labels = percent) +
  scale_x_continuous(labels = percent) +
  labs(y="Percent") + 
  ggtitle(paste("Centered and Scaled Together",
                "\nMean Accuracy =", round(mean(togetherSC$Accuracy), 3),
                "\nSD =",  round(sd(togetherSC$Accuracy), 3)))

```

Clearly that made the difference, accuracy is nearly perfect. 

###Scaling separately (the right way)

Well, this is the right way.

```{r, warning=FALSE, echo=FALSE}

separateSC <- data.frame("name"="Separate", "Accuracy"=numeric(test_len), stringsAsFactors=FALSE)
for( i in 1:test_len){
  #if(i%%100 == 0) print(i)
  set.seed(i * sample(1:100, 1))
  inTrain <- createDataPartition(y=wine$Region, p=0.50)

  train <- wine[inTrain$Resample1,]
  train_scaled <- data.frame(scale(train[,-1]), "Region"=train[,1])

  test <- wine[-inTrain$Resample1,]
  test_scaled <- data.frame(scale(test[,-1]), "Region"=test[,1])

  separateSCmod <- svm(Region ~ . , data=train_scaled, scale=FALSE)
  preds <- predict(separateSCmod, test_scaled) 
  levels(preds) <- levels(test_scaled$Region)
  acc <- confusionMatrix( test_scaled$Region, preds)
  separateSC[i, "Accuracy"] <- acc$overall["Accuracy"]
}

ggplot(data=separateSC, aes(x=Accuracy)) + 
  theme_hc()+
  geom_histogram(aes(y=..count../sum(..count..)), binwidth=diff(range(separateSC$Accuracy))/length(table(separateSC$Accuracy))) + 
  scale_y_continuous(labels = percent) +
  scale_x_continuous(labels = percent) +
  labs(y="Percent") + 
  ggtitle(paste("Centered and Scaled Separately",
                "\nMean Accuracy =", round(mean(separateSC$Accuracy), 3),
                "\nSD =", round(sd(separateSC$Accuracy), 3)))

```

So, it appears that there is just the teeensiest increase in accuracy if we do it the right way. Is it statistically significant, though?

###Statistical Significance

We can assess whether or not there is a difference in the mean accuracy between scaling together and scaling separately by using a t-test. From the histograms we can see that the accuracy scores are very roughly normal, so t-tests are appropriate here.

```{r,echo=FALSE}
z <- t.test(separateSC$Accuracy,togetherSC$Accuracy)
z
```

```{r,echo=FALSE, warning=FALSE}

all <- rbind(separateSC, togetherSC)
ggplot(data=all, aes(x=name, y=Accuracy,fill=name)) +
  theme_hc() +
  geom_boxplot(outlier.size = 0) +
  guides(fill=FALSE) +
  labs(x="") + 
  ylim(0.875, 1) + 
  ggtitle(paste("Difference in Mean Accuracy from Scaling and Centering\n",
                "Estimate =", round(z$estimate[1] - z$estimate[2], 5),
                "(p =", round(z$p.value, 5), ")"))
```

The difference is extremely small, but exists, and is statistically significant.  

##Conclusion

As we saw, centering and scaling the data the proper actually, for this dataset, leads to a tiny, but statistically significant **increase** in the mean accuracy of the svm models. I'm surprised, I thought it was going to be the other way around! It would be interesting to see how the size of the dataset, and number of variables in the model impact this difference.

Of course, this was no fully rigorous test, but I think this adds to the evidence that you should always center and scale your training and test data separately (not least because it's the correct way!).