analysis/rf_exploration.Rmd

---
title: "Random Forest"
author: Darius Goergen
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: true
editor_options:
  chunk_output_type: console
bibliography: library.bib
csl: elsevier-harvard.csl
link-citations: yes
---

```{r setup, include=FALSE, warning=FALSE, message=FALSE}
source("code/setup_website.R")
source("code/functions.R")
```

## Overview
Random Forest (RF) is a machine-learning algorithm which is based on the 
concept of traditional decision trees. Its popular implementation was
developed by @Breiman2001. It represents an ensemble classifier which has been 
reported to be the primary choice between different types of ensemble classifiers due to 
its easy handling and high classification accuracy [@Sagi2018]. It is based on 
a non-parametric classification method where each branch of a tree decides on randomly 
chosen variables to split the input data into finer sub-categories. 
In the case of RF, a user-specified number of trees is grown. The final class 
decision for an observation is made by a simple majority vote from all trees
in the forest. Internal accuracy assessment of the RF classifier is traditionally
obtained through an out-of-bag (OOB) error estimation. In many cases, this error estimation
is considered to be robust enough, so that no independent validation dataset is used
to test the generalization capacity. Here, we kept the number of trees fixed at 
500 since above that threshold no substantial gain in accuracy was observed. 
We also evaluated the RF model by a randomly chosen validation set of 50% of the 
original database. 


## Adding Noise

The representation of the data presented to RF is of high importance here.
Different techniques of data pre-processing might emphasize different features of 
the patterns to be learned by an algorithm. To grasp this, different transformations
of the data were presented to the RF algorithm. Additionally, the raw data signal
was jittered to test which transformation might prove beneficial in delivering 
a high classification accuracy even in the presence of noise. To test this, 
we define a function which adds noise to the raw data and returns a list with
the number of elements equal to the levels of noise applied.

```{r rf-addNoise}
addNoise = function(data, levels = c(0), category="class"){
  data.return = list()
  index = which(names(data) == category)
  for (n in levels){
    tmp = as.matrix(data[ , -index])
    if (n == 0){
      tmp = data
    }else{
      tmp = as.data.frame(jitter(tmp, n))
      tmp[category] = data[category]
    }
    data.return[[paste("noise", n, sep="")]] = tmp
  }
  return(data.return)
}

data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
noisy_data = addNoise(data, levels = c(0,10,100,250,500), category = "class")

# indivitual elements can be selected by using [[ and refering to the index or the name
head(noisy_data[["noise100"]])[1:3,1:3]
```

## Data Pre-processing

In another function which uses the `noisy_data`-object as input,
specific data transformations are applied. These is normalization which centers
and scales the input data, different forms of the Savitzkiy-Golay filter [@Savitzky1964],
and first and second derivatives of the raw spectra. The function iterates through
the elements in the `noisy_data` object and returns each specified 
transformation in a list element below the noise level. The exemplary function below
applies the pre-processing for normalization, standard filtering and first 
derivative only. The implementation of the function used in the project
can be found [here](https://github.com/goergen95/polymeRID/blob/master/code/functions.R#12).

```{r rf-TrainingSet}
createTrainingSet = function(data, category = "class",
                             SGpara = list(p=3,w=11), lag = 15){
  
  data.return = list()
  for (noise in names(data)){
    
    tmp = as.data.frame(data[[noise]])
    classes = tmp[,category]
    tmp = tmp[!names(tmp) %in% category]
    
    # original data
    data.return[[noise]][["raw"]] = as.data.frame(data[[noise]])
    
    # normalised data
    data_norm = preprocess(tmp, type="norm")
    data_norm[category] = classes
    data.return[[noise]][["norm"]] = data_norm
    
    # SG-filtered data
    data_sg = preprocess(tmp, type="sg", SGpara = SGpara)
    data_sg[category] = classes
    data.return[[noise]][["sg"]] = data_sg
    
    # first derivative of original data
    data_rawd1 = preprocess(tmp, type="raw.d1", lag = lag)
    data_rawd1[category] = classes
    data.return[[noise]][["raw.d1"]] = data_rawd1
    
  }
  return(data.return)
}

# applying the function
test_dataset = createTrainingSet(noisy_data, category = "class")

# individual transformations at a certain noise level can be accessed with [[
head(test_dataset[["noise500"]][["raw.d1"]])[1:3,1:3]
```

## Dimensionality Reduction

The database of @Primpke2018 currently shows 1863 variables for each observation.
Most of these data points do not hold relevant information to distinguish between
different types of particles. To shorten the computation time, one can use 
dimensionality reduction techniques such as principal component analysis (PCA).
PCA has already been used to transform spectral data of microplastic in marine 
ecosystems [@Jung2018; @Lorenzo-Navarro2018]. PCA basically takes the input
data for a given number of observations and performs a orthogonal transformation
to  derive uncorrelated principal components from the possibly correlated variables.
Both redundancies in the data as well as the presence of noise can be accounted for this way. 
PCA has previously been successfully applied to FTIR-spectroscopy data [@Hori2003; @Nieuwoudt2004; @Mueller2013; @Ami2013; @Fu2014].
Simultaneously, the number of variables can be significantly reduced by applying
PCA and thus speeding up the training process. Below we apply a PCA to the
raw data as an example only.

```{r PCA-raw , message=FALSE}
library(factoextra)
tmp = test_dataset[["noise0"]][["raw"]]
pca = prcomp(tmp[ ,-1864]) # omitting class variable
var_info = factoextra::get_eigenvalue(pca)
# setting a threshold of 99% explained variance
threshold = 99
thresInd = which(var_info$cumulative.variance.percent>=threshold)[1]
pca_data = pca$x[,1:thresInd]
```
We can use the index variable `thresInd` we just defined to take a look upon
all the principal components which explain 99% of the variance of the data.

```{r pca-table, echo=FALSE}
library(knitr)
knitr::kable(var_info[1:thresInd,])
```

We effectively reduced the number of variables from 1683 to 15 which still bears 
99% of the variance we find in the original data. When it comes
to machine learning, however, it is important to realize that this new data set is not fit
to be used in a training process. If we now randomly split the observations into
a training and testing set, we effectively mix up these two sets because information of 
the testing set has already influenced the outcome of the PCA. Therefore, the data set
needs to be split before applying the PCA. The analysis is done on the training data 
only and then the same orthogonal transformations is applied to the testing data.
This way it can be ensured that the test set is truly independent of the 
training process.

## Cross Validation

We apply a 10-fold cross-validation which is repeated five times. The following 
code takes a complete data set as input, applies a splitting function from the
`caret` package and then builds the PCA upon the the training set and finally applies
the same transformation to the testing set. Here, it is only applied for the raw data.
We also randomly split the data to a 50% training and a 50% testing set.

```{r PCA_loop}
folds = 10
repeats = 5
split_percentage = 0.5
threshold = 99
tmp = test_dataset[["noise0"]][["raw"]]

set.seed(42) # ensure reproducibility
fold_index = lapply(1:repeats, caret::createDataPartition, y=tmp$class,
                   times = folds, p = split_percentage)
fold_index = do.call(c, fold_index)

pcaData = list()
for (rep in 1:repeats){
  rep_index = fold_index[(rep*folds-folds+1):(rep*folds)] # jumps to the correct number of folds forward in each repeat
  
  pcadata_fold = lapply(1:folds,function(x){
    
    # splitting for current fold
    training = tmp[unlist(rep_index[x]),]
    validation = tmp[-unlist(rep_index[x]),]
    
    # keep response
    responseTrain = training$class
    responseVal = validation$class
    
    # apply PCA
    pca = prcomp(training[,1:1863])
    varInfo = factoextra::get_eigenvalue(pca)
    thresInd = which(varInfo$cumulative.variance.percent >= threshold)[1]
    pca_training = pca$x[ ,1:thresInd]
    pca_validation = predict(pca, validation)[ ,1:thresInd]
    
    training = as.data.frame(pca_training)
    training$response = responseTrain
    validation = as.data.frame(pca_validation)
    validation$response = responseVal
    foldtmp = list(training, validation)
    names(foldtmp) = c("training","validation")
    return(foldtmp)
  })
  names(pcadata_fold) = paste("fold", 1:folds, sep ="")
  pcaData[[paste0("repeat",rep)]] = pcadata_fold
}

```
We now have a list object with the number of elements equivalent to the repeats.
Below the level of repeats individual folds can be accessed.
There the splitted data set can be accessed by referring to `"training"`
and `"testing"`.
```{r PCA_access}
pcaData[["repeat5"]][["fold10"]][["training"]][1:3,1:3]
pcaData[["repeat5"]][["fold10"]][["validation"]][1:3,1:3]

summary(pcaData[["repeat5"]][["fold10"]][["training"]]$response)
summary(pcaData[["repeat5"]][["fold10"]][["validation"]]$response)
```

## Parameter Tuning

For the RF algorithm only the parameter `mtry` needs a search pattern since we
hold the number of trees constant at a value of 500. `mtry` effectively specifies
the number of variables to look at each split within a tree. We took the square 
root of the number of variables divided by 3 as the first `mtry` value, the square
root itself as the second and the maximum number of variables as a third value.
The optimal parameter is then selected to build the final model and its performance is 
evaluated with the test data.

```{r}
training = pcaData[["repeat1"]][["fold1"]][["training"]]
validation =pcaData[["repeat1"]][["fold1"]][["validation"]]
x_train = training[ ,1:ncol(training)-1]
y_train = training$response
x_test = validation[ ,1:ncol(validation)-1]
y_test = validation$response


first = floor(sqrt(ncol(x_train)))/3
if(first <= 1) first <- 1
second = floor(sqrt(ncol(x_train)))
last = ncol(x_train)
mtries = c(first,second,last)

Mods = lapply(1:length(mtries),function(x){return(0)})
accuracy = c()
for (mtry in mtries){
  Mods[which(mtries == mtry)] = list(randomForest::randomForest(x_train, 
                                                                y_train, 
                                                                ntree=500,
                                                                mtry=mtry))
  
  pred = predict(Mods[[which(mtries==mtry)]], x_test)
  conf = caret::confusionMatrix(pred, y_test)
  accuracy = c(accuracy, conf$overall["Kappa"])
}

best_model = Mods[[which(accuracy == max(accuracy))[1]]]
prediction = predict(best_model, x_test)
confMat = caret::confusionMatrix(prediction, y_test)
print(confMat$table)

```


The process of splitting the data set into training and testing was automated by
putting the above code in a function which can be found [here](https://github.com/goergen95/polymeRID/blob/master/code/functions.R#239).
Finally, this function was applied to the different pre-procesesing levels, as discussed before.

```{r CV-apply, eval = FALSE}
source("code/functions.R")
wavenumbers = readRDS(paste0(ref,"wavenumbers.rds"))
# add noise to data
noisyData = addNoise(data,levels = c(0,10,100,250,500), category = "class")

# preprocessing
testDataset = createTrainingSet(noisyData, category = "class",
                                SGpara = list(p=3, w=11), lag=15,
                                type = c("raw", "norm", "sg", "sg.d1", "sg.d2",
                                  "sg.norm", "sg.norm.d1", "sg.norm.d2",
                                  "raw.d1", "raw.d2", "norm.d1", "norm.d2"))


types = names(testDataset[[1]])

levels = lapply(names(testDataset), function(x){
  rep(x, length(types))
})
levels = unlist(levels)

results = data.frame(level=levels,type = types, kappa = rep(0,length(levels)))

for (level in unique(levels)){
  for (type in types){

    print(paste0("Level: ",level," Type: ",type))
    tmpData = testDataset[[level]][[type]]
    tmpData[which(wavenumbers<=2420 & wavenumbers>=2200)] = 0 # setting C02 window to 0
    tmpModel = pcaCV(tmpData, folds = 10, repeats = 5, threshold = 99, metric = "Kappa", p=0.5, method="rf")
    saveRDS(tmpModel,file = paste0(output,"rf/model_",level,"_",type,"_",round(tmpModel[[1]],2),".rds"))
    results[which(results$level==level & results$type==type),"kappa"] = as.numeric(tmpModel[[1]])
    print(results)

  }
}
saveRDS(paste0(output,"rf/exploration.rds"))

```

```{r read-exploration, include = FALSE}
results = readRDS(paste0(output,"rf/exploration.rds"))
```

## Results

The plot below shows the Kappa scores the algorithm achieved during training
for different representations of the data at increasing noise levels.

```{r plot-results, echo = FALSE, warning=FALSE, message=FALSE}
library(plotly)
results$level = as.character(results$level)
# let's visualize the accuracy results with noise levels on x-axis and kappa score on y-axis
test = ggplot(data=results)+
  geom_line(aes(y=kappa,group=type,x=level,color=type),size=1.5)+
  ylab("Kappa score")+
  xlab("Noise level")+
  scale_color_discrete(name="Type of\nPre-Processing")+
  theme_minimal()

t = ggplotly(test)
t
```
It can be observed that with higher noise levels the Kappa score is reduced significantly.
However, there are some data transformations which are able to maintain a 
relatively high level of accuracy even in the presence of noise. One of the best
tranformations might be the simple Savitzkiy-Golay filter, as well as the same filter applied
to the normalized data. Equal robust results are observed for the raw and the 
normalized data. The other transformations do not show the same level of robustness
to noise. Looking at this more mathematically by calculating the average 
slopes of the Kappa scores reveals the transformations with the most stable results.

```{r slope-calculation}
noise0 = results[results$level == "noise0", ]
noise500 = results[results$level == "noise500", ]
slopes = noise500$kappa - noise0$kappa 
types = unique(results$type)
df = data.frame(type = types, slope = slopes, row.names = NULL)
df = df[order(-slopes),]
```
```{r print-slope-table, echo = FALSE}
knitr::kable(df)
```

This confirms that the average decrease in Kappa score is the lowest for
the Savitzkiy-Golay filter applied to the raw data, and the normalized data followed
by the raw data and the normalized data itself. The numbers can be interpreted as a loss in
the Kappa score when the noise level increases from 0 to 500. 


## Citations on this page