analysis/rf_exploration.Rmd

---
title: "Preparation"
author: Darius Goergen
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: false
editor_options:
  chunk_output_type: console
bibliography: library.bib
link-citations: yes
---

```{r setup, include=FALSE, warning=FALSE, message=FALSE}
source("code/setup_website.R")
source("code/functions.R")
```

Random Forest (RF) is a machine learning algorithm which is based on the 
traditional concept of decision trees. It's popular implementation was
developed by @Breiman2001. It represents an ensemble classifier which has been 
reported to be the primary choice between different ensemble classifiers due to 
its easy handling and high classification accuracies [@Sagi2018]. 
**More elaborations on scientific background**


Different levels of data reprocessing might emphasize different features of 
the patterns to be learned by an algorithm. To grasp this, different transformations
of the data were presented to the RF algorithm. Additionally, the raw data signal
was jittered to test which transformation might prove beneficial in delivering 
high classification accuracies even in the presence of noise. To test this, 
we define a function which adds noise to the raw data and returns a list with
the number of elements equal to the levels of noise applied.

```{r}
addNoise = function(data, levels = c(0), category="class"){
  data.return = list()
  index = which(names(data) == category)
  for (n in levels){
    tmp = as.matrix(data[ , -index])
    if (n == 0){
      tmp = data
    }else{
      tmp = as.data.frame(jitter(tmp, n))
      tmp[category] = data[category]
    }
    data.return[[paste("noise", n, sep="")]] = tmp
  }
  return(data.return)
}

data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
noisy_data = addNoise(data, levels = c(0,10,100,250,500), category = "class")

# indivitual elements can be selected by using [[ and refering to the index or the name
head(noisy_data[["noise100"]])[1:3,1:3]
```

Then, in another user defined function which uses the `noisy_data` objected as input
specified data transformations are applied. These are normalization which centers
and scales the input data, as well as different forms of the Savitkiy-Golay filter [@Savitzky1964]
and first and second derivative of a raw spectrum. The functions iterates through
the noise level elements in the `noisy_data` object and returns each specified 
transformation in a list element below the noise level. The exemplary function below
applies the preprocessing for normalization, standard filtering and first 
derivative only. The implementation of the function used in the project
can be found [here](https://github.com/goergen95/polymeRID/blob/master/code/functions.R#12).

```{r}
createTrainingSet = function(data, category = "class",
                             SGpara = list(p=3,w=11), lag = 15){
  
  data.return = list()
  for (noise in names(data)){
    
    tmp = as.data.frame(data[[noise]])
    classes = tmp[,category]
    tmp = tmp[!names(tmp) %in% category]
    
    # original data
    data.return[[noise]][["raw"]] = as.data.frame(data[[noise]])
    
    # normalised data
    data_norm = preprocess(tmp, type="norm")
    data_norm[category] = classes
    data.return[[noise]][["norm"]] = data_norm
    
    # SG-filtered data
    data_sg = preprocess(tmp, type="sg", SGpara = SGpara)
    data_sg[category] = classes
    data.return[[noise]][["sg"]] = data_sg
    
    # first derivative of original data
    data_rawd1 = preprocess(tmp, type="raw.d1", lag = lag)
    data_rawd1[category] = classes
    data.return[[noise]][["raw.d1"]] = data_rawd1
    
  }
  return(data.return)
}

# applying the function
test_dataset = createTrainingSet(noisy_data, category = "class")

# individual transformations at a certain noise level can be accessed with [[
head(test_dataset[["noise500"]][["raw.d1"]])[1:3,1:3]
```

The data base of [@Primpke2018] currently shows 1863 variables for each observations.
Most of these data points do not bear relevant information to distinguish between
different types of particles. To shorten the computation time, one can use 
dimensionality reduction techniques such as principal component analysis (PCA).
PCA also has been used to transform spectral data of micro-plastics in marine 
ecosystems before [@Jung2018;@Lorenzo-Navarro2018]. PCA basically takes the input
data for a given number of observation and by performing a orthogonal transformation
to the data transforms these possible correlated variables to uncorrelated 
principal components. This way, both redundancies in the data as well as possible
noise can be accounted for. PCA previously has been successfully applied to
FTIR-spectrometer data [@Hori2003;@Nieuwoudt2004;@Mueller2013;@Ami2013;Fu2014].
Simultaniously, the number of variables can be significantly reduced by applying
PCA and thus speeding up the training process. Below we will apply a PCA to the
raw data as an example.
```{r PCA-raw , message=FALSE}
library(factoextra)
tmp = test_dataset[["noise0"]][["raw"]]
pca = prcomp(tmp[ ,-1864]) # omitting class variable
var_info = factoextra::get_eigenvalue(pca)
# setting a threshold of 99% explained variance
threshold = 99
thresInd = which(var_info$cumulative.variance.percent>=threshold)[1]
pca_data = pca$x[,1:thresInd]
```
We can use the index variable `thresInd` we just defined to take a look upon
all the principal components which explain 99% of the variance in the data set.

```{r pca-table, echo=FALSE}
library(knitr)
knitr::kable(var_info[1:thresInd,])
```

We effectivly reduced the number of variables from 1683 to 15 which still bear 
99% of the variance we can find in the original data set. However, when it comes
to machine learning, it is important to realise that this new dataset is not fit
to be used in a training process. If we now randomly split the observations into
training and test, we effectivly mix up these two sets because information of 
the test set has already influenced the outcome of the PCA. Therefor, the data set
need to be split beforhand of the PCA. The analysis is done on the training data 
only and then the same orthogonal transformations will be applied to the test data.
This way it can be ensured that the test set is truely independent from the 
training process.
Here, we apply a 10-fold cross-validation which is repeated 5 times. The following 
code takes a complete data set as input, applies a splitting function from the
`caret` package and then builds the PCA upon the the test set and finally applies
the same transformation to the test set. We apply it for the raw data only.
Also, we randomly split the data to 50% training and 50% test.

```{r PCA_loop}
folds = 10
repeats = 5
split_percentage = 0.5
threshold = 99
tmp = test_dataset[["noise0"]][["raw"]]

set.seed(42) # ensure reproducibility
fold_index = lapply(1:repeats, caret::createDataPartition, y=tmp$class,
                   times = folds, p = split_percentage)
fold_index = do.call(c, fold_index)

pcaData = list()
for (rep in 1:repeats){
  rep_index = fold_index[(rep*folds-folds+1):(rep*folds)] # jumps to the correct number of folds forward in each repeat
  
  pcadata_fold = lapply(1:folds,function(x){
    
    # splitting for current fold
    training = tmp[unlist(rep_index[x]),]
    validation = tmp[-unlist(rep_index[x]),]
    
    # keep response
    responseTrain = training$class
    responseVal = validation$class
    
    # apply PCA
    pca = prcomp(training[,1:1863])
    varInfo = factoextra::get_eigenvalue(pca)
    thresInd = which(varInfo$cumulative.variance.percent >= threshold)[1]
    pca_training = pca$x[ ,1:thresInd]
    pca_validation = predict(pca, validation)[ ,1:thresInd]
    
    training = as.data.frame(pca_training)
    training$response = responseTrain
    validation = as.data.frame(pca_validation)
    validation$response = responseVal
    foldtmp = list(training, validation)
    names(foldtmp) = c("training","validation")
    return(foldtmp)
  })
  names(pcadata_fold) = paste("fold", 1:folds, sep ="")
  pcaData[[paste0("repeat",rep)]] = pcadata_fold
}

```
We now have a list object with the number of elements equivalent to the repeats.
Below each repeat element we can access the individual folds.
There we find two elements which we can access by refering to `"training"`
and `"testing"`.
```{r PCA_access}
head(pcaData[["repeat5"]][["fold10"]][["training"]])
tail(pcaData[["repeat5"]][["fold10"]][["validation"]])

summary(pcaData[["repeat5"]][["fold10"]][["training"]]$response)
summary(pcaData[["repeat5"]][["fold10"]][["validation"]]$response)
```

This process of splitting the dataset into training and test was automised by
putting the above code in a function which can be found [here](https://github.com/goergen95/polymeRID/blob/master/code/functions.R#239).
Finally, we can apply this function to the different levels of preprocssing which
were discussed before.

```{r CV-apply, eval = FALSE}
source("code/functions.R")
wavenumbers = readRDS(paste0(ref,"wavenumbers.rds"))
# add noise to data
noisyData = addNoise(data,levels = c(0,10,100,250,500), category = "class")

# preprocessing
testDataset = createTrainingSet(noisyData, category = "class",
                                SGpara = list(p=3, w=11), lag=15,
                                type = c("raw", "norm", "sg", "sg.d1", "sg.d2",
                                  "sg.norm", "sg.norm.d1", "sg.norm.d2",
                                  "raw.d1", "raw.d2", "norm.d1", "norm.d2"))


types = names(testDataset[[1]])

levels = lapply(names(testDataset), function(x){
  rep(x, length(types))
})
levels = unlist(levels)

results = data.frame(level=levels,type = types, kappa = rep(0,length(levels)))

for (level in unique(levels)){
  for (type in types){

    print(paste0("Level: ",level," Type: ",type))
    tmpData = testDataset[[level]][[type]]
    tmpData[which(wavenumbers<=2420 & wavenumbers>=2200)] = 0 # setting C02 window to 0
    tmpModel = pcaCV(tmpData, folds = 10, repeats = 5, threshold = 99, metric = "Kappa", p=0.5, method="rf")
    saveRDS(tmpModel,file = paste0(output,"rf/model_",level,"_",type,"_",round(tmpModel[[1]],2),".rds"))
    results[which(results$level==level & results$type==type),"kappa"] = as.numeric(tmpModel[[1]])
    print(results)

  }
}

```

```{r read-exploration, include = FALSE}
results = readRDS(paste0(output,"rf/exploration.rds"))
```

We can now take a look at the kappa scores for the algorithm achived during training
with different representations of the data and noise levels.
```{r plot-results, echo = FALSE}
library(plotly)
results$level = as.character(results$level)
# let's visualize the accuracy results with noise levels on x-axis and kappa score on y-axis
test = ggplot(data=results)+
  geom_line(aes(y=kappa,group=type,x=level,color=type),size=1.5)+
  ylab("Kappa score")+
  xlab("Noise level")+
  scale_color_discrete(name="Type of\nPre-Processing")+
  theme_minimal()

t = ggplotly(test)
t
```
We can observe that with the more noise the kappa score is reduced significantly.
However, there are some data transformations which are able to maintain a 
relativly high level of accuracy even in the presence of noise. One of the best
might be the simple Savitzkiy-Golay filter, as well as the same filter applied
to the normalised data. Equal robust results are observed for the raw and the 
normalised data. The other transformations do not show the same level of robustness
to noise.