analysis/cnn_exploration.Rmd

---
title: "Convolutional Neural Network"
author: Darius Goergen
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: true
editor_options:
  chunk_output_type: console
bibliography: library.bib
csl: elsevier-harvard.csl
link-citations: yes
---

```{r setup, include=FALSE, warning=FALSE, message=FALSE}
source("code/setup_website.R")
source("code/functions.R")
```

## Overview

Convolutional Neural Networks (CNN) are mainly used in image processing 
tasks [@Rawat2017]. However, they can also be applied to one-dimensional data,
such as time series or spectral data [@Liu2017;@IsmailFawaz2019;@Ghosh2019;@Berisha2019].
They mainly consist of three different types of layers, which generally
are stacked into a sequential model for learning various patterns from the input data
to model the desired output. The most relevant layer is the convolutional layer,
which serves as an extractor for features found in the input [@Rawat2017]. They work
on a specificed number of neurons, or filters, each serving as a mapping function 
for a specific range in the input data, also referred to as the kernel size. 
Not only do they map features from the raw input data, but also detect features in 
the output of previous convolutional layers. This is achieved through adjusting 
the weights associated with each filter based on a non-linear activation function. 
Additionally, after some convolutional layers, pooling layers are most commonly 
included in CNNs. These layers reduce the feature maps of previous layers and 
are also associated with a function to choose which parameters are preserved. 
Nowadays, most commonly max-pooling layers are used, preserving the maximum signal 
from a feature map. Finally, most CNNs end with a fully-connected layer. 
These layers are used to transform the last feature map to the output. 
For regression problems, this layer may only contain a single neuron, while for 
classification problems it may contain `n`-neurons, each modelling the output of 
a specific class. The network learns through what is called "backpropagation". 
It means that the training data is repeatedly presented to the network, 
depending on an optimizer function the distance to the desired output is calculated.
Then, the weights associated with the filters are updated and a new epoch of 
presenting the training data to the CNN is started.

## Model Architecture

In contrast to random forest and support vector machines, the effect of noise on
the classification outcome was not tested. This is mainly due to limitations in computation time.
The computation time was significantly reduced by installing the `keras` package
in GPU mode based on the `CUDA` library from [Nvidia](https://developer.nvidia.com/cuda-zone). 
However, CNNs remain computational intensive, since depending on the architecture several thousands
of weights have to be trained. Here, we developed a simple two-block architecture
of four convolutional layers in total. The number of filters, or feature extractors,
increases with each layer by a factor of 2. We chose this network
architechture with four layers and only a small number of filters because it delivered
relatively high accuracies in short computation times. The code below
defines a function to set up and compile a CNN for a given kernel size.

```{r cnn-function}
# expects that you have installed keras and tensorflow properly
library(keras)

buildCNN <- function(kernel, nVariables, nOutcome){
  model = keras_model_sequential()
  model %>%
    # block 1
    layer_conv_1d(filters = 8,
                  kernel_size = kernel,
                  input_shape = c(nVariables,1),
                  name = "block1_conv1",) %>%
    layer_activation_relu(name="block1_relu1") %>%
    layer_conv_1d(filters = 16,
                  kernel_size = kernel,
                  name = "block1_conv2") %>%
    layer_activation_relu(name="block1_relu2") %>%
    layer_max_pooling_1d(strides=2,
                         pool_size = 5,
                         name="block1_max_pool1") %>%
    
    # block 2
    layer_conv_1d(filters = 32,
                  kernel_size = kernel,
                  name = "block2_conv1") %>%
    layer_activation_relu(name="block2_relu1") %>%
    layer_conv_1d(filters = 64,
                  kernel_size = kernel,
                  name = "block2_conv2") %>%
    layer_activation_relu(name="block2_relu2") %>%
    layer_max_pooling_1d(strides=2,
                         pool_size = 5,
                         name="block2_max_pool1") %>%
    
    # exit block
    layer_global_max_pooling_1d(name="exit_max_pool") %>%
    layer_dropout(rate=0.5) %>%
    layer_dense(units = nOutcome, activation = "softmax")
  
  # we compile for a classification with the categorcial crossentropy loss function
  # and use adam as optimizer function
  compile(model, loss="categorical_crossentropy", optimizer="adam", metrics="accuracy")
}

```

The function expects three arguments as input. The first is the kernel size which
specifies the width of the window, extracting features from the input data
and subsequent layer outputs. Note that the kernel size is held constant through out
the network. The second argument expects an integer representing the number of
input variables which relates to the amount of wavenumbers in the present case.
The third argument also expects an integer value, specifiying the number of 
classes in the output. Each convolutional layer is associated with a ReLU-activation
function. At the end of each block we added a max-pooling layer with `stride = 2`,
which takes the maximum values of its respective input and discards unrequired data points,
effectively reducing the feature space by half. The exit block again consists of a 
global-max-pooling layer and is followed by a dropout layer which randomly silences
half of the neurons to reduce the influence of overfitting. The last layer is a
fully-connected layer which maps its input to `nOutcome` classes via the softmax
activation function. The last line of code compiles the model so it is ready for training.
We use categorical crossentropy as the loss function in our network because we currently 
have 14 different classes which perfectly fit for one-hot encoding. If the number
of classes is too high, for example in speech recognition problems, sparse categorical
crossentropy would be the loss function of choice. As an optimizer function we chose 
`adam` because it ensures that the learning rate and decay values will be changed adaptively during training.
Finally, we tell the model to optimize the training process based on overall accuracy.
We can now compile a first model and take a look at its structure:

```{r cnn-build}
model = buildCNN(kernel = 50, nVariables = 1863, nOutcome = 12)
model
```

In total, the current network consists of 135,700 weights to be trained. 
In the column `output shape` we observe the shape transformation of the input 
data from a 1D-array of 1814 in size on the top layer of the network,
to a 1D-output of size 12 on the bottom layer. 

## Training a CNN

We can use our database to start a training process with the CNN defined before.
First, the input data needs to be transformed to arrays which can be understood
by the `keras::fit()` function. Here, we used `keras-backend` functionality 
to achieve this. Additionally, every training process needs to be initiated with 
information on the number of epochs the training data is going to be presented. 
We used a fixed value of 300 epochs because beyond that value no substantial gain
in accuracy was observed. Also, the training data is going to be presented in 
batches, each of 10 observations.

```{r cnn-testing, eval=FALSE}
data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)

K <- keras::backend()
x_train = as.matrix(data[,1:ncol(data)-1])
x = K$expand_dims(x_train, axis = 2L)
x_train = K$eval(x)
y_train = keras::to_categorical(as.numeric(data$class)-1, length(unique(data$class)))

history = keras::fit(model, x = x_train, y = y_train,
                               epochs=300,
                               batch_size = 10)
history
plot(history)
```

```{r cnn-testing-background, fig.align="center", echo=FALSE, fig.cap= "**Fig. 1**: Accuracy and loss values for an exemplary training process."}
history = readRDS(paste0(output,"cnn_exploration_history.rds"))
history
p = plot(history)
p = p + theme_minimal()
plotly::ggplotly(p)

```

## Kernel Search

We achived an accuracy of 0.98 in 300 epochs (**Fig. 1**). This single value is still hardly 
an indicator for the generalization potential of the CNN because we did not use
an independent validation dataset to evaluate the performance of the model on
unseen data. Before evaluating the generalization performance, we analyze how
the CNN reacts to different kernel sizes as well as to specific data transformations.
These are normalization, Savitzkiy-Golay filter, and the first and second derivative 
of the different representations.
We then apply a loop to calculate all the different combinations. Note, we
apply the `set.seed()` function before splitting the data. This way all the different
combinations of kernels and data transformations actually trains and evaluates on the
exact same dataset. This would not be valid if the generalization potential of
the model was going to be assessed. But since we are interested in the performance
of different kernel sizes and data transformation techniques, it is actually beneficial
for comparison if the different models train on the same data. It would otherwise
not be possible to account variations in performance, either to the kernel size, or
to a different split in the training and validation sets. 

```{r cnn-kernel-test, eval = FALSE}
kernels = c(10,20,30,40,50,60,70,80,90,100,125,150,175,200)
types = c("raw","norm","sg","sg.norm","raw.d1", "sg.norm.d1", "sg.norm.d2",
          "raw.d1","raw.d2","norm.d1","norm.d2")
results = data.frame(types = rep(0, length(kernels) * length(types)),
                     kernel =rep(0, length(kernels) * length(types)),
                     loss = rep(0, length(kernels) * length(types)),
                     acc = rep(0, length(kernels) * length(types)),
                     val_loss=rep(0, length(kernels) * length(types)),
                     val_acc=rep(0, length(kernels) * length(types)))

variables = ncol(data)-1
counter = 1

for (type in types){
    if (type == "raw"){
      tmp = data
      variables = ncol(data)-1
    }else{
      tmp = preprocess(data[ ,1:ncol(data)-1], type = type)
      variables = ncol(tmp)
      tmp$class =data$class
    }
  for (kernel in kernels){

    # splitting between training and test
    set.seed(42)
    index = caret::createDataPartition(y=tmp$class,p=.5)
    training = tmp[index$Resample1,]
    validation = tmp[-index$Resample1,]


    # splitting predictors and labels
    x_train = training[,1:variables]
    y_train = training[,1+variables]
    x_test = validation[,1:variables]
    y_test = validation[,1+variables]

    #number of preditors and unique targets
    nOutcome = length(levels(y_train))

    # function to get keras array for dataframes
    K <- keras::backend()
    df_to_karray <- function(df){
      d = as.matrix(df)
      d = K$expand_dims(d, axis = 2L)
      d = K$eval(d)
    }

    # coerce data to keras structure
    x_train = df_to_karray(x_train)
    x_test = df_to_karray(x_test)
    y_train = keras::to_categorical(as.numeric(y_train)-1,nOutcome)
    y_test = keras::to_categorical(as.numeric(y_test)-1,nOutcome)
    # contstruction of "large" neural network
    model = prepNNET(kernel, variables, nOutcome)
    history = keras::fit(model, x = x_train, y = y_train,
                          epochs=200, validation_data = list(x_test,y_test),
                          #callbacks =  callback_tensorboard(paste0(output,"nnet/logs")),
                          batch_size = 10 )
    results$types[counter] = type
    results$kernel[counter] = kernel
    results$loss[counter] = history$metrics$loss[100]
    results$acc[counter] = history$metrics$acc[100]
    results$val_loss[counter] = history$metrics$val_loss[100]
    results$val_acc[counter] = history$metrics$val_acc[100]
    write.csv(results, file = paste0(output,"nnet/kernels.csv"))
    counter = counter + 1
  }

  print(results)
}

```

## Results

We can now plot the results with increasing kernel sizes on the x-axis, accuracy 
values for the validation dataset on the y-axis and different lines for the data 
transformations (**Fig. 2**).

```{r cnn-kernel-results, echo = FALSE}
results = read.csv(paste0(output,"nnet/kernels.csv"))
```


```{r cnn-plot-kernel, message=FALSE, warning=FALSE, echo= FALSE, fig.cap="**Fig. 2**: Accuracy results for different kernel sizes."}
library(ggplot2)
library(plotly)
kernelPlot = ggplot(data = results, aes(x = kernel, y = val_acc))+
  geom_line(aes(color=types,group=types), size = 1.5)+
  ylab("validation accuracy")+
  xlab("kernel size")+
  theme_minimal()
ggplotly(kernelPlot)

```
We can observe that the pattern of accuracy is highly variable dependent on the 
kernel size as well as between different data transformations. For example, the use
of the second derivative of the Savitzkiy-Golay filtered data yields to very low
accuracies across all kernel sizes. To aid the selection of an appropriate kernel 
size and data transformation we calculated some descriptive statistic values 
to find optimal configurations. One indicator are the kernel sizes which deliver 
the highest accuracy results on average. Another indicator for optimal configurations
might be the highest accurcies achieved in absolute terms.

```{r cnn-calc-kernelStats}
kernelAcc = aggregate(val_acc ~ kernel, results, mean)
kernelAcc = kernelAcc[order(-kernelAcc$val_acc), ]

type = results[which(results$kernel == kernelAcc$kernel[1]),]
type = type[order(-type$val_acc), ]

highest = results[order(-results$val_acc),]

```

On average, a kernel size of 50 delivered the highest accuracy of 0.81 (**Tab. 1**). 
A kernel size of 90 yielded to the second-highest accuracy.

```{r cnn-kernel-table, echo=FALSE}
knitr::kable(kernelAcc, caption = "**Tab. 1**: Average performance of kernel size across data
             preprocessing types.")
```

When we order the results according to the absolute accuracies achieved,
it can be observed that there are only four pre-processing types and kernel sizes which yielded
to an accuracy of 0.9 or higher (**Tab. 2**). The second derivative of the normalized
data yielded to an accuracy of 0.91 at a kernel size of 90. The simple 
Savitzkiy-Golay smoothed data yielded to an accuracy of 0.9 at a kernel size of 70. 
The second derivative of the raw data yielded to an accuracy of 0.9 at a kernel size of
90. The first derivative of the normalised data yielded to an accuracy of 0.9 at a 
kernel size of 150.

```{r cnn-highest hits, echo=FALSE}
knitr::kable(highest[1:10,], caption = "**Tab. 2**: The ten highest accuracy results for 
             different preprocessing types at varying kernel sizes.")
```


## Cross Validation

After finding the optimal kernel sizes for different pre-processing techniques,
a cross-validation approach was used to find the configuration with the optimal
generalization potential. The documentation of the results can be found [here](cnn_crossvalidation.html).

## Citations on this page