-
Notifications
You must be signed in to change notification settings - Fork 0
/
cnn_exploration.Rmd
341 lines (289 loc) · 15 KB
/
cnn_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: "Convolutional Neural Network"
author: Darius Goergen
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: true
editor_options:
chunk_output_type: console
bibliography: library.bib
csl: elsevier-harvard.csl
link-citations: yes
---
```{r setup, include=FALSE, warning=FALSE, message=FALSE}
source("code/setup_website.R")
source("code/functions.R")
```
## Overview
Convolutional Neural Networks (CNN) are mainly used in image processing
tasks [@Rawat2017]. However, they can also be applied to one-dimensional data,
such as time series or spectral data [@Liu2017;@IsmailFawaz2019;@Ghosh2019;@Berisha2019].
They mainly consist of three different types of layers, which generally
are stacked into a sequential model for learning various patterns from the input data
to model the desired output. The most relevant layer is the convolutional layer,
which serves as an extractor for features found in the input [@Rawat2017]. They work
on a specificed number of neurons, or filters, each serving as a mapping function
for a specific range in the input data, also referred to as the kernel size.
Not only do they map features from the raw input data, but also detect features in
the output of previous convolutional layers. This is achieved through adjusting
the weights associated with each filter based on a non-linear activation function.
Additionally, after some convolutional layers, pooling layers are most commonly
included in CNNs. These layers reduce the feature maps of previous layers and
are also associated with a function to choose which parameters are preserved.
Nowadays, most commonly max-pooling layers are used, preserving the maximum signal
from a feature map. Finally, most CNNs end with a fully-connected layer.
These layers are used to transform the last feature map to the output.
For regression problems, this layer may only contain a single neuron, while for
classification problems it may contain `n`-neurons, each modelling the output of
a specific class. The network learns through what is called "backpropagation".
It means that the training data is repeatedly presented to the network,
depending on an optimizer function the distance to the desired output is calculated.
Then, the weights associated with the filters are updated and a new epoch of
presenting the training data to the CNN is started.
## Model Architecture
In contrast to random forest and support vector machines, the effect of noise on
the classification outcome was not tested. This is mainly due to limitations in computation time.
The computation time was significantly reduced by installing the `keras` package
in GPU mode based on the `CUDA` library from [Nvidia](https://developer.nvidia.com/cuda-zone).
However, CNNs remain computational intensive, since depending on the architecture several thousands
of weights have to be trained. Here, we developed a simple two-block architecture
of four convolutional layers in total. The number of filters, or feature extractors,
increases with each layer by a factor of 2. We chose this network
architechture with four layers and only a small number of filters because it delivered
relatively high accuracies in short computation times. The code below
defines a function to set up and compile a CNN for a given kernel size.
```{r cnn-function}
# expects that you have installed keras and tensorflow properly
library(keras)
buildCNN <- function(kernel, nVariables, nOutcome){
model = keras_model_sequential()
model %>%
# block 1
layer_conv_1d(filters = 8,
kernel_size = kernel,
input_shape = c(nVariables,1),
name = "block1_conv1",) %>%
layer_activation_relu(name="block1_relu1") %>%
layer_conv_1d(filters = 16,
kernel_size = kernel,
name = "block1_conv2") %>%
layer_activation_relu(name="block1_relu2") %>%
layer_max_pooling_1d(strides=2,
pool_size = 5,
name="block1_max_pool1") %>%
# block 2
layer_conv_1d(filters = 32,
kernel_size = kernel,
name = "block2_conv1") %>%
layer_activation_relu(name="block2_relu1") %>%
layer_conv_1d(filters = 64,
kernel_size = kernel,
name = "block2_conv2") %>%
layer_activation_relu(name="block2_relu2") %>%
layer_max_pooling_1d(strides=2,
pool_size = 5,
name="block2_max_pool1") %>%
# exit block
layer_global_max_pooling_1d(name="exit_max_pool") %>%
layer_dropout(rate=0.5) %>%
layer_dense(units = nOutcome, activation = "softmax")
# we compile for a classification with the categorcial crossentropy loss function
# and use adam as optimizer function
compile(model, loss="categorical_crossentropy", optimizer="adam", metrics="accuracy")
}
```
The function expects three arguments as input. The first is the kernel size which
specifies the width of the window, extracting features from the input data
and subsequent layer outputs. Note that the kernel size is held constant through out
the network. The second argument expects an integer representing the number of
input variables which relates to the amount of wavenumbers in the present case.
The third argument also expects an integer value, specifiying the number of
classes in the output. Each convolutional layer is associated with a ReLU-activation
function. At the end of each block we added a max-pooling layer with `stride = 2`,
which takes the maximum values of its respective input and discards unrequired data points,
effectively reducing the feature space by half. The exit block again consists of a
global-max-pooling layer and is followed by a dropout layer which randomly silences
half of the neurons to reduce the influence of overfitting. The last layer is a
fully-connected layer which maps its input to `nOutcome` classes via the softmax
activation function. The last line of code compiles the model so it is ready for training.
We use categorical crossentropy as the loss function in our network because we currently
have 14 different classes which perfectly fit for one-hot encoding. If the number
of classes is too high, for example in speech recognition problems, sparse categorical
crossentropy would be the loss function of choice. As an optimizer function we chose
`adam` because it ensures that the learning rate and decay values will be changed adaptively during training.
Finally, we tell the model to optimize the training process based on overall accuracy.
We can now compile a first model and take a look at its structure:
```{r cnn-build}
model = buildCNN(kernel = 50, nVariables = 1863, nOutcome = 12)
model
```
In total, the current network consists of 135,700 weights to be trained.
In the column `output shape` we observe the shape transformation of the input
data from a 1D-array of 1814 in size on the top layer of the network,
to a 1D-output of size 12 on the bottom layer.
## Training a CNN
We can use our database to start a training process with the CNN defined before.
First, the input data needs to be transformed to arrays which can be understood
by the `keras::fit()` function. Here, we used `keras-backend` functionality
to achieve this. Additionally, every training process needs to be initiated with
information on the number of epochs the training data is going to be presented.
We used a fixed value of 300 epochs because beyond that value no substantial gain
in accuracy was observed. Also, the training data is going to be presented in
batches, each of 10 observations.
```{r cnn-testing, eval=FALSE}
data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
K <- keras::backend()
x_train = as.matrix(data[,1:ncol(data)-1])
x = K$expand_dims(x_train, axis = 2L)
x_train = K$eval(x)
y_train = keras::to_categorical(as.numeric(data$class)-1, length(unique(data$class)))
history = keras::fit(model, x = x_train, y = y_train,
epochs=300,
batch_size = 10)
history
plot(history)
```
```{r cnn-testing-background, fig.align="center", echo=FALSE, fig.cap= "**Fig. 1**: Accuracy and loss values for an exemplary training process."}
history = readRDS(paste0(output,"cnn_exploration_history.rds"))
history
p = plot(history)
p = p + theme_minimal()
plotly::ggplotly(p)
```
## Kernel Search
We achived an accuracy of 0.98 in 300 epochs (**Fig. 1**). This single value is still hardly
an indicator for the generalization potential of the CNN because we did not use
an independent validation dataset to evaluate the performance of the model on
unseen data. Before evaluating the generalization performance, we analyze how
the CNN reacts to different kernel sizes as well as to specific data transformations.
These are normalization, Savitzkiy-Golay filter, and the first and second derivative
of the different representations.
We then apply a loop to calculate all the different combinations. Note, we
apply the `set.seed()` function before splitting the data. This way all the different
combinations of kernels and data transformations actually trains and evaluates on the
exact same dataset. This would not be valid if the generalization potential of
the model was going to be assessed. But since we are interested in the performance
of different kernel sizes and data transformation techniques, it is actually beneficial
for comparison if the different models train on the same data. It would otherwise
not be possible to account variations in performance, either to the kernel size, or
to a different split in the training and validation sets.
```{r cnn-kernel-test, eval = FALSE}
kernels = c(10,20,30,40,50,60,70,80,90,100,125,150,175,200)
types = c("raw","norm","sg","sg.norm","raw.d1", "sg.norm.d1", "sg.norm.d2",
"raw.d1","raw.d2","norm.d1","norm.d2")
results = data.frame(types = rep(0, length(kernels) * length(types)),
kernel =rep(0, length(kernels) * length(types)),
loss = rep(0, length(kernels) * length(types)),
acc = rep(0, length(kernels) * length(types)),
val_loss=rep(0, length(kernels) * length(types)),
val_acc=rep(0, length(kernels) * length(types)))
variables = ncol(data)-1
counter = 1
for (type in types){
if (type == "raw"){
tmp = data
variables = ncol(data)-1
}else{
tmp = preprocess(data[ ,1:ncol(data)-1], type = type)
variables = ncol(tmp)
tmp$class =data$class
}
for (kernel in kernels){
# splitting between training and test
set.seed(42)
index = caret::createDataPartition(y=tmp$class,p=.5)
training = tmp[index$Resample1,]
validation = tmp[-index$Resample1,]
# splitting predictors and labels
x_train = training[,1:variables]
y_train = training[,1+variables]
x_test = validation[,1:variables]
y_test = validation[,1+variables]
#number of preditors and unique targets
nOutcome = length(levels(y_train))
# function to get keras array for dataframes
K <- keras::backend()
df_to_karray <- function(df){
d = as.matrix(df)
d = K$expand_dims(d, axis = 2L)
d = K$eval(d)
}
# coerce data to keras structure
x_train = df_to_karray(x_train)
x_test = df_to_karray(x_test)
y_train = keras::to_categorical(as.numeric(y_train)-1,nOutcome)
y_test = keras::to_categorical(as.numeric(y_test)-1,nOutcome)
# contstruction of "large" neural network
model = prepNNET(kernel, variables, nOutcome)
history = keras::fit(model, x = x_train, y = y_train,
epochs=200, validation_data = list(x_test,y_test),
#callbacks = callback_tensorboard(paste0(output,"nnet/logs")),
batch_size = 10 )
results$types[counter] = type
results$kernel[counter] = kernel
results$loss[counter] = history$metrics$loss[100]
results$acc[counter] = history$metrics$acc[100]
results$val_loss[counter] = history$metrics$val_loss[100]
results$val_acc[counter] = history$metrics$val_acc[100]
write.csv(results, file = paste0(output,"nnet/kernels.csv"))
counter = counter + 1
}
print(results)
}
```
## Results
We can now plot the results with increasing kernel sizes on the x-axis, accuracy
values for the validation dataset on the y-axis and different lines for the data
transformations (**Fig. 2**).
```{r cnn-kernel-results, echo = FALSE}
results = read.csv(paste0(output,"nnet/kernels.csv"))
```
```{r cnn-plot-kernel, message=FALSE, warning=FALSE, echo= FALSE, fig.cap="**Fig. 2**: Accuracy results for different kernel sizes."}
library(ggplot2)
library(plotly)
kernelPlot = ggplot(data = results, aes(x = kernel, y = val_acc))+
geom_line(aes(color=types,group=types), size = 1.5)+
ylab("validation accuracy")+
xlab("kernel size")+
theme_minimal()
ggplotly(kernelPlot)
```
We can observe that the pattern of accuracy is highly variable dependent on the
kernel size as well as between different data transformations. For example, the use
of the second derivative of the Savitzkiy-Golay filtered data yields to very low
accuracies across all kernel sizes. To aid the selection of an appropriate kernel
size and data transformation we calculated some descriptive statistic values
to find optimal configurations. One indicator are the kernel sizes which deliver
the highest accuracy results on average. Another indicator for optimal configurations
might be the highest accurcies achieved in absolute terms.
```{r cnn-calc-kernelStats}
kernelAcc = aggregate(val_acc ~ kernel, results, mean)
kernelAcc = kernelAcc[order(-kernelAcc$val_acc), ]
type = results[which(results$kernel == kernelAcc$kernel[1]),]
type = type[order(-type$val_acc), ]
highest = results[order(-results$val_acc),]
```
On average, a kernel size of 50 delivered the highest accuracy of 0.81 (**Tab. 1**).
A kernel size of 90 yielded to the second-highest accuracy.
```{r cnn-kernel-table, echo=FALSE}
knitr::kable(kernelAcc, caption = "**Tab. 1**: Average performance of kernel size across data
preprocessing types.")
```
When we order the results according to the absolute accuracies achieved,
it can be observed that there are only four pre-processing types and kernel sizes which yielded
to an accuracy of 0.9 or higher (**Tab. 2**). The second derivative of the normalized
data yielded to an accuracy of 0.91 at a kernel size of 90. The simple
Savitzkiy-Golay smoothed data yielded to an accuracy of 0.9 at a kernel size of 70.
The second derivative of the raw data yielded to an accuracy of 0.9 at a kernel size of
90. The first derivative of the normalised data yielded to an accuracy of 0.9 at a
kernel size of 150.
```{r cnn-highest hits, echo=FALSE}
knitr::kable(highest[1:10,], caption = "**Tab. 2**: The ten highest accuracy results for
different preprocessing types at varying kernel sizes.")
```
## Cross Validation
After finding the optimal kernel sizes for different pre-processing techniques,
a cross-validation approach was used to find the configuration with the optimal
generalization potential. The documentation of the results can be found [here](cnn_crossvalidation.html).
## Citations on this page