-
Notifications
You must be signed in to change notification settings - Fork 0
/
rf_exploration.Rmd
378 lines (310 loc) · 14.6 KB
/
rf_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
title: "Random Forest"
author: Darius Goergen
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: true
editor_options:
chunk_output_type: console
bibliography: library.bib
link-citations: yes
---
```{r setup, include=FALSE, warning=FALSE, message=FALSE}
source("code/setup_website.R")
source("code/functions.R")
```
## Overview
Random Forest (RF) is a machine learning algorithm which is based on the
traditional concept of decision trees. It's popular implementation was
developed by @Breiman2001. It represents an ensemble classifier which has been
reported to be the primary choice between different ensemble classifiers due to
its easy handling and high classification accuracies [@Sagi2018]. It represents
a non-parametric classification method. At each split of a given tree a random
number variables are used for the decision of how to split the input data.
In the case of RF, a user specified number of trees is grown. The final class
decision for an observation is made by a simple majority vote from all trees
in the forest. Internal accuracy assessment of the RF classifier traditionally
is obtained through an out-of-bag (OOB) error estimation. Here, we kept the
number of trees fixed at 500 since above that threshold no substantial gain
in accuracy was observed.
## Adding Noise
The modality of the data presented to RF is of high importance here.
Different levels of data preprocessing might emphasize different features of
the patterns to be learned by an algorithm. To grasp this, different transformations
of the data were presented to the RF algorithm. Additionally, the raw data signal
was jittered to test which transformation might prove beneficial in delivering
high classification accuracies even in the presence of noise. To test this,
we define a function which adds noise to the raw data and returns a list with
the number of elements equal to the levels of noise applied.
```{r rf-addNoise}
addNoise = function(data, levels = c(0), category="class"){
data.return = list()
index = which(names(data) == category)
for (n in levels){
tmp = as.matrix(data[ , -index])
if (n == 0){
tmp = data
}else{
tmp = as.data.frame(jitter(tmp, n))
tmp[category] = data[category]
}
data.return[[paste("noise", n, sep="")]] = tmp
}
return(data.return)
}
data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
noisy_data = addNoise(data, levels = c(0,10,100,250,500), category = "class")
# indivitual elements can be selected by using [[ and refering to the index or the name
head(noisy_data[["noise100"]])[1:3,1:3]
```
## Data Preprocessing
Then, in another function which uses the `noisy_data` objected as input
specified data transformations are applied. These are normalization which centers
and scales the input data, as well as different forms of the Savitzkiy-Golay filter [@Savitzky1964]
and first and second derivative of a raw spectrum. The function iterates through
the noise level elements in the `noisy_data` object and returns each specified
transformation in a list element below the noise level. The exemplary function below
applies the preprocessing for normalization, standard filtering and first
derivative only. The implementation of the function used in the project
can be found [here](https://github.com/goergen95/polymeRID/blob/master/code/functions.R#12).
```{r rf-TrainingSet}
createTrainingSet = function(data, category = "class",
SGpara = list(p=3,w=11), lag = 15){
data.return = list()
for (noise in names(data)){
tmp = as.data.frame(data[[noise]])
classes = tmp[,category]
tmp = tmp[!names(tmp) %in% category]
# original data
data.return[[noise]][["raw"]] = as.data.frame(data[[noise]])
# normalised data
data_norm = preprocess(tmp, type="norm")
data_norm[category] = classes
data.return[[noise]][["norm"]] = data_norm
# SG-filtered data
data_sg = preprocess(tmp, type="sg", SGpara = SGpara)
data_sg[category] = classes
data.return[[noise]][["sg"]] = data_sg
# first derivative of original data
data_rawd1 = preprocess(tmp, type="raw.d1", lag = lag)
data_rawd1[category] = classes
data.return[[noise]][["raw.d1"]] = data_rawd1
}
return(data.return)
}
# applying the function
test_dataset = createTrainingSet(noisy_data, category = "class")
# individual transformations at a certain noise level can be accessed with [[
head(test_dataset[["noise500"]][["raw.d1"]])[1:3,1:3]
```
## Dimensionality Reduction
The data base of @Primpke2018 currently shows 1863 variables for each observations.
Most of these data points do not hold relevant information to distinguish between
different types of particles. To shorten the computation time, one can use
dimensionality reduction techniques such as principal component analysis (PCA).
PCA also has been used to transform spectral data of micro-plastics in marine
ecosystems before [@Jung2018; @Lorenzo-Navarro2018]. PCA basically takes the input
data for a given number of observation and by performing a orthogonal transformation
to the data transforms these possible correlated variables to uncorrelated
principal components. This way, both redundancies in the data as well as possible
noise can be accounted for. PCA previously has been successfully applied to
FTIR-spectrometer data [@Hori2003; @Nieuwoudt2004; @Mueller2013; @Ami2013; @Fu2014].
Simultaneously, the number of variables can be significantly reduced by applying
PCA and thus speeding up the training process. Below we will apply a PCA to the
raw data as an example only.
```{r PCA-raw , message=FALSE}
library(factoextra)
tmp = test_dataset[["noise0"]][["raw"]]
pca = prcomp(tmp[ ,-1864]) # omitting class variable
var_info = factoextra::get_eigenvalue(pca)
# setting a threshold of 99% explained variance
threshold = 99
thresInd = which(var_info$cumulative.variance.percent>=threshold)[1]
pca_data = pca$x[,1:thresInd]
```
We can use the index variable `thresInd` we just defined to take a look upon
all the principal components which explain 99% of the variance in the data set.
```{r pca-table, echo=FALSE}
library(knitr)
knitr::kable(var_info[1:thresInd,])
```
We effectively reduced the number of variables from 1683 to 15 which still bear
99% of the variance we can find in the original data set. However, when it comes
to machine learning, it is important to realize that this new data set is not fit
to be used in a training process. If we now randomly split the observations into
training and test, we effectively mix up these two sets because information of
the test set has already influenced the outcome of the PCA. Therefor, the data set
need to be split beforehand of the PCA. The analysis is done on the training data
only and then the same orthogonal transformations will be applied to the test data.
This way it can be ensured that the test set is truly independent from the
training process.
## Cross Validation
We apply a 10-fold cross-validation which is repeated 5 times. The following
code takes a complete data set as input, applies a splitting function from the
`caret` package and then builds the PCA upon the the test set and finally applies
the same transformation to the test set. We apply it for the raw data only.
Also, we randomly split the data to 50% training and 50% test.
```{r PCA_loop}
folds = 10
repeats = 5
split_percentage = 0.5
threshold = 99
tmp = test_dataset[["noise0"]][["raw"]]
set.seed(42) # ensure reproducibility
fold_index = lapply(1:repeats, caret::createDataPartition, y=tmp$class,
times = folds, p = split_percentage)
fold_index = do.call(c, fold_index)
pcaData = list()
for (rep in 1:repeats){
rep_index = fold_index[(rep*folds-folds+1):(rep*folds)] # jumps to the correct number of folds forward in each repeat
pcadata_fold = lapply(1:folds,function(x){
# splitting for current fold
training = tmp[unlist(rep_index[x]),]
validation = tmp[-unlist(rep_index[x]),]
# keep response
responseTrain = training$class
responseVal = validation$class
# apply PCA
pca = prcomp(training[,1:1863])
varInfo = factoextra::get_eigenvalue(pca)
thresInd = which(varInfo$cumulative.variance.percent >= threshold)[1]
pca_training = pca$x[ ,1:thresInd]
pca_validation = predict(pca, validation)[ ,1:thresInd]
training = as.data.frame(pca_training)
training$response = responseTrain
validation = as.data.frame(pca_validation)
validation$response = responseVal
foldtmp = list(training, validation)
names(foldtmp) = c("training","validation")
return(foldtmp)
})
names(pcadata_fold) = paste("fold", 1:folds, sep ="")
pcaData[[paste0("repeat",rep)]] = pcadata_fold
}
```
We now have a list object with the number of elements equivalent to the repeats.
Below each repeat element we can access the individual folds.
There we find two elements which we can access by referring to `"training"`
and `"testing"`.
```{r PCA_access}
pcaData[["repeat5"]][["fold10"]][["training"]][1:3,1:3]
pcaData[["repeat5"]][["fold10"]][["validation"]][1:3,1:3]
summary(pcaData[["repeat5"]][["fold10"]][["training"]]$response)
summary(pcaData[["repeat5"]][["fold10"]][["validation"]]$response)
```
## Parameter Tuning
For the RF algorithm only the parameter `mtry` needs a search pattern since we
hold the number of trees constant at a value of 500. `mtry` effectively specifies
the number of variables to look at each split within a tree. We took the square
root of the number of variables divided by 3 as the first `mtry` value, the square
root itself as the second and the maximum number of variables as a third value.
The optimal parameter is then selected to build the final model and it is
evaluated with the test data.
```{r}
training = pcaData[["repeat1"]][["fold1"]][["training"]]
validation =pcaData[["repeat1"]][["fold1"]][["validation"]]
x_train = training[ ,1:ncol(training)-1]
y_train = training$response
x_test = validation[ ,1:ncol(validation)-1]
y_test = validation$response
first = floor(sqrt(ncol(x_train)))/3
if(first <= 1) first <- 1
second = floor(sqrt(ncol(x_train)))
last = ncol(x_train)
mtries = c(first,second,last)
Mods = lapply(1:length(mtries),function(x){return(0)})
accuracy = c()
for (mtry in mtries){
Mods[which(mtries == mtry)] = list(randomForest::randomForest(x_train,
y_train,
ntree=500,
mtry=mtry))
pred = predict(Mods[[which(mtries==mtry)]], x_test)
conf = caret::confusionMatrix(pred, y_test)
accuracy = c(accuracy, conf$overall["Kappa"])
}
best_model = Mods[[which(accuracy == max(accuracy))[1]]]
prediction = predict(best_model, x_test)
confMat = caret::confusionMatrix(prediction, y_test)
print(confMat$table)
```
This process of splitting the data set into training and test was automated by
putting the above code in a function which can be found [here](https://github.com/goergen95/polymeRID/blob/master/code/functions.R#239).
Finally, we can apply this function to the different levels of prepossessing which
were discussed before.
```{r CV-apply, eval = FALSE}
source("code/functions.R")
wavenumbers = readRDS(paste0(ref,"wavenumbers.rds"))
# add noise to data
noisyData = addNoise(data,levels = c(0,10,100,250,500), category = "class")
# preprocessing
testDataset = createTrainingSet(noisyData, category = "class",
SGpara = list(p=3, w=11), lag=15,
type = c("raw", "norm", "sg", "sg.d1", "sg.d2",
"sg.norm", "sg.norm.d1", "sg.norm.d2",
"raw.d1", "raw.d2", "norm.d1", "norm.d2"))
types = names(testDataset[[1]])
levels = lapply(names(testDataset), function(x){
rep(x, length(types))
})
levels = unlist(levels)
results = data.frame(level=levels,type = types, kappa = rep(0,length(levels)))
for (level in unique(levels)){
for (type in types){
print(paste0("Level: ",level," Type: ",type))
tmpData = testDataset[[level]][[type]]
tmpData[which(wavenumbers<=2420 & wavenumbers>=2200)] = 0 # setting C02 window to 0
tmpModel = pcaCV(tmpData, folds = 10, repeats = 5, threshold = 99, metric = "Kappa", p=0.5, method="rf")
saveRDS(tmpModel,file = paste0(output,"rf/model_",level,"_",type,"_",round(tmpModel[[1]],2),".rds"))
results[which(results$level==level & results$type==type),"kappa"] = as.numeric(tmpModel[[1]])
print(results)
}
}
saveRDS(paste0(output,"rf/exploration.rds"))
```
```{r read-exploration, include = FALSE}
results = readRDS(paste0(output,"rf/exploration.rds"))
```
## Results
We can now take a look at the kappa scores the algorithm achieved during training
for different representations of the data at increasing noise levels.
```{r plot-results, echo = FALSE, warning=FALSE, message=FALSE}
library(plotly)
results$level = as.character(results$level)
# let's visualize the accuracy results with noise levels on x-axis and kappa score on y-axis
test = ggplot(data=results)+
geom_line(aes(y=kappa,group=type,x=level,color=type),size=1.5)+
ylab("Kappa score")+
xlab("Noise level")+
scale_color_discrete(name="Type of\nPre-Processing")+
theme_minimal()
t = ggplotly(test)
t
```
We can observe that with higher noise ratios the kappa score is reduced significantly.
However, there are some data transformations which are able to maintain a
relatively high level of accuracy even in the presence of noise. One of the best
might be the simple Savitzkiy-Golay filter, as well as the same filter applied
to the normalized data. Equal robust results are observed for the raw and the
normalized data. The other transformations do not show the same level of robustness
to noise. We can look at this more mathematically by calculating the average
slopes of the data transformations methods and order the data frame from low to
high slopes. Note that we only take the kappa score at noise level 0 and 500
to calculate the average slope.
```{r slope-calculation}
noise0 = results[results$level == "noise0", ]
noise500 = results[results$level == "noise500", ]
slopes = noise500$kappa - noise0$kappa
types = unique(results$type)
df = data.frame(type = types, slope = slopes, row.names = NULL)
df = df[order(-slopes),]
```
```{r print-slope-table, echo = FALSE}
knitr::kable(df)
```
We can now confirm, that the average decrease in kappa score is the lowest for
the Savitzkiy-Golay filter applied to the raw and the normalized data followed
by the raw and the normalized data. The numbers can be interpreted as the loss in
Kappa score when the noise level increases from 0 to 500.
## Citations on this page