/
response-normality.Rmd
212 lines (150 loc) · 5.42 KB
/
response-normality.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: "Checking for normality of response variable"
output:
workflowr::wflow_html:
includes:
in_header: header.html
editor_options:
chunk_output_type: console
author: "Patrick Schratz"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
fig.retina = 3,
fig.align = "center",
fig.width = 6.93,
fig.height = 6.13,
out.width = "100%",
echo = FALSE
)
```
This document originated from the fear of having a response variable which is not normally distributed "enough".
The response variable looks as follows:
```{r response-normality-1}
loadd(vi_data)
library(ggplot2)
ggplot(vi_data, aes(x = defoliation)) +
geom_histogram(bins = 30)
```
When applying the [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test we get
```{r response-normality-2}
shapiro.test(vi_data$defoliation)
```
## Exploring model residuals
Visualizing model residuals of LASSO and RF to see how they differ.
The LASSO "predicted vs. fitted" plot shows limited model power.
### Lasso model with no transformation of the response variable
```{r response-normality-3}
loadd(bm_vi_task_lasso_no_filter)
pred <- bm_vi_task_lasso_no_filter$results$`defoliation-a ll-plots-VI`$`Lasso-MBO`$pred
# fit lm of y and ŷ
resid <- lm(pred$data$truth ~ pred$data$response)
# https://stats.stackexchange.com/a/53257/101464
# plot(density(resid(resid)))
par(mfrow = c(1, 2))
plot(resid, which = c(1:2), ask = FALSE, main = "No Transform")
par(mfrow = c(1, 1))
# without outlier
resid1 <- resid
resid1$fitted.values[1526] <- NA
par(mfrow = c(1, 2))
plot(resid1, which = c(1:2), ask = FALSE, main = "No Transform")
par(mfrow = c(1, 1))
```
### RF model with no transformation of the response variable
```{r response-normality-4}
loadd(bm_vi_task_rf_no_filter)
pred_rf <- bm_vi_task_rf_no_filter$results$`defoliation-all-plots-VI`$RF$pred
# fit lm of y and ŷ
resid_rf <- lm(pred$data$truth ~ pred$data$response)
# https://stats.stackexchange.com/a/53257/101464
# plot(density(resid(resid)))
par(mfrow = c(1, 2))
plot(resid_rf, which = c(1:2), ask = FALSE, main = "No Transform")
par(mfrow = c(1, 1))
# without outlier
resid_rf1 <- resid_rf
resid1_rf$fitted.values[1526] <- NA
par(mfrow = c(1, 2))
plot(resid1, which = c(1:2), ask = FALSE, main = "No Transform")
par(mfrow = c(1, 1))
```
# Variable Transformations
The following transformations of the response variable were done to check if it they have an effect on the "residuals vs. fitted" and "QQ-Plot" shown above.
## Power transformation
One option to enforce more normality of a variable is by applying a power transformation.
The [Box-Cox](https://en.wikipedia.org/wiki/Power_transform#Box%E2%80%93Cox_transformation) power transformation estimates a lambda value from the variable.
Next, the transformation can be applied via
$$(y^lambda - 1) / lambda$$
There is a [Stackoverflow question](https://stackoverflow.com/questions/33999512/how-to-use-the-box-cox-power-transformation-in-r) that shows how to do this.
### Box-Cox Transformation
Applying it on the response of the data.
```{r response-normality-5}
library(MASS)
n <- 1759
x <- runif(n, 1, 5)
y <- vi_data$defoliation
# 1st approach to estimate lambda for power transformation
bc <- boxcox(y ~ x)
(lambda <- bc$x[which.max(bc$y)])
# apply power transform (boxcox)
y_new1 <- (y^lambda - 1) / lambda
shapiro.test(y_new1)
```
The "W" value is a little less than before.
There is another way to do this via package _car_.
```{r response-normality-6}
# 2nd approach to estimate lambda for power transformation
# boxcox (0.84)
y_new2 <- (y^(car::powerTransform(y)$lambda) - 1) / car::powerTransform(y)$lambda
shapiro.test(y_new2)
```
Exploring the residuals of a Lasso model with no transformation of the response variable.
```{r response-normality-7}
loadd(bm_vi_task_boxcox_lasso_no_filter)
pred <- bm_vi_task_boxcox_lasso_no_filter$results$`defoliation-all-plots-VI`$`Lasso-MBO`$pred
# fit lm of y and ŷ
resid <- lm(pred$data$truth ~ pred$data$response)
# https://stats.stackexchange.com/a/53257/101464
# plot(density(resid(resid)))
# residuals vs. fitted & QQ plot
par(mfrow = c(1, 2))
plot(resid, which = c(1:2), ask = FALSE, main = "No Transform")
par(mfrow = c(1, 1))
```
### Tukey Transformation
A slightly different way is to se the so called "Tukey" transformation instead of the "Box-Cox" transformation.
```{r response-normality-8}
# tukey trans (0.846)
y_new3 <- y^(car::powerTransform(y)$lambda)
shapiro.test(y_new3)
```
## Log transform
Beforehand we did a log transformation of the data.
However, since the data is a bit lef-skewed, this was enforced even more by that operation.
A substantially lower "W" value is the result.
```{r response-normality-9}
# log transformed response (0.53)
shapiro.test(log(y))
```
Also if we take a look at the residuals, they do not show normality.
```{r response-normality-10}
loadd(bm_models_vi_task_lasso_no_filter)
pred_bc <- bm_models_vi_task_lasso_no_filter$results$`defoliation-all-plots-VI`$lasso$pred
# fit lm of y and ŷ
resid_bc <- lm(pred_bc$data$truth ~ pred$data$response)
# https://stats.stackexchange.com/a/53257/101464
# plot(density(resid(resid_bc)))
par(mfrow = c(1, 2))
plot(resid_bc, which = c(1:2), ask = FALSE, main = "Boxcox")
par(mfrow = c(1, 1))
```
Also the Shapiro test does not look good
```{r response-normality-11}
shapiro.test(resid$residuals)
```
## Inverse Box-Cox Transformation
Just for completeness.
```{r response-normality-12}
y_old <- (lambda * y_new1 + 1)^(1 / lambda)
```