-
Notifications
You must be signed in to change notification settings - Fork 4
/
model-select.qmd
427 lines (318 loc) · 15.2 KB
/
model-select.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# Model Selection {#sec-modelselect}
## Generate the data {#sec-data}
Here we will demonstrate two approaches to model comparison. But first let's generate data, in the same way we did for multiple linear regression in (@sec-ols). Note that in this case, we will specify that two of the input variables have zero slope (i.e., no linear association with the outcome variable).
```{r}
n = 40
n_covariate = 4
p = n_covariate + 1
betas = vector("numeric", length = p)
xmat = matrix(0, nrow = n, ncol = p)
sigma = 2.25
# Column for intercept
xmat[,1] = 1
# Generate the covariate data randomly:
set.seed(5)
xmat[,2] = rnorm(n, mean = 5, sd = 8)
xmat[,3] = runif(n, min = 0, max = 20)
xmat[,4] = rchisq(n, df = 50)
xmat[,5] = rpois(n, lambda = 10)
# Set the betas:
betas[1] = 1.0
betas[2] = 0.0
betas[3] = -0.2
betas[4] = 0.0
betas[5] = 1.8
# Calculate the observed 'y', adding residual error
y = xmat %*% betas + rnorm(n, mean = 0, sd = sigma)
par(mfrow=c(2,2))
for(i in 2:p){
plot(y ~ xmat[,i],
xlab = paste("covariate ", i-1))
}
# Create a data.frame
my_df = data.frame(y, xmat[,2:5])
head(my_df)
# Run the model, report the summary
m1 = lm(y ~ 1 + X1 + X2 + X3 + X4, data = my_df)
m1_summary = summary(m1)
m1_summary
```
## Parsimony via model simplification
We will successively simplify the model until we find a "minimally acceptable" model that explains the most variability in the outcome variable.
There are several built-in functions in R that can help us make quantitatively justified decisions about which input variables can be dropped from the full model to determine our minimally acceptable model. First, we can use the $F$-test as described in lecture. This can be implemented by the `anova()` function.
Based on the `summary()` output, we see that input variable 3 ($x_3$) has the least significant effect on $y$, so we will drop that first and proceed from there.
```{r}
# The full model lives in object m1
formula(m1)
# Create a new model with the update() function
# This function has a strange notation, but so it goes...
m2 = update(m1, .~. -X3)
formula(m2)
summary(m2)
# Use anova() to test if the drop of X3 is justified
anova(m2, m1)
```
Remember that the hypothesis tested is:
$$H_0:\text{simple model}$$
$$H_A:\text{complex model}$$
So if the $p \ge 0.05$, as usual, we cannot reject the null hypothesis. In this case, it means that the simple model is just as good as the more complex model. Therefore, we are justified in dropping $x_3$. From the data, we could not detect that $x_3$ has a statistically meaningful linear relationship with the outcome data $y$.
Let's manually calculate that $F$ test statistic and associated $p$-value to verify that we understand how the test works.
```{r}
# Extract the residuals (errors)
resid_null = m2$residuals
resid_full = m1$residuals
# Sum of Square Errors (SSE)
sse_null = crossprod(resid_null)
sse_full = crossprod(resid_full)
# degrees of freedom
df_null = n-(p-1) # we dropped one input variable
df_full = n-p
# Calculate F_stat
f_test = ((sse_null - sse_full)/(df_null - df_full)) / (sse_full/df_full)
# Degrees of freedom for the F distribution:
df_numerator = df_null - df_full
df_denominator = df_full
p_m1vm2 = 1 - pf(f_test,
df1 = df_numerator,
df2 = df_denominator)
# Compare to anova()
f_test; p_m1vm2
anova_m1vm2 = anova(m2,m1)
anova_m1vm2$`F`; anova_m1vm2$`Pr(>F)`
```
Let's continue with the simplification process, using the `anova()` function.
```{r}
# model 2 is the current best.
summary(m2)
# Now, drop x1 and check
m3 = update(m2, .~. -X1)
# Check:
anova(m3, m2)
# The p-value is not significant, so we
# can accept the null (simpler model)
summary(m3)
# Remove X2 and check
m4 = update(m3, .~. -X2)
anova(m4, m3)
# Ok now the p-value is significant
# We need to reject the null (simpler model)
# We *cannot* reliably remove X2
# Try X4 just in case:
m5 = update(m3, .~. -X4)
anova(m3, m5)
# p-value is significant again
# need to reject the null
# we cannot drop X4
# Therefore, m3 is most parsimonious
summary(m3)
```
Therefore, model 3, is the minimum acceptable model:
$$y_i = \beta_0 + \beta_2 x_{2,i} + \beta_4 x_{4,i} + \epsilon_i$$
We could actually come to the same result, using a different, more automated function, `step()`. However, this function uses a different metric to test the null vs. full model hypothesis, the Akaike nformation criterion (AIC), which is calculated as:
$$\text{AIC} = - 2ln(\text{Model Likelihood}) + 2k$$
And $k$ is the number of estimated parameters in the model. We can then compare the AIC values to decide which models are "best". We will learn more about AIC later.
Let's use the `step()` function and verify it gives us the same final outcome.
```{r}
m1_step = step(m1)
summary(m1_step)
```
We can see the selected model only includes $x_2$ and $x_4$, just like our decision based on the $F$-test.
## Model averaging
Recall from lecture that model averaging represents another philosophical approach to model selection and model comparison. In this case, the idea is that we cannot know with certainty which model of a nested sub-set of models is "true". Therefore, instead of reporting the slopes and intercepts from the single "best" model that based on parsimony, we should report "averaged" values of slopes and intercepts. These averages will take into account all of the possible nested subset of models in which those slopes and intercepts could have been calculated. This averaging procedure can produce slope and intercept estimates (as well as estimates of their uncertainty) that are less biased, and can perhaps yield better predictions of future data.
We will use the `MuMIn` package (Multimodel Inference) to do model averaging later, but first we will do it manually.
### Required calculations
Recall that to come up with averaged estimates of model parameters (e.g., model-averaged slopes) we need to calculate weighted averages. These averages are weighted by how well sub-models explain the data. Following lecture, we will use the corrected $AIC$, noted as $AIC_c$, to calculate how well a model explains the data.
$$\text{AIC}_c = - 2ln(\text{Model Likelihood}) + 2k + \frac{2K(K+1)}{n-K-1}$$
where $n$ is the number of data observations. Then, to calculate the weights we need to see how much each sub-model deviates from the best model. For this deviation we calculate, for sub-model $i$:
$$\Delta \text{AIC}_{c,i} = \text{AIC}_{c,\text{best}} - \text{AIC}_{c,i}$$
The weight of sub-model $i$ is:
$$w_i = \frac{\text{exp}(-\Delta \text{AIC}_{c,i} / 2)}{\sum_{r=1}^{R} \text{exp}(-\Delta \text{AIC}_{c,r} / 2)} $$
And $R$ is the number of submodels being examined.
We are now ready to calculate the weighted average of any parameter of interest in the full model, $\theta$:
$$\hat{\bar{\theta}} = \sum_{r=1}^{R} w_r \hat{\theta}_{r}$$
Here, $\hat{\bar{\theta}}$ is the model-averaged estimate of parameter $\theta$, $w_r$ is the weight of sub-model $r$, and $\hat{\theta}_r$ is the parameter estimate derived from sub-model $r$.
We can also calculate the new averaged uncertainty in the parameter estimate:
$$\hat{\text{var}}(\hat{\bar{\theta}}) = \sum_{r=1}^{R} w_r \left( \hat{\text{var}}(\hat{\theta})_r + (\hat{\theta} - \hat{\bar{\theta}})^2 \right) $$
Here, $\hat{\text{var}}(\hat{\bar{\theta}})$ is the standard error of the averaged model parameter, whereas $\hat{\text{var}}(\hat{\theta})_r$ is the standard error of model parameter $\theta$ estimated from sub-model $r$.
### Manual calculation
Let's see if we can manually calculate all of this from a less complex example model. Imagine our full model is a model that only has two input variables. We'll use our simulated data set from above. (Of course, we know this is a poor model, but we're just doing a case-study here.)
```{r}
# Full model:
full_mod = lm(y~1+X1+X2, data = my_df)
```
If this full model has two inputs, then the number of sub-models is $2^2 = 4$, which iteratively drop one or both input variables. Now, run each sub-model. I know, this is tedious.
```{r}
sub_m2 = update(full_mod, .~. -X2)
sub_m3 = update(full_mod, .~. -X1)
sub_m4 = update(full_mod, .~. -X2-X1) # Intercept only
# Store all models in a list for easy looping later:
model_list = list(
full_mod, sub_m2, sub_m3, sub_m4
)
# how many models?
n_mod = 4
```
To get the model-averaged slopes of inputs $x_1$ and $x_2$, we'll need to calculate $AIC_c$ values and model weights. We'll store calculations in arrays as much as possible, so we can loop through.
Let's start with $AIC_c$. Fortunately, there's a built-in function for this in the `MuMIn` package, but we'll do one manually first.
```{r}
# Extract neg-log-likelihood from full model:
nll_full = -1*logLik(full_mod)
# This is in a weird format, so we'll convert:
nll_full = as.numeric(nll_full)
k_full = 4 # two slopes + 1 intercept + residual sigma
# Calculate AIC_c
aic_c_full = 2*nll_full + 2*k_full + (2*k_full*(k_full + 1))/(n - k_full - 1)
aic_c_full
# Check with built-in
library(MuMIn)
AICc(full_mod)
# Now calculate all:
AICc_vec = NULL
for(i in 1:n_mod){
AICc_vec[i] = AICc(model_list[[i]])
}
AICc_vec
```
We can see that the 'best' model, according to $AIC_c$ is the `sub_m3`, which includes the intercept and only input $x_2$. This makes sense, because we know that the slope of $x_1$ was simulated as zero.
Let's now calculate the $\Delta \text{AIC}_c$.
```{r}
# Best AICc
AICc_best = min(AICc_vec)
#\Delta AIC_c
Delta_AICc_vec = AICc_vec - AICc_best
Delta_AICc_vec
```
Now we can calculate the model weights (i.e., the value representing how "good" each model is, relative to the best model).
```{r}
# Calculate the denominator of the weight calculation
weight_denom = 0
for(i in 1:n_mod){
weight_denom =
weight_denom +
exp( -Delta_AICc_vec[i] / 2 )
}
weight_denom
# Now the individual weights:
weight = NULL
for(i in 1:n_mod){
weight[i] =
exp( -Delta_AICc_vec[i] / 2) / weight_denom
}
weight
# Sum to 1?
sum(weight)
```
We can see the "better" models, based on $AIC_c$ have higher weights, and the `weight` vector should add to 1.
Let's calculate the model-averaged slope estimate for input $x_2$. To do this, we'll first need to extract the estimate from each sub-model. This is a little tedious, because we need to know which coefficient refers to $x_2$ in each sub-model object (or if the coefficient is absent and therefore equal to zero).
```{r}
coef_x2 = NULL
coef_x2[1] = coef(model_list[[1]])[3]
coef_x2[2] = 0 # Absent from this sub-model
coef_x2[3] = coef(model_list[[3]])[2]
coef_x2[4] = 0 # Absent from this sub-model
coef_x2
# Averaged, based on model weight:
avg_coef_x2 = 0
for(i in 1:n_mod){
avg_coef_x2 =
avg_coef_x2 +
weight[i] * coef_x2[i]
}
avg_coef_x2
```
We can see the model-averaged slope estimate for input $x_2$ is slightly less than the estimate from the full model.
```{r}
summary(full_mod)
coef(full_mod)[3]
```
I'll leave calculating the model-averaged standard error of the slopes as an exercise for you as a student.
As I mentioned above, fortunately someone created a package to do this model averaging for us and remove a lot of the tedium.
First, run all sub-models using the `MuMIn::dredge()` function.
```{r}
# Required for MuMIn::dredge functionality
options(na.action = "na.fail")
# Fit all sub-models:
dredge_test = dredge(full_mod)
dredge_test
```
See how this output has run all sub-models, calculated the likelihoods, the $AIC_c$, the $\Delta AIC_c$, and the model weights.
Now, we can average all of the models.
```{r}
# Average the models:
test_average = model.avg(dredge_test, fit = TRUE)
summary(test_average)
```
This `summary()` statement shows the model-averaged values of slopes of $x_1$ and $x_2$, and the intercept. We care about the "full average". You can see the averaged estimate of the slope for $x_2$ matches our manual calculation. For emphasis:
```{r}
test_average$coefficients[1,2];
avg_coef_x2
```
### Back to more complex model
Ok, but our full model had four input variables, which means the number of sub-models is $4^2 = 16$. Let's not do that manually, but instead use the `MuMIn::model.avg()` function.
```{r}
# Reminder, m1 was our full model:
summary(m1)
# Fit all sub-models:
dredge_m1 = dredge(m1)
# Average the models:
m1_average = model.avg(dredge_m1, fit = TRUE)
summary(m1_average)
```
Notice how we see the model with inputs $x_2$ and $x_4$ is the best, based on $AIC_c$ (note this is the model labeled as `24` meaning it inclues inputs 2 and 4).
We can also plot the model parameter estimates with their confidence intervals:
```{r}
# Plot the coefficient estimates (from the averaged model)
plot(m1_average)
```
Finally, we can use the `predict()` function as we have before to visualize the effect of each input variable on the outcome. Here, we will show the independent, model-averaged effect of $x_2$, when all other input variables are held at their average values. Then, we'll do the same for $x_4$.
```{r}
# Predict from the average model:
# How does y change as a function of x2, while
# other inputs held at their average?
new_df = data.frame(
X1 = rep(mean(my_df$X1), 100),
X2 = seq(0, 19, length.out = 100),
X3 = rep(mean(my_df$X3), 100),
X4 = rep(mean(my_df$X4), 100)
)
pred_m1_avg_x2 =
predict(m1_average,
newdata = new_df,
se.fit = TRUE)
plot(my_df$y ~ my_df$X2,
xlab = "input x2", ylab = "y", pch = 19)
lines(pred_m1_avg_x2$fit ~ new_df$X2)
lines(pred_m1_avg_x2$fit-2*pred_m1_avg_x2$se.fit ~ new_df$X2, lty = 2)
lines(pred_m1_avg_x2$fit+2*pred_m1_avg_x2$se.fit ~ new_df$X2, lty = 2)
# Predict from the average model:
# How does y change as a function of x4, while
# other inputs held at their average?
new_df = data.frame(
X1 = rep(mean(my_df$X1), 100),
X2 = rep(mean(my_df$X2), 100),
X3 = rep(mean(my_df$X3), 100),
X4 = seq(0, 19, length.out = 100)
)
pred_m1_avg_x4 =
predict(m1_average,
newdata = new_df,
se.fit = TRUE)
plot(my_df$y ~ my_df$X4,
xlab = "input x4", ylab = "y", pch = 19)
lines(pred_m1_avg_x4$fit ~ new_df$X4)
lines(pred_m1_avg_x4$fit-2*pred_m1_avg_x4$se.fit ~ new_df$X4, lty = 2)
lines(pred_m1_avg_x4$fit+2*pred_m1_avg_x4$se.fit ~ new_df$X4, lty = 2)
```
Another, perhaps simpler way to vizualize how well a model matches the data is to plot the model predictions of the data versus the observed data. We can even compare this to the non-averaged model.
```{r}
raw_predict_avg = predict(m1_average)
raw_predict_nonavg = predict(m1)
raw_predict_bad = predict(sub_m2)
plot(my_df$y ~ raw_predict_avg,
xlab = "Model Prediction", ylab = "Data, y", pch = 19, col = "red")
points(my_df$y ~ raw_predict_nonavg, pch = 19, col = "black")
points(my_df$y ~ raw_predict_bad, pch = 19, col = "orange")
# 1-to-1 line
abline(a = 0, b = 1)
```
It is hard to see, and likely not significant in this case, but the red points (model-averaged) tend to be closer to the 1:1 line, compared to the black points, meaning the averaged model makes slightly better predictions of the observed data. What is more clear, is that the "bad" model (which only included covariate $x_1$), does not match the 1:1 line at all; it's more of a shot-gun of points. Therefore, this clearly indicates the model is not predictive of the $y$ data. This is a good visualization of how well your models' within-sample prediction (i.e., how close the model predictions of observed data match the observed data).