-
Notifications
You must be signed in to change notification settings - Fork 0
/
mvabund_model_sel.Rmd
359 lines (322 loc) · 13.9 KB
/
mvabund_model_sel.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
title: "mvabund model selection"
author: "JBW"
date: "12/7/2021"
output: pdf_document
---
First, load the libraries
```{r}
library(zoo) # as.Date
library(mvabund) #manyglm, etc
library(vegan) #rda
library(ggplot2)
library(ggfortify)
library(ggpubr) #ggarrange
library(gridExtra) ##grid.arrange
library(grid) #textGrob
library(reshape2)
```
Then, load the data
```{r}
# Load data
data = read.csv("data/combined_counts.csv")
data$Date = as.Date(data$Date,"%m/%d/%y")
data$Depth_ctgry = factor(data$Depth_ctgry)
# Pull out only the columns of plankton counts
plankton_counts = data[,8:18]; plankton_counts$copepods = data$copepods
# Make smaller data set omitting 3 rarest groups (fish, crabs, shrimp)
plankton_common = cbind(plankton_counts[,2:6],plankton_counts[,9:12])
# Create mvabund object for common plankton groups
plankton_mv = mvabund(plankton_common)
```
To avoid redundant predictor terms, run quick PCA on physical data
```{r}
data_waterqual = as.data.frame(cbind(data$DO,data$temp_C,data$turbidity,
data$salinity,data$chlorophyll,data$DOM,data$BGA))
colnames(data_waterqual) = c("DO","temp","turbidity","sal","chlor","DOM","BGA")
water.pca = rda(na.omit(data_waterqual), scale=T) #different usits, have to scale
#summary(water.pca)
screeplot(water.pca,type="l")
biplot(water.pca,type="text")
```
Turbidity, chlorophyll, and blue-green algae all (understandably) co-vary. So, we will only include 1 of the three in our model. We will stick with chlorophyll, since it is commonly measured in other studies.
Now, model comparison is best if run on the same dataset (AKA NA values have to be dealt with). So, we will run linear interpolation to fill in the 16 instances of NA's in chlorophyll and DOM.
```{r}
#create new object and use linear interpolation to fill NAs
data_nona = data
data_nona$chlorophyll = na.approx(data_nona$chlorophyll,method="linear",na.rm=F)
data_nona$DOM = na.approx(data_nona$DOM,method="linear",na.rm=F)
```
Now for model selection...
First, run the full model (omitting pH due to QC concerns), allowing for all possible 2-way interactions between factors, and all possible 2-way interactions between continuous variables
```{r}
glm1 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$salinity +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity +
data_nona$temp_C:data_nona$chlorophyll +
data_nona$temp_C:data_nona$DOM +
data_nona$salinity:data_nona$chlorophyll +
data_nona$salinity:data_nona$DOM +
data_nona$chlorophyll:data_nona$DOM,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
#for some reason, the above model would not run until I stopped allowing interactions between the factor variables (site, depth category) and the numeric ones...
plot(glm1) #check residuals
glm1 #get summary info
```
Then use drop1 function to test models with fewer terms --> 2022 edit: double check that p-values agree
```{r}
#drop_tests = drop1(glm1,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_tests
model_glm1 = anova.manyglm(glm1) #took 29 min
model_glm1
```
This suggests dropping the interaction between chlorophyll and DOM... (AND p-vals agree!)
So run the model again, dropping the term that yielded the lowest AIC
```{r}
glm2 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$salinity +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity +
data_nona$temp_C:data_nona$chlorophyll +
data_nona$temp_C:data_nona$DOM +
data_nona$salinity:data_nona$chlorophyll +
data_nona$salinity:data_nona$DOM,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm2) #check residuals
#drop_test_2 = drop1(glm2,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_2
model_glm2 = anova.manyglm(glm2) #took 26 min
model_glm2
```
Since we still have a lot of terms, we will go ahead and toss out the interaction term with the next lowest AIC: the interaction between DO and salinity.
EDIT: By p-values, we will instead toss out temp*chlor
```{r}
glm3 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$salinity +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity +
data_nona$temp_C:data_nona$DOM +
data_nona$salinity:data_nona$chlorophyll +
data_nona$salinity:data_nona$DOM,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm3) #check residuals
#drop_test_3 = drop1(glm3,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_3
model_glm3 = anova.manyglm(glm3) #took 23 min
model_glm3
```
2022 EDIT: NOW ALL REMAINING TERMS RESULT IN HIGHER AICs! THIS SUGGESTS THAT GLM3 IS THE BEST MODEL FOR THE COMMUNITY >> 1/7: But! salinity*chlorophyll has the lowest p-value. So out it goes.
```{r}
#model_2022 = anova.manyglm(glm3) #took 23 min
#model_2022
```
Again. since we still have a lot of terms, we will go ahead and toss out the interaction term with the next lowest AIC: the interaction between salinity and chlorophyll. >> 2022 EDIT! THE AICs WERE WORSE, BUT THE FINAL COMMUNITY MODEL STILL HAS NON-SIGNIFICANT TERMS...... SO LET'S TOSS TEMPxCHLOR
^ ignore the edit. We are again tossing sal*chlor (nice that we are achieving consistency!)
```{r}
glm4 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$salinity +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity +
data_nona$temp_C:data_nona$DOM +
data_nona$salinity:data_nona$DOM,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm4) #check residuals
#drop_test_4 = drop1(glm4,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_4
model_glm4 = anova.manyglm(glm4) #took 21 min
model_glm4
```
At this point, no single (dropped) term can make the model significantly better. However, we will keep dropping the lowest AIC value just to see if we get to another point when the model improves again. >> 2022 EDIT: SAME REASONING, BUT *NOW* DROPPING SALxCHLOR, WHICH GETS UP BACK TO THE AIC FROM GLM3
^ Edit: Now we drop temp*DOM
```{r}
glm5 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$salinity +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity +
data_nona$salinity:data_nona$DOM,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm5) #check residuals
#drop_test_5 = drop1(glm5,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_5
model_glm5 = anova.manyglm(glm5) #took 25 min
model_glm5
```
...It only got worse... So we will stick with glm4, which kept the term dropped in glm5. >> 2022 EDIT, GLM5 LOOKS GOOD
So next, we get the univariate tests...
########
```{r}
model1 = anova.manyglm(glm4,p.uni = "adjusted") #took 26 min
model1 #get p-values
#saveRDS(model1,"results/GLM_model_12.5.21.rds")
```
In the full output, we see that the interaction between temperature and chlorophyll is not significant in either the multivariate test or in any univariate tests. Therefore, we will try the same test for glm5 (when we dropped that term and the AIC stayed the same).
^ Edit, now we remove salinity*DOM
```{r}
#model2 = anova.manyglm(glm5) #took 19 min
#model2 = readRDS("results/GLM_model2_12.5.21.rds")
#model2 #get p-values
#saveRDS(model2,"results/GLM_model2_12.5.21.rds")
```
Just out of curiousity, what happens when we drop all interaction terms...
```{r}
glm6 = manyglm(plankton_mv ~ data_nona$Site +
data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm6) #check residuals
drop_test_6 = drop1(glm6,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
drop_test_6
```
2022: Let's keep dropping terms from glm5 to see what happens! Specifically, let's drop temp*DOM
^ Edit: as stated above, we are dropping sal*DOM
```{r}
glm7 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$salinity +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm7) #check residuals
#drop_test_5 = drop1(glm5,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_5
model_glm7 = anova.manyglm(glm7) #took 21 min
model_glm7
```
2022: The AICs are all worse, but let's see if the model still has non-significant terms...
```{r}
#model3 = anova.manyglm(glm7) #took 17.5 min
#model3 #get p-values
```
2022: Again, we have terms that are borderline significant (sal*DOM) so let's drop them!
^ Edit: now we are dropping DO*sal
```{r}
glm8 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$chlorophyll +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI") #default
plot(glm8) #check residuals
#drop_test_8 = drop1(glm8,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_8
model_glm8 = anova.manyglm(glm8) #took 15 min
model_glm8
```
2022: The AICs are still getting worse, but let's go ahead and see whether the remaining terms are all significant.
```{r}
#model5 = anova.manyglm(glm8) #took 15 min
#model2 = readRDS("results/GLM_model2_12.5.21.rds")
#model5 #get p-values
#saveRDS(model2,"results/GLM_model2_12.5.21.rds")
glm8_sum <- summary(glm8) #still has marginal DO*chlor
glm8_sum
```
2022: Out of curiosity, what happens when we ditch that DO*chlor term.....
^ Edit: And we re-converge. So, agreed to drop DO*chlor, since it is marginally significant.
```{r}
glm9 = manyglm(plankton_mv ~ data_nona$Depth_ctgry +
data_nona$DO +
data_nona$temp_C +
data_nona$salinity +
data_nona$chlorophyll +
data_nona$DOM +
data_nona$DO:data_nona$temp_C +
data_nona$DO:data_nona$DOM +
data_nona$temp_C:data_nona$salinity,
family="negative_binomial", #better for counts
nBoot = 1000, #default
k=1, #default
theta.method = "PHI", #default
show.coef = F)
plot(glm9) #check residuals
#drop_test_9 = drop1(glm9,test="Chisq",trace=T) #likelihood ratio test recommended for GLMs
#drop_test_9
model_glm9 = anova.manyglm(glm9) #took 13 min
#model2 = readRDS("results/GLM_model_01.05.22.rds")
model_glm9 #get p-values
#glm9_sum <- summary.manyglm(glm9) #still has marginal DO*chlor
#glm9_sum
```
Ok, let's try to dig out the deviance from these things
```{r}
glm8_9_aov <- anova(glm9, glm8,
p.uni = 'adjusted', test = 'LR')
glm8_9_aov
```
```{r}
```