-
Notifications
You must be signed in to change notification settings - Fork 0
/
spr_analysis_colloc.jmd
271 lines (207 loc) · 12 KB
/
spr_analysis_colloc.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# Modeling: SPR Modifier-Noun Collocations
## Setup
```julia results = "hidden"
using MixedModels
using CSV, DataFrames, DataFramesMacros
using StatsBase
using RCall, JellyMe4
using Random
R"""
library(lme4)
library(tidyverse)
library(effects)
library(lattice)
library(car)
library(corrplot)
library(ggeffects)
library(LMERConvenienceFunctions)
library(viridis)
""";
```
## Stimuli & Theoretical background
Background:
- previous experiment suggested that backwards TP predicts reading times to modifier-noun bigrams
- this effect developed over the spillover region (noun and two following words), with strongest effect on the word directly following the bigram (spillover 1)
- collinearity issues in previous experiment; found that a good part of the effect was attributable to first / single word frequencies and bigram frequency
Stimuli sets (SPR):
A: absolute control
B: absolute silence
(A & B are automatically extracted for the same first word and a maximally similar log bigram frequency)
C: total control
D: total silence
(C & D were constructed from a synonymous W1)
A & B control for iosyncratic features of W1 (not just word frequency, but also idiosyncratic features of that word in isolation, such as imageability, orthographic neighbors, etc.) They are matched on bigram frequency so that we can disentangle the effect of the entire chunk freq vs. the effect of its individual component words and their association to each other.
C & D control for idiosyncratic features of word 2, but are not matched on bigram frequency.
## Model
Hypothesis: Backwards transition probability (BTP_lz) and/or forward transition probability (FTP_lz) will predict log-transformed RTs over the spillover region (noun, spillover_1, spillover_2)
Further: BTP and FTP will interact with age (age_z), reading experience (reading_exp_z), and position -- potentially all four will enter into an interaction
Predictors of interest:
- BTP_lz: log-transformed backward transition probability of words in bigram (z-scored)
- FTP_lz: log-transformed forward transition probability of words in bigram (z-scored)
- age_z: age in years (z-scored)
- reading_exp_z: added score from Words that Go Together task, Author Recognition task and synonym-based vocabulary task, then z-scored
Other covariates:
- length_z: length of current word (z-scored)
- `prev_length_z`: length of previous word (z-scored)
- `trial_number_z`: trial number (z-scored)
- education: level of education (4 levels from high school to graduate degree)
## Load data
- Preprocessed in R (file: spr_preparer.Rmd):
- Added trial number
- Added item IDs and corpus-derived stats
- Log-transformed RTs and word frequencies using log()
- Centered and scaled numeric predictors using scale() -> new variables end in _z or _lz (log/z-score transformed)
- Removed RTs < 100ms and outside of 2.5SD of each participant's mean
- Extracted reading times to the critical region only (noun, spillover_1 and spillover_2)
```julia
spr_critregion = CSV.read(joinpath(@__DIR__,"data/critregion_exp.txt"), delim="\t", header = 1, DataFrame);
```
## Correlation between predictors
```julia results = "hidden"
@rput spr_critregion
R"""
spr_critregion %>%
select(bfreq_lz, freq_lz, prev_freq_lz, BTP_lz, FTP_lz, length_z, prev_length_z) %>%
cor(method="spearman") %>%
corrplot(method="number", type="lower")
"""
R"""
spr_critregion %>%
distinct(id, .keep_all=TRUE) %>%
select(vocab_z, wgt_z, art_z, reading_exp_z, age_z) %>%
drop_na() %>%
cor(method = "spearman") %>%
corrplot(method="number", type="lower")
"""
```
Correlation between predictors is only moderate. In any case, correlation should cause standard errors to go up (and t-values to go down), resulting in a loss of power, but it shouldn't affect coefficients.
However, the inclusion of current word frequencies has a different meaning based on position -- in the critical position (the noun), current and previous word frequencies are vital parts of the calculation of forward and backward TP. In the spillover region, on the other hand, this is greatly reduced (previous word freq. of spillover_1 is part of back TP, but all others are not used in TP calculations). For this reason, we use word length to proxy word frequency.
Correlation between experience-based measures (and age) stronger, as would be expected. Thus we create an overall measure of reading experience taken by adding the z-scores of the three reading assessments.
### Set contrasts
```julia results = "hidden"
cntrsts = merge(
Dict(:position => EffectsCoding(base="spillover_2"),
:education => HelmertCoding(levels=["High school", "Trade school", "Undergraduate", "Grad school"]),
:origin => DummyCoding(base="USA"),
:id => Grouping(),
:w1 => Grouping(),
:w2 => Grouping())
);
```
## Maximal model critical word
Partially-crossed random effects on W1 and W2 to distinguish between the different types of control in item design. This should also take care of any differences between pairings, like differences in average bigram frequency, but also control for variability in the quality of synonymy and the lexical characteristics of each word.
Step 1 - Initial (essentially) maximal models:
```julia results = "hidden"
formula_maximal_ftp = @formula (logRT ~ 1 + FTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + length_z + prev_length_z + education +
(1 + FTP_lz + trial_number_z + word_number_z + length_z + prev_length_z | id) +
(1 + trial_number_z + reading_exp_z * age_z | w1) +
(1 + trial_number_z + reading_exp_z * age_z | w2));
formula_maximal_btp = @formula (logRT ~ 1 + BTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + length_z + prev_length_z + education +
(1 + BTP_lz + trial_number_z + word_number_z + length_z + prev_length_z | id) +
(1 + trial_number_z + reading_exp_z * age_z | w1) +
(1 + trial_number_z + reading_exp_z * age_z | w2));
#maximal_ftp = fit(MixedModel, formula_maximal_ftp, spr_critregion, contrasts = cntrsts);
#maximal_ftp.rePCA
#VarCorr(maximal_ftp)
#maximal_btp = fit(MixedModel, formula_maximal_btp, spr_critregion, contrasts = cntrsts);
#maximal_btp.rePCA
#VarCorr(maximal_btp)
```
To avoid introducing differences between W1/W2 random effects and between age/reading_exp, age * reading_exp slope was removed on both the words and the effects on their own as well. These terms were already not very informative and keeping one of the two as a random slope or keeping them on one word but not the other would have introduced an imbalance in the model structure.
Also removed the TPs from id to allow for the same random effects structure as they were already not very informative, and word_number was also not very informative.
Step 2 - Reduced random effect structure:
```julia
formula_btp = @formula (logRT ~ 1 + BTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + length_z + prev_length_z + education +
(1 + trial_number_z + length_z + prev_length_z | id) +
(1 + trial_number_z | w1) +
(1 + trial_number_z | w2));
formula_ftp = @formula (logRT ~ 1 + FTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + length_z + prev_length_z + education +
(1 + trial_number_z + length_z + prev_length_z | id) +
(1 + trial_number_z | w1) +
(1 + trial_number_z | w2));
region_ftp = fit(MixedModel, formula_ftp, spr_critregion, contrasts=cntrsts);
region_btp = fit(MixedModel, formula_btp, spr_critregion, contrasts=cntrsts);
show(region_ftp)
show(region_btp)
```
## Plot interactions using R
```julia results = "hidden"
region_ftp_r = (region_ftp, spr_critregion); #tuple that wraps the fitted model and the dataframe
@rput region_ftp_r;
region_btp_r = (region_btp, spr_critregion); #tuple that wraps the fitted model and the dataframe
@rput region_btp_r;
R"summary(spr_critregion$age)"
R"""
spr_critregion %>%
filter(age %in% c(18, 31, 54, 76)) %>%
distinct(age, age_z) %>%
arrange(age)
"""
R"""
ftp_region <- plot(ggpredict(region_ftp_r, terms = c("FTP_lz", "position")), facet = T, colors = "viridis")
ggsave("fig/Figure8_int_ftp_pos.jpg", ftp_region, height = 4, width = 7, units = "in")
"""
R"""
int_age <- plot(ggpredict(region_ftp_r, terms = c("FTP_lz", "age_z [-1.4285578, -0.3035959, 1.6867213, 3.5905030]")), colors = "viridis") + scale_fill_viridis(name = "age", discrete = TRUE, labels = c("18", "31", "54", "76")) + scale_color_viridis(name = "age", discrete = TRUE, labels = c("18", "31", "54", "76"))
ggsave("fig/Figure9_int_ftp_age.jpg", int_age, height = 7, width = 7, units = "in")
"""
R"""
ftp_read <- plot(ggpredict(region_ftp_r, terms = c("FTP_lz", "reading_exp_z [quart]")), colors = "viridis")
ggsave("fig/Figure10_int_ftp_read.jpg", ftp_read, height = 7, width = 7, units = "in")
"""
R"""
btp_position <- plot(ggpredict(region_btp_r, terms = c("BTP_lz", "position")), facet = T, colors = "viridis")
ggsave("fig/Figure11_int_btp_pos.jpg", btp_position, height = 4, width = 7, units = "in")
"""
R"""
btp_age <- plot(ggpredict(region_btp_r, terms = c("BTP_lz", "age_z [-1.4285578, -0.3035959, 1.6867213, 3.5905030]")), colors = "viridis") + scale_fill_viridis(name = "age", discrete = TRUE, labels = c("18", "31", "54", "76")) + scale_color_viridis(name = "age", discrete = TRUE, labels = c("18", "31", "54", "76"))
ggsave("fig/Figure12_int_btp_age.jpg", btp_age, height = 7, width = 7, units = "in")
"""
R"""
btp_read <- plot(ggpredict(region_btp_r, terms = c("BTP_lz", "reading_exp_z [quart]")), colors = "viridis")
ggsave("fig/Figure13_int_btp_read.jpg.jpg", btp_read, height = 7, width = 7, units = "in")
"""
```
## Model diagnostics
```julia results = "hidden"
R"""
library(performance)
check_model(region_btp_r)
"""
R"""
check_model(region_ftp_r)
"""
```
## Post-hoc LMM: Nationality
To assure that nationality is not a source of significant interparticipant difference, in particular that the FTP and BTP scores are are not processed in a significantly different way by US and British nationals, we conducted the following post-hoc LMM on 37 matched participants.
```julia
formula_btp_nation = @formula (logRT ~ 1 + BTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + length_z + prev_length_z + education + BTP_lz * origin +
(1 + trial_number_z + length_z + prev_length_z | id) +
(1 + trial_number_z | w1) +
(1 + trial_number_z | w2));
formula_ftp_nation = @formula (logRT ~ 1 + FTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + length_z + prev_length_z + education + FTP_lz * origin +
(1 + trial_number_z + length_z + prev_length_z | id) +
(1 + trial_number_z | w1) +
(1 + trial_number_z | w2));
ftp_nation = fit(MixedModel, formula_ftp_nation, spr_critregion, contrasts=cntrsts);
btp_nation = fit(MixedModel, formula_btp_nation, spr_critregion, contrasts=cntrsts);
show(ftp_nation)
show(btp_nation)
```
## Post-hoc LMM: Freq v. word length
```julia
formula_btp_f = @formula (logRT ~ 1 + BTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + freq_lz + prev_freq_lz + education +
(1 + trial_number_z + freq_lz + prev_freq_lz | id) +
(1 + trial_number_z | w1) +
(1 + trial_number_z | w2));
formula_ftp_f = @formula (logRT ~ 1 + FTP_lz * reading_exp_z * age_z * position + trial_number_z + word_number_z + freq_lz + prev_freq_lz + education +
(1 + trial_number_z + freq_lz + prev_freq_lz | id) +
(1 + trial_number_z | w1) +
(1 + trial_number_z | w2));
ftp_freq = fit(MixedModel, formula_ftp_f, spr_critregion, contrasts=cntrsts);
btp_freq = fit(MixedModel, formula_btp_f, spr_critregion, contrasts=cntrsts);
show(ftp_freq)
show(btp_freq)
mods = [region_ftp, ftp_freq, region_btp, btp_freq];
gof_summary = DataFrame(name = [:region_ftp, :ftp_freq, :region_btp, :btp_freq], dof=dof.(mods), deviance=deviance.(mods), AIC = aic.(mods), AICc = aicc.(mods), BIC = bic.(mods))
```