/
Australia_trends.Rmd
375 lines (285 loc) · 16.5 KB
/
Australia_trends.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: "How did Australia do in the PISA study"
author: "The Freemasons"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_height: 10
fig_width: 14
number_sections: true
vignette: >
%\VignetteIndexEntry{How did Australia do in the PISA study}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
message = FALSE,
error = FALSE,
outwidth = "100%",
fig.width = 8,
fig.height = 6)
```
# Introduction
The purpose of this article is to explore some of the variables that influenced Australia's performance in PISA study. Note that this is an observational study (as oppose to controlled experiment), and we are inferring on factors that are correlated with academic performance rather than specific causes.
# Loading the packages and data
```{r}
#loading the data and libraries
library(learningtower)
library(tidyverse)
library(lme4)
library(ggfortify)
library(sjPlot)
library(patchwork)
library(ggrepel)
library(kableExtra)
student <- load_student("all")
data(countrycode)
theme_set(theme_classic(18)
+ theme(legend.position = "bottom"))
```
# Visualise predictors over time
Since we are expecting some time variations in the data, let's quickly visualize the time trends.
```{r}
#filtering the data for Australia
aus_data = student %>%
dplyr::filter(country %in% c("AUS")) %>%
dplyr::mutate(mother_educ = mother_educ %>% fct_relevel("less than ISCED1"),
father_educ = father_educ %>% fct_relevel("less than ISCED1"))
```
## Numeric variables
A boxplot is a standardized method of presenting data distribution. It informs whether or not our data is symmetrical. Box plots are important because they give a visual overview of the data, allowing researchers to rapidly discover mean values, data set dispersion, and skewness. In this data we visualize the numeric distribution across the years via boxplots.
```{r, fig.height = 9, fig.width = 15}
# plotting the distribution of numeric variables via boxplots
aus_data %>%
select(where(is.numeric)) %>%
bind_cols(aus_data %>% select(year)) %>%
pivot_longer(cols = -year) %>%
ggplot(aes(x = year, y = value,
colour = year)) +
geom_boxplot() +
facet_wrap(~name, scales = "free_y") +
theme(legend.position = "none") +
labs(x = "Year",
y = "",
title = "The distribution of numerical variables in the student dataset over all years")
```
## Factor variables
Missing data is a common issue that data professionals must deal with on a daily basis. In this section we visualize the number of missing values across the years for all the factor variables in the student dataset.
```{r, fig.height = 15, fig.width = 15}
#checking the missing values in the factor variables of the data
aus_fct_plotdata = aus_data %>%
select(where(is.factor)) %>%
dplyr::select(-country, -school_id, -student_id) %>%
pivot_longer(cols = -year) %>%
group_by(year, name, value) %>%
tally() %>%
dplyr::mutate(
value = coalesce(value, "missing"),
percent = n/sum(n),
year = year %>% as.character() %>% as.integer()) %>%
group_by(name, value) %>%
dplyr::mutate(last_point = ifelse(year == max(year), as.character(value), NA))
aus_fct_plotdata %>%
ggplot(aes(x = year, y = percent,
label = last_point,
group = value)) +
geom_point() +
geom_line() +
geom_label_repel(direction = "both", nudge_x = 3, seed = 2020, segment.size = 0) +
facet_wrap(~name, scales = "free_y", ncol = 3) +
scale_x_continuous(breaks = c(2000, 2003, 2006, 2009, 2012, 2015, 2018)) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Year",
y = "Percentage of missing values",
title = "Missing values in the student dataset's factor variables")
```
We initially investigate the most current 2018 data before generalizing the models/results into any patterns due to the quantity of missing values in the data in previous years and also to decrease the time complexity in modeling.
# Linear regression model for the 2018 study
Linear regression analysis predicts the value of one variable depending on the value of other variables. Because they are well known and can be trained rapidly, linear regression models have become a effective way of scientifically and consistently predicting the future.
We begin by doing a basic data exploration using linear regression models. To begin, we fit three linear models (one for each subject of math, reading, and science) to the 2018 Australian data to gain an understanding of the key variables that may be impacting test scores.
We filter the student data (we use the complete student data obtained by the load student("all") function) to pick the scores in Australia and re level some variables for further analyses.
```{r}
#filtering the data to Australia, defining the predictors and selecting the scores
student_predictors = c("mother_educ", "father_educ", "gender", "internet",
"desk", "room", "television", "computer_n",
"car", "book", "wealth", "escs")
student_formula_rhs = paste(student_predictors, collapse = "+")
aus2018 = aus_data %>%
filter(year == "2018") %>%
dplyr::select(
math, read, science,
all_of(student_predictors)) %>%
na.omit()
```
## Checking correlation matrix of the numeric variables
A correlation matrix is a table that displays the coefficients of correlation between variables. Each cell in the table represents the relationship between two variables.
```{r}
#correlation matrix for the numeric variables
aus2018 %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs") %>%
round(2) %>%
kbl(caption = "Correlation Matrix") %>%
kable_styling(full_width = NULL,
position = "center",
bootstrap_options = c("hover", "striped"))
```
## Fitting three linear models
```{r}
#fitting linear models for the three subjects maths, reading and science
aus2018_math = lm(formula = as.formula(paste("math ~ ", student_formula_rhs)) , data = aus2018)
aus2018_read = lm(formula = as.formula(paste("read ~ ", student_formula_rhs)) , data = aus2018)
aus2018_science = lm(formula = as.formula(paste("science ~ ", student_formula_rhs)) , data = aus2018)
sjPlot::tab_model(aus2018_math, aus2018_read, aus2018_science,
show.ci = FALSE, show.aic = TRUE, show.se = TRUE,
show.stat = TRUE,
show.obs = FALSE)
```
Some interesting discoveries from these models:
1. All three response variables seem to be influenced by the same set of factors.
2. Father's education level (`father_educ`) seems to have a much stronger effect than mother's education level (`mother_educ`).
3. While most estimates agree in signs across the three subjects, the most notable exception to this is `gender`, where girls tend to perform better than boys in reading.
4. The most influential predictors are those associated with socioeconomic status (`escs`) and education (`book`). A number of variables that should not be directly causal to academic performance also showed up as significant. This is likely due to their associations with socio-economic status.
Upon checking the classical diagnostic plots of these models, we see no major violation on the assumptions of linear models. The large amount of variations in the data may help to explain why the models only has a moderately low $R^2$ values (~ 0.20).
```{r, fig.height = 30, fig.width = 12}
#plotting the outcome of linear models
autoplot(aus2018_math) + labs(title = "2018 Australia maths model") +
autoplot(aus2018_read) + labs(title = "2018 Australia read model") +
autoplot(aus2018_science) + labs(title = "2018 Australia science model")
```
# Linear mixed model
Linear mixed models are a subset of simple linear models that allow for both fixed and random effects.
We already know that the socio-economic status (SES) of a student is often the most influential predictor and it is likely that students with similar SES will attend the same schools in their neighborhood and receive similar level of quality of education from the same teachers.
Thus, it is likely that there will be a grouping effect on the students if they attended the same school. This would imply that some observations in our data are not independent observations.
By building random effects in our linear model, that is building a linear mixed model, we should be able to produce a model with better fit if we consider this grouping effect of schools into our model.
```{r}
# joining school and student data, building a linear mixed model
lmm2018 = aus_data %>%
filter(year == "2018") %>%
select(
school_id,
math, read, science,
all_of(student_predictors)) %>%
na.omit()
lmm2018_math = lmer(formula = as.formula(paste("math ~ ", student_formula_rhs, "+ (escs | school_id)")), data = lmm2018)
lmm2018_read = lmer(formula = as.formula(paste("read ~ ", student_formula_rhs, "+ (escs | school_id)")), data = lmm2018)
lmm2018_science = lmer(formula = as.formula(paste("science ~ ", student_formula_rhs, "+ (escs | school_id)")), data = lmm2018)
sjPlot::tab_model(lmm2018_math, lmm2018_read, lmm2018_science,
show.ci = FALSE, show.aic = TRUE, show.se = TRUE,
show.stat = TRUE,
show.obs = FALSE)
```
We see that the linear mixed model improved on the fit of the model, as judged by the AIC.
```{r}
# subtracting AIC values of the two models
bind_cols(
AIC(aus2018_math) - AIC(lmm2018_math),
AIC(aus2018_read) - AIC(lmm2018_read),
AIC(aus2018_science) - AIC(lmm2018_science)
) %>%
rename(maths = ...1,
read = ...2,
science = ...3) %>%
kbl(caption = "AIC Values") %>%
kable_styling(full_width = NULL,
position = "center",
bootstrap_options = c("hover", "striped"))
```
# Integrating with `school` data
We now take this dataset on students and merge it with some variables from the `school` data which is also a part of this `learningtower` package. This allows us to gain more access to the school level variables this is helpful in modelling the data.
```{r}
#taking into account the school dataset variables and fitting a linear mixed model
selected_vars = c("father_educ", "gender", "internet",
"desk", "computer_n", "car",
"book", "wealth", "escs")
data(school)
aus_school_2018 = school %>%
dplyr::filter(country == "AUS", year == "2018") %>%
dplyr::mutate(school_size = log10(school_size)) %>% ## We take the log due to the scale
dplyr::select(-year, -country, -contains("fund"), -sch_wgt)
lmm2018_sch = lmm2018 %>%
left_join(aus_school_2018, by = c("school_id")) %>% na.omit()
school_predictors = c("stratio", "public_private", "staff_shortage", "school_size")
school_formula_rhs = paste(school_predictors, collapse = "+")
lmm2018_sch_math = lmer(formula = as.formula(paste("math ~ ", student_formula_rhs, "+ (escs | school_id) + ",
school_formula_rhs)), data = lmm2018_sch)
lmm2018_sch_read = lmer(formula = as.formula(paste("read ~ ", student_formula_rhs, "+ (escs | school_id) + ",
school_formula_rhs)), data = lmm2018_sch)
lmm2018_sch_science = lmer(formula = as.formula(paste("science ~ ", student_formula_rhs, "+ (escs | school_id) + ",
school_formula_rhs)), data = lmm2018_sch)
sjPlot::tab_model(lmm2018_sch_math, lmm2018_sch_read, lmm2018_sch_science,
show.ci = FALSE, show.aic = TRUE, show.se = TRUE,
show.stat = TRUE,
show.obs = FALSE)
```
We note the following:
1. The school size (`school_size`) is a strong predictor for academic performance, implying larger schools tend to do better. This is likely a confounding variable for the urban/rural region of the school which can imply a difference in available funding of school facilities.
2. Private school tends to better than public schools (note the reference level and the negative coefficient estimate in the variable `public_private`).
3. Perhaps surprisingly, the student-teacher ratio (`stratio`) wasn't found to be significant but the shortage of staff (`staff_shortage`) was significant. This would imply that as long as the school is adequately supported by staff, further reduction in the student-teacher ratio does not have a statistical significant effect on student performance.
# Visualising coefficient estimates over the years
All analyses above focused on the year 2018 for Australia, but what about the other years? We also visualize the academic performances of students as a function of time in the [time trend article](https://kevinwang09.github.io/learningtower/articles/exploring_time.html), so in this section, we attempt to visualize the effect of some interesting variables and their linear model coefficient estimates for each of the PISA study over time.
We would expect the availability of technology (e.g. computer) could be beneficial for students at the start of the 21st century, but it is not clear if students will be helped by these technologies as time goes by.
The construction goes as follow:
1. We first split the entire Australian data by year and fit a linear model, with `math` as the response variable.
2. We extract the coefficient estimate for every predictor from every linear model and combine the result.
3. We then plot the years on the x-axis and the coefficient estimates on the y-axis as points and join each variable using a line. For categorical variables, we split the categories as separate lines.
4. Additionally, we show the 95% confidence interval of each coefficient estimate using a transparent ribbon and show the y = 0 line. i.e. whenever the ribbon crosses the horizontal line, the p-value for testing this level will be < 0.05.
```{r, fig.height = 12, fig.width = 15}
#Fitting a linear model, extracting the coefficients and visualizing every predictor
aus_student_years = aus_data %>%
dplyr::select(
math,
all_of(student_predictors),
year) %>%
na.omit()
aus_student_years_coef = aus_student_years %>%
group_by(year) %>%
nest() %>%
dplyr::mutate(math_lm_coef = purrr::map(.x = data,
.f = ~ lm(formula = as.formula(paste("math ~ ", student_formula_rhs)), data = .x) %>%
broom::tidy())) %>%
dplyr::select(-data) %>%
tidyr::unnest(math_lm_coef)
aus_student_years_coef %>%
dplyr::filter(str_detect(term, "computer|father_educ|escs|wealth")) %>%
dplyr::mutate(
year = year %>% as.character() %>% as.integer(),
facet = case_when(
str_detect(term, "computer") ~ "Number of computer",
str_detect(term, "father_educ") ~ "Education of father",
# str_detect(term, "mother_educ") ~ "Education of mother",
str_detect(term, "wealth") ~ "Wealth",
str_detect(term, "escs") ~ "Socio-economic index"),
last_point = ifelse(year == 2018, term, NA)) %>%
ggplot(aes(x = year, y = estimate,
colour = term,
group = term,
label = last_point)) +
geom_hline(yintercept = 0) +
geom_point(position = position_dodge(width = 0.8), size = 2) +
geom_line(position = position_dodge(width = 0.8), size = 1.5) +
geom_linerange(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
size = 4, alpha = 0.7,
position = position_dodge(width = 0.8)) +
geom_label_repel(direction = "both", nudge_x = 2, seed = 2020, segment.size = 0) +
scale_x_continuous(limits = c(2005.5, 2022),
breaks = c(2006, 2009, 2012, 2015, 2018)) +
facet_wrap(~facet, scales = "free_y") +
theme(legend.position = "none") +
labs(x = "Year",
y = "Estimate",
title = "Graphing coefficient estimates throughout time")
```
We note the following:
1. Even though in the 2018, we found the education of father was statistically significant against students' academic performance, this was not always the case. From 2006 to 2018, the education of father seems to have ever positive influence on students.
2. It is clear that access to computers is ever more prevalent in Australia. But surprisingly, the positive influence of computers are decreasing. It is not clear why this would be the case. One possible reason is that students might have access to computers outside of their homes (e.g. from schools) and thus the advantages of accessing computers are dampened.
3. Quite interestingly, the influence of socio-economic index is dropping, implying a gradual move towards equality.
# Session info
```{r}
sessionInfo()
```