vignettes/articles/Australia_trends.Rmd

---
title: "How did Australia do in the PISA study"
author: "The Freemasons"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    fig_height: 10
    fig_width: 14
    number_sections: true
vignette: >
  %\VignetteIndexEntry{How did Australia do in the PISA study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
message = FALSE,
error = FALSE,
outwidth = "100%",
fig.width = 8, 
fig.height = 6)
```

# Introduction
  
The purpose of this article is to explore some of the variables that influenced Australia's performance in PISA study. Note that this is an observational study (as oppose to controlled experiment), and we are inferring on factors that are correlated with academic performance rather than specific causes.
  
  
# Loading the packages and data
  
```{r}
#loading the data and libraries
library(learningtower)
library(tidyverse)
library(lme4)
library(ggfortify)
library(sjPlot)
library(patchwork)
library(ggrepel)
library(kableExtra)

student <- load_student("all")
data(countrycode)

theme_set(theme_classic(18) 
          + theme(legend.position = "bottom"))
```

# Visualise predictors over time 

Since we are expecting some time variations in the data, let's quickly visualize the time trends.

```{r}
#filtering the data for Australia
aus_data = student %>% 
  dplyr::filter(country %in% c("AUS")) %>% 
  dplyr::mutate(mother_educ = mother_educ %>% fct_relevel("less than ISCED1"),
    father_educ = father_educ %>% fct_relevel("less than ISCED1"))
```


## Numeric variables

A boxplot is a standardized method of presenting data distribution. It informs whether or not our data is symmetrical. Box plots are important because they give a visual overview of the data, allowing researchers to rapidly discover mean values, data set dispersion, and skewness. In this data we visualize the numeric distribution across the years via boxplots.

```{r, fig.height = 9, fig.width = 15}
# plotting the distribution of numeric variables via boxplots
aus_data %>% 
  select(where(is.numeric)) %>% 
  bind_cols(aus_data %>% select(year)) %>% 
  pivot_longer(cols = -year) %>% 
  ggplot(aes(x = year, y = value,
             colour = year)) +
  geom_boxplot() +
  facet_wrap(~name, scales = "free_y") +
  theme(legend.position = "none") +
  labs(x = "Year", 
       y = "", 
       title = "The distribution of numerical variables in the student dataset over all years")
```

## Factor variables

Missing data is a common issue that data professionals must deal with on a daily basis. In this section we visualize the number of missing values across the years for all the factor variables in the student dataset.

```{r, fig.height = 15, fig.width = 15}
#checking the missing values in the factor variables of the data
aus_fct_plotdata = aus_data %>% 
  select(where(is.factor)) %>% 
  dplyr::select(-country, -school_id, -student_id) %>% 
  pivot_longer(cols = -year) %>% 
  group_by(year, name, value) %>% 
  tally() %>% 
  dplyr::mutate(
    value = coalesce(value, "missing"),
    percent = n/sum(n),
    year = year %>% as.character() %>% as.integer()) %>% 
  group_by(name, value) %>% 
  dplyr::mutate(last_point = ifelse(year == max(year), as.character(value), NA))

aus_fct_plotdata %>% 
  ggplot(aes(x = year, y = percent,
             label = last_point,
             group = value)) +
  geom_point() + 
  geom_line() +
  geom_label_repel(direction = "both", nudge_x = 3, seed = 2020, segment.size = 0) +
  facet_wrap(~name, scales = "free_y", ncol = 3) +
  scale_x_continuous(breaks = c(2000, 2003, 2006, 2009, 2012, 2015, 2018)) +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Year", 
       y = "Percentage of missing values",
       title = "Missing values in the student dataset's factor variables")
```

We initially investigate the most current 2018 data before generalizing the models/results into any patterns due to the quantity of missing values in the data in previous years and also to decrease the time complexity in modeling.

# Linear regression model for the 2018 study

Linear regression analysis predicts the value of one variable depending on the value of other variables. Because they are well known and can be trained rapidly, linear regression models have become a effective way of scientifically and consistently predicting the future.

We begin by doing a basic data exploration using linear regression models. To begin, we fit three linear models (one for each subject of math, reading, and science) to the 2018 Australian data to gain an understanding of the key variables that may be impacting test scores.

We filter the student data (we use the complete student data obtained by the load student("all") function) to pick the scores in Australia and re level some variables for further analyses.

```{r}
#filtering the data to Australia, defining the predictors and selecting the scores
student_predictors = c("mother_educ", "father_educ", "gender", "internet", 
                       "desk", "room", "television", "computer_n", 
                       "car", "book", "wealth", "escs")

student_formula_rhs = paste(student_predictors, collapse = "+")

aus2018 = aus_data %>% 
  filter(year == "2018") %>% 
  dplyr::select(
    math, read, science, 
    all_of(student_predictors)) %>% 
  na.omit()
```

## Checking correlation matrix of the numeric variables

A correlation matrix is a table that displays the coefficients of correlation between variables. Each cell in the table represents the relationship between two variables.

```{r}
#correlation matrix for the numeric variables
aus2018 %>% 
  select(where(is.numeric)) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  round(2) %>% 
  kbl(caption = "Correlation Matrix") %>% 
  kable_styling(full_width = NULL,
                position = "center",
                bootstrap_options = c("hover", "striped"))
```

## Fitting three linear models


```{r}
#fitting linear models for the three subjects maths, reading and science

aus2018_math = lm(formula = as.formula(paste("math ~ ", student_formula_rhs)) , data = aus2018)

aus2018_read = lm(formula = as.formula(paste("read ~ ", student_formula_rhs)) , data = aus2018)

aus2018_science = lm(formula = as.formula(paste("science ~ ", student_formula_rhs)) , data = aus2018)

sjPlot::tab_model(aus2018_math, aus2018_read, aus2018_science,
                  show.ci = FALSE, show.aic = TRUE, show.se = TRUE,
                  show.stat = TRUE,
                  show.obs = FALSE)
```


Some interesting discoveries from these models:

1. All three response variables seem to be influenced by the same set of factors.

2. Father's education level (`father_educ`) seems to have a much stronger effect than mother's education level (`mother_educ`). 

3. While most estimates agree in signs across the three subjects, the most notable exception to this is `gender`, where girls tend to perform better than boys in reading.

4. The most influential predictors are those associated with socioeconomic status (`escs`) and education (`book`). A number of variables that should not be directly causal to academic performance also showed up as significant. This is likely due to their associations with socio-economic status. 

Upon checking the classical diagnostic plots of these models, we see no major violation on the assumptions of linear models. The large amount of variations in the data may help to explain why the models only has a moderately low $R^2$ values (~ 0.20).

```{r, fig.height = 30, fig.width = 12}
#plotting the outcome of linear models
autoplot(aus2018_math) + labs(title = "2018 Australia maths model") +
autoplot(aus2018_read) + labs(title = "2018 Australia read model") + 
autoplot(aus2018_science) + labs(title = "2018 Australia science model")
```


# Linear mixed model

Linear mixed models are a subset of simple linear models that allow for both fixed and random effects.

We already know that the socio-economic status (SES) of a student is often the most influential predictor and it is likely that students with similar SES will attend the same schools in their neighborhood and receive similar level of quality of education from the same teachers. 

Thus, it is likely that there will be a grouping effect on the students if they attended the same school. This would imply that some observations in our data are not independent observations. 

By building random effects in our linear model, that is building a linear mixed model, we should be able to produce a model with better fit if we consider this grouping effect of schools into our model. 

```{r}
# joining school and student data, building a linear mixed model
lmm2018 = aus_data %>% 
  filter(year == "2018") %>% 
  select(
    school_id,
    math, read, science, 
    all_of(student_predictors)) %>% 
  na.omit()

lmm2018_math = lmer(formula = as.formula(paste("math ~ ", student_formula_rhs, "+ (escs | school_id)")), data = lmm2018)

lmm2018_read = lmer(formula = as.formula(paste("read ~ ", student_formula_rhs, "+ (escs | school_id)")), data = lmm2018)

lmm2018_science = lmer(formula = as.formula(paste("science ~ ", student_formula_rhs, "+ (escs | school_id)")), data = lmm2018)

sjPlot::tab_model(lmm2018_math, lmm2018_read, lmm2018_science,
                  show.ci = FALSE, show.aic = TRUE, show.se = TRUE,
                  show.stat = TRUE,
                  show.obs = FALSE)
```

We see that the linear mixed model improved on the fit of the model, as judged by the AIC. 

```{r}
# subtracting AIC values of the two models
bind_cols(
AIC(aus2018_math) - AIC(lmm2018_math),
AIC(aus2018_read) - AIC(lmm2018_read),
AIC(aus2018_science) - AIC(lmm2018_science)
) %>% 
  rename(maths = ...1,
         read = ...2,
         science = ...3) %>% 
  kbl(caption = "AIC Values") %>% 
  kable_styling(full_width = NULL,
                position = "center",
                bootstrap_options = c("hover", "striped"))
```

# Integrating with `school` data

We now take this dataset on students and merge it with some variables from the `school` data which is also a part of this `learningtower` package. This allows us to gain more access to the school level variables this is helpful in modelling the data. 

```{r}
#taking into account the school dataset variables and fitting a linear mixed model
selected_vars = c("father_educ", "gender", "internet", 
                  "desk", "computer_n", "car",
                  "book", "wealth", "escs")

data(school)

aus_school_2018 = school %>% 
  dplyr::filter(country == "AUS", year == "2018") %>% 
  dplyr::mutate(school_size = log10(school_size)) %>% ## We take the log due to the scale
  dplyr::select(-year, -country, -contains("fund"), -sch_wgt)

lmm2018_sch = lmm2018 %>% 
  left_join(aus_school_2018, by = c("school_id")) %>% na.omit()

school_predictors = c("stratio", "public_private", "staff_shortage", "school_size")
school_formula_rhs = paste(school_predictors, collapse = "+")

lmm2018_sch_math = lmer(formula = as.formula(paste("math ~ ", student_formula_rhs, "+ (escs | school_id) + ",
                                                   school_formula_rhs)), data = lmm2018_sch)

lmm2018_sch_read = lmer(formula = as.formula(paste("read ~ ", student_formula_rhs, "+ (escs | school_id) + ",
                                                   school_formula_rhs)), data = lmm2018_sch)

lmm2018_sch_science = lmer(formula = as.formula(paste("science ~ ", student_formula_rhs, "+ (escs | school_id) + ",
                                                      school_formula_rhs)), data = lmm2018_sch)


sjPlot::tab_model(lmm2018_sch_math, lmm2018_sch_read, lmm2018_sch_science,
                  show.ci = FALSE, show.aic = TRUE, show.se = TRUE,
                  show.stat = TRUE,
                  show.obs = FALSE)
```

We note the following:

1. The school size (`school_size`) is a strong predictor for academic performance, implying larger schools tend to do better. This is likely a confounding variable for the urban/rural region of the school which can imply a difference in available funding of school facilities.  

2. Private school tends to better than public schools (note the reference level and the negative coefficient estimate in the variable `public_private`). 

3. Perhaps surprisingly, the student-teacher ratio (`stratio`) wasn't found to be significant but the shortage of staff (`staff_shortage`) was significant. This would imply that as long as the school is adequately supported by staff, further reduction in the student-teacher ratio does not have a statistical significant effect on student performance. 

# Visualising coefficient estimates over the years

All analyses above focused on the year 2018 for Australia, but what about the other years? We also visualize the academic performances of students as a function of time in the [time trend article](https://kevinwang09.github.io/learningtower/articles/exploring_time.html), so in this section, we attempt to visualize the effect of some interesting variables and their linear model coefficient estimates for each of the PISA study over time. 

We would expect the availability of technology (e.g. computer) could be beneficial for students at the start of the 21st century, but it is not clear if students will be helped by these technologies as time goes by. 

The construction goes as follow: 

1. We first split the entire Australian data by year and fit a linear model, with `math` as the response variable. 

2. We extract the coefficient estimate for every predictor from every linear model and combine the result. 

3. We then plot the years on the x-axis and the coefficient estimates on the y-axis as points and join each variable using a line. For categorical variables, we split the categories as separate lines. 

4. Additionally, we show the 95% confidence interval of each coefficient estimate using a transparent ribbon and show the y = 0 line. i.e. whenever the ribbon crosses the horizontal line, the p-value for testing this level will be < 0.05. 

```{r, fig.height = 12, fig.width = 15}
#Fitting a linear model, extracting the coefficients and visualizing every predictor
aus_student_years = aus_data %>% 
  dplyr::select(
    math,
    all_of(student_predictors),
    year) %>% 
  na.omit()

aus_student_years_coef = aus_student_years %>% 
  group_by(year) %>% 
  nest() %>% 
  dplyr::mutate(math_lm_coef = purrr::map(.x = data, 
                              .f = ~ lm(formula = as.formula(paste("math ~ ", student_formula_rhs)), data = .x) %>% 
                                broom::tidy())) %>% 
  dplyr::select(-data) %>% 
  tidyr::unnest(math_lm_coef)

aus_student_years_coef %>% 
  dplyr::filter(str_detect(term, "computer|father_educ|escs|wealth")) %>% 
  dplyr::mutate(
    year = year %>% as.character() %>% as.integer(),
    facet = case_when(
      str_detect(term, "computer") ~ "Number of computer",
      str_detect(term, "father_educ") ~ "Education of father",
      # str_detect(term, "mother_educ") ~ "Education of mother",
      str_detect(term, "wealth") ~ "Wealth",
      str_detect(term, "escs") ~ "Socio-economic index"),
    last_point = ifelse(year == 2018, term, NA)) %>% 
  ggplot(aes(x = year, y = estimate,
             colour = term,
             group = term,
             label = last_point)) +
  geom_hline(yintercept = 0) +
  geom_point(position = position_dodge(width = 0.8), size = 2) +
  geom_line(position = position_dodge(width = 0.8), size = 1.5) +
  geom_linerange(aes(ymin = estimate - 2*std.error,
                     ymax = estimate + 2*std.error),
                 size = 4, alpha = 0.7,
                 position = position_dodge(width = 0.8)) +
  geom_label_repel(direction = "both", nudge_x = 2, seed = 2020, segment.size = 0) +
  scale_x_continuous(limits = c(2005.5, 2022),
                     breaks = c(2006, 2009, 2012, 2015, 2018)) +
  facet_wrap(~facet, scales = "free_y") +
  theme(legend.position = "none") +
  labs(x = "Year",
       y = "Estimate",
       title = "Graphing coefficient estimates throughout time")
```

We note the following: 

1. Even though in the 2018, we found the education of father was statistically significant against students' academic performance, this was not always the case. From 2006 to 2018, the education of father seems to have ever positive influence on students.

2. It is clear that access to computers is ever more prevalent in Australia. But surprisingly, the positive influence of computers are decreasing. It is not clear why this would be the case. One possible reason is that students might have access to computers outside of their homes (e.g. from schools) and thus the advantages of accessing computers are dampened. 

3. Quite interestingly, the influence of socio-economic index is dropping, implying a gradual move towards equality.

# Session info
```{r}
sessionInfo()
```