Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
81 lines (64 sloc) 4.68 KB
---
title: "Age distributions in samples from cross-national survey projects"
author: "Marta Kołczyńska"
date: 2018-09-02T10:09:00
categories: ["SDR"]
tags: ["surveys", "SDR", "R", "cross-national research", "age", "data quality", "survey data harmonization", "shiny"]
output:
blogdown::html_page:
toc: true
toc_depth: 4
---
Cross-national survey projects conduct surveys on representative samples of adult populations.
How do the distributions of respondents' age vary across surveys carried out in the same country in different years and different projects?
Like in a couple of previous posts ([here](https://martakolczynska.com/post/sdr-exploration/){target="_blank"}, [here](https://martakolczynska.com/post/education-sdr-oecd/) and [here](https://martakolczynska.com/post/sdr-demonstrations-multiplets/)) I use data from the [Survey Data Recycling dataset (SDR) version 1](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VWGF5Q){target="_blank"}, which includes selected harmonized variables from 22 cross-national survey projects. SDR only includes surveys that claim to have samples representative for adult populations. The code for manipulating the data and plotting can be found [here](https://github.com/mkolczynska/martakolczynska/blob/master/content/post/sdr-age-distributions.Rmd){target="_blank"} on GitHub.
The ridgeline plot (thanks to the [`ggridges` package](https://github.com/clauswilke/ggridges){target="_blank"}) presents the distributions of age in all 32 samples from Poland from the SDR v.1 dataset, arranged from the oldest surveys at the bottom to the most recent surveys at the top. Colors identify age quartiles, and differences in the hight of graphs between surveys reflect differences in the sample sizes. Distributions are weighted with the case weights provided with the original data. In the SDR dataset values higher than 96 were recoded to missing, because many surveys has surprisingly high numbers of respondents aged 97-99 (likely because these are common missing value codes).
Age distributions for all surveys included in the SDR v.1 dataset can be plotted with this Shiny app: [https://mkolczynska.shinyapps.io/sdr_age_distributions/](https://mkolczynska.shinyapps.io/sdr_age_distributions/){target="_blank"}.
How does this variation influence the comparability of sample estimates between samples? And how to adjust samples when calculating and comparing, e.g., rates of participation in politics?
```{r setup, warning=FALSE, message=FALSE, echo = FALSE}
library(tidyverse) # manipulating data
library(ggplot2) # pretty plots
library(ggridges) # plots
library(viridis) # color palettes
library(RColorBrewer) # more color palettes
master_prep <- read.csv("data/sdr-age-distributions.csv")
```
```{r wrangling, warning=FALSE, message=FALSE, eval = FALSE, echo = FALSE}
## recode of the SDR v.1 MAster File.
master_prep <- master %>%
select(t_country_year, t_survey_name, t_survey_edition, t_country_l1u, t_country_set, t_age, t_weight_l1u_2) %>%
group_by(t_country_year, t_survey_name, t_survey_edition, t_country_l1u, t_country_set) %>%
arrange(t_country_l1u, t_survey_name, t_survey_edition, t_country_set, t_country_year, t_age) %>%
mutate(id = paste(t_country_year, t_survey_name, t_survey_edition),
ncases = round(sum(t_weight_l1u_2,0)),
pos_age = row_number(),
pos_age_w = cumsum(t_weight_l1u_2),
qagew = case_when(
pos_age_w / ncases <= 0.25 ~ 1,
pos_age_w / ncases > 0.25 & pos_age_w / ncases <= 0.5 ~ 2,
pos_age_w / ncases > 0.5 & pos_age_w / ncases <= 0.75 ~ 3,
pos_age_w / ncases > 0.75 ~ 4),
qage = case_when(
pos_age / ncases <= 0.25 ~ 1,
pos_age / ncases > 0.25 & pos_age / ncases <= 0.5 ~ 2,
pos_age / ncases > 0.5 & pos_age / ncases <= 0.75 ~ 3,
pos_age / ncases > 0.75 ~ 4)) %>%
group_by(t_country_l1u, t_survey_name, t_survey_edition, t_country_set, t_country_year, t_age) %>%
summarise(freq = n(),
freqw = sum(t_weight_l1u_2),
qage = floor(mean(qage)),
qagew = floor(mean(qagew))) %>%
filter(!is.na(t_age))
```
```{r graph-pl, warning=FALSE, message=FALSE, echo = FALSE, fig.width = 8, fig.height = 10}
master_prep %>% filter(t_country_l1u == "PL") %>%
ggplot(., aes(x = t_age, y = id, height = freqw / 100, group = id, fill = factor(qagew))) +
geom_ridgeline_gradient() +
scale_fill_viridis(option = "D", discrete = TRUE, direction = -1, begin = 0.4, end = 0.98) +
xlim(13, 100) +
xlab("Age (years)") +
ylab("") +
guides(fill=guide_legend(title="Quartiles")) +
ggtitle("Distribution of age (weighted) in national surveys") +
labs(caption="Source: SDR v.1")
```