Open Case Studies: Disparities in Youth Disconnection
- HTML: https://www.opencasestudies.org/ocs-bp-youth-disconnection
- GitHub: https://github.com/opencasestudies/ocs-bp-youth-disconnection
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
To cite this case study:
Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie C. (2020). https://github.com/opencasestudies/ocs-youth-disconnection-case-study. Disparities in Youth Disconnection.
We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.
The total reading time for this case study was calculated with koRpus: About 85 minutes
The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 8, Age 13
Disparities in Youth Disconnection
According to this report youth disconnection (defined as “young people between the ages of 16 and 24 who are neither working nor in school” according to the Measure of America (a nonpartisan project) although generally showing decreasing trends for the past 7 years, shows racial and ethnic disparities, where some groups are showing increased rates of disconnection.
Thus in this case study we aim to look further at youth disconnection rates among gender and racial and ethnic subgroups to identify groups that may be particularly vulnerable.
- How have youth disconnection rates in American youth changed since 2008?
- In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
In this case study we will be using data related to youth disconnection from the two following reports from the Measure of America project:
Measure of America is a nonpartisan project of the nonprofit Social Science Research Council founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.
- Lewis, Kristen. Making the Connection: Transportation and Youth Disconnection. New York: Measure of America, Social Science Research Council, 2019. (Data up to 2017)
- : Lewis, Kristen. A Decade Undone: Youth Disconnection in the Age of Coronavirus. New York: Measure of America, Social Science Research Council, 2020. (Data up to 2018)
These reports use data from the American Community Survey (ASC).
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data Science Learning Objectives:
- Importing text from PDF files using images and the
- Apply action verbs in
dplyrfor data wrangling
- How to reshape data by pivoting between “long” and “wide” formats
and separating columns into additional columns (
- How to fill in data based on previous values (
- How to create data visualizations with
ggplot2that are in a similar style to an existing image
- How to add images to plots using
- How to create effective bar plots to for multiple comparisons,
including adding gaps between bars in bar plots, adding figure
legends to the plot area, and adding comparison lines (
Statistical Learning Objectives:
- Implementation of the Mann-Kendall trend test
- Interpretation of the Mann-Kendall trend test
- Difference between linear regression and Mann-Kendall trend test
Data is imported from several tables within two PDF documents by taking
screenshots of the tables of interest and using the
magick package to
import the text from the screenshots.
This case study particularly focuses on renaming variables, modifying
variables, creating new variables, and modifying the shape of the data
using functions such as as:
well as modifying specific variables using the
functions of the
This case study also covers combining data with
add_rows() functions of the
We also cover removing NA values with the
drop_na() function of the
tidyr package, separating one column into multiple columns using the
separate() function of the
tidyr package, filling in
based on previous values using the
fill() and replacing
replace_na() function, both of the
tidyr package, as well
as arranging levels of factors using the
Finally, this case study also covers many of the
stringr functions to
manipulate character strings, including
We include an example of creating a plot to match the style of a plot in
an existing report. We also demonstrate how to make effective bar plots,
by demonstrating details such as creating gaps between groups, taking
advantage of these gaps to move the legend to within the plot area, and
to use horizontal lines to allow for additional comparisons among
groups. We also demonstrate how to add images to plots and combine plots
The analysis in this case study covers some basics about probability and hypothesis testing, as well as the Mann-Kendall trend test and the difference between this test and simple linear regression. In this analysis we use the Mann-Kendall to test if there has been a trend within the disconnection rates of particular groups of youths over time.
Other notes and resources
simple linear regression
Kendall rank correlation coefficient
one-sided and two-sided hypotheses
Nonparametric Parametric significance threshold
Z score table
Z score to p-value calculator
To learn more about importing and wrangling PDFs using the
package see this case
To learn more about what you can do with the
magick package see this
To learn more about hypothesis testing, see this case study.
Packages used in this case study:
|Package||Use in this case study|
|here||to easily load and save data|
|pdftools||to import PDF documents|
|magick||for importing images and extracting text from images|
|tesseract||for extracting text from images with
|knitr||for showing images in reports|
|dplyr||to filter, subset, join, add rows to, and modify the data|
|stringr||to manipulate strings|
|magrittr||to pipe sequential commands|
|tidyr||to change the shape or format of tibbles to wide and long, to drop rows with
|tibble||to create tibbles|
|ggplot2||to create plots|
|directlabels||to add labels directly to lines in plots|
|cowplot||to add images to plots|
|forcats||to reorder factor for plot|
|kendall||to implement the Mann-Kendall trend test in R|
|patchwork||to combine plots|
Instructors can start at the Data Visualization or Data Analysis sections. Instructors can also skip the Subgroup plots section if they don’t wish to instruct students about making bar plots in depth.
This case study is appropriate for those new to R programming and new to statistics. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some fundamental knowledge of statistics.
- For the Asian and Latinx subgroup bar plots made across year, modify these plots to consider gender differences (instead of across time).
- Taking the plot you made above, modify the plot to facet across years.
- Find another table in one of the reports to import using the
magickpackage (for example perhaps the data about different states over time in the 2019 report called Making the Connection). Look for differences between groups by plotting the data and evaluating with the Mann-Kendall test.
Estimate of RMarkdown Compilation Time:
~ About 36 - 46 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.