- Static version: https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity
- Interactive version: https://rsconnect.biostat.jhsph.edu/ocs-bp-rural-and-urban-obesity-interactive/
- GitHub: https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
To cite this case study:
Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity. Exploring global patterns of obesity across rural and urban regions (Version v1.0.0).
We would like to acknowledge Jessica Fanzo for assisting in framing the major direction of the case study.
We would like to acknowledge Michael
Breshock for his contributions to this
case study and developing the
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.
The total reading time for this case study was calculated with koRpus: About 70 minutes
The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 9, Age 14
Exploring Global Patterns of Obesity from 1985 to 2017
Body Mass Index (BMI) is often used as a proxy for adiposity with classifications based on BMI to define “underweight”, “normal”, “overweight” and “obese”, where higher BMI has been associated with increased mortality, rates of type 2 diabetes, cancer, heart disease, and stroke. A recent paper showed that contrary to a widely reported view (that urbanization is one of the most important drivers in the global rise of obesity), in fact BMI is increasing at the same rate or faster in rural areas (compared to cities), in particular in low- and middle-income regions. Also, there a gender-discrepancy (women have a higher BMI in rural communities).
Here, we explore this data to understand global patterns in obesity. This analysis is important because it may indicate the need to provide better access (financial and physical access) to healthy foods in rural communities, especially in low-income countries, to address the obesity crisis.
- Is there a difference between rural and urban BMI estimates around the world? In particular, what does this difference look like for women?
- How have BMI estimates changed from 1985 to 2017? In particular, what does this change over time look like for women?
- How do different countries compare for BMI estimates? In particular, how does the United States compare to the rest of the world?
The data used in this analysis comes from a supplementary table for the following article:
This article can be found freely available online.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data science Learning Objectives:
- Importing data from a PDF (
- Subsetting and filtering data (
- Working with character strings (
- Reshaping data into different formats (
- Applying functions to all columns of a tibble (
- Creating data visualizations (
ggplot2) with labels (
- Combining multiple plots (
Statistical Learning Objectives:
- Familiarity with the use of Quantile-Quantile plots to assess normality
- Define and understand the utility of alpha and the p value
- Describe the difference between nonparametric and parametric tests
- Be able to identify paired data
- Implementation of a paired t-test
- Interpretation of a paired t-test
- Implementation of a Wilcoxon signed-rank test
- Interpretation of a Wilcoxon signed-rank test
- Understanding of the need for multiple testing correction
In this case study, we will largely focus on methods for comparing two groups using parametric and nonparametric hypothesis tests. We also cover multiple testing correction and fairly advanced data visualization methods using ggplot2.
Data is imported from a PDF using
pdftools to obtain data from a large
table. The beginning of this table looks like this:
This case study covers many wrangling techniques and largely involves
using the package
- Dividing data into separate lines
- Removing excess white-space
- Removing redundant header information
- Correcting spacing issues
- Dealing with
NAvalues that are labeled in an unusual manner
- Splitting the data into columns using a delimiter
- Changing variable names
- Sorting the data
- Converting to long format
- Separating a column into multiple columns
To explore the data we use the
summarize() function as well as plots
to look at the distribution of the data. Quantile-Quantile plots are
used to evaluate the distribution and compare it to the theoretical
This case study covers fundamental concepts in statistics such as type 1 error, alpha threshold, p-values, hypothesis testing, parametric two sample mean tests, and nonparametric two sample tests, as well as the assumptions of the various included statistical tests and what to do when data is paired.
Other notes and resources
Long and Wide Data Formats
Distributions Normal Distribution Skewed Distributions Bimodal Distribution ggplot2
Parametric and Nonparametric Methods
Balanced Study Design
Central Limit Theorem
Mood’s Two-Sample Scale Test
Wilcoxon Signed Rank Test
Wilcoxon Rank Sum Test
Two-sample Kolmogorov-Smirnov Test
Type 1 Error
Bonferroni Method of Multiple Testing Correction
Packages used in this case study:
|Package||Use in this case study|
|here||to easily load and save data with relative paths|
|pdftools||to read a text from pdf into R|
|stringr||to manipulate the text data|
|readr||to manipulate the text data within the pdf into individual lines|
|dplyr||to arrange/filter/select subsets of the data|
|tibble||to create data objects that we can manipulate with
|magrittr||to use the
|glue||to paste or combine character strings and data together|
|purrr||to perform functions on all columns of a tibble|
|tidyr||to convert data from ‘wide’ to ‘long’ format|
|ggplot2||to make visualizations with multiple layers|
|ggrepel||to allow labels in figures not to overlap|
|cowplot and patchwork||to allow plots to be combined|
Our goal is for instructors to use this case study as the starting point
for a set of lectures. We provide one R Markdown file
index.Rmd) for an instructor to use. However, we
anticipate the instructor may either break this file up into smaller R
Markdown files for multiple lectures or extract only a portion of the
material (e.g. the Data Wrangling or Data Analysis sections) to use in
the classroom. With the latter goal in mind, we save a
Wrangled_data.rda object at the end of the Data Wrangling section,
which is loaded at the start of the Data Exploration section.
This case study is designed for undergraduate students who have not taken a statistics course. While we do not discuss the theoretical aspects of the statistics concepts used in this case study, the case study discusses the motivation behind them.
Students can repeat a similar analysis, but evaluate the change in BMI over time using the global data available for each year between 2015 and 2017.
Estimate of RMarkdown Compilation Time:
~ About 31 - 41 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.