Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


render-README render-index

Important links


The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.


This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.


To cite this case study:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). Exploring global patterns of obesity across rural and urban regions (Version v1.0.0).


We would like to acknowledge Jessica Fanzo for assisting in framing the major direction of the case study.

We would like to acknowledge Michael Breshock for his contributions to this case study and developing the OCSdata package.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Reading Metrics

The total reading time for this case study was calculated with koRpus: About 70 minutes

The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 9, Age 14


Exploring Global Patterns of Obesity from 1985 to 2017


Body Mass Index (BMI) is often used as a proxy for adiposity with classifications based on BMI to define “underweight”, “normal”, “overweight” and “obese”, where higher BMI has been associated with increased mortality, rates of type 2 diabetes, cancer, heart disease, and stroke. A recent paper showed that contrary to a widely reported view (that urbanization is one of the most important drivers in the global rise of obesity), in fact BMI is increasing at the same rate or faster in rural areas (compared to cities), in particular in low- and middle-income regions. Also, there a gender-discrepancy (women have a higher BMI in rural communities).

Here, we explore this data to understand global patterns in obesity. This analysis is important because it may indicate the need to provide better access (financial and physical access) to healthy foods in rural communities, especially in low-income countries, to address the obesity crisis.

Motivating questions

  1. Is there a difference between rural and urban BMI estimates around the world? In particular, what does this difference look like for women?
  2. How have BMI estimates changed from 1985 to 2017? In particular, what does this change over time look like for women?
  3. How do different countries compare for BMI estimates? In particular, how does the United States compare to the rest of the world?


The data used in this analysis comes from a supplementary table for the following article:

NCD Risk Factor Collaboration (NCD-RisC). Rising rural body-mass index is the main driver of the global obesity epidemic in adults. Nature 569, 260–264 (2019).

This article can be found freely available online.

While gender and sex are not actually binary, the data presented that is used in this analysis only contain data for groups of individuals described as men or women.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data science Learning Objectives:

  1. Importing data from a PDF (pdftools)
  2. Subsetting and filtering data (dplyr)
  3. Working with character strings (stringr)
  4. Reshaping data into different formats (tidyr)
  5. Applying functions to all columns of a tibble (purrr)
  6. Creating data visualizations (ggplot2) with labels (ggrepel)
  7. Combining multiple plots (cowplot and patchwork)

Statistical Learning Objectives:

  1. Familiarity with the use of Quantile-Quantile plots to assess normality
  2. Define and understand the utility of alpha and the p value
  3. Describe the difference between nonparametric and parametric tests
  4. Be able to identify paired data
  5. Implementation of a paired t-test
  6. Interpretation of a paired t-test
  7. Implementation of a Wilcoxon signed-rank test
  8. Interpretation of a Wilcoxon signed-rank test
  9. Understanding of the need for multiple testing correction


In this case study, we will largely focus on methods for comparing two groups using parametric and nonparametric hypothesis tests. We also cover multiple testing correction and fairly advanced data visualization methods using ggplot2.

Data import

Data is imported from a PDF using pdftools to obtain data from a large table. The beginning of this table looks like this:

Data wrangling

This case study covers many wrangling techniques and largely involves using the package stringr.

  1. Dividing data into separate lines
  2. Removing excess white-space
  3. Removing redundant header information
  4. Correcting spacing issues
  5. Dealing with NA values that are labeled in an unusual manner
  6. Splitting the data into columns using a delimiter
  7. Changing variable names
  8. Sorting the data
  9. Converting to long format
  10. Separating a column into multiple columns

Data exploration

To explore the data we use the summarize() function as well as plots to look at the distribution of the data. Quantile-Quantile plots are used to evaluate the distribution and compare it to the theoretical normal distribution.

Statistical concepts

This case study covers fundamental concepts in statistics such as type 1 error, alpha threshold, p-values, hypothesis testing, parametric two sample mean tests, and nonparametric two sample tests, as well as the assumptions of the various included statistical tests and what to do when data is paired.

Other notes and resources

Long and Wide Data Formats
Distributions Normal Distribution Skewed Distributions Bimodal Distribution ggplot2
Q-Q Plots
Student t-test
Paired Data
Welch’s t-test
Parametric and Nonparametric Methods
Balanced Study Design
Independent Observations
Permutation/Resampling Methods
Central Limit Theorem
Mood’s Two-Sample Scale Test
Wilcoxon Signed Rank Test
Wilcoxon Rank Sum Test
Two-sample Kolmogorov-Smirnov Test
Type 1 Error
Multiple Testing
Bonferroni Method of Multiple Testing Correction

Packages used in this case study:

Package Use in this case study
here to easily load and save data with relative paths
pdftools to read a text from pdf into R
stringr to manipulate the text data
readr to manipulate the text data within the pdf into individual lines
dplyr to arrange/filter/select subsets of the data
tibble to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
magrittr to use the %<>% piping operator
glue to paste or combine character strings and data together
purrr to perform functions on all columns of a tibble
tidyr to convert data from ‘wide’ to ‘long’ format
ggplot2 to make visualizations with multiple layers
ggrepel to allow labels in figures not to overlap
cowplot and patchwork to allow plots to be combined

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file (

For instructors

Our goal is for instructors to use this case study as the starting point for a set of lectures. We provide one R Markdown file (index.Rmd) for an instructor to use. However, we anticipate the instructor may either break this file up into smaller R Markdown files for multiple lectures or extract only a portion of the material (e.g. the Data Wrangling or Data Analysis sections) to use in the classroom. With the latter goal in mind, we save a Wrangled_data.rda object at the end of the Data Wrangling section, which is loaded at the start of the Data Exploration section.

Target audience

This case study is designed for undergraduate students who have not taken a statistics course. While we do not discuss the theoretical aspects of the statistics concepts used in this case study, the case study discusses the motivation behind them.

Suggested homework

Students can repeat a similar analysis, but evaluate the change in BMI over time using the global data available for each year between 2015 and 2017.

Estimate of RMarkdown Compilation Time:

~ About 31 - 41 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.