Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Build Status

Important links


The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.


This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.


To cite this case study:

Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). Exploring CO2 emissions across time (Version v1.0.0).


We would like to acknowledge Megan Latshaw for assisting in framing the major direction of the case study.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.


Exploring CO2 emissions across time


C02 emissions have been on the rise for many countries. CO2 emissions trap heat in the atmosphere which can lead to increased global temperatures which can cause vast influences on the health of people and our planet. In this case study we explore national differences in CO2 emissions overtime. We evaluate the relationship between CO2 emissions and average annual temperatures in the US. And we also examine the relationship between emissions and natural disasters, as well as other factors that may influence, be influenced by CO2 emissions.

Motivating questions

  1. How have global CO2 emission rates changed over time? In particular for the US, and how does the US compare to other countries?
  2. Are CO2 emissions in the US, global temperatures, and natural disaster rates in the US associated?


In this case study we will be using data related to CO2 emissions, as well as other data that may influence, be influenced or relate to CO2 emissions.

This case study uses data from Gapminder that was originally obtained from the World Bank.

In addition, we will use some data that is specific to the United States from the National Oceanic and Atmospheric Administration (NOAA), which is an agency that collects weather and climate data.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Importing data from various types of Excel files and CSV files
  2. Apply action verbs in dplyr for data wrangling
  3. How to pivot between “long” and “wide” datasets
  4. Joining together multiple datasets using dplyr
  5. How to create effective longitudinal data visualizations with ggplot2
  6. How to add text, color, and labels to ggplot2 plots
  7. How to create faceted ggplot2 plots

Statistical Learning Objectives:

  1. Introduction to correlation coefficient as a summary statistic
  2. Relationship between correlation and linear regression
  3. Correlation is not causation

Data import

Data from several .xlsx files and a couple of .csv files were imported using readxl and readr respectively.

Data wrangling

This case study particularly focuses on renaming variables, modifying variables, creating new variables, and modifying the shape of the data using fuctions from the dplyr package such as: rename(), mutate(), pivot_longer(), and pivot_wider().

This case study also covers combining data with bind_rows() and full_join() of the dplyr package, including a comparison of the two functions.

We also cover filtering with thefilter() function of the dplyr package, removing NA values with the drop_na() function of the tidyr package, arrange data with the arrange() function of the dplyr package, as well as grouping and summarizing data with the group_by() and summarize() functions of the dplyr package.

Data Visualization

We include a thorough and introductory explanation of ggplot2 including how to add color, facets and labels to plots.


In this case study we look at the correaltion between CO2 emissions and annual average temperatures in the US. We also evaluate the assocation between the two using a linear regression. We discuss the relationship between correlation and linear regression and how we interpret the findings.

Other notes and resources

RStudio cheatsheets Introduction to correlation Correlation coefficient
Correlation does not imply causation
Locally estimated scatterplot smoothing
Local polynomial regression
Time series
Methods to account for autocorrelation
US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report
National Climate Assessment Report
Greenhouse gases Climate change

Packages used in this case study:

Package Use in this case study
here to easily load and save data
readxl to import the excel file data
readr to import the csv file data
dplyr o view and wrangle the data, by modifying variables, renaming variables, selecting variables, creating variables, and arranging values within a variable
magrittr to use and reassign data objects using the %<>%pipe operator
stringr to select only the first 4 characters of date data
purrr to apply a function on a list of tibbles (tibbles are the tidyverse version of a data frame)
tidyr to drop rows with NA values from a tibble
forcats to reorder the levels of a factor
ggplot2 to make visualizations
directlabels to add labels to plots easily
ggrepel to add labels that don’t overlap to plots
broom to make the output form statistical tests easier to work with
patchwork to combine plots

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file ( Users can start at any section after the “What are the data?” section, however some aspects about the code may be explained in an earlier section.

For instructors

Instructors can start at any section after the “What are the data?” section. There is additional data about mortality over time in different countries from the World Bank in the extra subdirectory of the data directory. This could be used for additional analyses.

Target audience

This case study is appropriate for those new to R programming and new to statistics. It is also appropriate for more advanced R users who are new to the Tidyverse.

Suggested homework

Ask students to create a plot with labels showing the countries with the lowest CO2 emission levels.

Ask students to plot CO2 emissions and other variables (e.g. energy use) on a scatter plot, calculate the Pearson’s correlation coefficient, and discuss results.


No description, website, or topics provided.






No releases published


No packages published