Skip to content

Tidy data principles (Elements of Computing II, S21, University of Notre Dame)

Notifications You must be signed in to change notification settings

kwaldenphd/tidy-data-principles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tidy Data

Creative Commons LicenseThis tutorial was written by Katherine Walden and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Lab Goals

This lab provides an overview of tidy data principles (Wickham et al). It covers how to recognize and address pattern errors in structured data using the data cleaning tool Open Refine and common spreadsheet programs like Microsoft Excel or Google Sheets. It also covers how to use survey design and data validation options to minimize user error in data entry.

By the end of this lab, students will be able to:

  • Understand the core components and principles of tidy data
  • Recognize pattern errors in structured datasets
  • Understand how to approach data cleaning and wrangling using OpenRefine and spreadsheet programs
  • Understand the affordances of survey design and data validation as part of data entry/wrangling workflows
Panopto logo Lecture/live coding playlist

Acknowledgements

The author consulted the following resources when building this tutorial:

Table of Contents

Lecture & Live Coding

Throughout this lab, you will see a Panopto icon at the start of select sections.

This icon indicates there is lecture/live coding asynchronous content that accompanies this section of the lab.

You can click the link in the figure caption to access these materials (ND users only).

Example:

Panopto logo Lecture/live coding playlist

Lab Notebook Template

Link to lab notebook template (Google Doc)

Data

Panopto logo Overview

The following data files are used in this tutorial, with Google Drive links (ND users only) provided below:

You can also download the files in a compressed folder (.zip): Google Drive link, ND users

Tools

We'll be opening some structured data files as part of our work in this lab. You can use a spreadsheet program or text editor to access these files.

We'll also be working with a free software program called OpenRefine as part of our work in this lab.

Tidy Data Principles

Panopto logo Tidy Data Principles

Hadley Wickham's 2014 article in the Journal of Statistical Software outlines the foundations and principles of tidy data. These principles have become widely used in data science and other statistical software applications.

  • "A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores." (Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.)

To prepare for this lab, we read Karl W. Broman and Kara H. Woo's 2018 "Data Organization in Spreadsheets" from The American Statistician.

  • "Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files." (Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989)

What Are the Principles

Designing spreadsheets that are “tidy, consistent, and as resistant to mistakes as possible” (2)

  1. Be Consistent:
  • Use consistent codes for categorical variables
  • Use a consistent fixed code for any missing values
  • Use consistent variable names
  • Use consistent subject identifiers
  • Use a consistent data layout in multiple files
  • Use consistent file names
  • Use a consistent format for all dates
  • Use consistent phrases in your notes
  • Be careful about extra spaces within cells
  1. Choose Good Names for Things:
  • Avoid spaces
  • Avoid special characters
  • Be short but meaningful
  1. Write Dates as YYYY-MM-DD
  • Or have separate columns for YEAR, MONTH, DATE
  1. No Empty Cells

  2. Put Just One Thing in a Cell

  3. Make it a Rectangle

  • Single first row with variable names

7.- Create a Data Dictionary

  • “This is part of the metadata that you will want to prepare: information about the data” (6)
  • You might also find this information in a codebook that goes with a dataset
  • Things to include:
    • The exact variable name as in the data file
    • A version of the variable name that might be used in data visualizations
    • A longer explanation of what the variable means
    • The measurement units
    • Expected minimum and maximum values
  1. No Calculations in the Raw Data Files

  2. Do Not Use Font Color or Highlighting as Data

  3. Make Backups

  • Multiple locations (OneDrive, local computer, etc.)
  • Version control program (i.e. Git)
  • Write protect the file when not entering data
  1. Use Data Validation to Avoid Errors

  2. Save a Copy of the Data in Plain Text Files

  • File formats can include comma-separated values (CSV) or plain-text (TXT)

The principles are also available as a PDF:

Common Spreadsheet Errors

As described in Library Carpentry's "Tidy data for librarians" tutorial, common formatting problems for data in spreadsheets include:

  • Multiple tables
  • Multiple tabs
  • Not filling in zeros
  • Using bad null values
  • Using formatting to convey information
  • Using formatting to make the data sheet look pretty
  • Placing comments or units in cells
  • More than one piece of information in a cell
  • Field name problems
  • Special characters in data
  • Inclusion of metadata in data table
  • Date formatting
Q1: What questions do you have about these principles? Which ones are unclear or confusing?

Dealing With Messy Data

Panopto logo Dealing With Messy Data

Double check that you have the needed data files for this lab.

Open the the Combined_Workbook file in a spreadsheet program (you can also open the two .csv files in a spreadsheet program or text editor).

Explore the tables tables (sheets/tabs in the combined workbook or separate .csv files), thinking about the types of information included or represented in different tables and fields

Q2: Explore both tables, thinking about what fields are represented in these datasets. Describe the data fields in your own words- what types of information do they contain, and what types of data are you seeing?
Type Description Example
String Used to store text or a string of non-integer characters "This classroom is in Bond Hall" or "student"
Integer Used to store positive or negative whole numbers -25, 0, 25
Double Used to store precise numerical values that include decimal points 3.14159265359

Identify Patterns and Brainstorm Solutions

Compare what you see in these tables to the tidy data principles outlined above. Start by looking for small-scale discrepencies and inconsistencies within the datasets.

Q3: Provide three (3) distinct examples from the sample datasets that do not conform to tidy data principles. Include the example as well as an explanation of how this example does not conform to tidy data principles.
Q4: Take a step back and consider the pattern errors you're seeing in these datasets. What trends do you notice? Any thoughts or ideas as to how or why these pattern errors might have occurred?
Q5: How would you address these pattern errors so the data conforms to tidy data principles? Explain what steps you would take to address at least 3 pattern errors. Each error explanation should include three parts:
  1. an example of the error
  2. an explanation of your method to address the error
  3. the same example as tidy data

Data Wrangling Using OpenRefine

Panopto logo Data Cleaning in Open Refine

As described in Library Carpentry's "Introduction to OpenRefine":

OpenRefine is described as 'a power tool for working with messy data' (David Huynh) - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you solve.


OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file.


It can help you:
  • Get an overview of a data set
  • Resolve inconsistencies in a data set, for example standardizing date formatting
  • Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
  • Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
  • Enhance a data set with data from other sources

Some common scenarios where you might use OpenRefine include:
  • Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
  • Where you want to know how values are distributed across your whole data set
  • Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format."

Installing and Loading Data in OpenRefine

We'll also be working with a free software program called OpenRefine as part of our work in this lab. Navigate to https://openrefine.org/download.html in a web browser and download the appropriate version for your operating system.

Capture_2

Launch OpenRefine and click Create Project from the menu on the left-hand side.

Select the option to Get data from This Computer, and select Player_Birthplaces.csv file. Click Next.

Capture_2

You now have a variety of configuration options before creating your project in OpenRefine.

  • Select an appropriate character encoding schema.
  • Select the option to Parse next 1 line(s) as column headers.
    • This option treats the first row of your file as a table header.
  • Check to be sure the Parse cell text into numbers, dates, ... option is NOT selected.
  • In the Project name window, name your project.

Click Create Project >> to begin to work with your data as an OpenRefine project.

OpenRefine's Layout

Capture_2

As described in Library Carpentry's "Introduction to OpenRefine":

"OpenRefine displays data in a tabular format. Each row will usually represent a ‘record’ in the data, while each column represents a type of information. This is very similar to how you might view data in a spreadsheet or database. As with a spreadsheet, the individual bits of data live in ‘cells’ at the intersection of a row and a column.


OpenRefine only displays a limited number of rows of data at one time. You can adjust the number choosing between 5, 10 (the default), 25 and 50 at the top left of the table of data. You can navigate through the records by using the previous/next/first/last navigation options at the top right of the table of data."

Faceting and Filtering

As described in Library Carpentry's "Introduction to OpenRefine":

"Facets are one of the most useful features of OpenRefine and can help in both getting an overview of the data and to improve the consistency of the data.


A ‘Facet’ groups all the values that appear in a column, and then allows you to filter the data by these values and edit values across many records at the same time.


The simplest type of Facet is called a ‘Text facet’. This simply groups all the text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.


To create a Text Facet for a column, click on the drop down menu at the top of the publisher column and choose `Facet -> Text Facet`. The facet will then appear in the left hand panel.


The facet consists of a list of values used in the data. You can filter the data displayed by clicking on one of these headings.


You can include multiple values from the facet in a filter at one time by using the `Include` option which appears when you put your mouse over a value in the Facet.


You can also invert the filter to show all records which do not match your selected values. This option appears at the top of the Facet panel when you select a value from the facet to apply as a filter."

Capture_2

Select the drop-down arrow for one of the columns that contains a pattern error. Select Facet > Text Facet.

Capture_2

The facet will now appear on the left-hand side of the page. Click a line in the facet to select rows with that value.

  • Use the Include option to select multiple values.
  • Use the Edit option to address a pattern error.

Do this for other pattern errors. Consult the following resources as needed to understand this data:

Exporting from OpenRefine

Capture_2

Click on the Export button in the top right-hand corner and select the option to export your OpenRefine project as a CSV file.

NOTE: OpenRefine will only export the rows of data selected in your current review. Be sure to remove all filters or facets before exporting.

Make sure this file has a unique name and is saved in a location where you can find it again. Open this new CSV file in a spreadsheet program. Check to see the pattern errors have been addressed.

Go through this same process for the Team_Locations.csv file.

Q6: Compare your experience working in OpenRefine to other experiences you have had in a text editor or spreadsheet program. In what ways do you understand, perceive, or relate to the data differently through working in OpenRefine? Describe your experience cleaning this data in OpenRefine.

We are barely scratching the surface of what is possible with data wrangling in OpenRefine. The progam can also standardize capitalization, remove leading and trailing spaces, and address other commonly-found data errors. Library Carpentry's OpenRefine tutorial goes in to greater detail about many of these other functions.

Data Wrangling Using A Spreadsheet Program

Panopto logo Data Wrangling in a Spreadsheet Program

Spreadsheet software programs are another commonly-used tool for interacting with structured data. Some spreadsheet programs like Microsoft Office's Excel or Apple's Numbers are proprietary software installed on a local computer. Other proprietary spreadsheet programs like Google Sheets run online and are not installed locally. Open-source spreadsheet programs include OpenOffice's Calc and LibreOffice's Calc.

Q7: Describe a past experience working with a spreadsheet program. What were you trying to do? How did it go? What were some of your overall impressions about working with data in a spreadsheet program?

There are two options for the next section of the lab- working in Google Sheets or Microsoft Excel. You are only expected to complete one of those options. For either option, the workflow covered in this section of the lab loads individual tables into a workbook and uses spreadsheet program tools to interact with and clean (or "tidy") the data.

Microsoft Excel

Loading Data

There are two options for loading data files in Excel.

  • Import CSV files
  • Open Excel workbook

Loading Data From CSV Files

Open a blank Microsoft Excel file. Save the blank file as an Excel workbook.

Capture_2

Capture_2

Click on Data in the top menu bar. Under Get Data select the From Text/CSV option. In Sheet1, select the Player_Birthplaces.csv file.

Capture_2

In the pop-up window, make sure Comma is selected as the delimiter, and switch File Origin to UTF-8. Click Load

Capture_2

You should now see the CSV data in the Excel workbook. Go through the same process for the team_locations.csv file. Save the updated workbook.

Loading Data as an Excel Workbook

Capture_2

Alternatively, you can download the Google Sheets file as an Excel workbook (.xlsx). That workbook file includes both tables needed for this lab.

Data Cleaning in Excel

Capture_2

Capture_2

Click the drop-down arrow next to a column header to see additional options for that field. Use these sort, search, and filter options to address data pattern errors.

Find and Replace

Capture_2

Alternatively, use the Replace option under Find & Select (in the Home menu section) to address pattern errors.

Capture_2

Click the Options button to see additional options.

Consult the following resources as needed to understand this data:

To change all cells in a column, click the cell in the first non-header row. Press Control or Command and the down arrow key to select all cells with data in that column. Press Control/Command and D to copy the first value into the other selected cells. Alternatively, move your cursor over the bottom right-hand corner of the cell in the first non-header row. Click and drag the plus icon that appears down through the column to copy the value in the first cell into the subsequent cells.

Go through this same process for the team_locations table.

Saving and Exporting in Excel

The default file type in Microsoft Excel is an Excel workbook (.xlsx).

Capture_2

Click on the File menu section to show additional export options. Under Export you can see some of the other options for exporting the data in Excel.

  • While plain-text formats (tab separated values, tsv; comma separated values, csv, etc) are best for digital preservation and interoperability, they only accept a single table.

Google Sheets

Loading Data

There are two options for loading data files in Google Sheets.

Loading Data From CSV Files

Capture_2

Open a blank Google Sheets project. Replace Untitled spreadsheet in the top-left hand corner with a meaningful file name.

Capture_2

Select Import under File in the top-level menu.

Capture_2

Select the option to Upload a file from your local computer, or drag and drop the player_birthplaces.csv file. Click Select.

Capture_2

A few settings to check before we import:

  • Change the Import Location to Insert new sheet(s)
  • Select Comma as the Separator type
  • Uncheck the Convert text to numbers, dates, and formulas box

Click Import data.

Capture_2

The player_birthplaces table is now loaded in Google Sheets. Go through the same workflow for the team_locations.csv file.

Copying the Google Sheets Project

Alternatively, you can make a copy of the Google Sheets project to your local Drive. The Google Sheets project includes both tables needed for this lab.

Data Cleaning in Google Sheets

Freezing the Header Row

The first thing we want to do is freeze the first row in the table, which has our column headers. This way when we start sorting specific columns, the column labels will not be affected.

Capture_2

Click View on the top-level menu, and select Freeze. Click 1 row to freeze the first row in the table.

Sorting

Then, we can select a single column to sort by character or number values.

Capture_2

Right click on a column label and select one of the Sort sheet options.

Capture_2

Or, click Data -> Sort sheet -> Sort range to sort by a particular column or selection. We can also sort by multiple columns.

Capture_2

Click Data -> Sort range -> Advanced range sorting options to access these options.

Capture_2

In the pop-up window, you can check the Data has header row box and then choose to sort by multiple columns. Click Sort to apply these settings.

Filters

We can also add filters to only see specific rows or values.

Capture_2

Click Create a filter under Data.

Capture_2

Now we can set up filters for individual columns in the table.

Capture_2

Click on the three horizontal lines next to a column label to open the filter menu.

Capture_2

One option is to filter by condition.

Capture_2

Another option is to filter by values, which allows you to search for particular characters or select specific values. Click OK to apply the filter.

Capture_2

Google Sheets allows you to save particular filter settings/configurating as a Filter view.

  • Data -> Filter views -> Save as filter view

Capture_2

You can also remove any filters from the Google Sheets project.

  • Data -> Remove filter

Find and Replace in Google Sheets

Google Sheets also supports find and replace operations.

Capture_2

Click Find and replace under the Edit menu tab.

Capture_2

Google Sheets gives you options to match by case or for an entire cell's contents. You can also search using regular expressions.

Capture_2

Click on the Search drop-down menu to see additional options for searching in Google Sheets.

Saving and Exporting in Google Sheets

Google Sheets projects are cloud-based, which means they are stored online (in the cloud).

Capture_2

Click on the File - Download menu to show export options.

Under Download you can see some of the other options for exporting the data.

  • An Excel workbook (.xlsx) supports multiple tables or sheets in a single workbook file.
  • Plain-text formats (tab separated values, tsv; comma separated values, csv, etc) are best for digital preservation and interoperability, but they only accept a single table.

Additional Resources and Next Steps

Q8: Compare your experience working in a spreadsheet program (Excel or Google Sheets) to your experience working in OpenRefine. In what ways do you understand, perceive, or relate to the data differently through working in a spreadsheet program? Describe your experience cleaning this data in a spreadsheet program.

We are barely scratching the surface of what is possible with data wrangling in Excel or Google Sheets. The progam can also standardize capitalization, remove leading and trailing spaces, and address other commonly-found data errors.

Workflows for Data Entry

Panopto logo Workflows for Data Entry
Q9: For the baseball datasets we have been working with in this lab, what do you think may have contributed to or caused the pattern errors we needed to address? How could these pattern errors be addressed in the data entry process?

According to Library Carpentry's Tidy Data for Librarians tutorial, "Quality assurance stops bad data from ever being entered by checking to see if values are valid during data entry. For example, if research is being conducted at sites A, B, and C, then the value V (which is right next to B on the keyboard) should never be entered. Likewise if one of the kinds of data being collected is a count, only integers greater than or equal to zero should be allowed."

Building in quality assurance constraints into a data entry workflow can help minimize pattern errors that require later cleaning.

Survey Versus Spreadsheet

One option is use a survey with some pre-defined choices or drop-down options. Log into your Notre Dame Google Drive account.

Capture_2

Select the option to create a Google Form. Click on the Blank Form option.

Capture_2

Explore the different question types.

Q10: Describe how you would go about building a survey form or template for the Tidy_Data_Lab_PlayerBirthplaces.csv file. You DO NOT need to actually create or submit a survey form. Describe what types of questions and pre-defined question or field options could you use to more effectively generate the data in this file.

Data Validation

You can also use some of the built-in data validation options in a spreadsheet program.

Data Validation in Excel

Capture_2

Under Data, select the Data Validation option.

Capture_2

In the pop-up window, you can limit the types of values that will be valid in a particular field. You can also define a list of options or values which will show up as drop-down options for cells in that field.

Visit Microsoft Office's "Apply data validation to cells" article to learn more about data validation options.

Data Validation in Google Sheets

Google Sheets also supports some data validation operations.

Capture_2

Select Data validation under the Data menu tab.

Capture_2

You can customize the cell range for specific validation settings.

Capture_2

The Criteria drop-down list includes as range of options for data validation settings.

Visit Zapier's "What is data validation in Google Sheets?" article to learn more about data validation options.

Q11: Describe how you would go about using data validation to build a template for the player_birthplaces.csv file. You DO NOT need to actually create or submit a template. Describe what data validation options and pre-defined field options could you use to more effectively generate the data in this file.

Lab Notebook Questions

Link to lab notebook template (Google Doc)

Q1: What questions do you have about these principles? Which ones are unclear or confusing?

Q2: Explore both tables, thinking about what fields are represented in these datasets. Describe the data fields in your own words- what types of information do they contain, and what types of data are you seeing?

Q3: Provide 3 distinct examples from the sample datasets that do not conform to tidy data principles. Include the example as well as an explanation.

Q4: Take a step back and consider the pattern errors you're seeing in these datasets. What trends do you notice? Any thoughts or ideas as to how or why these pattern errors might have occurred?

Q5: How would you address these pattern errors so the data conforms to tidy data principles? Explain what steps you would take to address at least 3 pattern errors. Each error explanation should include three parts:

  1. an example of the error
  2. an explanation of your method to address the error
  3. the same example as tidy data

Q6: Compare your experience working in OpenRefine to other experiences you have had in a text editor or spreadsheet program. In what ways do you understand, perceive, or relate to the data differently through working in OpenRefine? Describe your experience cleaning this data in OpenRefine.

Q7: Describe a past experience working with a spreadsheet program. What were you trying to do? How did it go? What were some of your overall impressions about working with data in a spreadsheet program?

Q8: Compare your experience working in a spreadsheet program to your experience working in OpenRefine. In what ways do you understand, perceive, or relate to the data differently through working in Excel? Describe your experience cleaning this data in a spreadsheet program.

Q9: For the baseball datasets we have been working with in this lab, what do you think may have contributed to or caused the pattern errors we needed to address? How could these pattern errors be addressed in the data entry process?

Q10: Describe how you would go about building a survey form or template for the player_birthplaces.csv file. You DO NOT need to actually create or submit a survey form. Describe what types of questions and pre-defined question or field options could you use to more effectively generate the data in this file.

Q11: Describe how you would go about using data validation to build a template for the player_birthplaces.csv file. You DO NOT need to actually create or submit a template. Describe what data validation options and pre-defined field options could you use to more effectively generate the data in this file.

About

Tidy data principles (Elements of Computing II, S21, University of Notre Dame)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages