Tidy Data

This tutorial was written by Katherine Walden and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Lab Goals

This lab provides an overview of tidy data principles (Wickham et al). It covers how to recognize and address pattern errors in structured data using the data cleaning tool Open Refine and common spreadsheet programs like Microsoft Excel or Google Sheets. It also covers how to use survey design and data validation options to minimize user error in data entry.

By the end of this lab, students will be able to:

Understand the core components and principles of tidy data
Recognize pattern errors in structured datasets
Understand how to approach data cleaning and wrangling using OpenRefine and spreadsheet programs
Understand the affordances of survey design and data validation as part of data entry/wrangling workflows

Lecture/live coding playlist

Acknowledgements

The author consulted the following resources when building this tutorial:

Lecture & Live Coding

Throughout this lab, you will see a Panopto icon at the start of select sections.

This icon indicates there is lecture/live coding asynchronous content that accompanies this section of the lab.

You can click the link in the figure caption to access these materials (ND users only).

Example:

Lecture/live coding playlist

Lab Notebook Template

Link to lab notebook template (Google Doc)

Screenshots (as needed) are HIGHLY RECOMMENDED.
- Windows (Snipping Tool for folks running older versions of Windows; Snip & Sketch for folks running updated versions)
- Mac
- Google Chromebook
Tutorial for adding images/tables/drawings to a Google Doc

Data

Overview

The following data files are used in this tutorial, with Google Drive links (ND users only) provided below:

Google Sheets project with all needed files:
- Combined_Workbook.xlsx
Individual CSV files:
- Player_Birthplaces.csv
- Team_Locations.csv

You can also download the files in a compressed folder (.zip): Google Drive link, ND users

Tools

We'll be opening some structured data files as part of our work in this lab. You can use a spreadsheet program or text editor to access these files.

Text editors:
- TextEdit (Mac)
- Notepad or Notepad++ (Windows)
- Atom, Brackets (free download options for Mac and Windows)
- Text (Chromebook)
Spreadsheet programs:
- Microsoft Excel (Windows or Mac)
  - ND students have free access to the Microsoft Office suite through Office365
- Apple Numbers (Mac)
- LibreCalc (open source Excel/Numbers alternative for Mac or Windows users)
- Google Sheets (web-based option available through Google Drive)

We'll also be working with a free software program called OpenRefine as part of our work in this lab.

Navigate to https://openrefine.org/download.html in a web browser and download the appropriate version for your operating system.
If you are getting memory-related error messages, visit https://docs.openrefine.org/manual/installing#increasing-memory-allocation to troubleshoot.

Tidy Data Principles

Hadley Wickham's 2014 article in the Journal of Statistical Software outlines the foundations and principles of tidy data. These principles have become widely used in data science and other statistical software applications.

"A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores." (Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.)

To prepare for this lab, we read Karl W. Broman and Kara H. Woo's 2018 "Data Organization in Spreadsheets" from The American Statistician.

"Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files." (Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989)

What Are the Principles

Designing spreadsheets that are “tidy, consistent, and as resistant to mistakes as possible” (2)

Be Consistent:

Use consistent codes for categorical variables
Use a consistent fixed code for any missing values
Use consistent variable names
Use consistent subject identifiers
Use a consistent data layout in multiple files
Use consistent file names
Use a consistent format for all dates
Use consistent phrases in your notes
Be careful about extra spaces within cells

Choose Good Names for Things:

Avoid spaces
Avoid special characters
Be short but meaningful

Write Dates as YYYY-MM-DD

Or have separate columns for YEAR, MONTH, DATE

No Empty Cells
Put Just One Thing in a Cell
Make it a Rectangle

Single first row with variable names

7.- Create a Data Dictionary

“This is part of the metadata that you will want to prepare: information about the data” (6)
You might also find this information in a codebook that goes with a dataset
Things to include:
- The exact variable name as in the data file
- A version of the variable name that might be used in data visualizations
- A longer explanation of what the variable means
- The measurement units
- Expected minimum and maximum values

No Calculations in the Raw Data Files
Do Not Use Font Color or Highlighting as Data
Make Backups

Multiple locations (OneDrive, local computer, etc.)
Version control program (i.e. Git)
Write protect the file when not entering data

Use Data Validation to Avoid Errors
Save a Copy of the Data in Plain Text Files

File formats can include comma-separated values (CSV) or plain-text (TXT)

The principles are also available as a PDF:

Common Spreadsheet Errors

As described in Library Carpentry's "Tidy data for librarians" tutorial, common formatting problems for data in spreadsheets include:

Multiple tables
Multiple tabs
Not filling in zeros
Using bad null values
Using formatting to convey information
Using formatting to make the data sheet look pretty
Placing comments or units in cells
More than one piece of information in a cell
Field name problems
Special characters in data
Inclusion of metadata in data table
Date formatting

Q1: What questions do you have about these principles? Which ones are unclear or confusing?

Dealing With Messy Data

Double check that you have the needed data files for this lab.

Google Sheets project with all needed files:
- Combined_Workbook.xlsx
Individual CSV files:
- Player_Birthplaces.csv
- Team_Locations.csv

Open the the Combined_Workbook file in a spreadsheet program (you can also open the two .csv files in a spreadsheet program or text editor).

Text editors:
- TextEdit (Mac)
- Notepad or Notepad++ (Windows)
- Atom, Brackets (free download options for Mac and Windows)
- Text (Chromebook)
Spreadsheet programs:
- Microsoft Excel (Windows or Mac)
  - ND students have free access to the Microsoft Office suite through Office365
- Apple Numbers (Mac)
- LibreCalc (open source Excel/Numbers alternative for Mac or Windows users)
- Google Sheets (web-based option available through Google Drive)

Explore the tables tables (sheets/tabs in the combined workbook or separate .csv files), thinking about the types of information included or represented in different tables and fields

Q2: Explore both tables, thinking about what fields are represented in these datasets. Describe the data fields in your own words- what types of information do they contain, and what types of data are you seeing?

Type	Description	Example
String	Used to store text or a string of non-integer characters	"This classroom is in Bond Hall" or "student"
Integer	Used to store positive or negative whole numbers	-25, 0, 25
Double	Used to store precise numerical values that include decimal points	3.14159265359

Identify Patterns and Brainstorm Solutions

Compare what you see in these tables to the tidy data principles outlined above. Start by looking for small-scale discrepencies and inconsistencies within the datasets.

Q3: Provide three (3) distinct examples from the sample datasets that do not conform to tidy data principles. Include the example as well as an explanation of how this example does not conform to tidy data principles.

Q4: Take a step back and consider the pattern errors you're seeing in these datasets. What trends do you notice? Any thoughts or ideas as to how or why these pattern errors might have occurred?

Q5: How would you address these pattern errors so the data conforms to tidy data principles? Explain what steps you would take to address at least 3 pattern errors. Each error explanation should include three parts:

an example of the error

an explanation of your method to address the error

the same example as tidy data

Data Wrangling Using OpenRefine

Data Cleaning in Open Refine

As described in Library Carpentry's "Introduction to OpenRefine":

OpenRefine is described as 'a power tool for working with messy data' (David Huynh) - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you solve.

OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file.

It can help you:

Get an overview of a data set

Resolve inconsistencies in a data set, for example standardizing date formatting

Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells

Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings

Enhance a data set with data from other sources

Some common scenarios where you might use OpenRefine include:
Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data

Where you want to know how values are distributed across your whole data set

Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format."

Installing and Loading Data in OpenRefine

We'll also be working with a free software program called OpenRefine as part of our work in this lab. Navigate to https://openrefine.org/download.html in a web browser and download the appropriate version for your operating system.

If you are getting memory-related error messages, visit https://docs.openrefine.org/manual/installing#increasing-memory-allocation to troubleshoot.

Launch OpenRefine and click Create Project from the menu on the left-hand side.

Select the option to Get data from This Computer, and select Player_Birthplaces.csv file. Click Next.

You now have a variety of configuration options before creating your project in OpenRefine.

Select an appropriate character encoding schema.
Select the option to Parse next 1 line(s) as column headers.
- This option treats the first row of your file as a table header.
Check to be sure the Parse cell text into numbers, dates, ... option is NOT selected.
In the Project name window, name your project.

Click Create Project >> to begin to work with your data as an OpenRefine project.

OpenRefine's Layout

As described in Library Carpentry's "Introduction to OpenRefine":

"OpenRefine displays data in a tabular format. Each row will usually represent a ‘record’ in the data, while each column represents a type of information. This is very similar to how you might view data in a spreadsheet or database. As with a spreadsheet, the individual bits of data live in ‘cells’ at the intersection of a row and a column.

OpenRefine only displays a limited number of rows of data at one time. You can adjust the number choosing between 5, 10 (the default), 25 and 50 at the top left of the table of data. You can navigate through the records by using the previous/next/first/last navigation options at the top right of the table of data."

Faceting and Filtering

As described in Library Carpentry's "Introduction to OpenRefine":

"Facets are one of the most useful features of OpenRefine and can help in both getting an overview of the data and to improve the consistency of the data.

A ‘Facet’ groups all the values that appear in a column, and then allows you to filter the data by these values and edit values across many records at the same time.

The simplest type of Facet is called a ‘Text facet’. This simply groups all the text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.

To create a Text Facet for a column, click on the drop down menu at the top of the publisher column and choose `Facet -> Text Facet`. The facet will then appear in the left hand panel.

The facet consists of a list of values used in the data. You can filter the data displayed by clicking on one of these headings.

You can include multiple values from the facet in a filter at one time by using the `Include` option which appears when you put your mouse over a value in the Facet.

You can also invert the filter to show all records which do not match your selected values. This option appears at the top of the Facet panel when you select a value from the facet to apply as a filter."

Select the drop-down arrow for one of the columns that contains a pattern error. Select Facet > Text Facet.

The facet will now appear on the left-hand side of the page. Click a line in the facet to select rows with that value.

Use the Include option to select multiple values.
Use the Edit option to address a pattern error.

Do this for other pattern errors. Consult the following resources as needed to understand this data:

Exporting from OpenRefine

Click on the Export button in the top right-hand corner and select the option to export your OpenRefine project as a CSV file.

NOTE: OpenRefine will only export the rows of data selected in your current review. Be sure to remove all filters or facets before exporting.

Make sure this file has a unique name and is saved in a location where you can find it again. Open this new CSV file in a spreadsheet program. Check to see the pattern errors have been addressed.

Go through this same process for the Team_Locations.csv file.

Q6: Compare your experience working in OpenRefine to other experiences you have had in a text editor or spreadsheet program. In what ways do you understand, perceive, or relate to the data differently through working in OpenRefine? Describe your experience cleaning this data in OpenRefine.

We are barely scratching the surface of what is possible with data wrangling in OpenRefine. The progam can also standardize capitalization, remove leading and trailing spaces, and address other commonly-found data errors. Library Carpentry's OpenRefine tutorial goes in to greater detail about many of these other functions.

Data Wrangling Using A Spreadsheet Program

Data Wrangling in a Spreadsheet Program

Spreadsheet software programs are another commonly-used tool for interacting with structured data. Some spreadsheet programs like Microsoft Office's Excel or Apple's Numbers are proprietary software installed on a local computer. Other proprietary spreadsheet programs like Google Sheets run online and are not installed locally. Open-source spreadsheet programs include OpenOffice's Calc and LibreOffice's Calc.

Q7: Describe a past experience working with a spreadsheet program. What were you trying to do? How did it go? What were some of your overall impressions about working with data in a spreadsheet program?

There are two options for the next section of the lab- working in Google Sheets or Microsoft Excel. You are only expected to complete one of those options. For either option, the workflow covered in this section of the lab loads individual tables into a workbook and uses spreadsheet program tools to interact with and clean (or "tidy") the data.

Microsoft Excel
- Steps 76-95
Google Sheets
- Steps 96-134

Microsoft Excel

Loading Data

There are two options for loading data files in Excel.

Import CSV files
Open Excel workbook

Loading Data From CSV Files

Open a blank Microsoft Excel file. Save the blank file as an Excel workbook.

Click on Data in the top menu bar. Under Get Data select the From Text/CSV option. In Sheet1, select the Player_Birthplaces.csv file.

In the pop-up window, make sure Comma is selected as the delimiter, and switch File Origin to UTF-8. Click Load

You should now see the CSV data in the Excel workbook. Go through the same process for the team_locations.csv file. Save the updated workbook.

Loading Data as an Excel Workbook

Alternatively, you can download the Google Sheets file as an Excel workbook (.xlsx). That workbook file includes both tables needed for this lab.

Data Cleaning in Excel

Click the drop-down arrow next to a column header to see additional options for that field. Use these sort, search, and filter options to address data pattern errors.

Find and Replace

Alternatively, use the Replace option under Find & Select (in the Home menu section) to address pattern errors.

Click the Options button to see additional options.

Consult the following resources as needed to understand this data:

To change all cells in a column, click the cell in the first non-header row. Press Control or Command and the down arrow key to select all cells with data in that column. Press Control/Command and D to copy the first value into the other selected cells. Alternatively, move your cursor over the bottom right-hand corner of the cell in the first non-header row. Click and drag the plus icon that appears down through the column to copy the value in the first cell into the subsequent cells.

Go through this same process for the team_locations table.

Saving and Exporting in Excel

The default file type in Microsoft Excel is an Excel workbook (.xlsx).

Click on the File menu section to show additional export options. Under Export you can see some of the other options for exporting the data in Excel.

While plain-text formats (tab separated values, tsv; comma separated values, csv, etc) are best for digital preservation and interoperability, they only accept a single table.

Google Sheets

Loading Data

There are two options for loading data files in Google Sheets.

Import CSV files
Copy Google Sheets project

Loading Data From CSV Files

Open a blank Google Sheets project. Replace Untitled spreadsheet in the top-left hand corner with a meaningful file name.

Select Import under File in the top-level menu.

Select the option to Upload a file from your local computer, or drag and drop the player_birthplaces.csv file. Click Select.

A few settings to check before we import:

Change the Import Location to Insert new sheet(s)
Select Comma as the Separator type
Uncheck the Convert text to numbers, dates, and formulas box

Click Import data.

The player_birthplaces table is now loaded in Google Sheets. Go through the same workflow for the team_locations.csv file.

Copying the Google Sheets Project

Alternatively, you can make a copy of the Google Sheets project to your local Drive. The Google Sheets project includes both tables needed for this lab.

Data Cleaning in Google Sheets

Freezing the Header Row

The first thing we want to do is freeze the first row in the table, which has our column headers. This way when we start sorting specific columns, the column labels will not be affected.

Click View on the top-level menu, and select Freeze. Click 1 row to freeze the first row in the table.

Sorting

Then, we can select a single column to sort by character or number values.

Right click on a column label and select one of the Sort sheet options.

Or, click Data -> Sort sheet -> Sort range to sort by a particular column or selection. We can also sort by multiple columns.

Click Data -> Sort range -> Advanced range sorting options to access these options.

In the pop-up window, you can check the Data has header row box and then choose to sort by multiple columns. Click Sort to apply these settings.

Filters

We can also add filters to only see specific rows or values.

Click Create a filter under Data.

Now we can set up filters for individual columns in the table.

Click on the three horizontal lines next to a column label to open the filter menu.

One option is to filter by condition.

Another option is to filter by values, which allows you to search for particular characters or select specific values. Click OK to apply the filter.

Google Sheets allows you to save particular filter settings/configurating as a Filter view.

Data -> Filter views -> Save as filter view

You can also remove any filters from the Google Sheets project.

Data -> Remove filter

Find and Replace in Google Sheets

Google Sheets also supports find and replace operations.

Click Find and replace under the Edit menu tab.

Google Sheets gives you options to match by case or for an entire cell's contents. You can also search using regular expressions.

To learn more: "Find and replace items using regular expressions," Google Docs Support

Click on the Search drop-down menu to see additional options for searching in Google Sheets.

Saving and Exporting in Google Sheets

Google Sheets projects are cloud-based, which means they are stored online (in the cloud).

Click on the File - Download menu to show export options.

Under Download you can see some of the other options for exporting the data.

An Excel workbook (.xlsx) supports multiple tables or sheets in a single workbook file.
Plain-text formats (tab separated values, tsv; comma separated values, csv, etc) are best for digital preservation and interoperability, but they only accept a single table.

Additional Resources and Next Steps

Q8: Compare your experience working in a spreadsheet program (Excel or Google Sheets) to your experience working in OpenRefine. In what ways do you understand, perceive, or relate to the data differently through working in a spreadsheet program? Describe your experience cleaning this data in a spreadsheet program.

We are barely scratching the surface of what is possible with data wrangling in Excel or Google Sheets. The progam can also standardize capitalization, remove leading and trailing spaces, and address other commonly-found data errors.

For more on Excel: Library Carpentry's Tidy Data for Librarians tutorial goes in to greater detail about many of these other functions.
For more on Google Sheets: 10 Google Workspace tips to clean up data

Workflows for Data Entry

Q9: For the baseball datasets we have been working with in this lab, what do you think may have contributed to or caused the pattern errors we needed to address? How could these pattern errors be addressed in the data entry process?

According to Library Carpentry's Tidy Data for Librarians tutorial, "Quality assurance stops bad data from ever being entered by checking to see if values are valid during data entry. For example, if research is being conducted at sites A, B, and C, then the value V (which is right next to B on the keyboard) should never be entered. Likewise if one of the kinds of data being collected is a count, only integers greater than or equal to zero should be allowed."

Building in quality assurance constraints into a data entry workflow can help minimize pattern errors that require later cleaning.

Survey Versus Spreadsheet

One option is use a survey with some pre-defined choices or drop-down options. Log into your Notre Dame Google Drive account.

Select the option to create a Google Form. Click on the Blank Form option.

Explore the different question types.

Q10: Describe how you would go about building a survey form or template for the Tidy_Data_Lab_PlayerBirthplaces.csv file. You DO NOT need to actually create or submit a survey form. Describe what types of questions and pre-defined question or field options could you use to more effectively generate the data in this file.

Data Validation

You can also use some of the built-in data validation options in a spreadsheet program.

Microsoft Excel
Google Sheets

Data Validation in Excel

Under Data, select the Data Validation option.

In the pop-up window, you can limit the types of values that will be valid in a particular field. You can also define a list of options or values which will show up as drop-down options for cells in that field.

Visit Microsoft Office's "Apply data validation to cells" article to learn more about data validation options.

Data Validation in Google Sheets

Google Sheets also supports some data validation operations.

Select Data validation under the Data menu tab.

You can customize the cell range for specific validation settings.

The Criteria drop-down list includes as range of options for data validation settings.

Visit Zapier's "What is data validation in Google Sheets?" article to learn more about data validation options.

Q11: Describe how you would go about using data validation to build a template for the player_birthplaces.csv file. You DO NOT need to actually create or submit a template. Describe what data validation options and pre-defined field options could you use to more effectively generate the data in this file.

Lab Notebook Questions

Link to lab notebook template (Google Doc)

Screenshots (as needed) are HIGHLY RECOMMENDED.
- Windows (Snipping Tool for folks running older versions of Windows; Snip & Sketch for folks running updated versions)
- Mac
- Google Chromebook
Tutorial for adding images/tables/drawings to a Google Doc

Q1: What questions do you have about these principles? Which ones are unclear or confusing?

Q2: Explore both tables, thinking about what fields are represented in these datasets. Describe the data fields in your own words- what types of information do they contain, and what types of data are you seeing?

Q3: Provide 3 distinct examples from the sample datasets that do not conform to tidy data principles. Include the example as well as an explanation.

Q4: Take a step back and consider the pattern errors you're seeing in these datasets. What trends do you notice? Any thoughts or ideas as to how or why these pattern errors might have occurred?

Q5: How would you address these pattern errors so the data conforms to tidy data principles? Explain what steps you would take to address at least 3 pattern errors. Each error explanation should include three parts:

an example of the error
an explanation of your method to address the error
the same example as tidy data

Q6: Compare your experience working in OpenRefine to other experiences you have had in a text editor or spreadsheet program. In what ways do you understand, perceive, or relate to the data differently through working in OpenRefine? Describe your experience cleaning this data in OpenRefine.

Q7: Describe a past experience working with a spreadsheet program. What were you trying to do? How did it go? What were some of your overall impressions about working with data in a spreadsheet program?

Q8: Compare your experience working in a spreadsheet program to your experience working in OpenRefine. In what ways do you understand, perceive, or relate to the data differently through working in Excel? Describe your experience cleaning this data in a spreadsheet program.

Q9: For the baseball datasets we have been working with in this lab, what do you think may have contributed to or caused the pattern errors we needed to address? How could these pattern errors be addressed in the data entry process?

Q10: Describe how you would go about building a survey form or template for the player_birthplaces.csv file. You DO NOT need to actually create or submit a survey form. Describe what types of questions and pre-defined question or field options could you use to more effectively generate the data in this file.

Q11: Describe how you would go about using data validation to build a template for the player_birthplaces.csv file. You DO NOT need to actually create or submit a template. Describe what data validation options and pre-defined field options could you use to more effectively generate the data in this file.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
figures		figures
QualData_TidyData_Principles.pdf		QualData_TidyData_Principles.pdf
README.md		README.md
data.zip		data.zip

kwaldenphd/tidy-data-principles

Folders and files

Latest commit

History

Repository files navigation

Tidy Data

Lab Goals

Acknowledgements

Table of Contents

Lecture & Live Coding

Lab Notebook Template

Data

Tools

Tidy Data Principles

What Are the Principles

Common Spreadsheet Errors

Dealing With Messy Data

Identify Patterns and Brainstorm Solutions

Data Wrangling Using OpenRefine

Installing and Loading Data in OpenRefine

OpenRefine's Layout

Faceting and Filtering

Exporting from OpenRefine

Data Wrangling Using A Spreadsheet Program

Microsoft Excel

Loading Data

Loading Data From CSV Files

Loading Data as an Excel Workbook

Data Cleaning in Excel

Find and Replace

Saving and Exporting in Excel

Google Sheets

Loading Data

Loading Data From CSV Files

Copying the Google Sheets Project

Data Cleaning in Google Sheets

Freezing the Header Row

Sorting

Filters

Find and Replace in Google Sheets

Saving and Exporting in Google Sheets

Additional Resources and Next Steps

Workflows for Data Entry

Survey Versus Spreadsheet

Data Validation

Data Validation in Excel

Data Validation in Google Sheets

Lab Notebook Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages