Data Cleaning with OpenRefine

Notes from lesson Data Cleaning with OpenRefine for Ecologists

Useful Links

Clustering https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

Setup

Download the data and save it to your desktop
Download the latest stable release of OpenRefine and unzip it to a convenient location on your computer

Full teaching database

These data are observations from a small mammal community in southern Arizona.

The data come from a project studying the effects of rodents and ants on the plant community that has been running for almost forty years.

The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.

This is a real data set that has been used in over 100 publications.

It has been simplified for this workshop.

Reminders

OpenRefine is self-contained within its own directory
Navigate to the executable and double click to launch it
It will launch in your web browser
If it does launch, copy the URL in the command-line window into a browser (http://127.0.0.1:3333/ or http://localhost:3333)
OpenRefine is a Java program that runs locally like a web page; it's not connecting to the internet
OpenRefine will not run correctly in Internet Explorer

Introduction

Questions:

What is OpenRefine useful for?

Objectives:

Describe OpenRefine's uses and applications.
Differentiate data cleaning from data organization.
Experiment with OpenRefine's user interface.
Locate helpful resources to learn more about OpenRefine.

What is OpenRefine

OpenRefine is a powerful, free, and open source tool that helps you with data wrangling
OpenRefine started as Google Refine before it was released to the open source community
It combines the GUI interface of Excel with the reproducibility of scripting languages like R and Python

Benefits of OpenRefine

OpenRefine automatically keeps a log of every change you make
- These can be added to a publication as supplemental material
OpenRefine does not allow you to modify your original file
- You must export to a new file to save your work
- OpenRefine keeps your original data pristine
Any operation can be undone
OpenRefine can repeat your steps for more than one data set
OpenRefine provides a user-friendly interface for complex clustering algorithms

Uses of OpenRefine

Overview a data set
Resolve inconsistencies
Help you split data into granular parts
Match local data with other data sets
Save a set of data cleaning steps for replay on multiple files
Limited to smaller data sets (100,000 rows)
- Still great for prototyping large data sets to determine steps

Working with OpenRefine

Questions:

How can we import data into OpenRefine?
How can we sort and summarize data with OpenRefine?
How can we find and correct errors with OpenRefine?

Objectives:

Create a new OpenRefine project from a .csv file.
Look at facets and how they sort and summarize data.
Look at clustering and how to apply it to edit groups of typos.
Undo/redo steps.
Split values into multiple columns.
Remove white spaces from cells.

Creating a Project

OpenRefine can import many different file types
- Tab-separated (.tsv)
- Comma-separated (.csv)
- Excel (.xls, .xlsx)
- JSON (.json)
- XML
- Google Spreadsheets
See the Importers documentation for more information

Launch OpenRefine
Click Create Project and select This Computer
Click Browse and select Portal_rodents_19772002_scinameUUIDs.csv. Click Open
Click Next >>
Look at the preview and ensure all the settings are correct
- You can adjust how OpenRefine interprets the file to ensure it is reading it correctly
Rename your project to portal_rodents
click Create Project >>

Point out key items in the user interface
- Navigating through pages
- How many rows to display

Faceting

Faceting is a core OpenRefine operation
Faceting allows you to group rows and see big-picture trends
Faceting allows you to change rows in bulk
Faceting groups values according to a criterion
Use faceting to look for errors in the scientificName column

Scroll to the scientificName column
Click the down arrow and choose Facet > Text facet
1. The facet displays every unique value along with a count of how many times it appears
2. Sort by name or count
Hover over one of the names. Notice the edit option
1. You could use this option to fix all instances of an misspelled name.
2. Find Crotalus viridis and click edit
3. Add a space in front of the name and click Apply
4. Point out the (blank) category

Other available facets

Numeric (displays graphs)
Time line (for date)
Custom
Scatter plot (displays graphs)

Clustering

Clustering allows you to find things that are similar to each other
e.g. New York and new york are likely the same item, but the computer will not necessarily equate them
OpenRefine comes will built-in clustering algorithms to identify things that are like each other

In the scientificName text facet, lick the Cluster button
1. You can change the Method and Keying Function
Try different combinations and see what clusters emerge
Select the key collision method and the metaphone3 keying function.
Click the Merge? box beside each, then click Merge Selected and Recluster to apply corrections
Try selecting methods again and see if you can find further improvements, but don't re-cluster again
Click Close

An in-depth introduction to clustering

Split

You can split columns based on a separator

Split scientificName into separate columns for genus and species
Click the down arrow at the top of scientificName. Choose Edit Column > Split into several columns...
Replace the comma with a space in the Separator box
Un-check the box that says Remove this column
Click OK
Notice you can use any separator you want

Undo/Redo

OpenRefine lets us undo and redo steps for the entire duration of the session

Click where it says Undo / Redo. You can see all the steps you have made.
Click on the steps you want to go back to (1. Mass edit 739 cells in column scientificName in this case)
Notice that you can go forwards and backwards
Go back before the split
Notice you must go to the step before the change you want to undo

Trim leading and trailing white space

OpenRefine provides tools for removing hidden white space characters at the beginning or end of words

In scientificName, choose Edit cells > Tranform > Trim leading and trailing whitespace
Perform the same split operation on `scientificName that you did earlier.
This time you only get two new columns, why?

Filtering and Sorting with OpenRefine

Questions:

How can we select only a subset of our data to work with?
How can we sort our data?

Objectives:

Employ text filter or include/exclude to filter a subset of rows.
Sort tables by a column.
Sort tables by multiple columns.

Filtering

We can filter entries to work on a subset of data in OpenRefine

scientificName > Text filter
Close the text facet
Type bai and press Return
Change the view to show fifty rows
You can now see all entries with a scientific name that starts with "bai"

Excluding Entries

You have two different species in this subset
scientificName > Facet > Text facet
You can combine filters and facets
Filters give you a subset of your data by filtering out other data
Facets give you a general overview of the currently selected data
Hover over the two species and include/exclude them

Sort

You can sort columns according to all kind of criteria by selecting Sort... from the columns drop down arrow
You can also specify what order Blanks and Errors are sorted in the results

Exercise: Sort by month. How can you ensure that months are in order?

Ensure you are sorting by number and not alphabetically.

If you already have a sort in place, the menu changes to remind you that you have already set a sort
You get additional options...
- Sort > Sort... - Modify the original sort
- Sort > Reverse - Reverse the order
- Sort > Remove sort - Undo your sort

Sorting by multiple columns

You can sort by multiple columns by repeating this process for each column
The order of the sort will depend on the order you set the sorts
To restart the sorting process, check the sort by this column alone box in the Sort pop-up menu

mo > Sort > Remove sort

Exercise: Sort by year, month, and day in that order. Try putting some in reverse order. Use mo > Sort > Remove sort. Notice how it changes the order. Remove all your sorts once your are finished.

Examining numbers in OpenRefine

Questions:

How can we convert a column from one data type to another?
How can we visualize relationships among columns?

Objectives:

Transform a text column into a number column.
Identify and modify non-numeric values in a column using facets.
Use the scatter plot facet to examine relationships among columns.

Numbers

OpenRefine treats all imported values as text values by default
We can change a column's type by using Edit cells > Common transforms

Remove any facets and filters
recordID > Edit cells > Common transforms... > To number
Try transforming scientificName to numerical values. Did it work?

Numeric facet

Now that we have numeric values, we can work with numeric facets
Numeric facets can help us uncover non-number values or blanks

Change a few values in recordID to letters
Apply a numeric facet to the column
Experiment with the slider and check boxes in this facet
1. The slider is a histogram of the values in the column
Remove the facet and undo the non-numeric changes

Scatter plot facet

The scatter plot facet allows us to examine the relationship between different numeric columns

Change on the columns to numeric columns that should be: recordID, mo, dy, yr, period, plot, decimalLatitude, decimalLongitude.
recordID > Facet > ScatterPlot facet
We can see the values plotted against each of the other numeric values.

Scripts from OpenRefine

Questions:

How can we document the data-cleaning steps we've applied to our data?
How can we apply these steps to additional data sets?

Objectives:

Describe how OpenRefine generates JSON code.
Demonstrate ability to export JSON code from OpenRefine.
Save JSON code from and analysis.
Apply saved JSON code to an analysis.

Scripts

OpenRefine keeps track of all your steps and saves them in a format called JSON
You can export your steps and then re-apply them to another file
This allows you to repeat an analysis or cleaning on multiple files.
You can even share your steps with collaborators

Undo / Redo > Extract...
Paste the code into a plain-text editor
Save as plain text

Reapply the steps to an unclean version of the data set

Download an unclean version
Click Open... to start a new project and import the file (name it something different)
Undo / Redo > Apply
Paste in the steps
Click Perform operations

This allows you to break data entry out to different team members (so long as they have the same headers)

Exporting and saving data from OpenRefine

Questions:

How can we save and export our cleaned data from OpenRefine?

Objectives:

Save an OpenRefine project.
Export cleaned data from and OpenRefine project.

Saving and Exporting

OpenRefine saves automatically in the background
If you close it and open it again, you can see a list of your projects
You can export a project and share it with others

Click Export
Select *Export project
A tar.gz file will begin downloading just like an online download

tar.gz files require special software on Windows (7-zip or WinZip)
To import: Open... > Import Project > Select your file
If you want to export just the data (and not the OpenRefine project): Export > Select file type > Look in Downloads folder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openRefine.md

openRefine.md

Data Cleaning with OpenRefine

Useful Links

Setup

Reminders

Introduction

What is OpenRefine

Benefits of OpenRefine

Uses of OpenRefine

Working with OpenRefine

Creating a Project

Faceting

Other available facets

Clustering

Split

Undo/Redo

Trim leading and trailing white space

Filtering and Sorting with OpenRefine

Filtering

Excluding Entries

Sort

Sorting by multiple columns

Examining numbers in OpenRefine

Numbers

Numeric facet

Scatter plot facet

Scripts from OpenRefine

Scripts

Exporting and saving data from OpenRefine

Saving and Exporting

Files

openRefine.md

Latest commit

History

openRefine.md

File metadata and controls

Data Cleaning with OpenRefine

Useful Links

Setup

Reminders

Introduction

What is OpenRefine

Benefits of OpenRefine

Uses of OpenRefine

Working with OpenRefine

Creating a Project

Faceting

Other available facets

Clustering

Split

Undo/Redo

Trim leading and trailing white space

Filtering and Sorting with OpenRefine

Filtering

Excluding Entries

Sort

Sorting by multiple columns

Examining numbers in OpenRefine

Numbers

Numeric facet

Scatter plot facet

Scripts from OpenRefine

Scripts

Exporting and saving data from OpenRefine

Saving and Exporting