- Notes from lesson Data Cleaning with OpenRefine for Ecologists
- Download the data and save it to your desktop
- Download the latest stable release of OpenRefine and unzip it to a convenient location on your computer
- These data are observations from a small mammal community in southern Arizona.
- The data come from a project studying the effects of rodents and ants on the plant community that has been running for almost forty years.
- The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.
- This is a real data set that has been used in over 100 publications.
- It has been simplified for this workshop.
- OpenRefine is self-contained within its own directory
- Navigate to the executable and double click to launch it
- It will launch in your web browser
- If it does launch, copy the URL in the command-line window into a browser (
http://127.0.0.1:3333/
orhttp://localhost:3333
) - OpenRefine is a Java program that runs locally like a web page; it's not connecting to the internet
- OpenRefine will not run correctly in Internet Explorer
Questions:
- What is OpenRefine useful for?
Objectives:
- Describe OpenRefine's uses and applications.
- Differentiate data cleaning from data organization.
- Experiment with OpenRefine's user interface.
- Locate helpful resources to learn more about OpenRefine.
- OpenRefine is a powerful, free, and open source tool that helps you with data wrangling
- OpenRefine started as Google Refine before it was released to the open source community
- It combines the GUI interface of Excel with the reproducibility of scripting languages like R and Python
- OpenRefine automatically keeps a log of every change you make
- These can be added to a publication as supplemental material
- OpenRefine does not allow you to modify your original file
- You must export to a new file to save your work
- OpenRefine keeps your original data pristine
- Any operation can be undone
- OpenRefine can repeat your steps for more than one data set
- OpenRefine provides a user-friendly interface for complex clustering algorithms
- Overview a data set
- Resolve inconsistencies
- Help you split data into granular parts
- Match local data with other data sets
- Save a set of data cleaning steps for replay on multiple files
- Limited to smaller data sets (100,000 rows)
- Still great for prototyping large data sets to determine steps
Questions:
- How can we import data into OpenRefine?
- How can we sort and summarize data with OpenRefine?
- How can we find and correct errors with OpenRefine?
Objectives:
- Create a new OpenRefine project from a
.csv
file. - Look at facets and how they sort and summarize data.
- Look at clustering and how to apply it to edit groups of typos.
- Undo/redo steps.
- Split values into multiple columns.
- Remove white spaces from cells.
- OpenRefine can import many different file types
- Tab-separated (
.tsv
) - Comma-separated (
.csv
) - Excel (
.xls
,.xlsx
) - JSON (
.json
) - XML
- Google Spreadsheets
- Tab-separated (
- See the Importers documentation for more information
- Launch OpenRefine
- Click Create Project and select This Computer
- Click Browse and select
Portal_rodents_19772002_scinameUUIDs.csv
. Click Open - Click Next >>
- Look at the preview and ensure all the settings are correct
- You can adjust how OpenRefine interprets the file to ensure it is reading it correctly
- Rename your project to
portal_rodents
- click Create Project >>
- Point out key items in the user interface
- Navigating through pages
- How many rows to display
-
Faceting is a core OpenRefine operation
-
Faceting allows you to group rows and see big-picture trends
-
Faceting allows you to change rows in bulk
-
Faceting groups values according to a criterion
-
Use faceting to look for errors in the
scientificName
column
- Scroll to the
scientificName
column - Click the down arrow and choose Facet > Text facet
- The facet displays every unique value along with a count of how many times it appears
- Sort by name or count
- Hover over one of the names. Notice the edit option
- You could use this option to fix all instances of an misspelled name.
- Find
Crotalus viridis
and click edit - Add a space in front of the name and click Apply
- Point out the
(blank)
category
- Numeric (displays graphs)
- Time line (for date)
- Custom
- Scatter plot (displays graphs)
- Clustering allows you to find things that are similar to each other
- e.g.
New York
andnew york
are likely the same item, but the computer will not necessarily equate them - OpenRefine comes will built-in clustering algorithms to identify things that are like each other
- In the
scientificName
text facet, lick the Cluster button- You can change the Method and Keying Function
- Try different combinations and see what clusters emerge
- Select the key collision method and the metaphone3 keying function.
- Click the Merge? box beside each, then click Merge Selected and Recluster to apply corrections
- Try selecting methods again and see if you can find further improvements, but don't re-cluster again
- Click Close
- An in-depth introduction to clustering
- You can split columns based on a separator
- Split
scientificName
into separate columns for genus and species - Click the down arrow at the top of
scientificName
. Choose Edit Column > Split into several columns... - Replace the comma with a space in the Separator box
- Un-check the box that says Remove this column
- Click OK
- Notice you can use any separator you want
- OpenRefine lets us undo and redo steps for the entire duration of the session
- Click where it says Undo / Redo. You can see all the steps you have made.
- Click on the steps you want to go back to (
1. Mass edit 739 cells in column scientificName
in this case) - Notice that you can go forwards and backwards
- Go back before the split
- Notice you must go to the step before the change you want to undo
- OpenRefine provides tools for removing hidden white space characters at the beginning or end of words
- In
scientificName
, choose Edit cells > Tranform > Trim leading and trailing whitespace - Perform the same split operation on `scientificName that you did earlier.
- This time you only get two new columns, why?
Questions:
- How can we select only a subset of our data to work with?
- How can we sort our data?
Objectives:
- Employ text filter or include/exclude to filter a subset of rows.
- Sort tables by a column.
- Sort tables by multiple columns.
- We can filter entries to work on a subset of data in OpenRefine
- scientificName > Text filter
- Close the text facet
- Type bai and press Return
- Change the view to show fifty rows
- You can now see all entries with a scientific name that starts with "bai"
- You have two different species in this subset
- scientificName > Facet > Text facet
- You can combine filters and facets
- Filters give you a subset of your data by filtering out other data
- Facets give you a general overview of the currently selected data
- Hover over the two species and include/exclude them
- You can sort columns according to all kind of criteria by selecting
Sort...
from the columns drop down arrow - You can also specify what order
Blanks
andErrors
are sorted in the results
Exercise: Sort by month. How can you ensure that months are in order?
Ensure you are sorting by number and not alphabetically.
- If you already have a sort in place, the menu changes to remind you that you have already set a sort
- You get additional options...
- Sort > Sort... - Modify the original sort
- Sort > Reverse - Reverse the order
- Sort > Remove sort - Undo your sort
- You can sort by multiple columns by repeating this process for each column
- The order of the sort will depend on the order you set the sorts
- To restart the sorting process, check the sort by this column alone box in the Sort pop-up menu
- mo > Sort > Remove sort
Exercise: Sort by
year
,month
, andday
in that order. Try putting some in reverse order. Use mo > Sort > Remove sort. Notice how it changes the order. Remove all your sorts once your are finished.
Questions:
- How can we convert a column from one data type to another?
- How can we visualize relationships among columns?
Objectives:
- Transform a text column into a number column.
- Identify and modify non-numeric values in a column using facets.
- Use the scatter plot facet to examine relationships among columns.
- OpenRefine treats all imported values as text values by default
- We can change a column's type by using Edit cells > Common transforms
- Remove any facets and filters
- recordID > Edit cells > Common transforms... > To number
- Try transforming
scientificName
to numerical values. Did it work?
- Now that we have numeric values, we can work with numeric facets
- Numeric facets can help us uncover non-number values or blanks
- Change a few values in
recordID
to letters - Apply a numeric facet to the column
- Experiment with the slider and check boxes in this facet
- The slider is a histogram of the values in the column
- Remove the facet and undo the non-numeric changes
- The scatter plot facet allows us to examine the relationship between different numeric columns
- Change on the columns to numeric columns that should be:
recordID
,mo
,dy
,yr
,period
,plot
,decimalLatitude
,decimalLongitude
. - recordID > Facet > ScatterPlot facet
- We can see the values plotted against each of the other numeric values.
Questions:
- How can we document the data-cleaning steps we've applied to our data?
- How can we apply these steps to additional data sets?
Objectives:
- Describe how OpenRefine generates JSON code.
- Demonstrate ability to export JSON code from OpenRefine.
- Save JSON code from and analysis.
- Apply saved JSON code to an analysis.
- OpenRefine keeps track of all your steps and saves them in a format called JSON
- You can export your steps and then re-apply them to another file
- This allows you to repeat an analysis or cleaning on multiple files.
- You can even share your steps with collaborators
- Undo / Redo > Extract...
- Paste the code into a plain-text editor
- Save as plain text
- Reapply the steps to an unclean version of the data set
- Download an unclean version
- Click Open... to start a new project and import the file (name it something different)
- Undo / Redo > Apply
- Paste in the steps
- Click Perform operations
- This allows you to break data entry out to different team members (so long as they have the same headers)
Questions:
- How can we save and export our cleaned data from OpenRefine?
Objectives:
- Save an OpenRefine project.
- Export cleaned data from and OpenRefine project.
- OpenRefine saves automatically in the background
- If you close it and open it again, you can see a list of your projects
- You can export a project and share it with others
- Click Export
- Select *Export project
- A
tar.gz
file will begin downloading just like an online download
-
tar.gz
files require special software on Windows (7-zip or WinZip) -
To import: Open... > Import Project > Select your file
-
If you want to export just the data (and not the OpenRefine project): Export > Select file type > Look in
Downloads
folder