Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
267 lines (189 sloc) 13.7 KB
- Class: meta
Course: tidyswirl
Lesson: tidyr
Author: Matt Dray
Type: Standard
Version: 2.4.3
# Swirl info reminder
- Class: text
Output: Reminder -- you can type info() at any time to see your options. For example, 'bye()' will exit Swirl.
# Intro ---
# Single sentence purpose
- Class: text
Output: The {tidyr} package by Hadley Wickham and Lionel Henry helps you organise your data frame to make it tidy.
# Short definition/expansion
- Class: text
Output: Tabular data is 'tidy' if there's one variable per column and one row per observation. Tidyverse packages prefer tidy data for simplicity and consistency.
# Tidyverse context
- Class: text
Output: Where is {tidyr} used in a tidyverse workflow? You're likely to use {tidyr} after you've read in your data with {readr} or {tibble} so you can prepare it for further manipulation with {dplyr}.
# Package installation
- Class: cmd_question
Output: You first need to install {tidyr} (version 1.0.0 or greater) into your computer's package library. Run install.packages("tidyr") to install it, or skip() if you've installed version 1.0.0 or greater before.
CorrectAnswer: if (!require("tidyr")) install.packages("tidyr")
AnswerTests: any_of_exprs('install.packages("tidyr")', 'install.packages(pkgs = "tidyr")', 'if (!require("tidyr")) install.packages("tidyr")', 'skip()')
Hint: Remember to type quotation marks around "tidyr" when you pass it to the function.
- Class: cmd_question
Output: Access {tidyr}'s functions by calling it from your computer's package library. Run library(tidyr).
CorrectAnswer: library(tidyr)
AnswerTests: any_of_exprs('library("tidyr")', 'library(tidyr)', 'library(package = "tidyr")')
Hint: Type the package name between the brackets of library(). You don't have to use quotation marks around the package name.
- Class: text
Output: Great, the {tidyr} package is installed and loaded. Now we're going to look at some of its functions.
# Group: reshaping data ---
- Class: text
Output: We'll start with two functions -- pivot_longer() and pivot_wider() -- for reshaping a data frame into a tidy format.
# Function: pivot_longer()
- Class: text
Output: pivot_longer() helps us reshape a dataset from a 'wide' format to a 'long' format.
- Class: text
Output: Wide data has multiple columns whose names are actually data too. For example, a data frame might have separate columns for each year of data. This is untidy -- each row would contain more than one observation (we would have as many observations as there are years) and each column is not a distinct variable (they're all years).
- Class: cmd_question
Output: The {tidyr} package has a built-in dataset that's wide and untidy. Type 'table4a' and hit enter to see it.
CorrectAnswer: table4a
AnswerTests: omnitest(correctVal = 'table4a')
Hint: Type 'table4a' (without quotation marks) and hit enter.
- Class: text
Output: Did you notice separate columns for each year that an observation was made (1999 and 2000)? We should gather these column names into a new column (the 'names_to' argument) and put their corresponding values into a new column too (the 'values_to' argument).
- Class: text
Output: pivot_longer() needs four things -- (1) the data frame name (table4a) (2) a vector of the columns to be gathered (c(`1999`, `2000`)), (3) a name for the new column we're gathering into ("year") and (4) a name for the new column of corresponding values ("cases"). Note that we quote the new column names.
- Class: cmd_question
Output: Use the pivot_longer() function to make table4a tidy.
CorrectAnswer: pivot_longer(table4a, c(`1999`, `2000`), "year", "cases")
AnswerTests: any_of_exprs('pivot_longer(table4a, c(`1999`, `2000`), "year", "cases")', 'pivot_longer(table4a, cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")')
Hint: Use the form pivot_longer(data = dataframe, cols = c(col1, col2), names_to = "new_col_1", values_to = "new_col_2")
- Class: text
Output: Excellent, you used pivot_longer() to help tidy the data from wide to long format.
# Function: pivot_wider()
- Class: text
Output: The second {tidyr} function for reshaping data is pivot_wider(). It does the opposite to pivot_longer() -- it helps us reshape from a 'long' to a 'wide' format.
- Class: text
Output: For example, we might have multiple rows for the same observation. For each country-year combination we might have a column with two different types of data in it that should really be in separate columns. Let's take a look at an example.
- Class: cmd_question
Output: The {tidyr} package has a built-in dataset that's long and untidy. Type 'table2' and hit enter to see it.
CorrectAnswer: table2
AnswerTests: omnitest(correctVal = 'table2')
Hint: Type table2 (without quotation marks) and hit enter.
- Class: text
Output: Did you notice the 'type' column and its corresponding 'count' column contain two types of data ('cases' and 'population')? We need to widen the unique values of 'type' (the argument 'names_from') into separate columns filled with the 'count' data (the 'values_from' argument).
- Class: text
Output: To do this, you need to give pivot_wider() three things -- (1) the name of the data frame (table2), (2) the name of the column (type) whose unique values will become new columns, and (3) the name of the column (count) that has the data that will fill the new columns.
- Class: cmd_question
Output: Use the pivot_wider() function to make table2 tidy.
CorrectAnswer: pivot_wider(table2, names_from = type, values_from = count)
AnswerTests: any_of_exprs('pivot_wider(table2, type, count)', 'pivot_wider(data = table2, names_from = type, values_from = count)')
Hint: Use the form pivot_wider(data = dataframe, names_from = names_col, values_from = values_col).
- Class: text
Output: Excellent, pivot_wider() helped tidy the data from long to wide format.
# Test: group
- Class: mult_question
Output: Which of these are always properties of 'tidy' data? (A) There's one row per observation, (B) there's one column per variable (C) long data is always tidier than wide data, (D) wide data is always tidier than long data.
AnswerChoices: All of them; (A) and (B); (A), (B) and (C); (A), (B) and (D); (A) only
CorrectAnswer: (A) and (B)
AnswerTests: omnitest(correctVal='(A) and (B)')
Hint: pivot_longer() makes wide data long and pivot_wider() does the opposite. Both create the right number of columns to allow for a single observation per row.
- Class: text
Output: Okay, we're done with reshaping for now. Let's take a look at some simpler functions from {tidyr} that will help prepare data for further analysis.
# Group: deal with missing values ---
- Class: text
Output: Often your dataset will have missing values. How can you treat these?
- Class: cmd_question
Output: I've added a table with NAs (missing values) to your environment. Run 'table_na' to print the table.
CorrectAnswer: table_na
AnswerTests: omnitest(correctVal = 'table_na')
Hint: Type 'table_na' and press enter to see the table in the console.
- Class: text
Output: The table has two columns and five rows. Both columns have a mixture of data and missing values (NA).
# Function: drop_na()
- Class: text
Output: drop_na() lets you discard rows that contain an NA by supplying (1) the name of the data frame and (2) the column containing NA values.
- Class: cmd_question
Output: Use drop_na() to remove the rows of table_na that contain an NA in the x2 column.
CorrectAnswer: drop_na(table_na, x2)
AnswerTests: omnitest(correctVal = 'drop_na(table_na, x2)')
Hint: Type 'table_na' and 'x2' into the brackets of drop_na() separated by a comma.
- Class: text
Output: So the rows that contained an NA in the x2 column are gone. We only have four rows now.
# Function: replace_na()
- Class: text
Output: Sometimes NAs can be safely replaced with a value. replace_na() lets you do this. You supply (1) the name of the data frame and (2) a list containing the column to convert and value to replace NAs with, in the form list(col_name = "replacement_value").
- Class: cmd_question
Output: Use replace_na() to convert the NA values in the x2 column of table_na to a zero.
CorrectAnswer: replace_na(table_na, list(x2 = 0))
AnswerTests: omnitest(correctVal = 'replace_na(table_na, list(x2 = 0))')
Hint: Supply table_na as the first argument to replace_na() and list(x2 = 0) as the second argument.
- Class: text
Output: Each instance of NA in the x2 column of table_na has been replaced with a zero.
# Function: fill()
- Class: text
Output: Sometimes only the first row of a group of data will be labelled in a column (often true if reading data from a spreadsheet). For example, column x1 of table_na has two groups -- A and B -- but only the first row of each is labelled. We want every row to contain this information, not just the first one.
- Class: text
Output: You can solve this with the fill() function, which replaces NAs in a column with the the closest valid value above it.
- Class: cmd_question
Output: Use fill() to 'fill down' the x1 column and replace the NAs.
CorrectAnswer: fill(table_na, x1)
AnswerTests: omnitest(correctVal = 'fill(table_na, x1)')
Hint: Pass table_na as the first argument to fill() and x1 as the second argument.
- Class: text
Output: The NAs in rows 2 and 3 of column x2 have been replaced with 'A' and the NAs in rows 5 and 6 are now a 'B'.
# Test: group
# Group: splitting data ---
- Class: text
Output: Sometimes multiple data items are stored in the same 'cell' of a data frame, which is untidy. {tidyr} can help split these items into new columns or rows.
- Class: cmd_question
Output: Run table3 to see an example of a table that needs separation.
CorrectAnswer: table3
AnswerTests: omnitest(correctVal = 'table3')
Hint: Type 'table3' and press enter to see the table in the console.
- Class: text
Output: The 'rate' column contains two data items with a forward slash ('/') between them, like '745/19987071' in the first row. These are two distinct bits of data that need to be separated.
# Function: separate()
- Class: text
Output: The separate() function splits data by some delimiter and puts each part into its own column. Supply (1) the data frame name, (2) the column that needs separating, (3) the delimiter, (4) a vector of column names to received the separated data.
- Class: cmd_question
Output: Use separate() on table3 to split the 'rate' column by '/' into new columns named 'cases' and 'pop'.
CorrectAnswer: separate(table3, rate, sep = "/", into = c("cases", "pop"))
AnswerTests: any_of_exprs('separate(table3, rate, into = c("cases", "pop"))', 'separate(table3, rate, "/", into = c("cases", "pop"))', 'separate(table3, rate, sep = "/", into = c("cases", "pop"))')
Hint: Use the form separate(dataframe, col_to_separate, sep = "/", into = c("new_col_1", "new_col_2"))
- Class: text
Output: See how the values in the 'rate' column of the original dataset are now separated into the 'cases' and 'pop' columns? For example, '745/19987071' is now separated into '745' and '19987071' across two columns.
# Function: separate_rows()
- Class: text
Output: The separate_rows() function is like separate(), but the separated data are inserted into new rows rather than new columns. This means you only need to supply (1) the data frame name, (2) the name of the column to be separated into rows and (3) in the form sep = "/".
- Class: cmd_question
Output: Use separate_rows() on table3 to split the 'rate' column by '/' into new rows.
CorrectAnswer: separate_rows(table3, rate, sep = "/")
AnswerTests: omnitest(correctVal = 'separate_rows(table3, rate, sep = "/")')
Hint: Your answer should be in the form separate_rows(dataframe, col_to_separate, sep = "/")
- Class: text
Output: So now the values in the rate column have been separated into different rows rather than different columns. For example, '745/19987071' has now been separated into '745' and '19987071' across two rows.
# Function: unite()
- Class: text
Output: We can also unite() columns, which is the opposite of separating them. You pass (1) the name of the data frame, (2) the columns from which data will be united, (3) the delimiter, (4) a name for the column that will hold the newly-united data.
- Class: cmd_question
Output: Use unite() on table5 to combine the 'century' and 'year' columns with a blank delimiter ("") into a new column ("full_year").
CorrectAnswer: unite(table5, century, year, sep = "", col = "full_year")
AnswerTests: omnitest(correctVal = 'unite(table5, century, year, sep = "", col = "full_year")')
Hint: Your answer should be in the form unite(dataframe, col_to_unite_1, col_to_unite_2, sep = "/", col = "new_col")
- Class: text
Output: The values from the 'century' and 'year' column have been united -- '19' and '99' in the first row are now joined as '1999' in the 'full_year' column.
# Test: group
# Group: List-columns ---
# Function: nest()
# Function: unnest()
# Test: group
# Group: expand tables ---
# Function: complete()
# Function: expand()
# Test: group
# Test knowledge of package ---
# Outro: what have we learnt? ---
- Class: text
Output: So what did we learn in this lesson? We've seen how {tidyr} can be used to tidy a data frame by (1) reshaping it with pivot_longer() and pivot_wider(), (2) deal with missing values using drop_na(), replace_na() and fill(), and (3) split and join columns and rows with separate() and unite().
# Outro: links
- Class: text
Output: The tidyverse webpage for the {tidyr} package is https://tidyr.tidyverse.org/
- Class: text
Output: Get the data import cheatsheet from RStudio at https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf
- Class: text
Output: Read about tidy data from the Journal of Statistical Software https://www.jstatsoft.org/article/view/v059i10
You can’t perform that action at this time.