(dplyr)=
# dplyr Library

The dplyr library is part of a family of R packages called the [tidyverse](https://www.tidyverse.org/). See the [dplyr documentation](https://dplyr.tidyverse.org/reference/index.html) for a summary of all the functions in this library. See also the [Data Transformations](https://r4ds.had.co.nz/transform.html) chapter of *R for Data Science* for more explanation and examples. 

In [5]:
# You can ignore the warning about objects from other libraries being masked.
library(dplyr)

# Running the above command twice will get rid of the warning message.

<div class="admonition note">
<div class="title" style="background: lightblue; padding: 10px">Note</div>
<p>If you try to run a dplyr function before loading the library, you will get an error messages stating that R could not find the function.</p>
</div>

## Example data
In this section, we will use the `ChickWeight` dataset which is built into R. Fun `?ChickWeight` in a code cell to view the documentation for this dataset. 

In [None]:
# Open the documentation for the built in ChickWeight dataset.
?ChickWeight

In [1]:
# Run this code to restore ChickWieght to its built in value.
rm(ChickWeight)

# If you see a warning message indicating that ChickWeight was not found, 
# it means that there were no saved edits to ChickWeight. 

“object 'ChickWeight' not found”


(mutate)=
## mutate
Use the dplyr function `mutate()` to add or change columns. The basic syntax for this function is the following:

`mutate(DATA_FRAME, NEW_COLUMN = FORMULA)`

The mutate function outputs a **new** data frame which is generated by using the formula to create the data for a column called NEW_COLUMN in the specified DATA_FRAME. 

In [52]:
# Convert the weight from grams to pounds.
# 1 gram is 0.00220462 pounds.
# mutate(ChickWeight, weight = weight * 0.00220462) # produces a large output

head(mutate(ChickWeight, weight = weight * 0.00220462), 3)
# We use the head() function to limit the output to the first three observations.

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,0.09259404,0,1,1
2,0.11243562,2,1,1
3,0.13007258,4,1,1


In [41]:
# Notice that the ChickWeight data frame is unchanged.
head(ChickWeight,3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,42,0,1,1
2,51,2,1,1
3,59,4,1,1


<div class="admonition warning">
<div class="title" style="background: pink; padding: 10px">Warning</div>
    <p>If you want to keep the updated data frame that you produced with a dplyr function, then you must <b>store the output in a variable</b>.</p>
    <code>VARIABLE  &lt;- mutate(DATA_FRAME, NEW_COLUMN = FORMULA)</code>
</div>

In [67]:
# Store the output of mutate() in the ChickWeight data frame
# Assigning a value to a variable generates no output
ChickWeight <- mutate(ChickWeight, weight = weight * 0.00220462)

# Look at the first three observations of ChickWieght
head(ChickWeight,3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,0.0002041347,0,1,1
2,0.0002478778,2,1,1
3,0.0002867606,4,1,1


### Compare to base R

In [68]:
# Restore ChickWeight to its original state
remove(ChickWeight)

# Use base R to convert the weight from grams to pounds and store the result.
ChickWeight$weight <- ChickWeight$weight * 0.00220462

# Look at the first three observations of ChickWieght
head(ChickWeight,3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,0.09259404,0,1,1
2,0.11243562,2,1,1
3,0.13007258,4,1,1


(arrange)=
## arrange
Use the dplyr `arrange()` to sort the observations of a data frame. The basic syntax for this function is the following:

`arrange(DATA_FRAME, COLUMN_NAME)`

The default is to sort the date frame so that the values of the specified column in *ascending* order. Put the COLUMN_NAME inside the the `desc()` function o sort in descending order.

`arrange(DATA_FRAME, desc(COLUMN_NAME))`

In [57]:
# Sort the ChickWeight data frame so that the values in the Time column are in ascending order.
arrange(ChickWeight, Time)

# This code allows us to look at the sorted data but does not save it anywhere. 

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
42,0,1,1
40,0,2,1
43,0,3,1
42,0,4,1
41,0,5,1
41,0,6,1
41,0,7,1
42,0,8,1
42,0,9,1
41,0,10,1


<div class="admonition warning">
<div class="title" style="background: pink; padding: 10px">Warning</div>
    <p>The sorted data frame is the output; arrange does not change the original data frame. To replace the original data frame with the sored one use</b>.</p>
    <code>DATA_FRAME  &#60;- arrange(DATA_FRAME, COLUMN_NAME)</code>
</div>

In [58]:
# Sort the ChickWeight data frame so that the values in the Time column are in descending order.
arrange(ChickWeight, desc(Time))

# This code allows us to look at the sorted data but does not save it anywhere. 

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
205,21,1,1
215,21,2,1
202,21,3,1
157,21,4,1
223,21,5,1
157,21,6,1
305,21,7,1
98,21,9,1
124,21,10,1
175,21,11,1


<div class="admonition note">
<div class="title" style="background: lightblue; padding: 10px">Note</div>
    <p><b>R cares about capitalizations</b>. The above code would not work if <code>Time</code> were replaced with <code>time</code>.</p>
</div>

### Sorting with respect to multiple columns
You can sort with respect to multiple columns by listing multiple column names in the arrange function:

`arrange(DATA_FRAME, COLUMN_NAME1,COLUMN_NAME2, COLUMN_NAME3, etc.)`

In [2]:
# Sort with respect to both Time and weight.
# Time values dominate the sort order.
arrange(ChickWeight, Time, weight)

# This code allows us to look at the sorted data but does not save it anywhere. 

ERROR: Error in arrange(ChickWeight, Time, weight): could not find function "arrange"


<div class="admonition note">
<div class="title" style="background: lightblue; padding: 10px">Note</div>
    <p>The order of the column names in the <code>arrange()</code> function matters. The farther left a columns name is higher priority its values have in the sort order.</p>
</div>

In [3]:
# Sort with respect to both Time and weight.
# Weight values dominate the sort order.
arrange(ChickWeight, weight, Time)

# This code allows us to look at the sorted data but does not save it anywhere. 

ERROR: Error in arrange(ChickWeight, weight, Time): could not find function "arrange"


(piping)=
## Piping
The dplyr library introduces an useful mechanism for passing the output of one function to another function as its first input. The general syntax for this is shown below.

`INPUT_1a %>% FUNCTION_1(INPUT_1b) %>% FUNCTION_2(INPUT_2b)`

The example of piping above is equivalent to the following.

`FUNCTION(FUNCTION_1(INPUT_1a, INPUT_1b), INPUT_2b))`

We could also use the following code to achieve the same output as the given example of piping. 

`INPUT_2a <- FUNCTION_1(INPUT_1a, INPUT_1b)`

`FUNCTION_2(INPUT_2a, INPUT_2b)`

In the following sections, all examples will be given both with and without piping. You can choose the method that you find easiest to understand. Either way, it is a good idea to know the basics of piping so that you will be able to understand code that uses it. 

## filter
The `filter()` function allows you to extract the observations in a data frame that meet specified criteria. The general syntax is given below.

`filter(DATA_FRAME, CRITERIA)`

### Criteria

Use `==` to extract observations that have a particular value in a given column.
`COLUMN_NAME == VALUE`

Use inequalities (`>`, `<`, `<=`,`>=`) to specify upper or lower bounds for the values in a given column.
`COLUMN_NAME >= BOUND`

<div class="admonition warning">
<div class="title" style="background: pink; padding: 10px">Warning</div>
    <p>If the value you are using in the filter is text, then it must be in quote marks. Consider the following example. </p>
    <code> filter(students, First_Name == "Elisabeth") </code>
</div>

### Example: Weight of chicks on the last day
Suppose we were only interested in the weight of the chicks on the last day of the study. We can use the following code to create a data frame that only contains the observations that were made on the last day. 

In [40]:
# without piping
ChickWeight_last_day <- filter(ChickWeight, Time == max(Time))
head(ChickWeight_last_day, 3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,205,21,1,1
2,215,21,2,1
3,202,21,3,1


In [41]:
# with piping
ChickWeight_last_day <- ChickWeight %>% 
    filter(Time == max(Time))
head(ChickWeight_last_day, 3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,205,21,1,1
2,215,21,2,1
3,202,21,3,1


### Filtering by multiple criteria
We can combine multiple criterion using `&` for "and" as well as `|` for "or". Commas also function as "and" inside `filter()`.

<div class="admonition warning" name="html-admonition">
<div class="title" style="background: pink; padding: 10px">Warning</div>
    <p>Criteria connected by <code>&amp;</code>, <code>|</code> and commas must be self sufficient. The following code will <b>not</b> work.</p>
    <code>filter(weather_data, Temp &lt; 60 &amp; &gt; 90) </code>
    <p>Instead, use this.</p>
    <code>filter(weather_data, Temp &lt; 60 &amp; Temp &gt; 90) </code>
</div>

In [42]:
# Chick weights on day 21 who were fed diet 2
# Strategy 1: Use & for "and"
filter(ChickWeight, Time == max(Time) & Diet == 2)

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
331,21,21,2
167,21,22,2
175,21,23,2
74,21,24,2
265,21,25,2
251,21,26,2
192,21,27,2
233,21,28,2
309,21,29,2
150,21,30,2


In [43]:
# Chick weights on day 21 who were fed diet 2
# Strategy 2: Use commas for "and"
filter(ChickWeight, Time == max(Time), Diet == 2)

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
331,21,21,2
167,21,22,2
175,21,23,2
74,21,24,2
265,21,25,2
251,21,26,2
192,21,27,2
233,21,28,2
309,21,29,2
150,21,30,2


## select

`select(DATA_FRAME, COLUMN_1, COLUMN_2)`

Alternatively, you can use `-` to mark columns that you wish to exclude.
`select(DATA_FRAME, -COLUMN_1, -COLUMN_2)`

### Example: Remove the time column from `ChickWeight_last_day`
In the previous section, we created the `ChickWeight_last_day` data frame which contained only observations made on the last day. The Time column of this filtered data set is not useful, because all of the entries are 21. The code bellow shows how we can use `select()` after filtering to remove the Time column. 

In [30]:
# without piping
ChickWeight_last_day <- filter(ChickWeight, Time == 21)
ChickWeight_last_day <- select(ChickWeight_last_day, -Time)
head(ChickWeight_last_day, 3)

Unnamed: 0_level_0,weight,Chick,Diet
Unnamed: 0_level_1,<dbl>,<ord>,<fct>
1,205,1,1
2,215,2,1
3,202,3,1


In [31]:
# with piping
ChickWeight_last_day <- ChickWeight %>% 
    filter(Time == 21) %>% 
    select(-Time)
head(ChickWeight_last_day, 3)

Unnamed: 0_level_0,weight,Chick,Diet
Unnamed: 0_level_1,<dbl>,<ord>,<fct>
1,205,1,1
2,215,2,1
3,202,3,1


### Example: Renaming a column
Notice that `weight` is the only column in the `ChickWeight` data frame that is not capitalized. We can combine `select()` and `mutate()` to rename the weight column. 

Note that this can also be done using the `rename()` function from the dplyr library as shown at the end of this example. 

In [2]:
# without piping
ChickWeight <- mutate(ChickWeight, Weight = weight)
ChickWeight <- select(ChickWeight, -weight)

ERROR: Error in mutate(ChickWeight, Weight = weight): could not find function "mutate"


In [3]:
# Restore ChickWieght to its built in value.
rm(ChickWeight)

“object 'ChickWeight' not found”


In [None]:
# with piping
ChickWeight <- ChickWeight %>%
    mutate(Weight = weight) %>%
    select(-weight)

In [3]:
# Restore ChickWieght to its built in value.
rm(ChickWeight)

“object 'ChickWeight' not found”


In [None]:
# using rename()
ChickWeight <- rename(ChickWeight, Weight = weight)

## group_by and summarize

`group_by(DATA_FRAME, COLUMN_NAME)`

`select(GROUPED_DATA_FRAME, COLUMN_NAME = FORMULA)`

### Example: Average weight of each diet group on last day
Suppose we wanted to know both how many chicks were in each diet group and what the average weight of each group was on day 21. We can do this by combining `filter()`, `group_by()` and `summarize()` as shown below.

In [4]:
# Remove any existant grouping.
ChickWeight <- ungroup(ChickWeight)

In [28]:
# without piping
ChickWeight_last_day <- filter(ChickWeight, Time == 21)
ChickWeight_last_day <- group_by(ChickWeight_last_day, Diet)
summarize(ChickWeight_last_day, Count = n(), Avg_Weight = mean(Weight))

Diet,Count,Avg_Weight
<fct>,<int>,<dbl>
1,16,177.75
2,10,214.7
3,10,270.3
4,9,238.5556


In [1]:
# With piping
ChickWeight %>% 
    filter(Time == 21) %>%
    group_by(Diet) %>%
    summarize(Count = n(), Avg_Weight = mean(Weight))

ERROR: Error in ChickWeight %>% filter(Time == 21) %>% group_by(Diet) %>% summarize(Count = n(), : could not find function "%>%"
