# dplyr Library

The dplyr library is part of a family of R packages called the [tidyverse](https://www.tidyverse.org/). See the [dplyr documentation](https://dplyr.tidyverse.org/reference/index.html) for a summary of all the functions in this library. See also the [Data Transformations](https://r4ds.had.co.nz/transform.html) chapter of *R for Data Science* for more explanation and examples. 

In [16]:
# You can ignore the warning about objects from other libraries being masked.
library(dplyr)

In [17]:
# The warn.conflicts=FALSE option supressess the warning about objects from other libraries being masked.
library(dplyr, warn.conflicts=FALSE)

<div class="admonition note" name="html-admonition">
<div class="title" style="background: lightblue; padding: 10px">Note</div>
<p>If you try to execute a dplyr function before loading the library, you will get an error messages stating that R could not find the function.</p>
</div>

## Example data
In this section, we will use the `ChickWeight` dataset which is built into R. Execute `?ChickWeight` in a code cell to view the documentation for this dataset. 

In [None]:
# Open the documentation for the built in ChickWeight dataset.
?ChickWeight

In [51]:
# Execute this code to restore ChickWieght to its original state.
remove(ChickWeight)

## mutate
Use the dplyr function `mutate()` to add or change columns. The basic syntax for this function is the following:

`mutate(DATA_FRAME, NEW_COLUMN = FORMULA)`

The mutate function outputs a **new** data frame which is generated by using the formula to create the data for a column called NEW_COLUMN in the specified DATA_FRAME. 

In [52]:
# Convert the weight from grams to pounds.
# 1 gram is 0.00220462 pounds.
# mutate(ChickWeight, weight = weight * 0.00220462) # produces a large output

head(mutate(ChickWeight, weight = weight * 0.00220462), 3)
# We use the head() function to limit the output to the first three observations.

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,0.09259404,0,1,1
2,0.11243562,2,1,1
3,0.13007258,4,1,1


In [41]:
# Notice that the ChickWeight data frame is unchanged.
head(ChickWeight,3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,42,0,1,1
2,51,2,1,1
3,59,4,1,1


<div class="admonition warning" name="html-admonition">
<div class="title" style="background: pink; padding: 10px">Warning</div>
    <p>If you want to keep the updated data frame that you produced with a dplyr function, then you must <b>store the output in a variable</b>.</p>
    <code>VARIABLE  &#60;- mutate(DATA_FRAME, NEW_COLUMN = FORMULA)</code>
</div>

In [47]:
# Store the output of mutate() in the ChickWeight data frame
# Assigning a value to a variable generates no output
ChickWeight <- mutate(ChickWeight, weight = weight * 0.00220462)

# Look at the first three observations of ChickWieght
head(ChickWeight,3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,0.0002041347,0,1,1
2,0.0002478778,2,1,1
3,0.0002867606,4,1,1


### Compare to base R

In [50]:
# Restore ChickWeight to its original state
remove(ChickWeight)

# Use base R to convert the weight from grams to pounds and store the result.
ChickWeight$weight <- ChickWeight$weight * 0.00220462

# Look at the first three observations of ChickWieght
head(ChickWeight,3)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,0.09259404,0,1,1
2,0.11243562,2,1,1
3,0.13007258,4,1,1


## arrange
Use the dplyr `arrange()` to sort the observations of a data frame. The basic syntax for this function is the following:

`arrange(DATA_FRAME, COLUMN_NAME)`

The default is to sort the date frame so that the values of the specified column in *ascending* order. Put the COLUMN_NAME inside the the `desc()` function o sort in descending order.

`arrange(DATA_FRAME, desc(COLUMN_NAME))`

In [57]:
# Sort the ChickWeight data frame so that the values in the Time column are in ascending order.
arrange(ChickWeight, Time)

# This code allows us to look at the sorted data but does not save it anywhere. 

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
42,0,1,1
40,0,2,1
43,0,3,1
42,0,4,1
41,0,5,1
41,0,6,1
41,0,7,1
42,0,8,1
42,0,9,1
41,0,10,1


<div class="admonition warning" name="html-admonition">
<div class="title" style="background: pink; padding: 10px">Warning</div>
    <p>The sorted data frame is the output; arrange does not change the original data frame. To replace the original data frame with the sorted one use</b>.</p>
    <code>DATA_FRAME  &#60;- arrange(DATA_FRAME, COLUMN_NAME)</code>
</div>

In [58]:
# Sort the ChickWeight data frame so that the values in the Time column are in descending order.
arrange(ChickWeight, desc(Time))

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
205,21,1,1
215,21,2,1
202,21,3,1
157,21,4,1
223,21,5,1
157,21,6,1
305,21,7,1
98,21,9,1
124,21,10,1
175,21,11,1


<div class="admonition note" name="html-admonition">
<div class="title" style="background: lightblue; padding: 10px">Note</div>
    <p><b>R cares about capitalizations</b>. The above code would not work if <code>Time</code> were replaced with <code>time</code>.</p>
</div>

### Sorting with respect to multiple columns
You can sort with respect to multiple columns by listing multiple column names in the arrange function:

`arrange(DATA_FRAME, COLUMN_NAME1,COLUMN_NAME2, COLUMN_NAME3, etc.)`

In [63]:
# Sort with respect to both Time and weight.
# Time values dominate the sort order.
arrange(ChickWeight, Time, weight)

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
39,0,18,1
39,0,27,2
39,0,28,2
39,0,29,2
39,0,33,3
39,0,36,3
39,0,48,4
40,0,2,1
40,0,21,2
40,0,25,2


<div class="admonition note" name="html-admonition">
<div class="title" style="background: lightblue; padding: 10px">Note</div>
    <p>The order of the column names in the <code>arrange()</code> function matters. The farther left a columns name is higher priority its values have in the sort order.</p>
</div>

In [64]:
# Sort with respect to both Time and weight.
# Weight values dominate the sort order.
arrange(ChickWeight, weight, Time)

weight,Time,Chick,Diet
<dbl>,<dbl>,<ord>,<fct>
35,2,18,1
39,0,18,1
39,0,27,2
39,0,28,2
39,0,29,2
39,0,33,3
39,0,36,3
39,0,48,4
39,2,3,1
40,0,2,1


## filter

## group_by and summarize