# dplyr and the Tidyverse


Being able to quickly modify datasets -- often referred to as "data wrangling" -- is critical to being a social scientist. Indeed, most social scientists and data scientists spend a huge proportion of their time of their time cleaning and organizing their data. ([about 80 percent in surveys](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=44a70ffb6f63)). 

In our previous readings, we learned how to accomplish tasks like subsetting and modifying variables using what's called "array indexing" (using those `[]` square brackets). 

There is, however, another approach to manipulating dataframes in R that is very popular: a set of packages known collectively as the [tidyverse](https://www.tidyverse.org/) originally developed by [Hadley Wickham](http://had.co.nz/).

In this reading, we'll explore a tidyverse library called `dplyr`, which provides a set of simple functions for subsetting, sorting, renaming variables, and extracting unique values. To be clear **dplyr doesn't allow you to do anything you couldn't do with array indexing, it just provides different ways to write your commands.** But the way it allows you to write commands is something that many people find quite compelling. 

## The Philosophy of dplyr

Before we get into dplyr, however, a quick note word of caution: there's a reason we learned to manipulate dataframes using array indexing before I introduced dplyr. That's because, despite its popularity, I will confess to having some mis-givings about dplyr.

The basic issue with dplyr is that it provides a set of specific commands do to lots of specific dataframe manipulation tasks. And if all you want to do is manipulate dataframes, dplyr is great. *But*...

Learning the tidyverse amounts to learning lots of specific functions. There's no concept of *generalized abstractions*, like array indexing. As we've seen in our past readings, in regular R the logic that dictates how vectors work informs how matrices work, which in turn informs how dataframes work. And if you move into three-dimensional arrays at some point, or other domains (like network analysis), what you know about vectors and matrices will still be relevant. 

Indeed, the concept of an array and the idea of array indexing is such a fundamental abstraction in data science that you'll also find it in languages like Python, Matlab, and Julia you may sometime end up using.

As such, I worry that over-reliance on the tidyverse amounts to moving away from learning to *program* in R by composing more sophisticated commands from basic building blocks and towards just learning to chain a series of specific commands together.

(If you want to read a more eloquent version of this critique, [you can find one here](https://towardsdatascience.com/a-thousand-gadgets-my-thoughts-on-the-r-tidyverse-2441d8504433).)

None of that is to suggest you should avoid dplyr or the rest of the tidyverse entirely. To the contrary, I think the tidyverse plotting library (`ggplot`) is the best plotting library around, and I'm a fan of several dplyr functions (especially `rename`, which makes an otherwise tedious task quite simple). But as you use it, be mindful of its different philosophy of programming, and approach it with intentionality. 

## Installing dplyr

To use dplyr, you must:

- Install dplyr with the command `install.package("dplyr")`. You only have to do this once on a given computer.
- Load it into your R session with `library(dplyr)`. This you have to run every time you open R and want to use dplyr. 

In [3]:
library(dplyr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




## Subsetting with filter

To demonstrate dplyr, we'll rely on the small data frame we used before you can create as follows:

In [2]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,11197
China,1994,12877
Sudan,1994,2326
USA,1995,15058
China,1995,5549
Sudan,1995,7658
USA,1996,2965
China,1996,19660
Sudan,1996,5971


Now suppose we want to subset just to observations from China. With array indexing, we'd run:

In [7]:
countries[countries$country == "China",]

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
2,China,1994,12877
5,China,1995,5549
8,China,1996,19660


To subset with dplyr, we use the `filter` command together with a similar logical statement:

In [8]:
filter(countries, country == "China")

country,year,gdp_pc
<chr>,<dbl>,<dbl>
China,1994,12877
China,1995,5549
China,1996,19660


Or to filter to just middle income countries, we could run:

In [9]:
filter(countries, gdp_pc > 5000 & gdp_pc < 14000)

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,11197
China,1994,12877
China,1995,5549
Sudan,1995,7658
Sudan,1996,5971


As is probably evident, the first argument in the `filter()` function
specifies the dataset on which to carry out the operation. The second
argument specifies the logical operation used to filter the data.

## Sorting


Use `arrange()` to sort a dataset. Here are a few examples. (I'll use
the chaining operator `%>%` from now on --- if you're confused about
this remember that you can just bring the data frame `countries`
inside the function.)

In [10]:
# Sort by country names 
arrange(countries, country) 

country,year,gdp_pc
<chr>,<dbl>,<dbl>
China,1994,12877
China,1995,5549
China,1996,19660
Sudan,1994,2326
Sudan,1995,7658
Sudan,1996,5971
USA,1994,11197
USA,1995,15058
USA,1996,2965


In [11]:
# Sort by GDP (ascending is default) 
arrange(countries, gdp_pc)

country,year,gdp_pc
<chr>,<dbl>,<dbl>
Sudan,1994,2326
USA,1996,2965
China,1995,5549
Sudan,1996,5971
Sudan,1995,7658
USA,1994,11197
China,1994,12877
USA,1995,15058
China,1996,19660


In [12]:
# Sort by GDP (descending)
arrange(countries, desc(gdp_pc))

country,year,gdp_pc
<chr>,<dbl>,<dbl>
China,1996,19660
USA,1995,15058
China,1994,12877
USA,1994,11197
Sudan,1995,7658
Sudan,1996,5971
China,1995,5549
USA,1996,2965
Sudan,1994,2326


## Selecting Columns

Just as we can select columns by name with array indexing:


In [30]:
countries[, c("country", "gdp_pc")]

country,gdp_pc
<chr>,<dbl>
USA,11197
China,12877
Sudan,2326
USA,15058
China,5549
Sudan,7658
USA,2965
China,19660
Sudan,5971


We can also select columns in dplyr `select()`:

In [14]:
# Keep country and GDP
select(countries, country, gdp_pc)

country,gdp_pc
<chr>,<dbl>
USA,11197
China,12877
Sudan,2326
USA,15058
China,5549
Sudan,7658
USA,2965
China,19660
Sudan,5971


In [15]:
# Same thing using '-', implying you want to delete a variable
select(countries, -year)

country,gdp_pc
<chr>,<dbl>
USA,11197
China,12877
Sudan,2326
USA,15058
China,5549
Sudan,7658
USA,2965
China,19660
Sudan,5971


In [16]:
# Selecting and renaming in one
select(countries, country_name = country, gdp_pc)

country_name,gdp_pc
<chr>,<dbl>
USA,11197
China,12877
Sudan,2326
USA,15058
China,5549
Sudan,7658
USA,2965
China,19660
Sudan,5971


## Renaming


As illustrated in the last line of code above you can rename variables
using `select()`. But this can also be done using `rename()`:

In [17]:
# Rename GDP per capita
rename(countries, GDP.PC = gdp_pc)

country,year,GDP.PC
<chr>,<dbl>,<dbl>
USA,1994,11197
China,1994,12877
Sudan,1994,2326
USA,1995,15058
China,1995,5549
Sudan,1995,7658
USA,1996,2965
China,1996,19660
Sudan,1996,5971


## New variables 

In our last reading, we saw how we could create new variables by pulling out a column, modifying it, and re-inserting it. For example, if we wanted GDP per capita in 1000s instead of in dollars, we could do:

In [18]:
countries$gdp_pc_in_1000s <- countries$gdp_pc / 1000
countries

country,year,gdp_pc,gdp_pc_in_1000s
<chr>,<dbl>,<dbl>,<dbl>
USA,1994,11197,11.197
China,1994,12877,12.877
Sudan,1994,2326,2.326
USA,1995,15058,15.058
China,1995,5549,5.549
Sudan,1995,7658,7.658
USA,1996,2965,2.965
China,1996,19660,19.66
Sudan,1996,5971,5.971


In `dplyr` one uses `mutate()`:

In [19]:
# Create a new variable that has GDP per capita in 1000s
mutate(countries, gdppc_1k = gdp_pc / 1000)

country,year,gdp_pc,gdp_pc_in_1000s,gdppc_1k
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
USA,1994,11197,11.197,11.197
China,1994,12877,12.877,12.877
Sudan,1994,2326,2.326,2.326
USA,1995,15058,15.058,15.058
China,1995,5549,5.549,5.549
Sudan,1995,7658,7.658,7.658
USA,1996,2965,2.965,2.965
China,1996,19660,19.66,19.66
Sudan,1996,5971,5.971,5.971


In [25]:
# Create a new variable with lower-case country names
mutate(countries, country_lc = tolower(country))

country,year,gdp_pc,gdp_pc_in_1000s,country_lc
<chr>,<dbl>,<dbl>,<dbl>,<chr>
USA,1994,11197,11.197,usa
China,1994,12877,12.877,china
Sudan,1994,2326,2.326,sudan
USA,1995,15058,15.058,usa
China,1995,5549,5.549,china
Sudan,1995,7658,7.658,sudan
USA,1996,2965,2.965,usa
China,1996,19660,19.66,china
Sudan,1996,5971,5.971,sudan


In [29]:
# Both in one statement
mutate(countries, gdppc_1k = gdp_pc / 1000,
                    country_lc = tolower(country))

country,year,gdp_pc,gdp_pc_in_1000s,gdppc_1k,country_lc
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
USA,1994,11197,11.197,11.197,usa
China,1994,12877,12.877,12.877,china
Sudan,1994,2326,2.326,2.326,sudan
USA,1995,15058,15.058,15.058,usa
China,1995,5549,5.549,5.549,china
Sudan,1995,7658,7.658,7.658,sudan
USA,1996,2965,2.965,2.965,usa
China,1996,19660,19.66,19.66,china
Sudan,1996,5971,5.971,5.971,sudan


A lot of times it makes more sense to just overwrite an
existing variable rather than adding a variable.

In [22]:
mutate(countries, country = tolower(country))

country,year,gdp_pc,gdp_pc_in_1000s
<chr>,<dbl>,<dbl>,<dbl>
usa,1994,11197,11.197
china,1994,12877,12.877
sudan,1994,2326,2.326
usa,1995,15058,15.058
china,1995,5549,5.549
sudan,1995,7658,7.658
usa,1996,2965,2.965
china,1996,19660,19.66
sudan,1996,5971,5.971


## Chaining

The last feature of dplyr to be aware of is *chaining*. Chaining is a way of combining commands to make code more concise. Basically, you use the command `%>%` to tell R to take the result of one function and make it the first argument in the next. 

For example, rather than writing: 

```r
mutate(countries, country = tolower(country))
```

you can write:

```r
countries %>% mutate(country = tolower(country))
```

Where `countries` is understood to be the first argument for `mutate`. 

Obviously this isn't very efficient with only one command, but it can be used with a long series of commands:

Suppose we wanted to use `countries` to create a new data frame called `countries_new`, which should have observations from years 1995 and 1996 (dropping 1994), should be sorted by country name (in lower case), and should have a new variable equal to GDP per capita in 1000s.

Here's how we could do this *without* chaining: 

In [27]:
countries_new <- filter(countries, year != 1994) #drop year 1994
countries_new <- arrange(countries_new, country) #sort by country names
countries_new <- mutate(countries_new, country = tolower(country), #convert name to lower-case
                        gdppc_1k = gdp_pc / 1000) #create GDP pc in 1000s
countries_new

country,year,gdp_pc,gdp_pc_in_1000s,gdppc_1k
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
china,1995,5549,5.549,5.549
china,1996,19660,19.66,19.66
sudan,1995,7658,7.658,7.658
sudan,1996,5971,5.971,5.971
usa,1995,15058,15.058,15.058
usa,1996,2965,2.965,2.965


Here's the same thing using chaining: 

In [28]:
countries_new <- countries %>%
    filter(year != 1994) %>%
    arrange(country) %>%
    mutate(country = tolower(country), gdppc_1k = gdp_pc / 1000)
countries_new

country,year,gdp_pc,gdp_pc_in_1000s,gdppc_1k
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
china,1995,5549,5.549,5.549
china,1996,19660,19.66,19.66
sudan,1995,7658,7.658,7.658
sudan,1996,5971,5.971,5.971
usa,1995,15058,15.058,15.058
usa,1996,2965,2.965,2.965


Chaining always begins with specifying the data frame we want to operate on (e.g.,
`countries`).  Every subsequent statement will then operate on this data frame, starting with the function that comes right after the data frame and working its way down.  In our case, the first thing we'll do to `countries` is to subset it. We'll then sort it by country name. Lastly, we'll overwrite the country name to be lower-case and create a new variable representing GDP per capita in 1000s.

Is chaining *better*? Some people find chaining makes code more readable. It certainly makes it more concise. 

Personally, my preference is actually to break down long manipulations like this into a series of distinct commands because it allows me to look at each intermediate step and make sure I didn't mess something up. And as we'll discuss in a later reading, I think you should *always* assume you've messed something up, because humans are bad at programming! But again, chaining is definitely the more popular approach to R these days, so it's important to introduce!

## Summing Up

In conclusion, dplyr allows you to write more concise commands with more familiar terminology -- `select` and `rename` rather than array notation. Chaining, similarly, can definitely make code more concise. As a result, many people are drawn to dplyr, and you may be too! And while I do have some mis-givings about it, I can certainly appreciate the draw.

So should you use it? That's up to you! At this point, you know enough about the different approaches to dataframe manipulation that you can make your own educated decision, and change that decision in the future if you want. 

### Want to Learn More?

If there's anything the tidyverse is good at, it's documentation! [Here are the docs for dplyr.](https://dplyr.tidyverse.org/)

