# Piping in R

Piping is a way to chain together multiple operations in an easy and very readable way. In R, especially the pipe operator `%>%` is of great convenience.

Originally the `%>%` operator comes from the `magrittr` package, but it is also included in the `dplyr` package (and thereby in the `tidyverse` package). So let us load the `tidyverse` package


In [3]:
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.1     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


To get started let us get some sample data to work with. Let us use the Marketing example data from chapter 3 of the book "Introduction to R for Business Intelligence" by Jay Gendron. Here is how to load that into R:

In [4]:
marketingData <- read.csv("https://raw.githubusercontent.com/jgendron/com.packtpub.intro.r.bi/master/Chapter3-ExploratoryDataAnalysis/data/Ch3_marketing.csv")
head(marketingData)

google_adwords,facebook,twitter,marketing_total,revenues,employees,pop_density
65.66,47.86,52.46,165.98,39.26,5,High
39.1,55.2,77.4,171.7,38.9,7,Medium
174.81,52.01,68.01,294.83,49.51,11,Medium
34.36,61.96,86.86,183.18,40.56,7,High
78.21,40.91,30.41,149.53,40.21,9,Low
34.19,15.09,12.79,62.07,38.09,3,High


Assume we want to subset the data on `pop_density` being `High`, then group by `employees`, then calculate the mean for the `marketing_total` for each of these groups. This can be done in the following two ways, for instance:

In [5]:
newData <- filter(marketingData, pop_density == "High")
newData2 <- group_by(newData, employees)
newData3 <- select(newData2, employees, marketing_total)
newData4 <- summarise_all(newData3, mean)
newData4 

employees,marketing_total
3,94.51
4,105.67
5,148.465
6,198.8944
7,174.5425
8,233.5122
9,299.8514
10,278.4633
12,360.594


In [6]:
summarise_all(select(group_by(filter(marketingData, pop_density == "High"), employees), employees, marketing_total), mean)

employees,marketing_total
3,94.51
4,105.67
5,148.465
6,198.8944
7,174.5425
8,233.5122
9,299.8514
10,278.4633
12,360.594


The two pieces of code do the same thing and illustrate two different ways we can execute several subsequent operations on some data. The operations are first to filter on the row for which `pop_density` is `High`, then grouping by the `employees` variable, then selecting the columns `marketing_total` and `employees`, and then summarizing all variables by the mean.

In the first bit of code we assign the output of each operation to new variable (called `newData`). Then we use this new variable as input in the next operation (grouping by the variable `employees`). We keep doing this for as many operations we want to execute.

In the second bit of code we just skipped all the intermediate assignments by just putting the calls of the operations inside each other with the latest being the outer most (furthest to the right). This is clearly less code, but it is also much hard to read and understand what is going on.

Can we make a simpler than the first one, which is still readable? Then answer is "yes, with the pipe operator `%>%`". We can simply do:

In [7]:
marketingData %>%
    filter(pop_density == "High") %>%
    group_by(employees) %>%
    select(employees, marketing_total) %>%
    summarise_all(mean)

employees,marketing_total
3,94.51
4,105.67
5,148.465
6,198.8944
7,174.5425
8,233.5122
9,299.8514
10,278.4633
12,360.594


A few things are worth noticing about this later piece of code. First of all the multiple lines are for increased readability, we could have put it on one line. (For it to work, however, it needs to be the pipe `%>%` that ends the line.)

Secondly, notice how, compared to the first code example, we have completely left out the **first** argument in the call to the functions `filter`, `group_by`, `select`, and `summarise_all`. What the piper operator does is, it takes the output of the operation on the left and put it in as the **first** argument to the next operation. That is, the output of the operation `filter(pop_density == "High")` is used as first input to the next `group_by` operation. Formally, an expression of the form `function(argument1, argument2)` can be rewritten as `argument1 %>% function(argument2)`. In itself this rewriting does not change much, but when multiple function calls are chained together it helps a lot.

Finally, look how easy it is to read, as each line represent one operation applied to the output of the operation on the line above.

### *Exercise*

Use the `%>%` operator on the `marketingData` to group it by the `pop_density` variable and then summarise `revenues` by adding it together wih `sum`.

For more on `%>%` and piping in R in general, see the following blog on DataCamp: https://www.datacamp.com/community/tutorials/pipe-r-tutorial