# Data Manipulation in R using `dplyr`

## What is `dplyr`?
dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr , focussing on only data frames. With `dplyr` , anything you can do to a local data frame you can also do to a remote database table.

## Why `dplyr`?
- Great for data exploration and transformation
- Intuitive to write and easy to read, especially when using the “chaining” syntax (covered below)
- Fast on data frames

## `dplyr` functionality
- Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` and `groub_by`
- Can work with data stored in databases and data tables
- Joins: inner join, left join, semi-join, anti-join
- Window functions for calculating ranking, offsets, and more
- Better than plyr if you’re only working with data frames (though it doesn’t yet duplicate all of the plyr functionality)

## Load Packages and Data 

In [1]:
install.packages('hflights', repos = 'http://cran.us.r-project.org')

Installing package into ‘/home/jubayer/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)



In [2]:
# Load packages 
library(dplyr)
library(hflights)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [3]:
# Explore data 
data(hflights)
head(hflights)

Unnamed: 0_level_0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
5424,2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,-10,0,IAH,DFW,224,7,13,0,,0
5425,2011,1,2,7,1401,1501,AA,428,N557AA,60,⋯,-9,1,IAH,DFW,224,6,9,0,,0
5426,2011,1,3,1,1352,1502,AA,428,N541AA,70,⋯,-8,-8,IAH,DFW,224,5,17,0,,0
5427,2011,1,4,2,1403,1513,AA,428,N403AA,70,⋯,3,3,IAH,DFW,224,9,22,0,,0
5428,2011,1,5,3,1405,1507,AA,428,N492AA,62,⋯,-3,5,IAH,DFW,224,9,9,0,,0
5429,2011,1,6,4,1359,1503,AA,428,N262AA,64,⋯,-7,-1,IAH,DFW,224,6,13,0,,0


In [4]:
#`tbl_df` creates a “local data frame”
# Local data frame is simply a wrapper for a data frame that prints nicely
flights <- tbl_df(hflights)

“`tbl_df()` is deprecated as of dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.


In [5]:
# Examine first few rows 
head(flights)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,2,7,1401,1501,AA,428,N557AA,60,⋯,-9,1,IAH,DFW,224,6,9,0,,0
2011,1,3,1,1352,1502,AA,428,N541AA,70,⋯,-8,-8,IAH,DFW,224,5,17,0,,0
2011,1,4,2,1403,1513,AA,428,N403AA,70,⋯,3,3,IAH,DFW,224,9,22,0,,0
2011,1,5,3,1405,1507,AA,428,N492AA,62,⋯,-3,5,IAH,DFW,224,9,9,0,,0
2011,1,6,4,1359,1503,AA,428,N262AA,64,⋯,-7,-1,IAH,DFW,224,6,13,0,,0


In [6]:
# Examine last few rows 
tail(flights)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
2011,12,6,2,1307,1600,WN,471,N632SW,113,⋯,0,7,HOU,TPA,781,5,10,0,,0
2011,12,6,2,1818,2111,WN,1191,N284WN,113,⋯,-9,8,HOU,TPA,781,5,11,0,,0
2011,12,6,2,2047,2334,WN,1674,N366SW,107,⋯,4,7,HOU,TPA,781,4,9,0,,0
2011,12,6,2,912,1031,WN,127,N777QC,79,⋯,-4,-3,HOU,TUL,453,4,14,0,,0
2011,12,6,2,656,812,WN,621,N727SW,76,⋯,-13,-4,HOU,TUL,453,3,9,0,,0
2011,12,6,2,1600,1713,WN,1597,N745SW,73,⋯,-12,0,HOU,TUL,453,3,11,0,,0


## Command structure (for all dplyr verbs)
- first argument is a **data frame**
- return value is a data frame
- nothing is modified in place
- Note: dplyr generally does not **preserve row names**

## `filter`: Keep rows matching criteria

### `AND` Operator( & )

In [7]:
# Note: you can use comma or ampersand(&) to represent AND conditionfilter(flights, Month==1, DayofMonth==1)
# conditionfilter(flights, Month==1, DayofMonth==1)
filter(flights, Month==1, DayofMonth==1)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,1,6,728,840,AA,460,N520AA,72,⋯,5,8,IAH,DFW,224,6,25,0,,0
2011,1,1,6,1631,1736,AA,1121,N4WVAA,65,⋯,-9,1,IAH,DFW,224,16,12,0,,0
2011,1,1,6,1756,2112,AA,1294,N3DGAA,136,⋯,-3,1,IAH,MIA,964,9,14,0,,0
2011,1,1,6,1012,1347,AA,1700,N3DAAA,155,⋯,7,-8,IAH,MIA,964,12,26,0,,0
2011,1,1,6,1211,1325,AA,1820,N593AA,74,⋯,15,6,IAH,DFW,224,6,29,0,,0
2011,1,1,6,557,906,AA,1994,N3BBAA,129,⋯,-9,-3,IAH,MIA,964,5,11,0,,0
2011,1,1,6,1824,2106,AS,731,N614AS,282,⋯,-4,-1,IAH,SEA,1874,7,20,0,,0
2011,1,1,6,654,1124,B6,620,N324JB,210,⋯,5,-6,HOU,JFK,1428,6,23,0,,0
2011,1,1,6,1639,2110,B6,622,N324JB,211,⋯,61,54,HOU,JFK,1428,12,11,0,,0


In [8]:
# Using & 
filter(flights, Month==1 & DayofMonth==1)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,1,6,728,840,AA,460,N520AA,72,⋯,5,8,IAH,DFW,224,6,25,0,,0
2011,1,1,6,1631,1736,AA,1121,N4WVAA,65,⋯,-9,1,IAH,DFW,224,16,12,0,,0
2011,1,1,6,1756,2112,AA,1294,N3DGAA,136,⋯,-3,1,IAH,MIA,964,9,14,0,,0
2011,1,1,6,1012,1347,AA,1700,N3DAAA,155,⋯,7,-8,IAH,MIA,964,12,26,0,,0
2011,1,1,6,1211,1325,AA,1820,N593AA,74,⋯,15,6,IAH,DFW,224,6,29,0,,0
2011,1,1,6,557,906,AA,1994,N3BBAA,129,⋯,-9,-3,IAH,MIA,964,5,11,0,,0
2011,1,1,6,1824,2106,AS,731,N614AS,282,⋯,-4,-1,IAH,SEA,1874,7,20,0,,0
2011,1,1,6,654,1124,B6,620,N324JB,210,⋯,5,-6,HOU,JFK,1428,6,23,0,,0
2011,1,1,6,1639,2110,B6,622,N324JB,211,⋯,61,54,HOU,JFK,1428,12,11,0,,0


### `OR` Operator( | )

In [9]:
# Pipe for OR operation
filter(flights, UniqueCarrier == "AA" | UniqueCarrier == "UA" )

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,2,7,1401,1501,AA,428,N557AA,60,⋯,-9,1,IAH,DFW,224,6,9,0,,0
2011,1,3,1,1352,1502,AA,428,N541AA,70,⋯,-8,-8,IAH,DFW,224,5,17,0,,0
2011,1,4,2,1403,1513,AA,428,N403AA,70,⋯,3,3,IAH,DFW,224,9,22,0,,0
2011,1,5,3,1405,1507,AA,428,N492AA,62,⋯,-3,5,IAH,DFW,224,9,9,0,,0
2011,1,6,4,1359,1503,AA,428,N262AA,64,⋯,-7,-1,IAH,DFW,224,6,13,0,,0
2011,1,7,5,1359,1509,AA,428,N493AA,70,⋯,-1,-1,IAH,DFW,224,12,15,0,,0
2011,1,8,6,1355,1454,AA,428,N477AA,59,⋯,-16,-5,IAH,DFW,224,7,12,0,,0
2011,1,9,7,1443,1554,AA,428,N476AA,71,⋯,44,43,IAH,DFW,224,8,22,0,,0
2011,1,10,1,1443,1553,AA,428,N504AA,70,⋯,43,43,IAH,DFW,224,6,19,0,,0


### `%in%` Operator

In [10]:
# Use of %in%  operator 
filter(flights, UniqueCarrier %in% c("AA", "UA"))

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>
2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,2,7,1401,1501,AA,428,N557AA,60,⋯,-9,1,IAH,DFW,224,6,9,0,,0
2011,1,3,1,1352,1502,AA,428,N541AA,70,⋯,-8,-8,IAH,DFW,224,5,17,0,,0
2011,1,4,2,1403,1513,AA,428,N403AA,70,⋯,3,3,IAH,DFW,224,9,22,0,,0
2011,1,5,3,1405,1507,AA,428,N492AA,62,⋯,-3,5,IAH,DFW,224,9,9,0,,0
2011,1,6,4,1359,1503,AA,428,N262AA,64,⋯,-7,-1,IAH,DFW,224,6,13,0,,0
2011,1,7,5,1359,1509,AA,428,N493AA,70,⋯,-1,-1,IAH,DFW,224,12,15,0,,0
2011,1,8,6,1355,1454,AA,428,N477AA,59,⋯,-16,-5,IAH,DFW,224,7,12,0,,0
2011,1,9,7,1443,1554,AA,428,N476AA,71,⋯,44,43,IAH,DFW,224,8,22,0,,0
2011,1,10,1,1443,1553,AA,428,N504AA,70,⋯,43,43,IAH,DFW,224,6,19,0,,0


## `select`: Pick columns by name 
- dplyr approach uses similar syntax to filter

In [11]:
# Selecting columns
select(flights, DepTime, ArrTime, FlightNum)

DepTime,ArrTime,FlightNum
<int>,<int>,<int>
1400,1500,428
1401,1501,428
1352,1502,428
1403,1513,428
1405,1507,428
1359,1503,428
1359,1509,428
1355,1454,428
1443,1554,428
1443,1553,428


### `contains`

In [12]:
# Use colon to select multiple contiguous columns, and use `contains` to match columns by name
# Note: `starts_with`, `ends_with`, and `matches` 
# (for regular expressions) can also be used to match columns by name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

Year,Month,DayofMonth,TaxiIn,TaxiOut,ArrDelay,DepDelay
<int>,<int>,<int>,<int>,<int>,<int>,<int>
2011,1,1,7,13,-10,0
2011,1,2,6,9,-9,1
2011,1,3,5,17,-8,-8
2011,1,4,9,22,3,3
2011,1,5,9,9,-3,5
2011,1,6,6,13,-7,-1
2011,1,7,12,15,-1,-1
2011,1,8,7,12,-16,-5
2011,1,9,8,22,44,43
2011,1,10,6,19,43,43


## "Chaining” or “Pipelining"
- Usual way to perform multiple operations in one line is by nesting
- Can write commands in a natural order by using the %>% infix operator (which can be pronounced as “then”)

In [13]:
# Nesting Method 
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

UniqueCarrier,DepDelay
<chr>,<int>
AA,90
AA,67
AA,74
AA,125
AA,82
AA,99
AA,70
AA,61
AA,74
AS,73


In [14]:
# Chaining method 
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    filter(DepDelay > 60) %>%
    head()

UniqueCarrier,DepDelay
<chr>,<int>
AA,90
AA,67
AA,74
AA,125
AA,82
AA,99


## `arrange`: Reorder rows

In [15]:
# Ascending order
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(DepDelay) %>%
    head()

UniqueCarrier,DepDelay
<chr>,<int>
OO,-33
MQ,-23
XE,-19
XE,-19
CO,-18
EV,-18


In [16]:
# Use `desc` for descending
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay)) %>%
    head()

UniqueCarrier,DepDelay
<chr>,<int>
CO,981
AA,970
MQ,931
UA,869
MQ,814
MQ,803


## `mutate`: Add new variable

In [17]:
# Add a new variable and prints the new variable but does not store it
flights %>%
    select(Distance, AirTime) %>%
    mutate(Speed = Distance/AirTime * 60) %>%
    head()

Distance,AirTime,Speed
<int>,<int>,<dbl>
224,40,336.0
224,45,298.6667
224,48,280.0
224,39,344.6154
224,44,305.4545
224,45,298.6667


In [18]:
# Store the new variable 
flights <- flights %>% mutate(Speed = Distance/AirTime * 60)

In [19]:
# See Dataset
head(flights)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Speed
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>,<dbl>
2011,1,1,6,1400,1500,AA,428,N576AA,60,⋯,0,IAH,DFW,224,7,13,0,,0,336.0
2011,1,2,7,1401,1501,AA,428,N557AA,60,⋯,1,IAH,DFW,224,6,9,0,,0,298.6667
2011,1,3,1,1352,1502,AA,428,N541AA,70,⋯,-8,IAH,DFW,224,5,17,0,,0,280.0
2011,1,4,2,1403,1513,AA,428,N403AA,70,⋯,3,IAH,DFW,224,9,22,0,,0,344.6154
2011,1,5,3,1405,1507,AA,428,N492AA,62,⋯,5,IAH,DFW,224,9,9,0,,0,305.4545
2011,1,6,4,1359,1503,AA,428,N262AA,64,⋯,-1,IAH,DFW,224,6,13,0,,0,298.6667


## `summarise`: Reduce variables to value
- Primarily useful with data that has been grouped by one or more variables
- `group_by` creates the groups that will be operated on
- `summarise` uses the provided aggregation function to summarise each group

In [20]:
# create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE)) %>%
    head()

`summarise()` ungrouping output (override with `.groups` argument)



Dest,avg_delay
<chr>,<dbl>
ABQ,7.226259
AEX,5.839437
AGS,4.0
AMA,6.840095
ANC,26.080645
ASE,6.794643


In [21]:
# summarise_each allows you to apply the same summary function to multiple columns at once
# Note: mutate_each is also available

# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted) %>%
    head()

“`summarise_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
“`funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))


UniqueCarrier,Cancelled,Diverted
<chr>,<dbl>,<dbl>
AA,0.018495684,0.001849568
AS,0.0,0.002739726
B6,0.025899281,0.005755396
CO,0.006782614,0.00262737
DL,0.015903067,0.003029156
EV,0.034482759,0.003176044


In [22]:
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay")) %>%
    head()

UniqueCarrier,ArrDelay_min,DepDelay_min,ArrDelay_max,DepDelay_max
<chr>,<int>,<int>,<int>,<int>
AA,-39,-15,978,970
AS,-43,-15,183,172
B6,-44,-14,335,310
CO,-55,-18,957,981
DL,-32,-17,701,730
EV,-40,-18,469,479


- Helper function n() counts the number of rows in a group
- Helper function n_distinct(vector) counts the number of unique items in that vector

In [23]:
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count)) %>%
    head()

`summarise()` regrouping output by 'Month' (override with `.groups` argument)



Month,DayofMonth,flight_count
<int>,<int>,<int>
8,4,706
8,11,706
8,12,706
8,5,705
8,3,704
8,10,704


In [24]:
# rewrite more simply with the `tally` function
flights %>%
    group_by(Month, DayofMonth) %>%
    tally(sort = TRUE) %>%
    head()

Month,DayofMonth,n
<int>,<int>,<int>
8,4,706
8,11,706
8,12,706
8,5,705
8,3,704
8,10,704


In [25]:
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
    group_by(Dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(TailNum)) %>%
    head()

`summarise()` ungrouping output (override with `.groups` argument)



Dest,flight_count,plane_count
<chr>,<int>,<int>
ABQ,2812,716
AEX,724,215
AGS,1,1
AMA,1297,158
ANC,125,38
ASE,125,60


In [26]:
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head()

Adding missing grouping variables: `Dest`



     Cancelled
Dest     0    1
  ABQ 2787   25
  AEX  712   12
  AGS    1    0
  AMA 1265   32
  ANC  125    0
  ASE  120    5

## `window` Functions
- Aggregation function (like `mean`) takes n inputs and returns 1 value
- Window function takes n inputs and returns n values
- Includes ranking and ordering functions (like `min_rank`), `offset` functions (lead and lag), and cumulative aggregates (like cummean).

In [27]:
# for each carrier, calculate which two days of the year they had their longest departure delays
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    filter(min_rank(desc(DepDelay)) <= 2) %>%
    arrange(UniqueCarrier, desc(DepDelay))

Adding missing grouping variables: `UniqueCarrier`



UniqueCarrier,Month,DayofMonth,DepDelay
<chr>,<int>,<int>,<int>
AA,12,12,970
AA,11,19,677
AS,2,28,172
AS,7,6,138
B6,10,29,310
B6,8,19,283
CO,8,1,981
CO,1,20,780
DL,10,25,730
DL,4,5,497


In [28]:
# rewrite more simply with the `top_n` function
flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    top_n(2) %>%
    arrange(UniqueCarrier, desc(DepDelay))

Adding missing grouping variables: `UniqueCarrier`

Selecting by DepDelay



UniqueCarrier,Month,DayofMonth,DepDelay
<chr>,<int>,<int>,<int>
AA,12,12,970
AA,11,19,677
AS,2,28,172
AS,7,6,138
B6,10,29,310
B6,8,19,283
CO,8,1,981
CO,1,20,780
DL,10,25,730
DL,4,5,497


In [29]:
# for each month, calculate the number of flights and the change from the previous month
flights %>%
    group_by(Month) %>%
    summarise(flight_count = n()) %>%
    mutate(change = flight_count - lag(flight_count))

`summarise()` ungrouping output (override with `.groups` argument)



Month,flight_count,change
<int>,<int>,<int>
1,18910,
2,17128,-1782.0
3,19470,2342.0
4,18593,-877.0
5,19172,579.0
6,19600,428.0
7,20548,948.0
8,20176,-372.0
9,18065,-2111.0
10,18696,631.0


In [30]:
# rewrite more simply with the `tally` function
flights %>%
    group_by(Month) %>%
    tally() %>%
    mutate(change = n - lag(n))

Month,n,change
<int>,<int>,<int>
1,18910,
2,17128,-1782.0
3,19470,2342.0
4,18593,-877.0
5,19172,579.0
6,19600,428.0
7,20548,948.0
8,20176,-372.0
9,18065,-2111.0
10,18696,631.0


## Other Useful Convenience Functions

In [31]:
# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Speed
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>,<dbl>
2011,10,8,6,1816,2136,AA,1294,N3FSAA,140,⋯,51,IAH,MIA,964,5,11,0,,0,466.4516
2011,1,25,2,1655,1811,CO,763,N76523,76,⋯,0,IAH,MSY,305,7,22,0,,0,389.3617
2011,3,20,7,717,1038,CO,1590,N11641,141,⋯,-8,IAH,MIA,964,7,17,0,,0,494.359
2011,12,15,4,1349,1454,WN,30,N298WN,65,⋯,19,HOU,DAL,239,2,7,0,,0,256.0714
2011,8,6,6,1743,1853,WN,2516,N348SW,70,⋯,33,HOU,MAF,441,3,5,0,,0,426.7742


In [32]:
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,⋯,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Speed
<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<int>,⋯,<int>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>,<dbl>
2011,3,27,7,1754,1855,XE,2562,N16963,61,⋯,24,IAH,DAL,217,4,19,0,,0,342.6316
2011,3,12,6,1125,1235,XE,2082,N11127,130,⋯,5,IAH,ABQ,744,6,10,0,,0,391.5789
2011,10,16,7,1122,1222,XE,4262,N11565,60,⋯,-3,IAH,BRO,308,4,7,0,,0,377.1429
2011,4,27,3,1829,2006,OO,1123,N716SK,157,⋯,49,IAH,COS,809,6,24,0,,0,382.2047
2011,9,18,7,1543,1651,XE,4693,N14943,68,⋯,38,IAH,LRD,301,4,16,0,,0,376.2500
2011,1,17,1,2115,2236,XE,2648,N26141,141,⋯,-5,IAH,ABQ,744,5,14,0,,0,365.9016
2011,4,8,5,1059,1155,XE,2407,N13970,56,⋯,-1,IAH,LFT,201,4,18,0,,0,354.7059
2011,11,10,4,1450,1749,CO,1423,N76269,299,⋯,5,IAH,SEA,1874,3,29,0,,0,421.1236
2011,3,23,3,2105,2231,CO,299,N14120,146,⋯,0,IAH,DEN,862,7,14,0,,0,413.7600
2011,3,13,7,1903,2017,CO,417,N76514,194,⋯,8,IAH,SAN,1303,4,12,0,,0,439.2135


In [33]:
str(flights)

tibble [227,496 × 22] (S3: tbl_df/tbl/data.frame)
 $ Year             : int [1:227496] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
 $ Month            : int [1:227496] 1 1 1 1 1 1 1 1 1 1 ...
 $ DayofMonth       : int [1:227496] 1 2 3 4 5 6 7 8 9 10 ...
 $ DayOfWeek        : int [1:227496] 6 7 1 2 3 4 5 6 7 1 ...
 $ DepTime          : int [1:227496] 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
 $ ArrTime          : int [1:227496] 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
 $ UniqueCarrier    : chr [1:227496] "AA" "AA" "AA" "AA" ...
 $ FlightNum        : int [1:227496] 428 428 428 428 428 428 428 428 428 428 ...
 $ TailNum          : chr [1:227496] "N576AA" "N557AA" "N541AA" "N403AA" ...
 $ ActualElapsedTime: int [1:227496] 60 60 70 70 62 64 70 59 71 70 ...
 $ AirTime          : int [1:227496] 40 45 48 39 44 45 43 40 41 45 ...
 $ ArrDelay         : int [1:227496] -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
 $ DepDelay         : int [1:227496] 0 1 -8 3 5 -1 -1 -5 43 

In [34]:
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)

Rows: 227,496
Columns: 22
$ Year              [3m[38;5;246m<int>[39m[23m 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
$ Month             [3m[38;5;246m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ DayofMonth        [3m[38;5;246m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ DayOfWeek         [3m[38;5;246m<int>[39m[23m 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, …
$ DepTime           [3m[38;5;246m<int>[39m[23m 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 144…
$ ArrTime           [3m[38;5;246m<int>[39m[23m 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 155…
$ UniqueCarrier     [3m[38;5;246m<chr>[39m[23m "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA…
$ FlightNum         [3m[38;5;246m<int>[39m[23m 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
$ TailNum           [3m[38;5;246m<chr>[39m[23m "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "…
$ ActualElapsedTime [3m[38;5;246m<in

## References
- https://rpubs.com/justmarkham/dplyr-tutorial
- https://rpubs.com/justmarkham/dplyr-tutorial-part-2
- https://rafalab.github.io/dsbook/
- [Official dplyr reference manual and vignettes on CRAN](http://cran.r-project.org/web/packages/dplyr/index.html)