# Data Manipulation in R using `dplyr`

## What is `dplyr`?
dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr , focussing on only data frames. With `dplyr` , anything you can do to a local data frame you can also do to a remote database table.

## Why `dplyr`?
- Great for data exploration and transformation
- Intuitive to write and easy to read, especially when using the “chaining” syntax (covered below)
- Fast on data frames

## `dplyr` functionality
- Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` and `groub_by`
- Can work with data stored in databases and data tables
- Joins: inner join, left join, semi-join, anti-join
- Window functions for calculating ranking, offsets, and more
- Better than plyr if you’re only working with data frames (though it doesn’t yet duplicate all of the plyr functionality)

## Load Packages and Data 

In [2]:
install.packages('hflights', repos = 'http://cran.us.r-project.org')

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [4]:
# Load packages 
library(dplyr)
library(hflights)

In [6]:
# Explore data 
data(hflights)
head(hflights)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
5424,2011,1,1,6,1400,1500,AA,428,N576AA,60,...,-10,0,IAH,DFW,224,7,13,0,,0
5425,2011,1,2,7,1401,1501,AA,428,N557AA,60,...,-9,1,IAH,DFW,224,6,9,0,,0
5426,2011,1,3,1,1352,1502,AA,428,N541AA,70,...,-8,-8,IAH,DFW,224,5,17,0,,0
5427,2011,1,4,2,1403,1513,AA,428,N403AA,70,...,3,3,IAH,DFW,224,9,22,0,,0
5428,2011,1,5,3,1405,1507,AA,428,N492AA,62,...,-3,5,IAH,DFW,224,9,9,0,,0
5429,2011,1,6,4,1359,1503,AA,428,N262AA,64,...,-7,-1,IAH,DFW,224,6,13,0,,0


In [7]:
#`tbl_df` creates a “local data frame”
# Local data frame is simply a wrapper for a data frame that prints nicely
flights <- tbl_df(hflights)

In [8]:
# Examine first few rows 
head(flights)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
2011,1,1,6,1400,1500,AA,428,N576AA,60,...,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,2,7,1401,1501,AA,428,N557AA,60,...,-9,1,IAH,DFW,224,6,9,0,,0
2011,1,3,1,1352,1502,AA,428,N541AA,70,...,-8,-8,IAH,DFW,224,5,17,0,,0
2011,1,4,2,1403,1513,AA,428,N403AA,70,...,3,3,IAH,DFW,224,9,22,0,,0
2011,1,5,3,1405,1507,AA,428,N492AA,62,...,-3,5,IAH,DFW,224,9,9,0,,0
2011,1,6,4,1359,1503,AA,428,N262AA,64,...,-7,-1,IAH,DFW,224,6,13,0,,0


In [9]:
# Examine last few rows 
tail(flights)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
2011,12,6,2,1307,1600,WN,471,N632SW,113,...,0,7,HOU,TPA,781,5,10,0,,0
2011,12,6,2,1818,2111,WN,1191,N284WN,113,...,-9,8,HOU,TPA,781,5,11,0,,0
2011,12,6,2,2047,2334,WN,1674,N366SW,107,...,4,7,HOU,TPA,781,4,9,0,,0
2011,12,6,2,912,1031,WN,127,N777QC,79,...,-4,-3,HOU,TUL,453,4,14,0,,0
2011,12,6,2,656,812,WN,621,N727SW,76,...,-13,-4,HOU,TUL,453,3,9,0,,0
2011,12,6,2,1600,1713,WN,1597,N745SW,73,...,-12,0,HOU,TUL,453,3,11,0,,0


## Command structure (for all dplyr verbs)
- first argument is a **data frame**
- return value is a data frame
- nothing is modified in place
- Note: dplyr generally does not **preserve row names**

## `filter`: Keep rows matching criteria

### `AND` Operator( & )

In [13]:
# Note: you can use comma or ampersand(&) to represent AND conditionfilter(flights, Month==1, DayofMonth==1)
# conditionfilter(flights, Month==1, DayofMonth==1)
filter(flights, Month==1, DayofMonth==1)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
2011,1,1,6,1400,1500,AA,428,N576AA,60,...,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,1,6,728,840,AA,460,N520AA,72,...,5,8,IAH,DFW,224,6,25,0,,0
2011,1,1,6,1631,1736,AA,1121,N4WVAA,65,...,-9,1,IAH,DFW,224,16,12,0,,0
2011,1,1,6,1756,2112,AA,1294,N3DGAA,136,...,-3,1,IAH,MIA,964,9,14,0,,0
2011,1,1,6,1012,1347,AA,1700,N3DAAA,155,...,7,-8,IAH,MIA,964,12,26,0,,0
2011,1,1,6,1211,1325,AA,1820,N593AA,74,...,15,6,IAH,DFW,224,6,29,0,,0
2011,1,1,6,557,906,AA,1994,N3BBAA,129,...,-9,-3,IAH,MIA,964,5,11,0,,0
2011,1,1,6,1824,2106,AS,731,N614AS,282,...,-4,-1,IAH,SEA,1874,7,20,0,,0
2011,1,1,6,654,1124,B6,620,N324JB,210,...,5,-6,HOU,JFK,1428,6,23,0,,0
2011,1,1,6,1639,2110,B6,622,N324JB,211,...,61,54,HOU,JFK,1428,12,11,0,,0


In [14]:
# Using & 
filter(flights, Month==1 & DayofMonth==1)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
2011,1,1,6,1400,1500,AA,428,N576AA,60,...,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,1,6,728,840,AA,460,N520AA,72,...,5,8,IAH,DFW,224,6,25,0,,0
2011,1,1,6,1631,1736,AA,1121,N4WVAA,65,...,-9,1,IAH,DFW,224,16,12,0,,0
2011,1,1,6,1756,2112,AA,1294,N3DGAA,136,...,-3,1,IAH,MIA,964,9,14,0,,0
2011,1,1,6,1012,1347,AA,1700,N3DAAA,155,...,7,-8,IAH,MIA,964,12,26,0,,0
2011,1,1,6,1211,1325,AA,1820,N593AA,74,...,15,6,IAH,DFW,224,6,29,0,,0
2011,1,1,6,557,906,AA,1994,N3BBAA,129,...,-9,-3,IAH,MIA,964,5,11,0,,0
2011,1,1,6,1824,2106,AS,731,N614AS,282,...,-4,-1,IAH,SEA,1874,7,20,0,,0
2011,1,1,6,654,1124,B6,620,N324JB,210,...,5,-6,HOU,JFK,1428,6,23,0,,0
2011,1,1,6,1639,2110,B6,622,N324JB,211,...,61,54,HOU,JFK,1428,12,11,0,,0


### `OR` Operator( | )

In [15]:
# Pipe for OR operation
filter(flights, UniqueCarrier == "AA" | UniqueCarrier == "UA" )

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
2011,1,1,6,1400,1500,AA,428,N576AA,60,...,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,2,7,1401,1501,AA,428,N557AA,60,...,-9,1,IAH,DFW,224,6,9,0,,0
2011,1,3,1,1352,1502,AA,428,N541AA,70,...,-8,-8,IAH,DFW,224,5,17,0,,0
2011,1,4,2,1403,1513,AA,428,N403AA,70,...,3,3,IAH,DFW,224,9,22,0,,0
2011,1,5,3,1405,1507,AA,428,N492AA,62,...,-3,5,IAH,DFW,224,9,9,0,,0
2011,1,6,4,1359,1503,AA,428,N262AA,64,...,-7,-1,IAH,DFW,224,6,13,0,,0
2011,1,7,5,1359,1509,AA,428,N493AA,70,...,-1,-1,IAH,DFW,224,12,15,0,,0
2011,1,8,6,1355,1454,AA,428,N477AA,59,...,-16,-5,IAH,DFW,224,7,12,0,,0
2011,1,9,7,1443,1554,AA,428,N476AA,71,...,44,43,IAH,DFW,224,8,22,0,,0
2011,1,10,1,1443,1553,AA,428,N504AA,70,...,43,43,IAH,DFW,224,6,19,0,,0


### `%in%` Operator

In [16]:
# Use of %in%  operator 
filter(flights, UniqueCarrier %in% c("AA", "UA"))

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
2011,1,1,6,1400,1500,AA,428,N576AA,60,...,-10,0,IAH,DFW,224,7,13,0,,0
2011,1,2,7,1401,1501,AA,428,N557AA,60,...,-9,1,IAH,DFW,224,6,9,0,,0
2011,1,3,1,1352,1502,AA,428,N541AA,70,...,-8,-8,IAH,DFW,224,5,17,0,,0
2011,1,4,2,1403,1513,AA,428,N403AA,70,...,3,3,IAH,DFW,224,9,22,0,,0
2011,1,5,3,1405,1507,AA,428,N492AA,62,...,-3,5,IAH,DFW,224,9,9,0,,0
2011,1,6,4,1359,1503,AA,428,N262AA,64,...,-7,-1,IAH,DFW,224,6,13,0,,0
2011,1,7,5,1359,1509,AA,428,N493AA,70,...,-1,-1,IAH,DFW,224,12,15,0,,0
2011,1,8,6,1355,1454,AA,428,N477AA,59,...,-16,-5,IAH,DFW,224,7,12,0,,0
2011,1,9,7,1443,1554,AA,428,N476AA,71,...,44,43,IAH,DFW,224,8,22,0,,0
2011,1,10,1,1443,1553,AA,428,N504AA,70,...,43,43,IAH,DFW,224,6,19,0,,0


## `select`: Pick columns by name 
- dplyr approach uses similar syntax to filter

In [19]:
# Selecting columns
select(flights, DepTime, ArrTime, FlightNum)

DepTime,ArrTime,FlightNum
1400,1500,428
1401,1501,428
1352,1502,428
1403,1513,428
1405,1507,428
1359,1503,428
1359,1509,428
1355,1454,428
1443,1554,428
1443,1553,428


### `contains`

In [20]:
# Use colon to select multiple contiguous columns, and use `contains` to match columns by name
# Note: `starts_with`, `ends_with`, and `matches` 
# (for regular expressions) can also be used to match columns by name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

Year,Month,DayofMonth,TaxiIn,TaxiOut,ArrDelay,DepDelay
2011,1,1,7,13,-10,0
2011,1,2,6,9,-9,1
2011,1,3,5,17,-8,-8
2011,1,4,9,22,3,3
2011,1,5,9,9,-3,5
2011,1,6,6,13,-7,-1
2011,1,7,12,15,-1,-1
2011,1,8,7,12,-16,-5
2011,1,9,8,22,44,43
2011,1,10,6,19,43,43


## "Chaining” or “Pipelining"
- Usual way to perform multiple operations in one line is by nesting
- Can write commands in a natural order by using the %>% infix operator (which can be pronounced as “then”)

In [21]:
# Nesting Method 
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

UniqueCarrier,DepDelay
AA,90
AA,67
AA,74
AA,125
AA,82
AA,99
AA,70
AA,61
AA,74
AS,73


In [38]:
# Chaining method 
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    filter(DepDelay > 60) %>%
    head()

UniqueCarrier,DepDelay
AA,90
AA,67
AA,74
AA,125
AA,82
AA,99


## `arrange`: Reorder rows

In [39]:
# Ascending order
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(DepDelay) %>%
    head()

UniqueCarrier,DepDelay
OO,-33
MQ,-23
XE,-19
XE,-19
CO,-18
EV,-18


In [40]:
# Use `desc` for descending
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay)) %>%
    head()

UniqueCarrier,DepDelay
CO,981
AA,970
MQ,931
UA,869
MQ,814
MQ,803


## `mutate`: Add new variable

In [41]:
# Add a new variable and prints the new variable but does not store it
flights %>%
    select(Distance, AirTime) %>%
    mutate(Speed = Distance/AirTime * 60) %>%
    head()

Distance,AirTime,Speed
224,40,336.0
224,45,298.6667
224,48,280.0
224,39,344.6154
224,44,305.4545
224,45,298.6667


In [28]:
# Store the new variable 
flights <- flights %>% mutate(Speed = Distance/AirTime * 600)

In [29]:
# See Dataset
head(flights)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Speed
2011,1,1,6,1400,1500,AA,428,N576AA,60,...,0,IAH,DFW,224,7,13,0,,0,3360.0
2011,1,2,7,1401,1501,AA,428,N557AA,60,...,1,IAH,DFW,224,6,9,0,,0,2986.667
2011,1,3,1,1352,1502,AA,428,N541AA,70,...,-8,IAH,DFW,224,5,17,0,,0,2800.0
2011,1,4,2,1403,1513,AA,428,N403AA,70,...,3,IAH,DFW,224,9,22,0,,0,3446.154
2011,1,5,3,1405,1507,AA,428,N492AA,62,...,5,IAH,DFW,224,9,9,0,,0,3054.545
2011,1,6,4,1359,1503,AA,428,N262AA,64,...,-1,IAH,DFW,224,6,13,0,,0,2986.667


## `summarise`: Reduce variables to value
- Primarily useful with data that has been grouped by one or more variables
- `group_by` creates the groups that will be operated on
- `summarise` uses the provided aggregation function to summarise each group

In [42]:
# create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE)) %>%
    head()

Dest,avg_delay
ABQ,7.226259
AEX,5.839437
AGS,4.0
AMA,6.840095
ANC,26.080645
ASE,6.794643


In [43]:
# summarise_each allows you to apply the same summary function to multiple columns at once
# Note: mutate_each is also available

# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted) %>%
    head()

UniqueCarrier,Cancelled,Diverted
AA,0.018495684,0.001849568
AS,0.0,0.002739726
B6,0.025899281,0.005755396
CO,0.006782614,0.00262737
DL,0.015903067,0.003029156
EV,0.034482759,0.003176044


In [44]:
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay")) %>%
    head()

UniqueCarrier,ArrDelay_min,DepDelay_min,ArrDelay_max,DepDelay_max
AA,-39,-15,978,970
AS,-43,-15,183,172
B6,-44,-14,335,310
CO,-55,-18,957,981
DL,-32,-17,701,730
EV,-40,-18,469,479


- Helper function n() counts the number of rows in a group
- Helper function n_distinct(vector) counts the number of unique items in that vector

In [45]:
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count)) %>%
    head()

Month,DayofMonth,flight_count
8,4,706
8,11,706
8,12,706
8,5,705
8,3,704
8,10,704


In [46]:
# rewrite more simply with the `tally` function
flights %>%
    group_by(Month, DayofMonth) %>%
    tally(sort = TRUE) %>%
    head()

Month,DayofMonth,n
8,4,706
8,11,706
8,12,706
8,5,705
8,3,704
8,10,704


In [47]:
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
    group_by(Dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(TailNum)) %>%
    head()

Dest,flight_count,plane_count
ABQ,2812,716
AEX,724,215
AGS,1,1
AMA,1297,158
ANC,125,38
ASE,125,60


In [37]:
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head()

Adding missing grouping variables: `Dest`


     Cancelled
Dest     0    1
  ABQ 2787   25
  AEX  712   12
  AGS    1    0
  AMA 1265   32
  ANC  125    0
  ASE  120    5

## `window` Functions
- Aggregation function (like `mean`) takes n inputs and returns 1 value
- Window function takes n inputs and returns n values
- Includes ranking and ordering functions (like `min_rank`), `offset` functions (lead and lag), and cumulative aggregates (like cummean).

In [48]:
# for each carrier, calculate which two days of the year they had their longest departure delays
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    filter(min_rank(desc(DepDelay)) <= 2) %>%
    arrange(UniqueCarrier, desc(DepDelay))

Adding missing grouping variables: `UniqueCarrier`


UniqueCarrier,Month,DayofMonth,DepDelay
AA,12,12,970
AA,11,19,677
AS,2,28,172
AS,7,6,138
B6,10,29,310
B6,8,19,283
CO,8,1,981
CO,1,20,780
DL,10,25,730
DL,4,5,497


In [49]:
# rewrite more simply with the `top_n` function
flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    top_n(2) %>%
    arrange(UniqueCarrier, desc(DepDelay))

Adding missing grouping variables: `UniqueCarrier`
Selecting by DepDelay


UniqueCarrier,Month,DayofMonth,DepDelay
AA,12,12,970
AA,11,19,677
AS,2,28,172
AS,7,6,138
B6,10,29,310
B6,8,19,283
CO,8,1,981
CO,1,20,780
DL,10,25,730
DL,4,5,497


In [50]:
# for each month, calculate the number of flights and the change from the previous month
flights %>%
    group_by(Month) %>%
    summarise(flight_count = n()) %>%
    mutate(change = flight_count - lag(flight_count))

Month,flight_count,change
1,18910,
2,17128,-1782.0
3,19470,2342.0
4,18593,-877.0
5,19172,579.0
6,19600,428.0
7,20548,948.0
8,20176,-372.0
9,18065,-2111.0
10,18696,631.0


In [51]:
# rewrite more simply with the `tally` function
flights %>%
    group_by(Month) %>%
    tally() %>%
    mutate(change = n - lag(n))

Month,n,change
1,18910,
2,17128,-1782.0
3,19470,2342.0
4,18593,-877.0
5,19172,579.0
6,19600,428.0
7,20548,948.0
8,20176,-372.0
9,18065,-2111.0
10,18696,631.0


## Other Useful Convenience Functions

In [52]:
# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Speed
2011,12,21,3,1946,2113,XE,4580,N14558,87,...,6,IAH,BHM,562,5,13,0,,0,4886.957
2011,11,18,5,1518,1731,CO,1511,N14230,133,...,48,IAH,ORD,925,9,10,0,,0,4868.421
2011,9,23,5,1838,1934,CO,5,N23708,56,...,48,IAH,MSY,305,3,10,0,,0,4255.814
2011,5,30,1,1435,1537,XE,2591,N12934,62,...,5,IAH,DAL,217,10,14,0,,0,3426.316
2011,8,8,1,1302,1411,CO,1576,N73275,69,...,-3,IAH,MSY,305,2,19,0,,0,3812.5


In [53]:
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)

Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Speed
2011,11,29,2,859,1038,CO,1724,N77520,219,...,-1,IAH,SNA,1346,10,16,0,,0,4184.456
2011,2,10,4,742,1210,CO,408,N14604,208,...,-8,IAH,EWR,1400,8,22,0,,0,4719.101
2011,8,31,3,1551,1705,CO,1733,N76515,194,...,-4,IAH,LAX,1379,9,13,0,,0,4810.465
2011,7,26,2,1043,1359,XE,2520,N19554,136,...,-9,IAH,CVG,871,5,11,0,,0,4355.000
2011,11,1,2,1901,2001,WN,2219,N472WN,60,...,-4,HOU,MSY,302,4,11,0,,0,4026.667
2011,5,12,4,1500,1716,WN,565,N520SW,136,...,40,HOU,MDW,937,5,10,0,,0,4646.281
2011,12,5,1,1856,2310,CO,1111,N27724,194,...,6,IAH,DCA,1208,14,34,0,,0,4964.384
2011,10,27,4,1148,1258,WN,15,N475WN,190,...,13,HOU,LAS,1235,3,9,0,,0,4162.921
2011,4,14,4,859,1000,WN,1204,N241WN,121,...,-1,HOU,ELP,677,3,13,0,,0,3868.571
2011,11,19,6,1256,1434,WN,1457,N692SW,98,...,6,HOU,BNA,670,5,6,0,,0,4620.690


In [54]:
str(flights)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	227496 obs. of  22 variables:
 $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
 $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
 $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
 $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
 $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
 $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
 $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
 $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
 $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
 $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
 $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
 $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
 $ Dest             : chr  "DFW" "DFW" "DFW" "

In [55]:
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)

Observations: 227,496
Variables: 22
$ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
$ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, …
$ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 144…
$ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 155…
$ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA…
$ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
$ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "…
$ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56, 63,…
$ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41, 44,…
$ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5, -9,…
$ DepDelay      

## Resources
- https://rpubs.com/justmarkham/dplyr-tutorial
- https://rpubs.com/justmarkham/dplyr-tutorial-part-2
- https://rafalab.github.io/dsbook/
- [Official dplyr reference manual and vignettes on CRAN](http://cran.r-project.org/web/packages/dplyr/index.html)