### Load libraries

In [1]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Data preparation task
- what is the average arrival delay of airlines ?
- what airline is the worst in terms of average arrival delay ?

### Load data

In [2]:
flights_data <- read.csv('L2-data/flights.csv')
airlines_data <- read.csv('L2-data/airlines.csv')

In [3]:
head(flights_data)

Unnamed: 0_level_0,X,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>
1,1,2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2,2,2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
3,3,2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
4,4,2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
5,5,2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
6,6,2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


In [4]:
head(airlines_data)

Unnamed: 0_level_0,X,carrier,name
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,9E,Endeavor Air Inc.
2,2,AA,American Airlines Inc.
3,3,AS,Alaska Airlines Inc.
4,4,B6,JetBlue Airways
5,5,DL,Delta Air Lines Inc.
6,6,EV,ExpressJet Airlines Inc.


### Subset columns
- select `arr_delay`, `carrier` which are relevant columns 

In [5]:
flights_select <- select(flights_data, arr_delay, carrier)
head(flights_select)

Unnamed: 0_level_0,arr_delay,carrier
Unnamed: 0_level_1,<int>,<chr>
1,11,UA
2,20,UA
3,33,AA
4,-18,B6
5,-25,DL
6,12,UA


### Filter rows
- deselect flights with NA in arrival delay
- deselect flights arriving on time or earlier than scheduled time

In [6]:
flights_select <- filter(flights_select, !is.na(arr_delay) & arr_delay > 0)
head(flights_select)

Unnamed: 0_level_0,arr_delay,carrier
Unnamed: 0_level_1,<int>,<chr>
1,11,UA
2,20,UA
3,33,AA
4,12,UA
5,19,B6
6,8,AA


### Compute mean delay
- group data by carrier and calculate group mean of arrival delay

In [7]:
mean_arr_delay <- group_by(flights_select, carrier) %>% summarise(mean_delay=mean(arr_delay))
mean_arr_delay

carrier,mean_delay
<chr>,<dbl>
9E,49.27271
AA,38.26555
AS,34.36508
B6,40.00906
DL,37.74356
EV,48.26858
F9,47.57908
FL,41.09446
HA,35.03093
MQ,37.85205


### Sort data
- descendingly sort mean arrival delay of aggregated data

In [8]:
mean_arr_delay <- arrange(mean_arr_delay, desc(mean_delay))
mean_arr_delay

carrier,mean_delay
<chr>,<dbl>
OO,60.6
YV,51.0814
9E,49.27271
EV,48.26858
F9,47.57908
VX,43.84708
FL,41.09446
WN,40.74755
B6,40.00906
AA,38.26555


### Join data
- join aggregated data with `airlines` table to get more information

In [9]:
joined <- left_join(mean_arr_delay, select(airlines_data, -X), by='carrier')
joined

carrier,mean_delay,name
<chr>,<dbl>,<chr>
OO,60.6,SkyWest Airlines Inc.
YV,51.0814,Mesa Airlines Inc.
9E,49.27271,Endeavor Air Inc.
EV,48.26858,ExpressJet Airlines Inc.
F9,47.57908,Frontier Airlines Inc.
VX,43.84708,Virgin America
FL,41.09446,AirTran Airways Corporation
WN,40.74755,Southwest Airlines Co.
B6,40.00906,JetBlue Airways
AA,38.26555,American Airlines Inc.


### Pipeline
- combine all the data preparation steps into one

In [10]:
flights_data %>% 
select(carrier, arr_delay) %>%
filter(!is.na(arr_delay) & arr_delay > 0) %>%
group_by(carrier) %>%
summarise(mean_delay=mean(arr_delay)) %>%
arrange(desc(mean_delay)) %>%
left_join(select(airlines_data, -X), by='carrier')

carrier,mean_delay,name
<chr>,<dbl>,<chr>
OO,60.6,SkyWest Airlines Inc.
YV,51.0814,Mesa Airlines Inc.
9E,49.27271,Endeavor Air Inc.
EV,48.26858,ExpressJet Airlines Inc.
F9,47.57908,Frontier Airlines Inc.
VX,43.84708,Virgin America
FL,41.09446,AirTran Airways Corporation
WN,40.74755,Southwest Airlines Co.
B6,40.00906,JetBlue Airways
AA,38.26555,American Airlines Inc.


### Conclusion
- on average, US Airways is least delaying airline on arrival 
- SkyWest Airlines is the worst airline in terms of mean arrival delay
- mean arrival delay of SkyWest Airlines (the worst) is double of US Airways (the best)