# Flight delays

https://www.kaggle.com/usdot/flight-delays

### Description of data 

- YEAR, MONTH, DAY, DAY_OF_WEEK: dates of the flight 
- AIRLINE: An identification number assigned by US DOT to identify a unique airline 
- ORIGIN_AIRPORT and DESTINATION_AIRPORT: code attributed by IATA to identify the airports 
- SCHEDULED_DEPARTURE and SCHEDULED_ARRIVAL : scheduled times of take-off and landing 
- DEPARTURE_TIME and ARRIVAL_TIME: real times at which take-off and landing took place 
- DEPARTURE_DELAY and ARRIVAL_DELAY: difference (in minutes) between planned and real times 
- DISTANCE: distance (in miles) 

### Exercises

1. Create a date from : YEAR, MONTH, DAY to a datetime.date object
2. Convert SCHEDULED_DEPARTURE to a datetime.time object
3. Merge the airlines and the flights dataset
4. Find for each airline, the min, max, mean and count of DEPARTURE_DELAY 
5. How many airports ? How many airports are visited by each airline ?
6. What is the average delay by aiport ?
7. Average Delay by time of departure ?


In [39]:
import pandas as pd
import os
import numpy as np
import datetime as dt

folder = '../datasets/flight-delays'


In [15]:

airlines = pd.read_csv(os.path.join(folder, 'airlines.csv'))
airports = pd.read_csv(os.path.join(folder, 'airports.csv'))
flights = pd.read_csv(os.path.join(folder, 'flights.csv'))

In [None]:
## 1. Create a date from : YEAR, MONTH, DAY to a datetime.date object

In [None]:
## 3. Merge the airlines and the flights dataset

## Graphs With Altair

In [8]:
import altair as alt

# https://altair-viz.github.io/

In [17]:
flights.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,0,0,,,,,,


### Let's plot ORIGIN_AIRPORT vs DESTINATION_AIRPORT

Objective: let's explore if some pairs of origin / destination are particularly late

In [28]:
selection = alt.selection_multi(fields=['AIRLINE'], bind='legend')

basic_scatter = alt.Chart(flights.head(5000), title='Distance vs Arrival Delay').mark_point().encode(
    alt.Y('ORIGIN_AIRPORT', title="Origin"),
    alt.X('DESTINATION_AIRPORT', title="Destination"),
    size='count()',
)
basic_scatter

### Selecting the data we need

In [49]:
# top 20 airports
top_airports = flights.groupby(['ORIGIN_AIRPORT'])['YEAR'].count().sort_values(ascending=False).head(20).index

airport_pairs = (flights.loc[(flights.ORIGIN_AIRPORT.isin(top_airports)) & (flights.DESTINATION_AIRPORT.isin(top_airports))]
         .groupby(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT'])
         .agg(
         {
             'ARRIVAL_DELAY': np.median,
             'DISTANCE': 'mean',
             'DEPARTURE_DELAY': 'count'
             
         })
         .reset_index()
         .rename(columns={'ARRIVAL_DELAY': 'median_arrival_delay', 'DISTANCE': 'distance', 'DEPARTURE_DELAY': 'nb_flights' })
                 
)
airport_pairs.head()

Unnamed: 0,ORIGIN_AIRPORT,DESTINATION_AIRPORT,median_arrival_delay,distance,nb_flights
0,ATL,BOS,-5.0,946,85
1,ATL,CLT,-6.0,226,107
2,ATL,DEN,-1.0,1199,92
3,ATL,DFW,-2.5,731,132
4,ATL,DTW,-3.0,594,88


In [55]:
basic_scatter = alt.Chart(airport_pairs, title='Arrival Delay by airport Pair').mark_point().encode(
    alt.Y('ORIGIN_AIRPORT', title="Origin"),
    alt.X('DESTINATION_AIRPORT', title="Destination"),
    size='nb_flights',
    color=alt.Color('median_arrival_delay', 
                           scale=alt.Scale(scheme='inferno'), 
                           legend=alt.Legend(title="Median Arrival Delay")
                          )
)

basic_scatter



### Adding a Tooltip

In [56]:

basic_scatter = alt.Chart(airport_pairs, title='Arrival Delay by airport Pair').mark_point().encode(
    alt.Y('ORIGIN_AIRPORT', title="Origin"),
    alt.X('DESTINATION_AIRPORT', title="Destination"),
    size='nb_flights',
    color=alt.Color('median_arrival_delay', 
                           scale=alt.Scale(scheme='inferno'), 
                           legend=alt.Legend(title="Median Arrival Delay")
                          ),
    tooltip=[  alt.Tooltip('ORIGIN_AIRPORT')
             , alt.Tooltip('DESTINATION_AIRPORT')
             , alt.Tooltip('nb_flights')
             , alt.Tooltip('median_arrival_delay')

            ],     
)

basic_scatter




### Interactive Charts: selection v1

In [85]:
airport_pairs['distance_bins'] = pd.cut(airport_pairs['distance'], 4).astype(str)

## CHART
selection = alt.selection_multi(fields=['distance_bins'], bind='legend')

basic_scatter = alt.Chart(airport_pairs, title='Arrival Delay by airport Pair').mark_point().encode(
    alt.Y('ORIGIN_AIRPORT', title="Origin"),
    alt.X('DESTINATION_AIRPORT', title="Destination"),
    size='distance_bins',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),    
    color=alt.Color('median_arrival_delay', 
                           scale=alt.Scale(scheme='inferno'), 
                           legend=alt.Legend(title="Median Arrival Delay")
                          ),
    tooltip=[  alt.Tooltip('ORIGIN_AIRPORT')
             , alt.Tooltip('DESTINATION_AIRPORT')
             , alt.Tooltip('nb_flights')
             , alt.Tooltip('median_arrival_delay')

            ],     
).add_selection(
selection
)

basic_scatter




### Interactive Charts: selection v2

In [88]:
bins = pd.cut(airport_pairs['distance'], 4, labels=['1. short (<812km)', '2. medium (<1447km)', '3. long (<2082km)', '4. v.long (<2727km)'])
airport_pairs['distance_bins'] = bins

## CHART
selection = alt.selection_multi(fields=['distance_bins'], bind='legend')

basic_scatter = alt.Chart(airport_pairs, title='Arrival Delay by airport Pair').mark_point().encode(
    alt.Y('ORIGIN_AIRPORT', title="Origin"),
    alt.X('DESTINATION_AIRPORT', title="Destination"),
    size='distance_bins',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),    
    color=alt.Color('median_arrival_delay', 
                           scale=alt.Scale(scheme='inferno'), 
                           legend=alt.Legend(title="Median Arrival Delay")
                          ),
    tooltip=[  alt.Tooltip('ORIGIN_AIRPORT')
             , alt.Tooltip('DESTINATION_AIRPORT')
             , alt.Tooltip('nb_flights')
             , alt.Tooltip('median_arrival_delay')

            ],     
).add_selection(
selection
)

basic_scatter


