In [1]:
#| label: libraries
#| include: false
import pandas as pd
import numpy as np
import statistics as st
import plotly.express as px

## Elevator pitch

_This report provides key insights for the aviation industry. I determined the airport with the worst delays using total flights, delayed flights, proportion of delays, and average delay time. I identify the best month to avoid delays based on the proportion of delayed flights. I calculate total weather-related delays, including both severe and mild, and replace missing data with the mean. My report includes a barplot showing the proportion of flights delayed by weather at each airport. I ensure consistent handling of missing data types. Benefit from my comprehensive analysis to optimize operations and decision-making._


In [40]:
#| label: project data
#| code-summary: Read and format project data
df = pd.read_json('https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json')


## Fixing Missing Values

__Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__

_I replaced all the NA-like values that I came across during my initial exploration and replaced those values with Numpy’s NaN values to be consistent with data null value practices._


In [3]:
#| label: Q1
#| code-summary: Read and format data
data = pd.DataFrame(df)

data.replace(['None','NA','NULL','n/a','null',''],np.nan,inplace=True)

data[data['airport_name'].isna()].head(10)

Unnamed: 0,airport_code,airport_name,month,year,num_of_flights_total,num_of_delays_carrier,num_of_delays_late_aircraft,num_of_delays_nas,num_of_delays_security,num_of_delays_weather,num_of_delays_total,minutes_delayed_carrier,minutes_delayed_late_aircraft,minutes_delayed_nas,minutes_delayed_security,minutes_delayed_weather,minutes_delayed_total
2,IAD,,January,2005.0,12381,414,1058,895,4,61,2430,,70919,35660.0,208,4497,134881
13,SLC,,Febuary,2005.0,12404,645,463,752,10,79,1947,32336.0,23087,24544.0,293,4614,84874
25,SAN,,April,2005.0,7091,364,369,343,2,15,1095,15602.0,15994,10015.0,59,792,42462
41,SLC,,June,2005.0,13860,788,642,670,7,95,2203,37008.0,31149,,162,5223,98063
170,IAD,,January,2007.0,8572,530,778,458,2,51,1822,,51350,16439.0,47,2902,99051
186,SAN,,March,2007.0,7824,537,544,250,5,26,1365,24347.0,27938,8755.0,140,2295,63475
213,ORD,,July,2007.0,32047,1500+,3086,3441,12,238,8364,109326.0,214217,214364.0,604,20933,559444
230,SLC,,,2007.0,12015,670,671,324,9,29,1704,28345.0,31081,10544.0,272,2335,72577
232,DEN,,October,2007.0,19967,837,1293,891,5,53,3084,42264.0,70580,,151,2718,143385
261,IAD,,Febuary,2008.0,6411,577,707,357,2,51,1693,40829.0,52543,15093.0,38,3559,112062


## Which airport has the worst delays?

__Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.__

_Out of the airports that are listed in the dataset, it seems the airports with the worst reputations for delays are ORD, ATL, and SFO. I used the total flights and total delays columns to get this metrics because I wanted to give a high level overview of that delay stats looked like across each of the airports._


In [4]:
#| label: Q2
#| code-summary: Read and format data
data_delays = pd.DataFrame(data)

data_delays['total_delay_ratio'] = round(data_delays['num_of_delays_total'] / data_delays['num_of_flights_total'],2)

data_delays = data_delays[['airport_code','total_delay_ratio']].groupby(['airport_code'],as_index=False)\
    .mean().sort_values(by='total_delay_ratio',ascending=False)

data_delays.head()

Unnamed: 0,airport_code,total_delay_ratio
5,SFO,0.260455
3,ORD,0.228712
0,ATL,0.20197
2,IAD,0.195227
4,SAN,0.189621


In [5]:
#| label: Q2 chart
#| code-summary: plot example
#| fig-align: center
graph_delay = px.bar(data_delays, x='airport_code', y='total_delay_ratio', title= "Delay Ratio AVGs by Airport",
                    labels= {
                        'total_delay_ratio': 'Delay Ratio',
                        'airport_code': 'Airport'
                    }, text_auto=True)

graph_delay.show()

## Best month to fly?

__What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)__

_Based on the graph below and the chart, the best 3 months to fly are September, November, and October. I used the total number of delays and flights to get a high level ratio of when the best months of the year are for flying as in less likely to experience a delay._


In [6]:
#| label: Q3
#| code-summary: Read and format data
data_month = data[['month','num_of_flights_total','num_of_delays_total']]\
    .groupby(['month'],as_index=False).agg({'num_of_delays_total': 'sum', 'num_of_flights_total':'sum'})
        
data_month['month_delay_ratio'] = round(data_month['num_of_delays_total'] / data_month['num_of_flights_total'],3)

data_month = data_month.sort_values(by='month_delay_ratio')

data_month.head(12)

Unnamed: 0,month,num_of_delays_total,num_of_flights_total,month_delay_ratio
11,September,201905,1227208,0.165
9,November,197768,1185434,0.167
10,October,235166,1301612,0.181
0,April,231408,1259723,0.184
8,May,233494,1227795,0.19
7,March,250142,1213370,0.206
1,August,279699,1335158,0.209
3,Febuary,248033,1115814,0.222
4,January,265001,1193018,0.222
5,July,319960,1371741,0.233


In [7]:
#| label: Q3 chart
#| code-summary: plot example
#| fig-align: center
graph_delay_month = px.bar(data_month, x='month', y='month_delay_ratio', title='Delay Ratio AVGs by Month',
                           labels= {
                               'month': 'Month',
                               'month_delay_ratio': 'Delay Ratio'
                           }, text_auto=True)

graph_delay_month.show()

## Including "mild" weather conditions as delay reasons

__According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table.__

_I created a new column which includes the values as specified in the requirements of the assignments. I decided to do some cleaning with some of the data points in the some of the delay columns as I was examining them. Some values appeared negative so I added some logic to take into consideration those values and convert them to positive values to do the aggregations properly._

In [8]:
#| label: Q4
#| code-summary: Read and format data
data_weather = data

data_weather['num_of_delays_late_aircraft'] = data_weather['num_of_delays_late_aircraft'].fillna(data_weather['num_of_delays_late_aircraft'].mean())

data_weather['month'] = data_weather['month'].dropna()
    
data_weather['num_delays_weather_total'] = (data_weather['num_of_delays_late_aircraft'] * 0.3) + data_weather['num_of_delays_weather'] +\
    (data_weather.apply(lambda row: row['num_of_delays_nas'] * -1 if row['num_of_delays_nas'] < 0 else
                        row['num_of_delays_late_aircraft'] * -1 if row['num_of_delays_late_aircraft'] < 0 else
                        row['num_of_delays_weather'] * -1 if row['num_of_delays_weather'] < 0 else
                        row['num_of_delays_nas'] * 0.4 if row['month'] in ['April','May','June','July','August'] else
                        row['num_of_delays_nas'] * 0.65, axis=1)) 

data_weather.head(5)    

Unnamed: 0,airport_code,airport_name,month,year,num_of_flights_total,num_of_delays_carrier,num_of_delays_late_aircraft,num_of_delays_nas,num_of_delays_security,num_of_delays_weather,num_of_delays_total,minutes_delayed_carrier,minutes_delayed_late_aircraft,minutes_delayed_nas,minutes_delayed_security,minutes_delayed_weather,minutes_delayed_total,num_delays_weather_total
0,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",January,2005.0,35048,1500+,-999,4598,10,448,8355,116423.0,104415,207467.0,297,36931,465533,1147.3
1,DEN,"Denver, CO: Denver International",January,2005.0,12687,1041,928,935,11,233,3153,53537.0,70301,36817.0,363,21779,182797,1119.15
2,IAD,,January,2005.0,12381,414,1058,895,4,61,2430,,70919,35660.0,208,4497,134881,960.15
3,ORD,"Chicago, IL: Chicago O'Hare International",January,2005.0,28194,1197,2255,5415,5,306,9178,88691.0,160811,364382.0,151,24859,638894,4502.25
4,SAN,"San Diego, CA: San Diego International",January,2005.0,7283,572,680,638,7,56,1952,27436.0,38445,21127.0,218,4326,91552,674.7


## Which airports really have the most delays?

__Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.__

_I took the previous data set and grabbed the only 2 columsn that I needed and prepared it to be used for the bar chart. It’s very clear that the airports - ORD and ATL both suffer from high volumes of delys due to reasons related to weather._

In [9]:
#| label: Q5
#| code-summary: Read and format data
data_weather_bar = data_weather

data_weather_bar = data_weather_bar[['airport_code','num_delays_weather_total']]\
    .groupby(['airport_code'],as_index=False).sum().sort_values(by='num_delays_weather_total', ascending=False)
    
data_weather_bar['num_delays_weather_total'] = round(data_weather_bar['num_delays_weather_total'] / 1000, 2)

data_weather_bar.head(10)

Unnamed: 0,airport_code,num_delays_weather_total
3,ORD,297.54
0,ATL,244.61
5,SFO,159.59
1,DEN,149.11
6,SLC,60.35
2,IAD,50.84
4,SAN,48.92


In [10]:
#| label: Q5 chart
#| code-summary: plot example
#| fig-align: center
graph_delays_total = px.bar(data_weather_bar, x='airport_code', y='num_delays_weather_total',
                            title='Number of Delays Due to Weather by Airport (All Time)',
                            labels= {
                                'airport_code': 'Airport',
                                'num_delays_weather_total': 'Total Delays (1000)'
                            },text_auto=True)

graph_delays_total.show()