## Intro - Northwoods Airlines Analysis

The client Northwoods Airlines requests POV using both Snowflake and Databricks.  This is part of an investigation into the benefits of using these platforms.

The data owners have provided datasets for airports, airlines and flights and shared it over [Google Drive](https://drive.google.com/drive/folders/18Mkt2Ku3gIxenT-zjYi68kcufpcvNwbv).

## Assumptions

- Data is good.  No mayor cleaning needs to be performed.

## Reports

The following reports are exemplary of the industry reporting and insight gathering as competitive advantages.

- Total Number Of Flights By Airline and Airport, Month Granularity
- On Time Percentage Per Airline For The Year 2015
- Airlines With The Largest Number Of Delays
- Cancelation Reasons By Airport
- Delay Reasons By Airport
- Airline With The Most Unique Route

## Data Prep

### Step 1 - Importing Customer Data

The customer provided data for **flights**, **airlines** and **airports**.

In [0]:
# Import data files provided by the customer.

df_airlines = spark.read.csv("/FileStore/tables/airlines.csv", header="true", inferSchema="true")

df_airports = spark.read.csv("/FileStore/tables/airports.csv", header="true", inferSchema="true")

df_flights = spark.read.csv("/FileStore/tables/partition_01.csv", header="true", inferSchema="true")

df_partition_02 = spark.read.csv("/FileStore/tables/partition_02.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_02)

df_partition_03 = spark.read.csv("/FileStore/tables/partition_03.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_03)

df_partition_04 = spark.read.csv("/FileStore/tables/partition_04.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_04)

df_partition_05 = spark.read.csv("/FileStore/tables/partition_05.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_05)

df_partition_06 = spark.read.csv("/FileStore/tables/partition_06.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_06)

df_partition_07 = spark.read.csv("/FileStore/tables/partition_07.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_07)

df_partition_08 = spark.read.csv("/FileStore/tables/partition_08.csv", header="true", inferSchema="true")
df_flights = df_flights.union(df_partition_08)

### Step 2 - Merging Airport Data to Flight Data

We are building a wide-table for reporting.  First, lets combine the airport and the flight data.

In [0]:
# Merging airports to flights.

# Assuming all analysis is expected with regards to origin airport unless specified otherwise
# df_flights(origin_airport,destination_airport)
# df_airports(iata_code)

# Origin airport dataframe
df_origin_airport = (
    df_airports
       .withColumnRenamed('IATA_CODE', 'ORIGIN_AIRPORT_IATA_CODE')
       .withColumnRenamed('AIRPORT', 'ORIGIN_AIRPORT_NAME')
       .withColumnRenamed('CITY','ORIGIN_AIRPORT_CITY')
       .withColumnRenamed('STATE', 'ORIGIN_AIRPORT_STATE')
       .withColumnRenamed('COUNTRY','ORIGIN_AIRPORT_COUNTRY')
       .withColumnRenamed('LATITUDE', 'ORIGIN_AIRPORT_LATITUDE')
       .withColumnRenamed('LONGITUDE','ORIGIN_AIRPORT_LONGITUDE') )

# Join Flights To Origin Airport
df_flights = df_flights.join(df_origin_airport, df_flights.ORIGIN_AIRPORT ==  df_origin_airport.ORIGIN_AIRPORT_IATA_CODE,"inner")

# Destination airport dataframe
df_destination_airport = (
    df_airports
       .withColumnRenamed('IATA_CODE', 'DESTINATION_AIRPORT_IATA_CODE')
       .withColumnRenamed('AIRPORT', 'DESTINATION_AIRPORT_NAME')
       .withColumnRenamed('CITY','DESTINATION_AIRPORT_CITY')
       .withColumnRenamed('STATE', 'DESTINATION_AIRPORT_STATE')
       .withColumnRenamed('COUNTRY','DESTINATION_AIRPORT_COUNTRY')
       .withColumnRenamed('LATITUDE', 'DESTINATION_AIRPORT_LATITUDE')
       .withColumnRenamed('LONGITUDE','DESTINATION_AIRPORT_LONGITUDE') )

# Join Flights To Origin Airport
df_flights = df_flights.join(df_destination_airport, df_flights.DESTINATION_AIRPORT ==  df_destination_airport.DESTINATION_AIRPORT_IATA_CODE,"inner")

# todo - can drop ORIGIN_AIRPORT_IATA_CODE, DESTINATION_AIRPORT_IATA_CODE

### Step 3 - Merging Airline Data to Flight Data

Lastly, lets add the airline data to our table.

In [0]:
# Merging airlines to flights

# df_flights(origin_airport,destination_airport)
# df_airlines(iata_code)

# Destination airport dataframe
df_airlines = (
    df_airlines
       .withColumnRenamed('IATA_CODE', 'AIRLINE_IATA_CODE')
       .withColumnRenamed('AIRLINE', 'AIRLINE_NAME') )

# Join Flights To Origin Airport
df_flights = df_flights.join(df_airlines, df_flights.AIRLINE ==  df_airlines.AIRLINE_IATA_CODE,"inner")

# todo - can drop AIRLINE_IATA_CODE
display(df_flights)

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,ORIGIN_AIRPORT_IATA_CODE,ORIGIN_AIRPORT_NAME,ORIGIN_AIRPORT_CITY,ORIGIN_AIRPORT_STATE,ORIGIN_AIRPORT_COUNTRY,ORIGIN_AIRPORT_LATITUDE,ORIGIN_AIRPORT_LONGITUDE,DESTINATION_AIRPORT_IATA_CODE,DESTINATION_AIRPORT_NAME,DESTINATION_AIRPORT_CITY,DESTINATION_AIRPORT_STATE,DESTINATION_AIRPORT_COUNTRY,DESTINATION_AIRPORT_LATITUDE,DESTINATION_AIRPORT_LONGITUDE,AIRLINE_IATA_CODE,AIRLINE_NAME
2015,1,1,4,MQ,3393,N918MQ,TXK,DFW,1125,,,,,60,,,181,,,1225,,,0,1,B,,,,,,TXK,Texarkana Regional Airport (Webb Field),Texarkana,AR,USA,33.45371,-93.99102,DFW,Dallas/Fort Worth International Airport,Dallas-Fort Worth,TX,USA,32.89595,-97.0372,MQ,American Eagle Airlines Inc.
2015,1,1,4,B6,1716,N507JB,TPA,LGA,1127,1123.0,-4.0,11.0,1134.0,153,172.0,129.0,1010,1343.0,32.0,1400,1415.0,15.0,0,0,,15.0,0.0,0.0,0.0,0.0,TPA,Tampa International Airport,Tampa,FL,USA,27.97547,-82.53325,LGA,LaGuardia Airport (Marine Air Terminal),New York,NY,USA,40.77724,-73.87261,B6,JetBlue Airways
2015,1,1,4,UA,1642,N78866,DEN,IAH,1127,1159.0,32.0,21.0,1220.0,140,148.0,120.0,862,1520.0,7.0,1447,1527.0,40.0,0,0,,8.0,0.0,21.0,11.0,0.0,DEN,Denver International Airport,Denver,CO,USA,39.85841,-104.667,IAH,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972,UA,United Air Lines Inc.
2015,1,1,4,UA,1171,N77066,DEN,SFO,1127,1213.0,46.0,12.0,1225.0,167,132.0,114.0,967,1319.0,6.0,1314,1325.0,11.0,0,0,,,,,,,DEN,Denver International Airport,Denver,CO,USA,39.85841,-104.667,SFO,San Francisco International Airport,San Francisco,CA,USA,37.619,-122.37484,UA,United Air Lines Inc.
2015,1,1,4,EV,4485,N13161,RDU,IAH,1125,1116.0,-9.0,22.0,1138.0,190,219.0,178.0,1042,1336.0,19.0,1335,1355.0,20.0,0,0,,20.0,0.0,0.0,0.0,0.0,RDU,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,IAH,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972,EV,Atlantic Southeast Airlines
2015,1,1,4,OO,4770,N823AS,DIK,MSP,1125,1115.0,-10.0,15.0,1130.0,109,92.0,66.0,481,1336.0,11.0,1414,1347.0,-27.0,0,0,,,,,,,DIK,Dickinson Theodore Roosevelt Regional Airport,Dickinson,ND,USA,46.79739,-102.80195,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,OO,Skywest Airlines Inc.
2015,1,1,4,EV,4472,N12163,IAH,SLC,1129,1130.0,1.0,28.0,1158.0,207,194.0,160.0,1195,1338.0,6.0,1356,1344.0,-12.0,0,0,,,,,,,IAH,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972,SLC,Salt Lake City International Airport,Salt Lake City,UT,USA,40.78839,-111.97777,EV,Atlantic Southeast Airlines
2015,1,1,4,OO,4556,N161PQ,MSP,BNA,1125,1415.0,170.0,25.0,1440.0,130,126.0,94.0,695,1614.0,7.0,1335,1621.0,166.0,0,0,,0.0,0.0,0.0,166.0,0.0,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,BNA,Nashville International Airport,Nashville,TN,USA,36.12448,-86.67818,OO,Skywest Airlines Inc.
2015,1,1,4,OO,7420,N446SW,MSP,IMT,1125,1112.0,-13.0,10.0,1122.0,75,52.0,37.0,257,1159.0,5.0,1240,1204.0,-36.0,0,0,,,,,,,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,IMT,Ford Airport,Iron Mountain/Kingsford,MI,USA,45.81835,-88.11454,OO,Skywest Airlines Inc.
2015,1,1,4,OO,3495,N227AG,FAT,SEA,1125,1124.0,-1.0,13.0,1137.0,135,126.0,107.0,748,1324.0,6.0,1340,1330.0,-10.0,0,0,,,,,,,FAT,Fresno Yosemite International Airport,Fresno,CA,USA,36.77619,-119.71814,SEA,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,OO,Skywest Airlines Inc.


### Step 3.5 - Defining Snowflake Connection, Refactoring Column Names

In [0]:
# Snowflake Connection Details. 
options = {
  "sfUrl": "https://hra22104.us-east-1.snowflakecomputing.com",
  "sfUser": "mariotalavera",
  "sfPassword": "Password123!!", # todo - figure out how to use token instead
  "sfDatabase": "phData",
  "sfSchema": "public",
  "sfWarehouse": "INTERVIEW_WH"
}

# Refactoring names here so that I do not have to do in S...
df_flights = (
  df_flights
  .withColumnRenamed("AIR_SYSTEM_DELAY", "Air System Delay")
  .withColumnRenamed("AIRLINE_DELAY", "Airline Delay")
  .withColumnRenamed("AIRLINE", "AIRLINE_CODE")
  .withColumnRenamed("AIRLINE_NAME", "Airline")
  .withColumnRenamed("CANCELLATION_REASON", "Cancellation Reason")
  .withColumnRenamed("DEPARTURE_DELAY", "Departure Delay")
  .withColumnRenamed("LATE_AIRCRAFT_DELAY", "Late Aircraft Delay")
  .withColumnRenamed("MONTH", "Month")
  .withColumnRenamed("ORIGIN_AIRPORT_NAME", "Airport")
  .withColumnRenamed("SECURITY_DELAY", "Security Delay")
  .withColumnRenamed("WEATHER_DELAY", "Weather Delay")
)

# We could send wide table to Snowflake
# df_flights.write.format("snowflake").options(**options).option("dbtable", "stg_flights_combo").mode("overwrite").save()

YEAR,Month,DAY,DAY_OF_WEEK,AIRLINE_CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,Departure Delay,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,Cancellation Reason,Air System Delay,Security Delay,Airline Delay,Late Aircraft Delay,Weather Delay,ORIGIN_AIRPORT_IATA_CODE,Airport,ORIGIN_AIRPORT_CITY,ORIGIN_AIRPORT_STATE,ORIGIN_AIRPORT_COUNTRY,ORIGIN_AIRPORT_LATITUDE,ORIGIN_AIRPORT_LONGITUDE,DESTINATION_AIRPORT_IATA_CODE,DESTINATION_AIRPORT_NAME,DESTINATION_AIRPORT_CITY,DESTINATION_AIRPORT_STATE,DESTINATION_AIRPORT_COUNTRY,DESTINATION_AIRPORT_LATITUDE,DESTINATION_AIRPORT_LONGITUDE,AIRLINE_IATA_CODE,Airline
2015,1,1,4,MQ,3393,N918MQ,TXK,DFW,1125,,,,,60,,,181,,,1225,,,0,1,B,,,,,,TXK,Texarkana Regional Airport (Webb Field),Texarkana,AR,USA,33.45371,-93.99102,DFW,Dallas/Fort Worth International Airport,Dallas-Fort Worth,TX,USA,32.89595,-97.0372,MQ,American Eagle Airlines Inc.
2015,1,1,4,B6,1716,N507JB,TPA,LGA,1127,1123.0,-4.0,11.0,1134.0,153,172.0,129.0,1010,1343.0,32.0,1400,1415.0,15.0,0,0,,15.0,0.0,0.0,0.0,0.0,TPA,Tampa International Airport,Tampa,FL,USA,27.97547,-82.53325,LGA,LaGuardia Airport (Marine Air Terminal),New York,NY,USA,40.77724,-73.87261,B6,JetBlue Airways
2015,1,1,4,UA,1642,N78866,DEN,IAH,1127,1159.0,32.0,21.0,1220.0,140,148.0,120.0,862,1520.0,7.0,1447,1527.0,40.0,0,0,,8.0,0.0,21.0,11.0,0.0,DEN,Denver International Airport,Denver,CO,USA,39.85841,-104.667,IAH,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972,UA,United Air Lines Inc.
2015,1,1,4,UA,1171,N77066,DEN,SFO,1127,1213.0,46.0,12.0,1225.0,167,132.0,114.0,967,1319.0,6.0,1314,1325.0,11.0,0,0,,,,,,,DEN,Denver International Airport,Denver,CO,USA,39.85841,-104.667,SFO,San Francisco International Airport,San Francisco,CA,USA,37.619,-122.37484,UA,United Air Lines Inc.
2015,1,1,4,EV,4485,N13161,RDU,IAH,1125,1116.0,-9.0,22.0,1138.0,190,219.0,178.0,1042,1336.0,19.0,1335,1355.0,20.0,0,0,,20.0,0.0,0.0,0.0,0.0,RDU,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,IAH,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972,EV,Atlantic Southeast Airlines
2015,1,1,4,OO,4770,N823AS,DIK,MSP,1125,1115.0,-10.0,15.0,1130.0,109,92.0,66.0,481,1336.0,11.0,1414,1347.0,-27.0,0,0,,,,,,,DIK,Dickinson Theodore Roosevelt Regional Airport,Dickinson,ND,USA,46.79739,-102.80195,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,OO,Skywest Airlines Inc.
2015,1,1,4,EV,4472,N12163,IAH,SLC,1129,1130.0,1.0,28.0,1158.0,207,194.0,160.0,1195,1338.0,6.0,1356,1344.0,-12.0,0,0,,,,,,,IAH,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972,SLC,Salt Lake City International Airport,Salt Lake City,UT,USA,40.78839,-111.97777,EV,Atlantic Southeast Airlines
2015,1,1,4,OO,4556,N161PQ,MSP,BNA,1125,1415.0,170.0,25.0,1440.0,130,126.0,94.0,695,1614.0,7.0,1335,1621.0,166.0,0,0,,0.0,0.0,0.0,166.0,0.0,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,BNA,Nashville International Airport,Nashville,TN,USA,36.12448,-86.67818,OO,Skywest Airlines Inc.
2015,1,1,4,OO,7420,N446SW,MSP,IMT,1125,1112.0,-13.0,10.0,1122.0,75,52.0,37.0,257,1159.0,5.0,1240,1204.0,-36.0,0,0,,,,,,,MSP,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,IMT,Ford Airport,Iron Mountain/Kingsford,MI,USA,45.81835,-88.11454,OO,Skywest Airlines Inc.
2015,1,1,4,OO,3495,N227AG,FAT,SEA,1125,1124.0,-1.0,13.0,1137.0,135,126.0,107.0,748,1324.0,6.0,1340,1330.0,-10.0,0,0,,,,,,,FAT,Fresno Yosemite International Airport,Fresno,CA,USA,36.77619,-119.71814,SEA,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,OO,Skywest Airlines Inc.


### Step 4 - How Many Years Of Data Were Provided?

Looking at the flight data given, we see that we only have data for year 2015.  

```python
  display(df_flights.groupBy("YEAR").count())
```

Because of this, we supress such from the results and show in header info instead.

In [0]:
# Lets find out how many years of data we have in set provided 

display(df_flights.groupBy("YEAR").count())

# Because we have only year 2015, lets supress it from results and show in header info instead.

## Reporting

**Note** - Internal dataframe reference.  We are sending these dataframes to Snowflake for these corresponding reports.

| Dataframe | Report |
| --- | --- |
| df_report_1 | Total Number Of Flights By Airline And Airport, 2015 |
| df_report_2 | Airline On Time Percentage, 2015 |
| df_report_3 | Airlines With Largest Number Of Delays |
| df_report_4 | Cancellation Reasons By Airport |
| df_report_5 | Delay Reasons By Airport |
| df_report_6 | Airline With The Largest Number Of Unique Routes |

### Total Number Of Flights By Airline And Airport, 2015

This report returns the monthly number of flights by airline and airport for the available data.

In [0]:
# Report 1 - Total number of flights by airline and airport on a monthly basis

import pyspark.sql.functions as F

# For sending to Snowflake
df_report_1 = (
  df_flights[["Airline","Airport","Month"]]
)

# Sending to Snowflake
df_report_1.write.format("snowflake").options(**options).option("dbtable", "df_report_1").mode("overwrite").save()

# For display at Databricks
df_report_1 = (
  df_report_1
  .groupBy("Airline","Airport","Month")
  .count().withColumnRenamed("count", "Flights")
  .withColumn("Flights", F.format_number("Flights", 0))
  .orderBy("Airline","Airport","Month", ascending=True)
)

display(df_report_1)

Airline,Airport,Month,Flights
Alaska Airlines Inc.,Adak Airport,1,9
Alaska Airlines Inc.,Adak Airport,2,8
Alaska Airlines Inc.,Adak Airport,3,9
Alaska Airlines Inc.,Adak Airport,4,9
Alaska Airlines Inc.,Adak Airport,5,9
Alaska Airlines Inc.,Adak Airport,6,8
Alaska Airlines Inc.,Adak Airport,7,9
Alaska Airlines Inc.,Adak Airport,8,9
Alaska Airlines Inc.,Albuquerque International Sunport,1,31
Alaska Airlines Inc.,Albuquerque International Sunport,2,28


### Airline On Time Percentage, 2015

This report returns the percentage of the flights that were on-time by airline.

- We define an on-time flight as a flight with zero delays.  
- A zero-delay-flight is a flight with **ARRIVAL_DELAY = 0**.
- Flight data without **ARRIVAL_DELAY** is discarded.  These records are missing either **SCHEDULED_DELAY** or **ARRIVAL_TIME**.
- Flight data with **ARRIVAL_DELAY < 0** is discarded as well.  These are flights that arrived early.

In [0]:
# Report 2 - On time percentage of each airline for the year 2015

from pyspark.sql.functions import concat, col, lit
import pyspark.sql.functions as func

# Lets find the count of flights that are on time
df_flights_onTime = (
  df_flights[["Airline"]]
  .filter("YEAR = 2015")
  .filter("arrival_delay IS NOT NULL")
  .filter("arrival_delay >= 0")
  .filter("arrival_delay = 0")
  .groupBy("Airline").count()  
  .withColumnRenamed("count", "onTime") 
)

# Lets find the count of flights that are delayed
df_flights_delayed = (
  df_flights[["Airline"]]
  .filter("YEAR = 2015")
  .filter("arrival_delay IS NOT NULL")
  .filter("arrival_delay >= 0")
  .groupBy("Airline").count()  
  .withColumnRenamed("Airline", "AIRLINE_NAME_DROP")
  .withColumnRenamed("count", "Total") 
)

df_report_2 = (
  df_flights_onTime
  .join(df_flights_delayed, df_flights_onTime.Airline ==  df_flights_delayed.AIRLINE_NAME_DROP,"inner")
  .drop(col("AIRLINE_NAME_DROP"))
  .withColumn("On-Time (%)", col("onTime")/col("Total") * 100.0)
  .drop(col("onTime"))
  .drop(col("Total"))
  .orderBy("Airline") 
)

df_report_2 = (
  df_report_2
  .withColumn("On-Time (%)", func.round(df_report_2["On-Time (%)"], 2))
)

df_report_2.write.format("snowflake").options(**options).option("dbtable", "df_report_2").mode("overwrite").save()

display(df_report_2)

Airline,On-Time (%)
Alaska Airlines Inc.,7.23
American Airlines Inc.,4.96
American Eagle Airlines Inc.,4.46
Atlantic Southeast Airlines,5.68
Delta Air Lines Inc.,6.21
Frontier Airlines Inc.,4.29
Hawaiian Airlines Inc.,9.86
JetBlue Airways,4.36
Skywest Airlines Inc.,6.05
Southwest Airlines Co.,5.56


### Airlines With Largest Number Of Delays

This report returns airlines with the five largest number of delays.

In [0]:
# Airlines with the largest number of delays

import pyspark.sql.functions as F

# For sending to Snowflake
df_report_3 = (
  df_flights[["Departure Delay","Airline"]]
  .filter("`Departure Delay` > 0")
)

# Sending to Snowflake
df_report_3.write.format("snowflake").options(**options).option("dbtable", "df_report_3").mode("overwrite").save()

# For display at Databricks
df_report_3 = (
  df_report_3
  .groupBy("Airline")
  .count()
  .withColumnRenamed("count", "Delays")
  .orderBy("Delays", ascending=False)
  .withColumn("Delays", F.format_number("Delays", 0)) 
  .limit(5)
)

display(df_report_3)

Airline,Delays
Southwest Airlines Co.,395439
Delta Air Lines Inc.,200207
United Air Lines Inc.,185714
American Airlines Inc.,155094
Atlantic Southeast Airlines,125056


### Cancellation Reasons By Airport

This report returns the number of cancellations by reason for each airport.

In [0]:
# Cancellation reasons by airport

import pyspark.sql.functions as F

# For sending to Snowflake
df_report_4 = (
  df_flights[["Airport","Cancellation Reason"]]
)

# Sending to Snowflake
df_report_4.write.format("snowflake").options(**options).option("dbtable", "df_report_4").mode("overwrite").save()

# For display at Databricks
df_report_4 = (
  df_report_4
  .groupBy("Airport","Cancellation Reason")
  .count()
  .withColumnRenamed("count", "Cancelations")
  .orderBy("Airport","Cancellation Reason", ascending=True)
  .withColumn("Cancelations", F.format_number("Cancelations", 0)) 
)

display(df_report_4)

Airport,Cancellation Reason,Cancelations
Aberdeen Regional Airport,,480
Aberdeen Regional Airport,A,5
Aberdeen Regional Airport,B,1
Abilene Regional Airport,,1657
Abilene Regional Airport,A,8
Abilene Regional Airport,B,70
Abraham Lincoln Capital Airport,,1102
Abraham Lincoln Capital Airport,A,15
Abraham Lincoln Capital Airport,B,34
Abraham Lincoln Capital Airport,C,3


### Delay Reasons By Airport

This report provides the number of delays by airport by reason.

In [0]:
# Delay reasons by airport

import pyspark.sql.functions as F

df_report_5 = (
  df_flights[["Airport","Air System Delay","Security Delay","Airline Delay","Late Aircraft Delay","Weather Delay"]]
)

# Sending to Snowflake
df_report_5.write.format("snowflake").options(**options).option("dbtable", "df_report_5").mode("overwrite").save()

# For display at Databricks
df_report_5 = (
  df_report_5
  .groupBy("Airport")
  .sum("Air System Delay","Security Delay","Airline Delay","Late Aircraft Delay","Weather Delay")
  .orderBy("Airport")
  .withColumnRenamed("sum(Air System Delay)", "Air System Delay")
  .withColumnRenamed("sum(Security Delay)", "Security Delay")
  .withColumnRenamed("sum(Airline Delay)", "Airline Delay")
  .withColumnRenamed("sum(Late Aircraft Delay)", "Late Aircraft Delay")
  .withColumnRenamed("sum(Weather Delay)", "Weather Delay")
  .withColumn("Air System Delay", F.format_number("Air System Delay", 0))
  .withColumn("Security Delay", F.format_number("Security Delay", 0))
  .withColumn("Airline Delay", F.format_number("Airline Delay", 0))
  .withColumn("Late Aircraft Delay", F.format_number("Late Aircraft Delay", 0))
  .withColumn("Weather Delay", F.format_number("Weather Delay", 0)) 
)

display(df_report_5)

Airport,Air System Delay,Security Delay,Airline Delay,Late Aircraft Delay,Weather Delay
Aberdeen Regional Airport,1066,9,3949,1986,101
Abilene Regional Airport,3823,46,5646,6000,3769
Abraham Lincoln Capital Airport,4619,44,3061,6759,477
Adak Airport,173,485,43,188,32
Akron-Canton Regional Airport,11321,0,11981,16260,2724
Albany International Airport,9144,93,17693,18323,2521
Albert J. Ellis Airport,1406,0,3335,5288,0
Albuquerque International Sunport,27369,75,44070,72720,6503
Alexandria International Airport,7099,106,12146,12309,2321
Alpena County Regional Airport,344,0,757,3563,228


### Airline With The Largest Number Of Unique Routes

This report returns the airline offering the largest number of unique routes.

In [0]:
# Airline with the most unique routes

from pyspark.sql.functions import concat, col, lit
import pyspark.sql.functions as F

# For sending to Snowflake
df_report_6 = (
  df_flights[["Airline","Airport","DESTINATION_AIRPORT"]]
  # Lets define a flight route as the permutation of its origin and destination locations
  .withColumn("Route",concat(col("Airport"),lit('-'),col("DESTINATION_AIRPORT")))
)

# Sending to Snowflake
df_report_6.write.format("snowflake").options(**options).option("dbtable", "df_report_6").mode("overwrite").save()

# For display at Databricks
df_report_6 = (
  df_report_6
  .groupBy("Airline","Route")
  .count()
  .orderBy("Airline", ascending=True)
  .drop("count","Route")
  .groupBy("Airline")
  .count()
  .withColumnRenamed("count", "Unique Routes")
  .orderBy("Unique Routes", ascending=False)
  .withColumn("Unique Routes", F.format_number("Unique Routes", 0))
  .limit(1)
)

display(df_report_6)

Airline,Unique Routes
Atlantic Southeast Airlines,1351
