# Group Assigment: C

In [1]:
import findspark
findspark.init()

In [2]:
findspark.find()
import pyspark
findspark.find()

'/opt/spark-2.4.4-bin-hadoop2.7'

In [3]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = pyspark.SparkConf().setAppName('appName').setMaster('local[4]')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

In [4]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window, WindowSpec

## Introduction to the Flights dataset

According to a 2010 report made by the US Federal Aviation Administration, the economic price of domestic flight delays generates yearly costs of USD 32.9 billion to passengers, airlines and other parts of the economy. More than half of that amount comes from the pockets of passengers, as they do not only waste time waiting for their planes to leave, but also in missed connecting flights, money spent not only on food but also sleeping on hotel rooms while they're stranded.

The report, focusing on data from the year 2007, estimated that air transportation delays put a a dent of USD 4 billion in the country's gross domestic product in that year. The full report can be found in the following link: 
<a href="http://www.isr.umd.edu/NEXTOR/pubs/TDI_Report_Final_10_18_10_V3.pdf">here</a>.

But which are the causes for these delays?

In order to answer this question, we are going to analyze the provided dataset, containing up to 1.936.758 different internal flights in the US for 2008 and the causes for their delays, diversions and cancellations; if any.

The data comes from the U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS)

This dataset is composed by the following variables:
1. **Year** 2008
2. **Month** 1
3. **DayofMonth** 1-31
4. **DayOfWeek** 1 (Monday) - 7 (Sunday)
5. **DepTime** actual departure time (local, hhmm)
6. **CRSDepTime** scheduled departure time (local, hhmm)
7. **ArrTime** actual arrival time (local, hhmm)
8. **CRSArrTime** scheduled arrival time (local, hhmm)
9. **UniqueCarrie**r unique carrier code
10. **FlightNum** flight number
11. **TailNum** plane tail number: aircraft registration, unique aircraft identifier
12. **ActualElapsedTime** in minutes
13. **CRSElapsedTime** in minutes
14. **AirTime** in minutes
15. **ArrDelay** arrival delay, in minutes: A flight is counted as "on time" if it operated less than 15 minutes later the scheduled time shown in the carriers' Computerized Reservations Systems (CRS).
16. **DepDelay** departure delay, in minutes
17. **Origin** origin IATA airport code
18. **Dest** destination IATA airport code
19. **Distance** in miles
20. **TaxiIn** taxi in time, in minutes
21. **TaxiOut** taxi out time in minutes
22. **Cancelled** *was the flight cancelled
23. **CancellationCode** reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24. **Diverted** 1 = yes, 0 = no
25. **CarrierDelay** in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays.
26. **WeatherDelay** in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival.
27. **NASDelay** in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc.
28. **SecurityDelay** in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
29. **LateAircraftDelay** in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation

Read the CSV file using Spark's default delimiter (","). The first line contains the headers so it is not part of the data. Hence we set the header option to true.

In [5]:
flightsDF = spark.read.option("inferSchema", "true").option("header", "true").csv("flights_jan08.csv")
airportsDF = spark.read.option("inferSchema", "true").option("header", "true").csv("Airports.csv")
flightsDF.cache()
airportsDF.cache()

DataFrame[IATA_CODE: string, LINK: string, LOCATION: string, TYPE: string, NAME: string, TERMINALS: int, RUNWAYS: int, BUILD_DATE: int, CITY_POPULATION: int]

## Your topic: Arrival Delay related to the morphology of the arrival airport

The team's goal was to investigate what happens in each of the cities, and if there is a relation between the city and the delay that goes beyond the airports. For that purpose, **you have to get a small dataset with features of the airport**, such as the number of terminals, the build year, the average length of the arriving runway, the number of runways as well as other metrics. No need for a lot of features, just 4 or 5 are fine. Once you have it, answer the following questions:

* Is there a relation between the year it was built and the number of flights arriving to it?
* Are modern airports located mostly in large cities or not necessarily?
* Do airports with more terminals have larger delays, or is the opposite true? What about the runways?
* Is there any threshold in the average arriving flights per terminal, so that above that value the delays tend to increase a lot?
* Discretize the arrival delay as in the reference notebook, and relate it to the average number of arriving flights per runway. Is there a relation? Support your conclusions with data.

## Business Questions:

### 1. Is there a relation between the year it was built and the number of flights arriving to it?

In [6]:
DestDF = flightsDF.join(airportsDF,col('Dest')==col('IATA_CODE'))\
                      .withColumn('YEARS_BUILT',(2008-col('BUILD_DATE')))\
                      .withColumn('AGE_CATEGORY', when(col("YEARS_BUILT")<=40,"MODERN")\
                            .when((col("YEARS_BUILT")>40) & (col("YEARS_BUILT")<=70),"MIDDLE-AGE")\
                            .when((col("YEARS_BUILT")>70),"OLD"))\
                      .withColumn('CITY_SIZE', when(col("CITY_POPULATION")<=550000,"SMALL")\
                            .when((col("CITY_POPULATION")>550000) & (col("CITY_POPULATION")<=1600000),"MEDIUM")\
                            .when((col("CITY_POPULATION")>1600000),"LARGE"))
DestDF.cache()


DataFrame[Year: int, Month: int, DayofMonth: int, DayOfWeek: int, DepTime: string, CRSDepTime: int, ArrTime: string, CRSArrTime: int, UniqueCarrier: string, FlightNum: int, TailNum: string, ActualElapsedTime: string, CRSElapsedTime: int, AirTime: string, ArrDelay: string, DepDelay: string, Origin: string, Dest: string, Distance: int, TaxiIn: string, TaxiOut: string, Cancelled: int, CancellationCode: string, Diverted: int, CarrierDelay: string, WeatherDelay: string, NASDelay: string, SecurityDelay: string, LateAircraftDelay: string, IATA_CODE: string, LINK: string, LOCATION: string, TYPE: string, NAME: string, TERMINALS: int, RUNWAYS: int, BUILD_DATE: int, CITY_POPULATION: int, YEARS_BUILT: int, AGE_CATEGORY: string, CITY_SIZE: string]

In [7]:
oldDF = DestDF.select(col('Dest').alias('AIRPORT'),'BUILD_DATE','YEARS_BUILT','AGE_CATEGORY')\
                  .groupBy('AIRPORT','BUILD_DATE','YEARS_BUILT','AGE_CATEGORY')\
                  .agg(count(lit(1)).alias('ARRIVALS'))\
                  .withColumn('RATIO', round(col('ARRIVALS')/col('YEARS_BUILT'),2))\
                  .orderBy('RATIO',ascending=False)
oldDF.cache()

oldDF2= oldDF.groupBy('AGE_CATEGORY')\
             .agg(round(avg('YEARS_BUILT'),1).alias('Avg_YEARS_BUILT'),\
                  count(lit(1)).alias('TOTAL_AIRPORTS'),\
                  sum('ARRIVALS').alias('TOTAL_ARRIVALS'),\
                  round(avg('ARRIVALS'),1).alias('Avg_ARRIVALS'))\
             .withColumn('ARRIVALS PROPORTION %', round(col('TOTAL_ARRIVALS')/(sum('TOTAL_ARRIVALS').over(Window.partitionBy()))*100,1))\
             .orderBy('Avg_ARRIVALS',ascending=False)

oldDF2.show()

+------------+---------------+--------------+--------------+------------+---------------------+
|AGE_CATEGORY|Avg_YEARS_BUILT|TOTAL_AIRPORTS|TOTAL_ARRIVALS|Avg_ARRIVALS|ARRIVALS PROPORTION %|
+------------+---------------+--------------+--------------+------------+---------------------+
|         OLD|           78.8|            32|         43143|      1348.2|                 43.1|
|  MIDDLE-AGE|           60.8|            41|         50164|      1223.5|                 50.2|
|      MODERN|           21.8|             8|          6693|       836.6|                  6.7|
+------------+---------------+--------------+--------------+------------+---------------------+



An aggregation was conducted on the destination airports from the 2008 US Flights Dataset and three main categories created based on the airport’s-built year:

1. OLD: Older than 70 years.
2. MIDDLE-AGE: Between 40 and 70 years.
3. MODERN: Younger than 40 years.

#### Quick observations: 
- More than 90% of the airports are older than 40 years.
- If we only consider the the consider airplane arrivals based upon the AGE CATEGORY, the above table clearly illustrates that the category of MIDDLE-AGE airports cover 50% of the total arrivals.

To evaluate the relation between the year an airport was built and the number of flights arriving to it, the team calculated the AVERAGE ARRIVALS for each AGE CATEGORY to deal with the skewness of the number of airports per category:

#### Based on this measure we can affirm that there is a direct relation between the age of an airport and its arrivals, whereby:    
- The older the airport, the higher the number of arrivals on average per airport: The 32 "OLD" airports got on average the highest number of arrivals for 2008 with 1.348 flights per airport. 

### 2. Are modern airports located mostly in large cities or not necessarily?

In [8]:
popDF = DestDF.groupBy('CITY_SIZE')\
                  .pivot('AGE_CATEGORY')\
                  .agg(countDistinct('Dest'))\
                  .select('CITY_SIZE',col('MODERN').alias('MODERN AIRPORTS'))\
                  .orderBy('MODERN AIRPORTS', ascending = False)\
                  .withColumn('Proportion %', round(col('MODERN AIRPORTS')/(sum('MODERN AIRPORTS').over(Window.partitionBy()))*100,1))
popDF.show()

popDF2 = DestDF.groupBy('AGE_CATEGORY')\
                  .pivot('CITY_SIZE')\
                  .agg(countDistinct('Dest'))
                  
popDF2= popDF2.withColumn('LARGE CITY Prop%', round(col('LARGE')/(sum('LARGE').over(Window.partitionBy()))*100,1))\
            .withColumn('MEDIUM CITY Prop%', round(col('MEDIUM')/(sum('MEDIUM').over(Window.partitionBy()))*100,1))\
            .withColumn('SMALL CITY Prop%', round(col('SMALL')/(sum('SMALL').over(Window.partitionBy()))*100,1))

popDF2.orderBy(col("LARGE CITY Prop%").desc(),col("MEDIUM CITY Prop%").desc(),col("SMALL CITY Prop%").desc()).show()


+---------+---------------+------------+
|CITY_SIZE|MODERN AIRPORTS|Proportion %|
+---------+---------------+------------+
|   MEDIUM|              4|        50.0|
|    SMALL|              3|        37.5|
|    LARGE|              1|        12.5|
+---------+---------------+------------+

+------------+-----+------+-----+----------------+-----------------+----------------+
|AGE_CATEGORY|LARGE|MEDIUM|SMALL|LARGE CITY Prop%|MEDIUM CITY Prop%|SMALL CITY Prop%|
+------------+-----+------+-----+----------------+-----------------+----------------+
|         OLD|    3|     7|   22|            75.0|             31.8|            40.0|
|      MODERN|    1|     4|    3|            25.0|             18.2|             5.5|
|  MIDDLE-AGE| null|    11|   30|            null|             50.0|            54.5|
+------------+-----+------+-----+----------------+-----------------+----------------+



To ascertain whether modern airports are located in large cities, the team aggregated the MODERN AIRPORTS according to the CITY_SIZE category defined as follows:

1. SMALL: City population < 550K people.
2. MEDIUM: City population between 550K and 1.6M people.
3. LARGE: City population >1.6 M people.

Quick Observations:

-	There are only 8 modern airports regardless of CITY_SIZE.
-	77.5 % of MODERN AIRPORTS are in small or medium cities.
-	Only 1 MODERN airport is in a large city.

To further investigate the relationship between the AGE of AIRPORT and the CITY_SIZE, we calculated the proportion of the CITY_SIZE according to the AGE of the Airport.

Based on this investigation it can be concluded:

-	#### The United States does not have many “MODERN” airports that are in LARGE cities. Moreover, the larger cities (at least 75 %), typically have “OLD” airports. 

### 3. Do airports with more terminals have larger delays, or is the opposite true? What about the runways?

#### 3.1. Terminal Delays

In [9]:
TerminalDF=DestDF.where(col("ArrDelay")!="NA")\
              .where(col("Cancelled")==0)\
              .groupBy('TERMINALS')\
              .agg(count(lit(1)).alias("NUM FLIGHTS"),\
                   countDistinct('DEST').alias('NUM AIRPORTS'),\
                  round(avg("ArrDelay"),1).alias("AVG FLIGHT DELAY/AIRPORT(Min)")) 
TerminalDF.cache()

TerminalDF=TerminalDF.withColumn('AVG FLIGHT DELAY/TERMINAL(Min)', round(col('AVG FLIGHT DELAY/AIRPORT(Min)')/col('TERMINALS'),1))\
          .orderBy(col('AVG FLIGHT DELAY/TERMINAL(Min)').desc())
TerminalDF.show()

+---------+-----------+------------+-----------------------------+------------------------------+
|TERMINALS|NUM FLIGHTS|NUM AIRPORTS|AVG FLIGHT DELAY/AIRPORT(Min)|AVG FLIGHT DELAY/TERMINAL(Min)|
+---------+-----------+------------+-----------------------------+------------------------------+
|        1|      37580|          44|                          4.2|                           4.2|
|        2|      38712|          21|                          6.8|                           3.4|
|        4|       1929|           2|                         11.9|                           3.0|
|        3|      13174|          10|                          6.2|                           2.1|
|        9|       3283|           1|                         11.2|                           1.2|
|        7|       1678|           1|                          3.0|                           0.4|
|        5|       2342|           2|                         -0.6|                          -0.1|
+---------+---------

#### To answer this question is important to understand the definition of terminals, concourses and gates:

"A concourse is that part of a Terminal where Aircraft park at Gates to exchange Passengers. A Terminal may have more than one Concourse. A Concourse may have more than one Gate." -  [Terminals by FAA](https://www.faa.gov/)


#### Quick observations: 
- As expected, airports with only 1 terminal have the highest delays on average.
- The 2 airports with 5 Terminals were the most operationally efficient in 2008 according to average flight delays per terminal with a significant number of flights (over 2.000).

In order to explore whether or not airports with more terminals have larger delays, the airports were aggregated by their number of terminals and calculated in turn, the AVERAGE FLIGHT DELAY per AIRPORT, was calculated as well as  the AVERAGE DELAY per TERMINAL: 

#### Based on these measures a trend could be seen, whereby the smaller the number of terminals the larger the delay per terminal:

However, we have some exceptions for two airports with 5 terminals, which have no delays. Also, there are two additional airports with 9 and 7 terminals, respectively, do not follow the trend with delays of 1.2 & .4 mins delay per terminal.

Thus, it is suggested to further delve into the analysis with different measures such as number of gates as they are a more indicative measure to better explore the traffic and delays in the airport (when conducting  research, it was found that some of the largest airports in the US have one terminal but with many concourses and gates).

#### 3.2. Runway Delays

In [10]:
RunwayDF=DestDF.where(col("ArrDelay")!="NA")\
              .where(col("Cancelled")==0)\
              .groupBy('RUNWAYS')\
              .agg(count(lit(1)).alias("NUM FLIGHTS"),\
                   countDistinct('Dest').alias('NUM AIRPORTS'),\
                  round(avg("ArrDelay"),1).alias("AVG FLIGHT DELAY/AIRPORT(Min)")) 
RunwayDF.cache()

RunwayDF=RunwayDF.withColumn('AVG FLIGHT DELAY/RUNWAY(Min)', round(col('AVG FLIGHT DELAY/AIRPORT(Min)')/col('RUNWAYS'),1))\
          .orderBy(col('AVG FLIGHT DELAY/RUNWAY(Min)').desc())
RunwayDF.show()

+-------+-----------+------------+-----------------------------+----------------------------+
|RUNWAYS|NUM FLIGHTS|NUM AIRPORTS|AVG FLIGHT DELAY/AIRPORT(Min)|AVG FLIGHT DELAY/RUNWAY(Min)|
+-------+-----------+------------+-----------------------------+----------------------------+
|      1|       3556|           3|                          9.0|                         9.0|
|      2|      25026|          29|                          5.6|                         2.8|
|      4|      31074|          14|                          7.1|                         1.8|
|      3|      30779|          30|                          4.4|                         1.5|
|      5|       6117|           3|                          4.6|                         0.9|
|      6|       2146|           2|                          4.7|                         0.8|
+-------+-----------+------------+-----------------------------+----------------------------+



Using the previous approximation, we also investigated if airports containing more RUNWAYS have larger delays.

#### Quick observations: 

- The 2 airports with 5 Terminals were the most operationally efficient in 2008 according to average flight delays per terminal with a significant number of flights (over 2.000)

Airports were aggregated by the number of runways and in turn AVERAGE FLIGHT DELAY per AIRPORT and t AVERAGE DELAY per RUNWAY were calculated: 

#### Based on these measures a trend was observed whereby the larger the number of runways the smaller the delay:
- As expected, airports with only 1 Runway have the highest delays on average.
- There is a trend between the number of runways and the operational efficiency of the airport. However, it is observed that there is an exception for airports with 3 runways: They have the lowest ratio for Average Flight Delay per Airport (4.4). As such a conclusion is drawn whereby it would be better to include more explanatory variables (factors) that could impact the delays for future analysis.

### 4. Is there any threshold in the average arriving flights per terminal, so that above that value the delays tend to increase a lot?

In [11]:
ThresholdDF=DestDF.where(col("ArrDelay")!="NA")\
                  .where(col("Cancelled")==0)\
                  .select(col('Dest').alias('AIRPORT'),'TERMINALS','ArrDelay')\
                  .groupBy('AIRPORT','TERMINALS')\
                  .agg(count(lit(1)).alias('NUM FLIGHTS'),\
                       round((count(lit(1))/col('TERMINALS')),1).alias('AVG FLIGHTS/TERM'),\
                       round(avg('ArrDelay')/col('TERMINALS'),1).alias('AVG FLIGHT DELAY/TERMINAL'))\
                  .orderBy(col('AVG FLIGHT DELAY/TERMINAL').desc(),col('AVG FLIGHTS/TERM').asc())

ThresholdDF.cache()
ThresholdDF.show(10)

print("Linear Correlation Average Flight per Terminal vs Average Delay per Terminal")
ThresholdDF.select(round(corr(col("AVG FLIGHT DELAY/TERMINAL"),col("AVG FLIGHTS/TERM")),2).alias("Correlation")).show()

TermDF=TerminalDF.withColumn('AVG FLIGHTS/TERM', round(col('NUM FLIGHTS')/(col('TERMINALS')*col('NUM AIRPORTS')),1))\
          .select('TERMINALS','NUM AIRPORTS','NUM FLIGHTS', 'AVG FLIGHTS/TERM', 'AVG FLIGHT DELAY/TERMINAL(Min)')\
          .orderBy(col('AVG FLIGHTS/TERM').asc())

TermDF.show()
print("Linear Correlation Average Flight per Terminal vs Average Delay per Terminal")
TermDF.select(round(corr(col("AVG FLIGHT DELAY/TERMINAL(Min)"),col("AVG FLIGHTS/TERM")),2).alias("Correlation")).show()


+-------+---------+-----------+----------------+-------------------------+
|AIRPORT|TERMINALS|NUM FLIGHTS|AVG FLIGHTS/TERM|AVG FLIGHT DELAY/TERMINAL|
+-------+---------+-----------+----------------+-------------------------+
|    SAV|        1|          1|             1.0|                     31.0|
|    GSO|        1|          2|             2.0|                     27.0|
|    ROC|        1|          1|             1.0|                     24.0|
|    MYR|        1|          1|             1.0|                     18.0|
|    RIC|        1|          1|             1.0|                     13.0|
|    BOI|        1|        637|           637.0|                     10.0|
|    EUG|        1|         26|            26.0|                      9.3|
|    BFL|        2|         60|            30.0|                      9.0|
|    MAF|        1|        316|           316.0|                      8.9|
|    SFO|        4|        698|           174.5|                      8.7|
+-------+---------+------

To identify if there is a threshold in the average arriving flights per terminal, the team again calculated  the AVG FLIGHTS per TERMINAL to compare it with the AVG FLIGHT DELAY per TERMINAL. 

#### Quick observations: 

Analyzing the airports individually, there is no linear relation between the Average Flight per Terminal and the Average Delay per Terminal: the relationship between both is random (statistically with a correlation value close to 0). Under this condition, it was not possible to identify a threshold at any level. 

The team aggregated the airports by their number of terminals trying to detect better patterns:

- The airports with the lowest average number of flights per terminal (234) have no delays, but the airports with the highest average flight delays (4.2) are not the one with the highest average flights per terminal (854). 
- Even though there is a linear dependency between them (positive correlation value close to 1), there are still other explanatory variables that should be analyzed to avoid this random behavior (as mentioned before: number of gates because could be more indicative). 
- There is no obvious threshold of average number of flights where above it delays starts to increase per terminal.

#### Based on this measure a threshold for average number of flights per terminal where above it the average flight delay per terminal increases, cannot definitively be identified 




### 5. Discretize the arrival delay as in the reference notebook, and relate it to the average number of arriving flights per runway. Is there a relation? Support your conclusions with data.

In [12]:
RunDF=RunwayDF.withColumn('AVG FLIGHTS/RUNWAY', round(col('NUM FLIGHTS')/(col('RUNWAYS')*col('NUM AIRPORTS')),1))\
          .select('RUNWAYS','NUM AIRPORTS','NUM FLIGHTS', 'AVG FLIGHTS/RUNWAY', 'AVG FLIGHT DELAY/RUNWAY(Min)')\
          .orderBy(col('AVG FLIGHTS/RUNWAY').asc())

TotalFlightsDF = DestDF.groupBy("RUNWAYS").agg(count("ArrDelay").alias("TotalFlights"))

DelayDF = DestDF.where(col("ArrDelay")!="NA")\
                               .withColumn("DelaySeverity", when(col("ArrDelay")<=0,"1.nodelay")\
                               .when((col("ArrDelay")>0) & (col("ArrDelay")<=15),"2.acceptable")\
                               .when((col("ArrDelay")>15) & (col("ArrDelay")<=30),"3.annoying")\
                               .when((col("ArrDelay")>30) & (col("ArrDelay")<=60),"4.impactful")\
                               .otherwise("5.unacceptable"))

DelayDF.cache() 

SevereDelaysDF = DelayDF.where((col("Cancelled")==0))\
                           .where((col("DelaySeverity")!="1.nodelay") & (col("DelaySeverity")!="2.acceptable"))\
                           .withColumn("IntArrDelay", col("ArrDelay").cast(IntegerType()))\
                           .select("DelaySeverity", "IntArrDelay","RUNWAYS")\
                           .groupBy("RUNWAYS", "DelaySeverity")\
                           .agg(count("IntArrDelay").alias("NumSevereDelayedFlights"))

combinedDF = SevereDelaysDF.join(TotalFlightsDF, "RUNWAYS")\
                             .withColumn("SevereDelayedRatio", round(col("NumSevereDelayedFlights")/col("TotalFlights")*100,1))\
                             .orderBy(col("SevereDelayedRatio").desc())
combinedDF.cache()


combinedDF=combinedDF.groupBy("RUNWAYS")\
          .pivot("DelaySeverity")\
          .min("SevereDelayedRatio")\
          .orderBy(col("`5.unacceptable`").desc(), col("`4.impactful`").desc(), col("`3.annoying`").desc())

FinalDF= RunDF.join(combinedDF,"RUNWAYS")\
              .withColumn("Total Severe Delays %", round(col("`5.unacceptable`")+col("`4.impactful`")+col("`3.annoying`"),1))\
              .select('RUNWAYS', 'AVG FLIGHTS/RUNWAY',"Total Severe Delays %","`5.unacceptable`","`4.impactful`","`3.annoying`")\
              .orderBy(col('AVG FLIGHTS/RUNWAY').asc())

print("Runways with % severe delayed flights by category:")
FinalDF.show()


Runways with % severe delayed flights by category:
+-------+------------------+---------------------+--------------+-----------+----------+
|RUNWAYS|AVG FLIGHTS/RUNWAY|Total Severe Delays %|5.unacceptable|4.impactful|3.annoying|
+-------+------------------+---------------------+--------------+-----------+----------+
|      6|             178.8|                 19.3|           4.8|        5.7|       8.8|
|      3|             342.0|                 17.4|           4.1|        5.6|       7.7|
|      5|             407.8|                 16.6|           4.8|        5.3|       6.5|
|      2|             431.5|                 18.3|           4.1|        6.0|       8.2|
|      4|             554.9|                 20.5|           5.4|        6.5|       8.6|
|      1|            1185.3|                 22.7|           5.9|        7.7|       9.1|
+-------+------------------+---------------------+--------------+-----------+----------+



In general terms, any busy airport with 1-2 runways has to make those runways run as operationally efficient as possible with little room for error. Otherwise they end up with delays.  These delays can be severe or acceptable depending on the US FAA standards.

For this analysis the compared the average number of arriving flights per runway with a classification of the arrivals delay as follows:

1. No Delay: Delay < 0 minutes. 
2. Acceptable: Delay between 0 and 15 minutes.
3. Annoying: Delay between 15 and 30 minutes.
4. Impactful: Delay between 30 and 60 minutes.
5. Unacceptable: Delay > 60 minutes.

The last three catagories are considered "Severe Delays" and the team focused on understanding the behavior of this macro-category for the airports grouped by the number of runways:


 
Based on this investigation we can conclude:

#### In general the airports with the highest average number of flights per runway have more severe delays:

- The airports with the highest average flights per runway (1185) have the highest delays (23 mins).
- There is a clear exception to this statement for the airports with 6 runways whereby, those that have the lowest average number of flights (179) still have a high delay (19 mins in total).

### Sources:

- https://www.faa.gov/airports/airport_safety/airportdata_5010/
- https://aci.aero/data-centre
- https://www.citypopulation.de/en/usa/states/admin/
