<a href="https://colab.research.google.com/github/hargagan/EDA-NYC-Taxi-Data-Analysis/blob/main/pyspark/graded/Mini_Assignment_2_Har_Gagan_Sahai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The data set ‘Domestic Flight Delay Records’ contains comprehensive data about domestic flights in the United States, such as flight dates, airtime, flight distance, scheduled departure/arrival times and departure and arrival delays.**



Complete the following tasks using the provided data set:

Data Set Link  



With the given data set, solve the following tasks **using PySpark**.  

**Task 1:** Create a function that gives back how many flights arrived earlier than expected.

**Task 2:** Create a function that determines the typical departure time for flights over 2000 miles.

**Task 3:** Create a function that gives back the proportion of flights that have arrival delays longer than 60 minutes.

**Task 4:** Create a function that gives the average airtime for flights that left earlier than 9:00 am.

**Task 5:** Create a function that determines the maximum arrival delay for flights that did not experience a delay upon departure.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('FlightRecord').getOrCreate()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df_flight = spark.read.csv('/content/drive/MyDrive/Assignments/EDA/Flight Dataset - CSV(in).csv', header=True)
df_flight.show(5)
df_flight.printSchema()

+--------+---------+---------+--------+--------+---------+---------+
| FL_DATE|DEP_DELAY|ARR_DELAY|AIR_TIME|DISTANCE| DEP_TIME| ARR_TIME|
+--------+---------+---------+--------+--------+---------+---------+
|1/1/2006|        5|       19|     350|    2475| 9.083333|12.483334|
|1/2/2006|      167|      216|     343|    2475|11.783334|15.766666|
|1/3/2006|       -7|       -2|     344|    2475| 8.883333|12.133333|
|1/4/2006|       -5|      -13|     331|    2475| 8.916667|    11.95|
|1/5/2006|       -3|      -17|     321|    2475|     8.95|11.883333|
+--------+---------+---------+--------+--------+---------+---------+
only showing top 5 rows
root
 |-- FL_DATE: string (nullable = true)
 |-- DEP_DELAY: string (nullable = true)
 |-- ARR_DELAY: string (nullable = true)
 |-- AIR_TIME: string (nullable = true)
 |-- DISTANCE: string (nullable = true)
 |-- DEP_TIME: string (nullable = true)
 |-- ARR_TIME: string (nullable = true)



### Clean the data
####* FL_DATE to datetime
####* Change numerical columns to integer


In [None]:
from pyspark.sql import functions as F
df_cleaned = df_flight.withColumn("FL_DATE", F.to_date("FL_DATE", "M/d/yyyy"))
df_cleaned = df_cleaned.withColumn("DEP_DELAY", df_cleaned["DEP_DELAY"].cast("int"))
df_cleaned = df_cleaned.withColumn("AIR_TIME", df_cleaned["AIR_TIME"].cast("int"))
df_cleaned = df_cleaned.withColumn("ARR_DELAY", df_cleaned["ARR_DELAY"].cast("int"))
df_cleaned = df_cleaned.withColumn("DISTANCE", df_cleaned["DISTANCE"].cast("int"))

df_cleaned.printSchema()

root
 |-- FL_DATE: date (nullable = true)
 |-- DEP_DELAY: integer (nullable = true)
 |-- ARR_DELAY: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- DEP_TIME: string (nullable = true)
 |-- ARR_TIME: string (nullable = true)



In [None]:
from pyspark.sql import functions as F

# Convert DEP_TIME and ARR_TIME to double, overwriting the original string columns
df_cleaned = df_cleaned.withColumn("DEP_TIME", F.col("DEP_TIME").cast("double"))
df_cleaned = df_cleaned.withColumn("ARR_TIME", F.col("ARR_TIME").cast("double"))

# Extract hours for DEP_TIME and ARR_TIME (now that they are double)
df_cleaned = df_cleaned.withColumn("DEP_HOUR", F.floor(F.col("DEP_TIME")).cast("int"))
df_cleaned = df_cleaned.withColumn("ARR_HOUR", F.floor(F.col("ARR_TIME")).cast("int"))

# Show the updated schema and some data to verify the conversion
df_cleaned.printSchema()
df_cleaned.select("FL_DATE", "DEP_TIME", "DEP_HOUR", "ARR_TIME", "ARR_HOUR").show(5)

root
 |-- FL_DATE: date (nullable = true)
 |-- DEP_DELAY: integer (nullable = true)
 |-- ARR_DELAY: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- DEP_TIME: double (nullable = true)
 |-- ARR_TIME: double (nullable = true)
 |-- DEP_HOUR: integer (nullable = true)
 |-- ARR_HOUR: integer (nullable = true)

+----------+---------+--------+---------+--------+
|   FL_DATE| DEP_TIME|DEP_HOUR| ARR_TIME|ARR_HOUR|
+----------+---------+--------+---------+--------+
|2006-01-01| 9.083333|       9|12.483334|      12|
|2006-01-02|11.783334|      11|15.766666|      15|
|2006-01-03| 8.883333|       8|12.133333|      12|
|2006-01-04| 8.916667|       8|    11.95|      11|
|2006-01-05|     8.95|       8|11.883333|      11|
+----------+---------+--------+---------+--------+
only showing top 5 rows


####**Task 1:** Create a function that gives back how many flights arrived earlier than expected.

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.functions import col

def flights_arrived_on_time(df):
  return df.filter(col("ARR_DELAY") < 0).count()

print("Flights which came earlier than expected: ", flights_arrived_on_time(df_cleaned))

Flights which came earlier than expected:  534655


####**Task 2:** Create a function that determines the typical departure time for flights over 2000 miles.

In [None]:
def typical_departure_time(df):
  # Using DEP_TIME_DOUBLE for calculating the average
  return df.filter(col("DISTANCE") > 2000).select(F.avg("DEP_TIME")).collect()[0][0]

print("Typical departure time for flights over 2000 miles: ", round(typical_departure_time(df_cleaned), 2))

Typical departure time for flights over 2000 miles:  13.97


####**Task 3:** Create a function that gives back the proportion of flights that have arrival delays longer than 60 minutes.

In [None]:
def arrival_delays_longer_than_60(df):
  return df.filter(col("ARR_DELAY") > 60).count() / df.count()

print("Proportion of flights that have arrival delays longer than 60 minutes: ", round(arrival_delays_longer_than_60(df_cleaned), 2))

Proportion of flights that have arrival delays longer than 60 minutes:  0.05


####**Task 4:** Create a function that gives the average airtime for flights that left earlier than 9:00 am.

In [None]:
def average_airtime_early_departure(df):
  # Filter for flights that left earlier than 9:00 am (DEP_HOUR < 9)
  early_flights = df.filter(F.col("DEP_HOUR") < 9)
  # Calculate the average airtime for these flights
  avg_airtime = early_flights.agg(F.avg("AIR_TIME")).collect()[0][0]
  return avg_airtime

print("Average airtime for flights that left earlier than 9:00 am: ", round(average_airtime_early_departure(df_cleaned), 2))

Average airtime for flights that left earlier than 9:00 am:  111.36


####**Task 5:** Create a function that determines the maximum arrival delay for flights that did not experience a delay upon departure.

In [None]:
def max_arrival_delay_no_departure_delay(df):
  # Filter for flights that did not experience a delay upon departure (DEP_DELAY <= 0)
  on_time_departure_flights = df.filter(F.col("DEP_DELAY") <= 0)
  # Find the maximum arrival delay for these flights
  max_arr_delay = on_time_departure_flights.agg(F.max("ARR_DELAY")).collect()[0][0]
  return max_arr_delay

print("Maximum arrival delay for flights that did not experience a delay upon departure: ", max_arrival_delay_no_departure_delay(df_cleaned))

Maximum arrival delay for flights that did not experience a delay upon departure:  701
