# Example - Flight Data Processing 


## Step-1: First you need download and move the file data to **./data/ folder**. 

The following command runs on Unix-based operating systems. 

Download it from this link https://s3.amazonaws.com/metcs777/flights.csv.bz2
File size is 135 MB 



In [9]:
%%bash 
# Dataset is stored on AWS S3 
# https://s3.amazonaws.com/metcs777/flights.csv.bz2
# or from inside S3
# s3://metcs777/flights.csv.bz2

# Uncomment the following lines to download the datasets 
wget -q https://s3.amazonaws.com/metcs777/flights.csv.bz2
mv flights.csv.bz2 ./data/
ls -la ./data/

total 295896
drwxr-xr-x   6 kiat  staff        192 Jul 18 18:16 .
drwxr-xr-x  16 kiat  staff        512 Jul 18 18:16 ..
-rw-r--r--@  1 kiat  staff        244 Sep  5  2018 airlines.csv.bz2
-rw-r--r--@  1 kiat  staff       8071 Sep  5  2018 airports.csv.bz2
-rw-r--r--@  1 kiat  staff  141245764 Sep  5  2018 flights.csv.bz2
-rw-r--r--   1 kiat  staff          0 Feb 25 21:34 placeholder


In [10]:
%%bash
# Now, let us get the Airports and Airlines datasets 
# And Link these data sets. 

# AirLine dataset https://s3.amazonaws.com/metcs777/airlines.csv.bz2 or s3://metcs777/airlines.csv.bz2
# Airport dataset https://s3.amazonaws.com/metcs777/airports.csv.bz2 or s3://metcs777/airports.csv.bz2 
wget -q https://s3e.amazonaws.com/metcs777/airlines.csv.bz2 

mv  airlines.csv.bz2   ./data/

wget -q https://s3.amazonaws.com/metcs777/airports.csv.bz2

mv  airports.csv.bz2  ./data/ 
ls -la ./data/

total 295896
drwxr-xr-x   6 kiat  staff        192 Jul 18 18:16 .
drwxr-xr-x  16 kiat  staff        512 Jul 18 18:16 ..
-rw-r--r--@  1 kiat  staff        244 Sep  5  2018 airlines.csv.bz2
-rw-r--r--@  1 kiat  staff       8071 Sep  5  2018 airports.csv.bz2
-rw-r--r--@  1 kiat  staff  141245764 Sep  5  2018 flights.csv.bz2
-rw-r--r--   1 kiat  staff          0 Feb 25 21:34 placeholder


--2020-07-18 18:16:22--  https://s3e.amazonaws.com/metcs777/airlines.csv.bz2
Resolving s3e.amazonaws.com (s3e.amazonaws.com)... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘s3e.amazonaws.com’
mv: airlines.csv.bz2: No such file or directory
--2020-07-18 18:16:22--  https://s3.amazonaws.com/metcs777/airports.csv.bz2
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.32.150
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.32.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8071 (7.9K) [application/x-bzip]
Saving to: ‘airports.csv.bz2’

     0K .......                                               100%  115M=0s

2020-07-18 18:16:22 (115 MB/s) - ‘airports.csv.bz2’ saved [8071/8071]



In [11]:
lines = sc.textFile("./data/flights.csv.bz2")

# First line is the header. 
lines.first()

'YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY'

In [12]:
# First line is the header 
lines.take(2)

['YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY',
 '2015,1,1,4,AS,98,N407AS,ANC,SEA,0005,2354,-11,21,0015,205,194,169,1448,0404,4,0430,0408,-22,0,0,,,,,,']

In [13]:
# Remove the header from the RDD
linesHeader = lines.first()
header = sc.parallelize([linesHeader])
linesWithOutHeader = lines.subtract(header)
linesWithOutHeader.first()

'2015,1,1,4,US,2013,N584UW,LAX,CLT,0030,0044,14,13,0057,273,249,228,2125,0745,8,0803,0753,-10,0,0,,,,,,'

In [14]:
# The data is about the flights from different airports which includes following attributes
#[u'YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY']
flights = linesWithOutHeader.map(lambda x: x.split(','))
flights.first()

['2015',
 '1',
 '1',
 '4',
 'US',
 '2013',
 'N584UW',
 'LAX',
 'CLT',
 '0030',
 '0044',
 '14',
 '13',
 '0057',
 '273',
 '249',
 '228',
 '2125',
 '0745',
 '8',
 '0803',
 '0753',
 '-10',
 '0',
 '0',
 '',
 '',
 '',
 '',
 '',
 '']

In [15]:
# We expect to have 31 data elements. 
# We go ahead and remove all rows that do not include 31 elements
dataFiltered=flights.filter(lambda x: len(x)==31)
dataFiltered.first()

['2015',
 '1',
 '1',
 '4',
 'US',
 '2013',
 'N584UW',
 'LAX',
 'CLT',
 '0030',
 '0044',
 '14',
 '13',
 '0057',
 '273',
 '249',
 '228',
 '2125',
 '0745',
 '8',
 '0803',
 '0753',
 '-10',
 '0',
 '0',
 '',
 '',
 '',
 '',
 '',
 '']

In [16]:
# YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY, CANCELLED
# We only need the following elements
mainFlightsData = dataFiltered.map(lambda p: (p[0], p[1] , p[2] , p[3], p[4] , p[5] , p[6], p[7] , p[8] , p[9], p[10], p[11], p[24] ))

# Cache this RDD, we will use it alot
mainFlightsData.cache()

# Show the first 
mainFlightsData.first()

# Note: this new RDD will have only 13 elements (max index 12) 

# 0 YEAR,
# 1 MONTH,
# 2 DAY,
# 3 DAY_OF_WEEK,
# 4 AIRLINE, 
# 5 FLIGHT_NUMBER,
# 6 TAIL_NUMBER,
# 7 ORIGIN_AIRPORT,
# 8 DESTINATION_AIRPORT,
# 9 SCHEDULED_DEPARTURE,
# 10 DEPARTURE_TIME,
# 11 DEPARTURE_DELAY, 
# 12 CANCELLED

('2015',
 '1',
 '1',
 '4',
 'US',
 '2013',
 'N584UW',
 'LAX',
 'CLT',
 '0030',
 '0044',
 '14',
 '0')

In [17]:
airlines = sc.textFile("./data/airlines.csv.bz2")
airlines.take(2)

['IATA_CODE,AIRLINE', 'UA,United Air Lines Inc.']

In [18]:
airports = sc.textFile("./data/airports.csv.bz2")
airports.take(2)

['IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE',
 'ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.44040']

In [19]:
# Remove the header from the RDD
airlinesHeader = airlines.first()
header1 = sc.parallelize([airlinesHeader])
airlinesWithOutHeader = airlines.subtract(header1)
airlinesWithOutHeader.first()

'UA,United Air Lines Inc.'

In [20]:
# Remove the header from the RDD
airportsHeader = airports.first()
header1 = sc.parallelize([airportsHeader])
airportsWithOutHeader = airports.subtract(header1)
airportsWithOutHeader.first()

'ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.68190'

In [None]:
# Q1 - Find a list of unique Origin Airports

In [None]:
# Q2 - Find a list of (Origin, Destination) pairs

In [None]:
# Q3 - Find the Origin airport which had the largest departure delay 
# in the month of January

In [None]:
# Q4 - Find out which carrier has the largest delay on Weekends. 


In [None]:
# Q5 - Which airport has the most cancellation of flights?


In [None]:
# Q6 - Find percentage of flights cancelled for each carrier.


In [None]:
# Q7 - Find the largest departure delay for each carrier


In [None]:
# Q8 - Find the largest departure delay for each carrier in each month

In [None]:
# Q9 - For each carrier find the average Departure delay 


In [None]:
# Q10 - For each carrier find the average Departure delay for each month


In [None]:
# Q11 - Which date of year has the highest rate  of flight cancellations?
# Rate of flight cancellation is calculated by deviding number of 
# canceled flights by total number of flights.


In [None]:
# Q12 - Calculate the number of flights to each destination state
# For each carrier, for which state do they have the largest average delay? 
# You will need the airline and airport data sets for this question. 