In [1]:
# Put the correct credentials here and mount S3 bucket
ACCESS_KEY = "ACCESS_KEY"
SECRET_KEY = "SECRET_KEY"
AWS_BUCKET_NAME = "AWS_BUCKET_NAME"
MOUNT_NAME = "MOUNT_NAME"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")

try: 
  dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
except:
  pass

In [2]:
# For now we will just read data from a text file
input_rdd = sc.textFile("/mnt/%s/flights.csv" % MOUNT_NAME)
flight_delay_list_rdd = input_rdd.filter(lambda line: 'YEAR' not in line).map(lambda line: line.split(','))

### Transformations on RDD of key-value Pairs

In [4]:
print((input_rdd.map(lambda line: line.split(',')).first()[4]),(input_rdd.map(lambda line: line.split(',')).first()[11]))

In [5]:
# lets do some data cleaning, lets just assume the delay to be 0 if the value is missing
# since it is a small portion of our dataset it doesn't matter
# however the right approach is to take the mean to handle scenarios of missing data
airline_departureDelay_rdd = flight_delay_list_rdd.map(lambda line: (line[4], int(line[11]) if line[11] != '' else 0))

In [6]:
# we just created a key-value pair RDD with airline and arrival_delay
airline_departureDelay_rdd.take(10)

In [7]:
# lets do some cleaning, remove negative departure delay and consider them as 0
airline_posDepartureDelay_rdd = airline_departureDelay_rdd.map(lambda line: (line[0], line[1] if line[1] > 0 else 0))

In [8]:
airline_posDepartureDelay_rdd.take(10)

### reduceByKey

The higher-order reduceByKey method takes an associative binary operator as input and reduces values with the same key to a single value using the specified binary operator.

A binary operator takes two values as input and returns a single value as output. An associative operator returns the same result regardless of the grouping of the operands.

The reduceByKey method can be used for aggregating values by key. For example, it can be used for calculating sum, product, minimum or maximum of all the values mapped to the same key.

In [10]:
# reduce by key (airline) to get the total departure delay per airline
posDepartureDelay_reduced_rdd = airline_posDepartureDelay_rdd.reduceByKey(lambda value1, value2: value1 + value2)

In [11]:
# airline late departure sorted print
sorted(posDepartureDelay_reduced_rdd.collect(), key=lambda x: x[1])

### groupByKey

The groupByKey method returns an RDD of pairs, where the first element in a pair is a key from the source RDD and the second element is a collection of all the values that have the same key. It is similar to the groupBy method that we saw earlier. The difference is that groupBy is a higher-order method that takes as input a function that returns a key for each element in the source RDD. The groupByKey method operates on an RDD of key-value pairs, so a key generator function is not required as input.

** The groupByKey method should be avoided. It is an expensive operation since it may shuffle data. For most use cases, better alternatives are available. ** 

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

### join
The join method takes an RDD of key-value pairs as input and performs an inner join on the source and input RDDs. It returns an RDD of pairs, where the first element in a pair is a key found in both source and input RDD and the second element is a tuple containing values mapped to that key in the source and input RDD.

In [14]:
# We want to get the Airlines names, not just the code
posDepartureDelay_reduced_rdd.first()

In [15]:
# lucky for us, we have a table that translates these codes into actual airline names
airlines_input_rdd = sc.textFile("/mnt/%s/airlines.csv" % MOUNT_NAME)

In [16]:
airlines_input_rdd.take(5)

In [17]:
# lets remove the headers and convert out string to list
airlines_rdd = airlines_input_rdd.filter(lambda line: 'IATA_CODE' not in line).map(lambda line: line.split(','))
airlines_rdd.take(5)

In [18]:
# We can use join to translate the code into names
posDepartureDelay_reduced_rdd.join(airlines_rdd).collect()

In [19]:
# we can keep just the airlines name and total departure delay from the join
posDepartureDelay_reduced_rdd.join(airlines_rdd).map(lambda line: (line[1][1], line[1][0])).collect()

In [20]:
# lets store it and print it sorted 
totalDepartureDelay = posDepartureDelay_reduced_rdd.join(airlines_rdd).map(lambda line: (line[1][1], line[1][0])).collect()
sorted(totalDepartureDelay, key=lambda x: x[1])

### Can we say Hawaiian Airlines Inc. is your best bet in order to avoid late departure?
#### No: Hawaiian Airlines Inc. may have smaller operations than the others, so let's get the mean!

In [22]:
# we already have the total delay per airline
posDepartureDelay_reduced_rdd.collect()

In [23]:
# lets create a rdd of flight count per airline
flight_count_rdd = airline_posDepartureDelay_rdd.map(lambda x: (x[0],1)).reduceByKey(lambda v1, v2: v1 + v2)

In [24]:
# we can join this two data to get total delay and total flight count per airline
posDepartureDelay_flight_rdd = posDepartureDelay_reduced_rdd.join(flight_count_rdd)

In [25]:
# we can divide the values to get the mean
posDepartureDelay_mean_rdd = posDepartureDelay_flight_rdd.map(lambda x: (x[0], x[1][0]/x[1][1]))
sorted(posDepartureDelay_mean_rdd.collect(), key=lambda x: x[1])

In [26]:
# we can also do one more join to print out the airlines full name
departureDelay = posDepartureDelay_mean_rdd.join(airlines_rdd).map(lambda line: (line[1][1], line[1][0])).collect()
sorted(departureDelay, key=lambda x: x[1])

#### lets compare with with our previous result of total departure delay, what do you see?

In [28]:
sorted(totalDepartureDelay, key=lambda x: x[1])