In [1]:
from pyspark import SparkContext, SparkConf

In [2]:
if not 'sc' in globals(): # This 'trick' makes sure the SparkContext sc is initialized exactly once
    conf = SparkConf().setMaster('local[*]')
    sc = SparkContext(conf=conf)

In [3]:
# Replace with your values
# NOTE: Set the access to this notebook appropriately to protect the security of your keys.
# Or you can delete this cell after you run the mount command below once successfully.
ACCESS_KEY = "none"
SECRET_KEY = "none"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "none"
MOUNT_NAME = "none"

In [4]:
# only execute this line once
try: 
  dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
except:
  pass

In [5]:
# For now we will just read data from a text file
input_rdd = sc.textFile("/mnt/%s/flights.csv" % MOUNT_NAME)

### Transformations
Transformations are operations that will not be completed at the time you write and execute the code in a cell - they will only get executed once you have called an action. An example of a transformation might be to convert an integer into a float or to filter a set of values. In this section we will discuss the basic transformations that can be applied on top of RDD.
##### map
The map method is a higher-order method that takes a function as input and applies it to each element in
the source RDD to create a new RDD. The input function to map must take a single input parameter and
return a value.

In [7]:
# just to show you the first line of the RDD
input_rdd.first()

In [8]:
# using the map function, I transformed my original RDD
input_list = input_rdd.map(lambda line: line.split(','))

In [9]:
# They are a list object now, instead of pure string
# First is an action. ONLY AT THIS POINT SPARK WILL START PROCESSING
input_list.first()[:10]

##### filter
The filter method is a higher-order method that takes a Boolean function as input and applies it to each
element in the source RDD to create a new RDD. A Boolean function takes an input and returns true or
false. The filter method returns a new RDD formed by selecting only those elements for which the input
Boolean function returned true. Thus, the new RDD contains a subset of the elements in the original RDD.

In [11]:
# the original input RDD has header
input_rdd.take(4)

In [12]:
# for processing the data I can get rid of the header with a filter operation
flight_delay_rdd = input_rdd.filter(lambda line: 'YEAR' not in line)

In [13]:
# we got rid of the header
# again, only at this point spark processes the data
flight_delay_rdd.first()

##### flatMap
The flatMap method is a higher-order method that takes an input function, which returns a sequence for
each input element passed to it. The flatMap method returns a new RDD formed by flattening this collection
of sequence.

In [15]:
# imagine we want to know the total delay and we don't have ARRIVAL_DELAY and DEPARTURE_DELAY
# so we must find out using the following fields
input_list.first()[26:]

In [16]:
# split string input to a list object in python and then subset to the delays
different_delays_rdd = flight_delay_rdd.map(lambda line: line.split(',')).map(lambda line: line[26:])

In [17]:
# Now we only have different delays in minutes
different_delays_rdd.first()

In [18]:
# If we want to replace the empty items on the list with 0 we can use the map function
different_delays_rdd.map(lambda delays: [delay if len(delay)>0 else 0 for delay in delays]).first()

In [19]:
# there are same number of lines as input (minus one for header)
different_delays_rdd.count()

In [20]:
# using flatMap we can represent each delay per line as a single object within our RDD
different_delay_rdd = different_delays_rdd.flatMap(lambda line: line)

In [21]:
# let's see how many lines we have now within our RDD|
different_delay_rdd.count()

In [22]:
# our RDD became 5 time it's original size
5819079 * 5

In [23]:
# instead of 5 empty objects, we have one empty object for our first line, makes sense? 
different_delay_rdd.filter(lambda line: line!='').take(10)

##### Writing Custom Functions
You can write custom functions to process each line within RDD, as illustrated below

In [25]:
# if our line is empty this function will return 0
# otherwise it will convert it to an integer
def convert_to_int(line):
    try:
        return int(line)
    except:
        return 0

In [26]:
different_delay_int_rdd = different_delay_rdd.map(convert_to_int)

In [27]:
different_delay_int_rdd = different_delay_rdd.map(convert_to_int)

In [28]:
# print out a non-zero delay from our RDD
different_delay_int_rdd.filter(lambda line: line is not 0).first()

In [29]:
# now if you wanted to know the total flight delays (hours) that happened in USA in the year 2015 you can use sum() which is an action
different_delay_int_rdd.sum() / 60.0

##### union
The union method takes an RDD as input and returns a new RDD that contains the union of the elements in the source RDD and the RDD passed to it as an input.

```linesFile1 = sc.textFile("...")
linesFile2 = sc.textFile("...")
linesFromBothFiles = linesFile1.union(linesFile2)```

In [31]:
mammals = sc.parallelize(["Lion", "Dolphin", "Whale"])
aquatics = sc.parallelize(["Shark", "Dolphin", "Whale"])
zoo = mammals.union(aquatics)
zoo.collect()

##### intersection
The intersection method takes an RDD as input and returns a new RDD that contains the intersection of
the elements in the source RDD and the RDD passed to it as an input.

```val linesFile1 = sc.textFile("...")
val linesFile2 = sc.textFile("...")
val linesPresentInBothFiles = linesFile1.intersection(linesFile2)```

In [33]:
mammals = sc.parallelize(["Lion", "Dolphin", "Whale"])
aquatics = sc.parallelize(["Shark", "Dolphin", "Whale"])
aquaticMammals = mammals.intersection(aquatics)
aquaticMammals.collect()

##### subtract
The subtract method takes an RDD as input and returns a new RDD that contains elements in the source
RDD but not in the input RDD.
```linesFile1 = sc.textFile("...")
linesFile2 = sc.textFile("...")
linesInFile1Only = linesFile1.subtract(linesFile2)```

In [35]:
mammals = sc.parallelize(["Lion", "Dolphin", "Whale"])
aquatics =sc.parallelize(["Shark", "Dolphin", "Whale"])
fishes = aquatics.subtract(mammals)
fishes.collect()

##### distinct
The distinct method of an RDD returns a new RDD containing the distinct elements in the source RDD

In [37]:
# Airline is in the fourth index on our RDD list
input_list.map(lambda line: line[4]).first()

In [38]:
airlines_rdd = flight_delay_rdd.map(lambda line: line.split(',')[4])

In [39]:
# these are two letter airline code
airlines_rdd.first()

In [40]:
# how many different airlines do we have in our dataset
airlines_rdd.distinct().count()

In [41]:
# show all the distinct airlines in our dataset
airlines_rdd.distinct().collect()