# Spark Programming - Resilient Distributed Dataset

## 1. Basic RDD Operations

**Transformations** return new RDDs

- `map`, `flatmap`
- `reduceByKey`
- `filter`

**Actions** return values

- `collect`
- `reduce`
- `take`
- `count`

Everything starts with a `SparkContext`

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

#from pyspark import SparkContext
#sc = SparkContext.getOrCreate()

#conf = SparkConf().setAppName("PySpark App").setMaster("spark://spark-master:7077")
#sc = SparkContext(conf=conf)

In [None]:
sc.stop()

In [None]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [None]:
#the spark configuration
sc.getConf().getAll()

In [None]:
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

In [None]:
rdd = sc.parallelize(range(0, 100, 1), 5)
print(rdd.collect())

What does this look like?

- `glom`: Returns an RDD list from each partition of an RDD.
- `collect`: Returns a list from all elements of an RDD to the Driver Program.

In [None]:
for x in rdd.glom().collect():
    print(x) 

Use map and square each element of the RDD

In [None]:
#two transformations - A DAG with two tasks
mappedRdd = (rdd.map(lambda x:x*x).filter(lambda x:x>1000))

#one action
print(mappedRdd.collect())

In [None]:
mappedRdd.count()

Use reduce to sum the squares

In [None]:
reduced = mappedRdd.reduce(lambda x,y:x+y)
print(reduced)

## 2. Map and Reduce

### `map` and `flatMap`

Create a new RDD  - a list of lists

In [None]:
rdd = sc.parallelize([ [2, 3, 4], [0, 1, 2, 3], [5, 6, 7, 8] ])
rdd.collect()

In [None]:
firstElement = rdd.map(lambda x: x[0])
firstElement.collect()

In [None]:
lastElement = rdd.map(lambda x: x[-1])
lastElement.collect()

In [None]:
lastElement = rdd.map(lambda x: x[-1:])
lastElement.collect()

Return a new RDD by first applying a function (the length of each string), then flatten.

In [None]:
y = rdd.map(lambda x: range(len(x))).collect()
y

Or I can flatten the results...

In [None]:
z = rdd.flatMap(lambda x: range(len(x))).collect()
z

Or flatten the original results

In [None]:
rdd.flatMap(lambda x: x).collect()

### Reduce

In [None]:
rdd = sc.parallelize(range(1000), 5)

In [None]:
rdd.reduce(lambda x,y: x+y)

In [None]:
rdd = sc.parallelize([ [2, 3, 4], [0, 1, 2, 3], [5, 6, 7, 8] ])

In [None]:
rdd.flatMap(lambda x: x).reduce(lambda x,y: x+y)

 ## 3. RDD with Key Value Pairs

In [None]:
rdd = sc.parallelize([("cat", 1), ("dog", 1), ("cat", 2)])
rdd.collect()

In [None]:
first = rdd.map(lambda x: x[0])
first.collect()

In [None]:
rdd.reduceByKey(lambda x,y: x+y).collect()

In [None]:
rdd.groupByKey().mapValues(lambda x: sum(x)).collect()

In [None]:
rdd = sc.parallelize([("cat", 1), ("dog", 2), ("cat", 1)])

In [None]:
rdd.countByKey()

In [None]:
rdd.reduceByKey(lambda x,y: x+y).collect()

In [None]:
#optional example - generate key value pair RDD from regular RDD
simple_rdd = sc.parallelize([6, 3, 4, 53, 654, 2, 5, 8, 1 , 65, 66, 54])

In [None]:
key_value_rdd = simple_rdd.map(lambda x: (x % 2, x))
key_value_rdd.collect()

In [None]:
key_value_rdd = simple_rdd.keyBy(lambda x: x % 2)
key_value_rdd.collect()

In [None]:
key_value_rdd.countByKey()

In [None]:
even_odd_summed_rdd = key_value_rdd.reduceByKey(lambda x,y: x+y)
even_odd_summed_rdd.collect()

In [None]:
grouped_rdd = key_value_rdd.groupByKey()
grouped_rdd = grouped_rdd.mapValues(list)
grouped_rdd.collect()

In [None]:
grouped_rdd = simple_rdd.groupBy(lambda x: x % 2)
print(grouped_rdd.collect())
grouped_rdd = grouped_rdd.mapValues(list)
grouped_rdd.collect()

## 4. RDD Partitions

*Partitioning* is the process of distributing data across workers.  This allows workers to process in parallel.  The final results are then collated and combined.

Under the hood, each worker machine is subdivided into "executors".    Often 1 executor = 1 core.

You should have # partitions at LEAST equal to the total number of executors in your cluster, otherwise some executors will just sit idle.

Each partition is processed sequentially on a single executor.  There is one *task* per element in the partition.

In [None]:
#string concatenation
strconcat = "hello" + "there"
print(strconcat)
strconcat = "hello" + " " + "there"
print(strconcat)

#converting numbers into strings
numstr = str(5.6)
print(numstr)

In [None]:
#a simple function to combine numbers
def combine_num(x, y):
    return "(" + str(x) + " " + str(y) + ")"

combine_num(5, 7)

In [None]:
combine_num(5, combine_num(7, 10))

In [None]:
combine_num(combine_num(5, 7), 10)

In [None]:
#create a simple RDD
simple_rdd = sc.parallelize([6, 8, 2, 9, 10, 13, 7, 4])

#let's reduce an RDD that has only a single partition
simple_rdd.reduce(combine_num)

In [None]:
# let's partition the kid ages into 2 partitions
simple_rdd2 = sc.parallelize([6, 8, 2, 9, 10, 13, 7, 4], 2)  # second argument specifies number of partitions

In [None]:
# let's reduce an RDD that has 2 partitions
simple_rdd2.reduce(combine_num)

### The associative and commutative rule of reduce

The results is different depending on the number of partitions you specify. So in this case, if we divide the large problem into partitions, we can different results when having different partitions.  Because of this, `reduce` only works when the operation is *associative* (in the mathematical sense).

For example, ((A + B) + C) = (A + (B + C)); but we can't guarantee ((A / B) / C) = (A / (B / C))

This also won't be consistent run after run unless the combine operation is *commutative* (in the mathematical sense)

For example, (A + B) = (B + A); but (A / B) is not necessarily equal to (B / A)

So if the reducing function is not associative and commutative you will sometimes get wrong results depending how your data is partitioned.
Let look at another example.

In [None]:
#if we divide the values by the sequence in the list, we expect to get 10
simple_rdd = sc.parallelize([1, 2, 0.5, 0.1, 5, 0.2], 1)
simple_rdd.reduce(lambda x, y: x / y)

However, if you were to partition the data into 3 partitions, the result will be wrong.

In [None]:
simple_rdd2 = sc.parallelize([1, 2, 0.5, 0.1, 5, 0.2], 3)
simple_rdd2.reduce(lambda x, y: x / y)

## 5. RDD LAZY Evaluation

In [None]:
#regular Python (operations on lists) is also LAZY (when programming in the functional style)
data = [1.6, 2.4, 7.8, 4.6, 2.3]

newdata_lazy = map(lambda x: x+1, data)
newdata_lazy

In [None]:
#you can force the "action" by explicitly converting to a list
newdata = list(newdata_lazy)
newdata

In [None]:
#let's look at this in spark
data_rdd = sc.parallelize(data)
#nothing has been done at this point
newdata_rdd = data_rdd.map(lambda x: x+1)  

In [None]:
# perform the .collect() action to bring the modified list back to driver
newdata = newdata_rdd.collect()
newdata

## 6. Task - Simple WordCount

In [None]:
wordsList = ['dog', 'cat', 'cat', 'bird', 'dog', 'elephant', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
#check the type of wordsRDD
print(type(wordsRDD))

In [None]:
#look at RDD partitions
for x in wordsRDD.glom().collect():
    print(x)

In [None]:
wordPairs = wordsRDD.map(lambda x:(x,1))
wordPairs.collect()

Find the number of unique words

In [None]:
# Note that groupByKey requires no parameters
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
    print ('{}: {}'.format(key, list(value)))
wordsGrouped.collect()

In [None]:
uniqueWords = wordsGrouped.keys().count()
print(uniqueWords)

Reduce to get the word count

In [None]:
# Note that reduceByKey takes in a function that accepts two values and returns a single value
wordCounts = wordPairs.reduceByKey(lambda x,y: x+y)
wordCounts.collect()

In [None]:
totalCount = (wordCounts.map(lambda x:x[1]).reduce(lambda x,y: x+y))
average = totalCount / float(uniqueWords)
print(totalCount)
print(round(average, 2))