reference: https://github.com/edyoda/pyspark-tutorial

#### Intrduction to Spark

- At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.
- The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
- RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.
- Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.
- Finally, RDDs automatically recover from node failures.

In [3]:
import pyspark
from pyspark import SparkContext

#### RDD

- A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
- Represents an immutable, partitioned collection of elements that can be operated on in parallel.

#### Two ways to create RDD

- parallelize -from collection
- textFile -from external file

In [4]:
sc = SparkContext("local", "count app")

#### parallelize

In [10]:
rdd = sc.parallelize([1,2,3,4,5],2)

In [11]:
rdd.collect()

[1, 2, 3, 4, 5]

#### External Datsets

In [22]:
baby_name = sc.textFile('data/Baby_Names_Beginning_2007.csv')

#### Basics of RDD

In [23]:
lines = sc.textFile('data/Baby_Names_Beginning_2007.csv')

lines.first()

'Year,First Name,County,Sex,Count'

In [26]:
lines.take(5)

['Year,First Name,County,Sex,Count',
 '2013,GAVIN,ST LAWRENCE,M,9',
 '2013,LEVI,ST LAWRENCE,M,9',
 '2013,LOGAN,NEW YORK,M,44',
 '2013,HUDSON,NEW YORK,M,49']

In [27]:
# length of first 5 elements
lines.map(lambda s: len(s)).take(5)

[32, 26, 25, 24, 25]

In [28]:
# return total number of characters
rdd = lines.map(lambda s: len(s))
rdd = rdd.map(lambda s: 2*s)
print (rdd.reduce(lambda a,b: a+b))

2424036


#### Key-Value Pairs RDD

In [29]:
rdd = sc.parallelize(["hello", "world", "good", "hello"])

In [30]:
rdd = rdd.map(lambda w: (w,1))
rdd.collect()

[('hello', 1), ('world', 1), ('good', 1), ('hello', 1)]

- value corresponding to same key undergoes lambda operation
- Note: Any function which has (key,value) pair can be worked on by {Any}ByKey

In [31]:
rdd.reduceByKey(lambda x, y: x+y).collect()

[('hello', 2), ('world', 1), ('good', 1)]

#### Transformation
- Eg - map, filter, flatMap ...
- Changes data from one format to another
- Lazy execution - Delays execution untill finds an 'Action' so that it can prepare optimized lineage ( spark internal code pipeline )

#### Actions
- Eg - count, collect, reduce ...
- Trigger execution of pipeline

#### Shuffle Operations

- Many operations in spark trigger shuffle .i.e movement of data across one one to another.
- Data movement is expensive & should be as less as possible

In [33]:
# Create 2 partition
rdd = sc.parallelize(["hello", "world", "good", "hello"],2)

In [34]:
# glom - retusn data in one partition in list
rdd. glom().collect()

[['hello', 'world'], ['good', 'hello']]

In [36]:
rdd = rdd.map(lambda w:(w,1))

In [37]:
rdd.glom().collect()

[[('hello', 1), ('world', 1)], [('good', 1), ('hello', 1)]]

In [38]:
# reduceByKey - generates a new RDD where all the values of same key are tupled 
rdd.reduceByKey(lambda a,b:(a,b)).collect()

[('world', 1), ('good', 1), ('hello', (1, 1))]

- The above operation brings all data with same key in one node
- This operation causes data shuffling

Note - We can reduce shuffle using groupByKey

#### RDD Operations

#### aggregate
- Aggregate the elements of each partition.
- Aggregate the result of each partition
- 'zero_value' isdefault init value

In [42]:
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x,y: (x[0] + y[0], x[1] + y[1]))

print (sc.parallelize([1, 2, 3, 4]).aggregate((0,0), seqOp, combOp))

print(sc.parallelize([]).aggregate((0,0), seqOp, combOp))

(10, 4)
(0, 0)


#### aggregateByKey
- seqOp works on each partition
- combOp works on result of each partitions
- ByKey causes operations on data with same key