In [1]:
simple = ['Cat','camel','bird','Zebra']
sorted(simple, key=lambda arg: arg.lower())

['bird', 'camel', 'Cat', 'Zebra']

In [2]:
# filter()
list(filter(lambda arg: len(arg) < 4, simple))

['Cat']

In [3]:
# map()
list(map(lambda arg: arg.upper(), simple))

['CAT', 'CAMEL', 'BIRD', 'ZEBRA']

In [4]:
# reduce()
from functools import reduce

In [5]:
reduce(lambda val1, val2: val1 + val2, simple)

'CatcamelbirdZebra'

In [6]:
import findspark

In [7]:
findspark.init('/home/daniel/spark-2.4.7-bin-hadoop2.7')

In [8]:
import pyspark

#### to interact with Pyspark, create specialized data structures called 
#### Resilient Distributed Datasets (RDD)
#### RDDs hide all the complexity of transforming and distributing data automatically across multiple nodes
#### SparkContext object allows you to connect to a Spark cluster and create RDDs.
#### The local[*] string is a special string denoting that you are using a local cluster

In [9]:
sc = pyspark.SparkContext('local[*]')

#### Creating a SparkContext can be more involved when using a cluster. To connect to a Spark cluster, apply some authentication.

In [10]:
conf = pyspark.SparkConf()
conf.setMaster('spark://head_node:56887')
conf.set('spark.authenticate', True)
conf.set('spark.authenticate.secret', 'secret-key')
# sc = SparkContext(conf=conf) 

<pyspark.conf.SparkConf at 0x7fad6c1725f8>

#### parallelize() can transform some Python data structures like lists and tuples into RDD's.

In [11]:
# The following code creates an iterator of 10,000 elements and then uses parallelize() to distribute that data into 2 partitions.
big_list = range(10000)
rdd = sc.parallelize(big_list,2)
odds = rdd.filter(lambda x: x % 2 != 0)
odds.take(5)

[1, 3, 5, 7, 9]

### NOTE: By using RDD filter() method (not built-in filter()) that operation occurs in a distributed manner across several CPUs or computers

# Passing Functions to Spark

In [12]:
class MyClass(object):
    def func(self, s):
        return s
    def doStuff(self, rdd):
        return rdd.map(self.func)

In [13]:
# simplest way to copy  field into a local variable instead of accessing it externally:
def doStuff(self,rdd):
    field =self.field
    return rdd.map(lambda s: field + s)

## Understanding closures

### One of the harder things about Spark is understanding the scope and life cycle of variables and methods then executing code across a cluster.
### RDD operations that modify veriables outside of their scope can be a frequent source of confusion. 

In [23]:
# example of what NOT to DO!
# stack overflow 'Understanding closures and parallelism in Spark'
data = [1,2,3,4,5]
counter = 0
rdd = sc.parallelize(data)

# Wrong
def increment_counter(x):
    global counter
    counter += 1
rdd.foreach(increment_counter)

print("counter value: ", counter)

counter value:  0


#### cause of issue, lifecycle of the variable counter.
#### assume a spark cluster with 1 driver and 2 executor nodes
#### each executor has its own copy of the variables and methods, all updates will take place in that copy only!
#### the counter on the driver will still be zero because the executors were modifying their own copy and the counter variable on the driver is untouched.

## Instead... use Accumulators 

#### closure are those variables and methods which much be visible for the executor to perform its computations on the RDD.
#### next time, before updating a variable in a loop while dealing with Spark, think about its scope and how Spark will break down the execution.

In [28]:
# lets apply the accumulator
num = sc.accumulator(1)

In [29]:
def inc_counter(x):
    global num
    num += x

rdd = sc.parallelize([2,3,4,5])
rdd.foreach(inc_counter)
print("counter value: ", num)

counter value:  15


# PySpark Serializers

### For performance tuningon Apache Spark, Serialization is used. However, all data which is sent over the network or written to the disk or persisted in memory must be serialized.
### PySpark supports two types of serializers: MarshalSerializer & PickleSerializer

# Storage Level in Spark 

### PySpark StorageLevel decides how RDD should be stored in Apache Spark (stored in memory or disk).
### Additionally it decides whether to serialize RDD and to replicate RDD partitions.