## Spark Tutorial  (PySpark)

### Initialize spark context 

In [1]:
from pyspark import SparkConf
from pyspark.context import SparkContext
# appName parameter is a name for your application to show on the cluster UI. 
#master is a Spark, Mesos or YARN cluster URL
conf = SparkConf().setAppName('appName').setMaster('local[*]')

#first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster
sc = SparkContext(conf=conf)
 


## RDD 
- Fault-tolerant collection of elements that can be operated on in parallel.
- Two ways to create RDDs,
     - Parallelizing an existing collection in your driver program
     - Referencing a dataset in an external storage system

### RDD by Parallelized Collections
- Created by calling SparkContext’s parallelize method on an existing iterable or collection in your driver program.
- given data is  copied to form a distributed dataset that can be operated on in parallel
- number of partitions to cut the dataset into
     - Normally, Spark tries to set the number of partitions automatically based on your cluster
     - set it manually by passing it as a second parameter to parallelize
  

In [56]:
# rdd from parallelization
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
distData_with_10_partition = sc.parallelize(data, 10)

In [None]:
# add all collection operation
distData.reduce(lambda a, b: a + b)

### RDD from External source 
- PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc.
- Spark supports text files, SequenceFiles, and any other Hadoop InputFormat
- method takes a URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc )
- file must also be accessible at the same path on worker nodes
- support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz")
- textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)
- RDD.saveAsPickleFile and SparkContext.pickleFile support saving an RDD in a simple format consisting of pickled Python objects
- Writables are automatically converted
- , SequenceFiles can be saved and loaded by specifying the path ( these are map type i.e. key,value )

In [81]:
# rdd from extenal dataset 
# this method is prefreder if spark version  <2.4
local_text_rdd = sc.textFile("file:///Users/kaustuv/test.txt")

## Number of rows in this DataFrame
local_text_rdd.count()

# First row in this DataFrame
local_text_rdd.first()

 

# add up the sizes of all the lines using the map and reduce operations
local_text_rdd.map(lambda s: len(s)).reduce(lambda a, b: a + b)

# saving as sequence file 
distData.saveAsPickleFile("file:///Users/kaustuv/Documents/Courses/Spark/pyspark/sequence_file")

## RDD Operations


In [66]:
local_text_rdd.collect()

['just to test',
 "If you're inspired by innovation, hard work and a passion for data, this may be the ideal opportunity to leverage your background in Big Data and Software Engineering, Data Engineering or Data Analytics experience to design, develop and innovate big data solutions for a diverse set of global and enterprise clients.  ",
 '',
 "At phData, our proven success has skyrocketed the demand for our services, resulting in quality growth at our company headquarters conveniently located in Downtown Minneapolis and expanding throughout the US. Notably we've also been voted Best Company to Work For in Minneapolis for three (3) consecutive years.   ",
 '',
 'As the world’s largest pure-play Big Data services firm, our team includes Apache committers, Spark experts and the most knowledgeable Scala development team in the industry. phData has earned the trust of customers by demonstrating our mastery of Hadoop services and our commitment to excellence.',
 '',
 'In addition to a pheno

In [65]:
lineLengths = local_text_rdd.map(lambda s: len(s))

lineLengths.collect()

[12, 318, 0, 314, 0, 297, 0, 277]

In [69]:
totalLength = lineLengths.reduce(lambda a, b: a + b)
totalLength

1218

In [70]:
# if you want to use lineLengths again  in pipeline then persist
lineLengths.persist() 

[12, 318, 0, 314, 0, 297, 0, 277]

### Pyspark Transformation 

- lazy intitilization processing starts with action  and not with transformtion 

#### map(func)
   - Return a new distributed dataset formed by passing each element of the source through a function func. 


#### filter(func)
 - Return a new dataset formed by selecting those elements of the source on which func returns true. 

#### flatMap(func) 
 - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). 
 
#### mapPartitions(func)
 - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type (Iterator) 
 
#### mapPartitionsWithIndex(func)
 - Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator)
    
#### sample(withReplacement, fraction, seed)
   - Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed. 
    
#### union(otherDataset) 
   - Return a new dataset that contains the union of the elements in the source dataset and the argument. 
    
#### intersection(otherDataset) 
   - Return a new RDD that contains the intersection of elements in the source dataset and the argument. 
    
#### distinct([numTasks])) 
   - Return a new dataset that contains the distinct elements of the source dataset.
    
#### groupByKey([numTasks]) 
   - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable ) pairs.
   - Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.
 - Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks. 


#### reduceByKey(func, [numTasks]) 
   - When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. 
    
#### aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) 
   - When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. 
    
#### sortByKey([ascending], [numTasks]) 
   - When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
    
#### join(otherDataset, [numTasks]) 
   - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. 
    
#### cogroup(otherDataset, [numTasks]) 
   - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable v, Iterable W)) tuples. This operation is also called groupWith. 
    
#### cartesian(otherDataset) 
 - When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). 
 
#### pipe(command, [envVars]) 
 - Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. 
 
#### coalesce(numPartitions)
   - Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. 
    
#### repartition(numPartitions) 
 - Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. 
 
#### repartitionAndSortWithinPartitions(partitioner)
 - Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery. 



## Pyspark actions

#### reduce(func)
 - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. 
 
#### collect() 
 - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. 
 
#### count()
 - Return the number of elements in the dataset. 
 
#### first()
 - Return the first element of the dataset (similar to take(1)). 
 
#### take(n)
 - Return an array with the first n elements of the dataset. 
 
#### takeSample (withReplacement,num, [seed]) 
 - Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
 
#### takeOrdered(n, [ordering]) 
 - Return the first n elements of the RDD using either their natural order or a custom comparator. 
 
#### saveAsTextFile(path)
 -  Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. 
 
#### countByKey() 
 - Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. 
 
#### foreach(func) 
 - Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. 
 

In [88]:
# rdd from extenal dataset 
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

local_text_rdd = spark.read.text("file:///Users/kaustuv/test.txt")



In [89]:
## Number of rows in this DataFrame
local_text_rdd.count()



8

In [90]:
# First row in this DataFrame
local_text_rdd.first()



Row(value='just to test')

In [91]:
# let’s transform this DataFrame to a new one
linesWithSpark = local_text_rdd.filter(local_text_rdd.value.contains("this"))

In [93]:
linesWithSpark.collect()

[Row(value="If you're inspired by innovation, hard work and a passion for data, this may be the ideal opportunity to leverage your background in Big Data and Software Engineering, Data Engineering or Data Analytics experience to design, develop and innovate big data solutions for a diverse set of global and enterprise clients.  ")]

In [96]:
# chaining together transformations and actions:
linesWithSpark.filter(linesWithSpark.value.contains("this")).count()

1

In [97]:
# more complex : find the line with the most words

from pyspark.sql.functions import *
linesWithSpark.select(size(split(linesWithSpark.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()

[Row(max(numWords)=52)]

In [98]:
#  Spark can implement MapReduce flows as below 

wordCounts = local_text_rdd.select(explode(split(local_text_rdd.value, "\s+")).alias("word")).groupBy("word").count()

In [99]:
wordCounts.collect()

[Row(word='growth', count=2),
 Row(word='If', count=1),
 Row(word='throughout', count=1),
 Row(word='experts', count=1),
 Row(word='Data', count=4),
 Row(word='proven', count=1),
 Row(word='set', count=1),
 Row(word='demand', count=1),
 Row(word='committers,', count=1),
 Row(word='certifications', count=1),
 Row(word='data,', count=1),
 Row(word='Engineering', count=1),
 Row(word='by', count=2),
 Row(word='innovate', count=1),
 Row(word='success', count=1),
 Row(word='headquarters', count=1),
 Row(word='pure-play', count=1),
 Row(word='In', count=1),
 Row(word='opportunity', count=1),
 Row(word='PTO', count=1),
 Row(word='for', count=5),
 Row(word='ideal', count=1),
 Row(word='develop', count=1),
 Row(word='diverse', count=1),
 Row(word='skyrocketed', count=1),
 Row(word='in', count=6),
 Row(word='Work', count=1),
 Row(word="we've", count=1),
 Row(word='Company', count=1),
 Row(word='includes', count=1),
 Row(word='design,', count=1),
 Row(word='addition', count=2),
 Row(word='training

### Passing Functions to Spark

In [None]:
# For example, to pass a longer function than can be supported using a lambda, consider the code below:
if __name__ == "__main__":
    def myFunc(s):
        words = s.split(" ")
        return len(words)


sc.textFile("file.txt").map(myFunc)

# passing a reference to a method in a class instance

class MyClass(object):
    def func(self, s):
        return s
    def doStuff(self, rdd):
        return rdd.map(self.func)


# for accessing fields of the outer object

class MyClass(object):
    def __init__(self):
        self.field = "Hello"
    def doStuff(self, rdd):
        return rdd.map(lambda s: self.field + s)
    
# alternativly 
def doStuff(self, rdd):
    field = self.field
    return rdd.map(lambda s: field + s)
    


### Closure 
- Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor
- Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD
- Closure is serialized and sent to each executor.
- Hence for reference global variable socpe special variable  i.e. accumulater is used

### Shuffle 

- The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions.
- This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

- For example for the reduceByKey operation.not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result. to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

- due to this  long-running Spark jobs may consume a large amount of disk space



In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

### Spark Execution mode 

#### Local vs. cluster modes




### Word Count in Pyspark

In [51]:
# word count 
text_data = sc.textFile("file:///Users/kaustuv/test.txt")
counts = text_data.flatMap(lambda line:line.split(" ")).map( lambda word : (word, 1)).reduceByKey(lambda a,b:a+b);
counts.saveAsTextFile('file:///Users/kaustuv/Documents/Courses/Spark/pyspark/out.txt')

### Caching : RDD Persistence

- persisting (or caching) a dataset in memory across operations
- Caching is a key tool for iterative algorithms and fast interactive use
- can mark an RDD to be persisted using the persist() or cache() methods on it
- Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
- persisted RDD can be stored using a different storage level,on disk,in memory but as serialized Java objects
- cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory)

- In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level.

### Storage levels :


MEMORY_ONLY :  If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed

MEMORY_ONLY_2 : same as MEMORY_ONLY replicate each partition on two cluster nodes 

MEMORY_AND_DISK : if the RDD does not fit in memory, store the partitions that don't fit on disk

MEMORY_AND_DISK_2 :same as MEMORY_AND_DISK but replicate each partition on two cluster nodes

DISK_ONLY : Store the RDD partitions only on disk. 

DISK_ONLY_2 : same as DISK only with  2 replication 

DISK_ONLY_3 : same as DISK only with  3 replication 


MEMORY_ONLY_SER
(Java and Scala) : Store RDD as serialized Java objects




####  Chossing Storage levels :

- If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option.

- If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient.

- Use the replicated storage levels if you want fast fault recovery ,replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition however all the storage levels provide full fault tolerance by recomputing lost data


#### Data Cleanup in Spark 

- Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.
 
 - use the RDD.unpersist() method. to manually remove an RDD instead of waiting for it to fall out of the cach,To block until resources are freed, specify blocking=true when calling this method.

In [100]:
linesWithSpark.cache()

DataFrame[value: string]

In [101]:
linesWithSpark.count()

1

In [79]:
counts.persist()

PythonRDD[261] at RDD at PythonRDD.scala:53

In [80]:
counts.unpersist()

PythonRDD[261] at RDD at PythonRDD.scala:53


 
 
### Shared variable 

- Broadcast veriable  (readonly)

- Accumulator (writeonly , sililar to mpreduce counter)


In [None]:
# for reference on casting 
# Convert String to Integer Type
df.withColumn("age",df.age.cast(IntegerType()))
df.withColumn("age",df.age.cast('int'))
df.withColumn("age",df.age.cast('integer'))

# Using select
df.select(col("age").cast('int').alias("age"))

#Using selectExpr()
df.selectExpr("cast(age as int) age")

#Using with spark.sql()
spark.sql("SELECT INT(age),BOOLEAN(isGraduated),DATE(jobStartDate) from CastExample"

### Broadcast Variables

- allow the programmer to keep a read-only variable cached on each machine
- used, for example, to give every node a copy of a large input dataset in an efficient manner
- Spark automatically broadcasts the common data needed by tasks within each stage
- explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important
- The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method

- To release the resources that the broadcast variable copied onto executors, call .unpersist()
- To permanently release all resources used by the broadcast variable, call .destroy()


In [71]:
broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value

[1, 2, 3]

### Accumulators

- are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in paralle.
- Can be used to implement counters
- Spark natively supports accumulators of numeric types, and programmers can add support for new types.
- Spark displays the value for each accumulator modified by a task in the “Tasks” table.

- programmers can also create their own types by of accumulator subclassing AccumulatorParam
- AccumulatorParam interface has two methods: zero for providing a “zero value” for your data type, and addInPlace for adding two values together.
- Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action.



In [72]:
accum = sc.accumulator(0)
accum

Accumulator<id=0, value=0>

In [74]:
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
accum.value

10

## Building application 

-  you are building a packaged PySpark application or library you can add it to your setup.py file as:

```
    install_requires=
    [
        'pyspark==3.1.1'
                 ]
```
   
- example of  simple Spark application, SimpleApp.py:

```
"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()

```

- And run this application using the bin/spark-submit script:

```
spark-submit \
  --master local[4] \
  SimpleApp.py
  
```

## Submitting  Spark Applications

- Job are submitted using spark-submit script in Spark’s bin directory

### Bundling Your Application’s Dependencies,
- Package dependencies  alongside your application in order to distribute the code to a Spark cluster
- Spark and Hadoop as provided dependencies should not be bundled in package  since they are provided by the cluster manager at runtime
- pass  your jar while calling   the bin/spark-submit script 

### Launching Applications with spark-submit

- Application can be launched using the bin/spark-submit script
- Some of the commonly used options are:
    - --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
    
    - --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) (* see next section)
    
    - --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client). In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.Currently, the standalone mode does not support cluster mode for Python applications. 
    - --conf: Arbitrary Spark configuration property in key=value format
    - application-jar: Path to a bundled jar including your application and all dependencies
    - application-arguments: Arguments passed to the main method of your main class, if any
    
    
- To enumerate all such options available to spark-submit, run it with --help

- Spark uses the following URL scheme to allow different strategies for disseminating jars,file:/,hdfs:, http:, https:, ftp,local, 

### Master URLs
master URL passed to Spark can be in one of the following formats:

- local : Run Spark locally with one worker thread (i.e. no parallelism at all). 
- local[K] : Run Spark locally with K worker threads
- local[K,F] :Run Spark locally with K worker threads and F maxFailures
- local[*] : Run Spark locally with as many worker threads as logical cores on your machine.
- local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.
- spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. 
- spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper.
- mesos://HOST:PORT : Connect to the given Mesos cluster
- yarn :Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode
- k8s://HOST:PORT : Connect to a Kubernetes cluster in cluster mode

In [None]:
./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
    
# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

# Run on a Kubernetes cluster in cluster deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://xx.yy.zz.ww:443 \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  http://path/to/examples.jar \
  1000

## Monitoring

- Each driver program has a web UI
- It can be accessed on port 4040 i.e. http://driver-node:4040
- If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040  then 4041, 4042, etc.
 
- You Can construct the UI of an application through Spark’s history server 'sbin/start-history-server.sh'

- Job metrics, are also available as JSON so developers can  create new visualizations and monitoring tools for Spark
 
- Information about the application includes :
    - list of scheduler stages and tasks, 
    - summary of RDD sizes and memory usage, 
    - Environmental information, 
    - running executors information


## Job Scheduling in spark

- Cluster managers, that Spark runs on provide, scheduling facilities.

- Within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads.

- Spark also includes a fair scheduler to schedule resources within each SparkContext.

- Dynamic Resource Allocation is by default false, to enable set these two properties to true.
     - spark.dynamicAllocation.enabled  
     - spark.dynamicAllocation.shuffleTracking.enabled
 

## Tuning Spark

### Data Serialization
- Storing RDD is serialized form, it reduces memory usage.
- Spark provides two serialization libraries:
    - Java serialization:  Spark serializes objects using Java’s ObjectOutputStream framework  by implementing java.io.Serializable and  java.io.Externalizable.

    - Kryo serialization: 
        - Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
        - to use kryo, initializing your job with a SparkConf and calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        
    
    
###  Memory Tuning

- considerations in tuning memory usage
    - Amount of memory used by your objects
    - Cost of accessing those objects
    - The overhead of garbage collection
- we will describe how to determine the memory usage of your objects, and how to improve it – either by changing your data structures, or by storing data in a serialized format. We will then cover tuning Spark’s cache size and the Java garbage collector.


#### Memory Management Overview

- In Spark, execution and storage share a unified region (M).

####  Determining Memory Consumption

- To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method 

####        Tuning Data Structures

- avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this:

    -  prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes 
    - avoid nested structures with a lot of small objects and pointers when possible.
    - Consider using numeric IDs or enumeration objects instead of strings for keys.
    - If you have less than 32 GiB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight. You can add these options in spark-env.sh.
    

####        Serialized RDD Storage

- use the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER
- downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly

- spark highly recommend using Kryo if you want to cache data in serialized form

####        Garbage Collection Tuning

- The cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objectsgreatly lowers this cost.

- An even better method is to persist objects in serialized form, as described above: now there will be only one object (a byte array) per RDD partition

- first step in GC tuning is to<b>Measuring the Impact</b> i.e. of GC collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options

### Other Considerations

- Level of Parallelism : 
     -  try setting the level of parallelism for each operation high enough
     - config property spark.default.parallelism
        
- Parallel Listing on Input Paths :
    - Sometimes you may also need to increase *directory listing parallelism when job input has large number of directories, otherwise the process could take a very long time, especially when against object store like S3
        

- Broadcasting Large Variables :
    - Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster
        
- Data Locality :
    - If data and the code that operates on it are together then computation tends to be fast.
    - Spark prefers to schedule all tasks at the best locality level,
    - Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU
    - Wait timeout for fallback between each level can be configured individually or all together in one parameter; see the <b>spark.locality</b> parameters on the
    



## Unit Testing in Spark

- test outside clusters
- test to prevent bugs propagation in clusters
- test the logic (not spark)
- IDE prwffered over notebook for testing ( & debugging)
- Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

-  for pyspark use  unittest.mock library 


In [None]:
import unittest
import pyspark


class PySparkTestCase(unittest.TestCase):

    @classmethod
    def setUpClass(cls):
        conf = pyspark.SparkConf().setMaster("local[2]").setAppName("testing")
        cls.sc = pyspark.SparkContext(conf=conf)
        cls.spark = pyspark.SQLContext(cls.sc)

    @classmethod
    def tearDownClass(cls):
        cls.sc.stop()
        
        
class SimpleTestCase(PySparkTestCase):

    def test_with_rdd(self):
        test_input = [
            ' hello spark ',
            ' hello again spark spark'
        ]

        input_rdd = self.sc.parallelize(test_input, 1)

        from operator import add

        results = input_rdd.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(add).collect()
        self.assertEqual(results, [('hello', 2), ('spark', 3), ('again', 1)])

    def test_with_df(self):
        df = self.spark.createDataFrame(data=[[1, 'a'], [2, 'b']], 
                                        schema=['c1', 'c2'])
        self.assertEqual(df.count(), 2)
