# Unit 7 Optimizing, Monitoring and Debugging Applications

## Contents

```
1. Performance considerations
1.1. RDD lineage
1.2. RDD persistance
1.3. Broadcast variables
1.4. Accumulators
1.5. Repartition and coalesce
2. Monitoring and Debugging
2.1. YARN RM UI
2.2. Spark UI and Spark History
2.2.1. Spark Event Timeline
2.2.2. Spark DAG Visualization
2.2.3. How to interprate the DAG
2.3. How to see the logs of a job
2.4. How to change the log level
2.5. Understanding how to configure memory limits
2.6. How to tune the partitioner

```

# Performance considerations

### RDD lineage
Each time you do a transformation in an RDD, Spark does not execute it immediately, instead it creates what is called an RDD lineage.

This lineage keeps track of what are all transformations that has to be applied to produce the final RDD, from reading the data from HDFS to the different transformations that have to be applied and in which order.

The lineage allows to add fault tolerance to the RDD, because in the case that something goes wrong and a executor is lost it is able to re-compute the RDD from the HDFS original data.

### RDD persistance

In case there is an RDD that you are going to reuse it is very useful to persist it so it does not need to re-compute it each time you operate on it (by default it is persisted in memory).

In [None]:
rdd.cache()

The same can be done for a DataFrame:

In [None]:
df.cache()

In a similar way when you no longer need it you can unpersist it:

In [None]:
rdd.unpersist()

It is also possible to indicate a the storage location:

In [None]:
from pyspark import StorageLevel
# The following is equivalent to rdd.cache()
rdd.persist(StorageLevel.MEMORY_ONLY)
# Use disk instead of memory
rdd.persist(StorageLevel.DISK_ONLY)
# Use disk if it does not fit in memory (spilling)
rdd.persist(StorageLevel.MEMORY_AND_DISK)

If you want to know more about why persistance is important and the different persistance options you can read:
* [RDD persistance](https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)

### Broadcast variables
If you have a **read-only** variable that must be shared between all the tasks you can do it more efficiently using a broadcast variable:

In [None]:
# You create a broadcast variable in the driver
centroidsBC = sc.broadcast([1, 2, 3])

# And then you can read it in the different tasks with
centroidsBC.value

"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost." Source: [Spark Programming Guide](https://spark.apache.org/docs/1.6.1/programming-guide.html#broadcast-variables)

### Accumulators

Accumulators are **write-only** variables (only the driver can read it) that can be used to implement counters (as in MapReduce) or sums.

In [None]:
# Integer accumulator
events = sc.accumulator(0)
# Float accumulator
amount = sc.accumulator(0.0)

The accumulator will be incremented once per task.

In [None]:
# On the executors
events += 1

Only the driver can access the value:

In [None]:
# Only works in the driver
total.value

For more information: [Spark Programming Guide](https://spark.apache.org/docs/1.6.1/programming-guide.html#accumulators)

### Repartitition and Coalesce

You can change the number of partitions of an RDD using:

In [None]:
rdd.repartition(10)

You can also reduce the number of partitions, this is done more efficiently using:

In [None]:
rdd.coalesce(4)

**coalece()** is an optimized version of repartition() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

## Monitoring and Debugging

## YARN RM UI
To see the status of the cluster you can connect to the YARN Resource Manager User Interface and see the list of running applications:

* [Running Applications](http://yarn.hdp.cesga.es:8088/cluster/apps/RUNNING)

## Spark UI and Spark History
From YARN RM by following the ApplicationMaster link you can access the Spark UI of the running application or the Spark History server in case the application has finished.

WARN: The VPN is needed to access the private addresses.

### Understanding your Apache Spark Application Through Visualization
A Spark application is composed of:
* jobs
* stages
* tasks

#### Spark Event Timeline
The timeline view is available on three levels: across all jobs, within one job, and within one stage.
![Event Timeline](https://databricks.com/wp-content/uploads/2015/06/Screen-Shot-2015-06-19-at-1.55.07-PM-1024x481.png)

We can get more details about one of the jobs:
![Event Timeline Job](https://databricks.com/wp-content/uploads/2015/06/Screen-Shot-2015-06-19-at-1.56.30-PM-1024x426.png)

And finally for a stage:
![Event Timeline Stage](https://databricks.com/wp-content/uploads/2015/06/Screen-Shot-2015-06-19-at-1.57.36-PM-1024x823.png)

#### Execution DAG
A job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG) that we can also visualize in the Spark UI:

![Execution DAG](https://databricks.com/wp-content/uploads/2015/06/Screen-Shot-2015-06-19-at-2.00.59-PM.png)




Source: [Understanding your Apache Spark Application Through Visualization](https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html)

## How-to see the logs of a job
YARN has the aggregated logs produced by the job.

    yarn logs -applicationId application_1489083567361_0070 | less

## Configuring the log level
For debugging it can be useful to modify the debug level.

Spark uses log4j for logging so the more versatile way to do it is changing the log4j.properties file.

In some cases it can be useful to set the log level from the SparkContext:
    sc.setLogLevel("INFO")
    sc.setLogLevel("WARN")
    
This allows you to tune the information shown in order to debug your application.

## Understanding how to configure memory limits
To increase performance Spark uses an off-heap memory through the [Project Tungsten](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html).

![Container memory layout](http://bigdata.cesga.gal/files/spark_memory_limits.png)

In case you are facing a **memoryOverhead issue**:
* The first thing to do, is to boost ‘spark.yarn.executor.memoryOverhead’ (Tungsten: off-heap memory, recommended 10% memory)
* The second thing to take into account, is whether your data is balanced across the partitions

When using Python, decreasing the value of **spark.executor.memory** will help since Python will be all off-heap memory and would not use the RAM we reserved for heap. So, by decreasing this value, you reserve less space for the heap, thus you get more space for the off-heap operations (we want that, since Python will operate there). ‘spark.executor.memory’ is for JVM heap only.

Sources and further details:
* [Memory Overhead](https://gsamaras.wordpress.com/code/memoryoverhead-issue-in-spark/)
* [Understanding memory management in spark for fun and profit](https://www.slideshare.net/SparkSummit/understanding-memory-management-in-spark-for-fun-and-profit)
* [Project Tungsten: Bringing Apache Spark Closer to Bare Metal](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html)

## Verifying Spark Configuration

When debugging an application it can be useful to verify the values of all the Spark Properties.

There are two options to do it:
* Connecting to the Spark UI and checking the Environment tab
* Programatically using:

In [1]:
sc._conf.getAll()

[(u'spark.executor.extraLibraryPath',
  u'/opt/intel/mkl/lib/intel64/:/opt/intel/lib/intel64/'),
 (u'spark.history.kerberos.keytab', u'none'),
 (u'spark.eventLog.enabled', u'true'),
 (u'spark.driver.extraClassPath',
  u'/mnt/EMC/fs-hadoop/netlib-dependencies/target/netlib-dependencies-0.0.1-SNAPSHOT-jar-with-dependencies.jar'),
 (u'spark.yarn.scheduler.heartbeat.interval-ms', u'5000'),
 (u'spark.history.ui.port', u'18080'),
 (u'spark.shuffle.service.enabled', u'true'),
 (u'spark.history.fs.logDirectory', u'hdfs:///spark-history'),
 (u'spark.master', u'yarn-client'),
 (u'spark.yarn.containerLauncherMaxThreads', u'25'),
 (u'spark.yarn.historyServer.address', u'c13-18.node.int.cesga.es:18080'),
 (u'spark.yarn.queue', u'interactive'),
 (u'spark.app.name', u'PySparkShell'),
 (u'spark.eventLog.dir', u'hdfs:///spark-history'),
 (u'spark.yarn.preserve.staging.files', u'false'),
 (u'spark.yarn.submit.file.replication', u'3'),
 (u'spark.history.kerberos.principal', u'none'),
 (u'spark.rdd.compre

## Tuning the partitioner

The partitioner is the part that decides how to split the data into the different partitions. The default is to use the HashPartitioner but in some cases you may use other partitioners in order to produce a more balanced data distribution between partitions.

Apart from the HashPartitioner Spark provides the [RangePartitioner](https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/RangePartitioner.html).

You can also implement your own partitioner.


## Exercises
* Exercise: Optimize the KMeans exercise by making use of RDD caching and broadcast variables.
* Exercise: Explore the monitoring information from application_1498464222862_3294 and answer the following questions:
  * How many jobs were run by Spark? 846
  * Explore the executors tab: How many executors were used? 54 + driver
  * Explore the Storage tab: How much data was cached? 3.7GB What was the fraction of the RDD cached in memory? 100% How many partitions? 71
  * Explore the Environment tab: What was the executor.memoryOverhead value? 384MB
  * In the jobs tab, explore the global event timeline
  * What was the typical duration of each job? Most less than 1 second and some around 10 seconds.
  * Look into Job 569 and explore its DAG visualization. How many stages formed the Job? 2
  * Inside this job look into Stage 1215 and open the Event Timelime:
    * What was the dominant time in each task of this stage? Computing time.
    * How many task were executed by each executor? 1 or 2.
    * Could we take advantadge of more executors? Yes.
    * In the case of the second stage of this job (Stage 1216) how was the time distributed? In this case the computing time is reduced and we have scheduler delay time, task deserialization time, shuffle read and result serialization time taking an important amount of the total time of each task.