# Unit 7 Optimizing, Monitoring and Debugging Applications

## Contents
```
7.1. Performance considerations
  7.1.1 RDD lineage
  7.1.2 RDD persistance
  7.1.3 Broadcast variables
  7.1.4 Accumulators
  7.1.5 Repartition and coalesce
  
7.2. Monitoring and Debugging
  7.2.1. HUE/YARN UI
  7.2.2. Spark UI and Spark History
    7.2.2.1. Spark Event Timeline
    7.2.2.2. Spark DAG Visualization
    7.2.2.3. How to interprate the DAG
  7.2.3. How to see the logs of a job
  7.2.4. How to change the log level
  7.2.5. Understanding how to configure memory limits
  7.2.6. How to tune the partitioner
```

# Performance considerations

### RDD lineage
Each time you do a transformation in an RDD, Spark does not execute it immediately, instead it creates what is called an RDD lineage.

This lineage keeps track of what are all transformations that has to be applied to produce the final RDD, from reading the data from HDFS to the different transformations that have to be applied and in which order.

The lineage allows to add fault tolerance to the RDD, because in the case that something goes wrong and a executor is lost it is able to re-compute the RDD from the HDFS original data.

### RDD persistance

In case there is an RDD that you are going to reuse it is very useful to persist it so it does not need to re-compute it each time you operate on it (by default it is persisted in memory).

In [None]:
rdd.cache()

The same can be done for a DataFrame:

In [None]:
df.cache()

In a similar way when you no longer need it you can unpersist it:

In [None]:
rdd.unpersist()

It is also possible to indicate a the storage location:

In [None]:
from pyspark import StorageLevel
# The following is equivalent to rdd.cache()
rdd.persist(StorageLevel.MEMORY_ONLY)
# Use disk instead of memory
rdd.persist(StorageLevel.DISK_ONLY)
# Use disk if it does not fit in memory (spilling)
rdd.persist(StorageLevel.MEMORY_AND_DISK)

If you want to know more about why persistance is important and the different persistance options you can read:
* [RDD persistance](https://spark.apache.org/docs/2.4.0/programming-guide.html#rdd-persistence)

### Broadcast variables
If you have a **read-only** variable that must be shared between all the tasks you can do it more efficiently using a broadcast variable:

In [None]:
# You create a broadcast variable in the driver
centroidsBC = sc.broadcast([1, 2, 3])

# And then you can read it in the different tasks with
centroidsBC.value

"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost." Source: [Spark Programming Guide](https://spark.apache.org/docs/2.4.0/programming-guide.html#broadcast-variables)

### Accumulators

Accumulators are **write-only** variables (only the driver can read it) that can be used to implement counters (as in MapReduce) or sums.

In [None]:
# Integer accumulator
events = sc.accumulator(0)
# Float accumulator
amount = sc.accumulator(0.0)

The accumulator will be incremented once per task.

In [None]:
# On the executors
events += 1

Only the driver can access the value:

In [None]:
# Only works in the driver
total.value

For more information: [Spark Programming Guide](https://spark.apache.org/docs/1.6.1/programming-guide.html#accumulators)

### Repartitition and Coalesce

You can change the number of partitions of an RDD using:

In [None]:
rdd.repartition(10)

You can also reduce the number of partitions, this is done more efficiently using:

In [None]:
rdd.coalesce(4)

**coalesce()** is an optimized version of repartition() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

# Monitoring and Debugging

## Big Data WebUI
To see the status of the cluster you can connect to the Big Data Web UI and from there you connect to HUE Web Interface.

This interface will allow you to monitor your applications from a graphical interface and to access the Spark UI information.


## Spark UI and Spark History
From HUE by looking at the Properties tab and following the `trackingURL` link you can access the Spark UI of the running application or the Spark History server in case the application has finished.

### Understanding your Apache Spark Application Through Visualization
A Spark application is composed of:
* jobs
* stages
* tasks

#### Spark Event Timeline
The timeline view is available on three levels: across all jobs, within one job, and within one stage.
![Event Timeline](https://bigdata.cesga.es/img/spark-ui-jobs.png)

We can get more details about a specific, for example Job 0:
![Event Timeline Job](https://bigdata.cesga.es/img/spark-ui-details_for_job_0.png)

And finally we can go deeper selecting a specific stage:
![Event Timeline Stage](https://bigdata.cesga.es/img/spark-ui-details_for_stage_0.png)

#### Execution DAG
A job is associated with a chain of RDD dependencies organized in a direct acyclic graph (DAG) that we can also visualize in the Spark UI:

![Execution DAG](https://bigdata.cesga.es/img/spark-ui-execution_dag.png)

The greyed stage indicates that data was fetched from cache so it was not needed to re-execute that given stage: for that reason it appears as **skipped**. Whenever there is shuffling involved Spark automatically caches generated data.


More information: [Understanding your Apache Spark Application Through Visualization](https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html)

## How-to see the logs of a job
YARN has the aggregated logs produced by the job.

    yarn logs -applicationId application_1489083567361_0070 | less

## Configuring the log level
For debugging it can be useful to modify the debug level.

Spark uses log4j for logging so the more versatile way to do it is changing the log4j.properties file.

In some cases it can be useful to set the log level from the SparkContext:
    sc.setLogLevel("INFO")
    sc.setLogLevel("WARN")
    
This allows you to tune the information shown in order to debug your application.

## Understanding how to configure memory limits
To increase performance Spark uses an off-heap memory through the [Project Tungsten](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html).

![Container memory layout](http://bigdata.cesga.gal/files/spark_memory_limits.png)

In case you are facing a **memoryOverhead issue**:
* The first thing to do, is to boost ‘spark.yarn.executor.memoryOverhead’ (Tungsten: off-heap memory, recommended 10% memory)
* The second thing to take into account, is whether your data is balanced across the partitions

When using Python, decreasing the value of **spark.executor.memory** will help since Python will be all off-heap memory and would not use the RAM we reserved for heap. So, by decreasing this value, you reserve less space for the heap, thus you get more space for the off-heap operations (we want that, since Python will operate there). ‘spark.executor.memory’ is for JVM heap only.

Sources and further details:
* [Memory Overhead](https://gsamaras.wordpress.com/code/memoryoverhead-issue-in-spark/)
* [Understanding memory management in spark for fun and profit](https://www.slideshare.net/SparkSummit/understanding-memory-management-in-spark-for-fun-and-profit)
* [Project Tungsten: Bringing Apache Spark Closer to Bare Metal](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html)

## Verifying Spark Configuration

When debugging an application it can be useful to verify the values of all the Spark Properties.

There are two options to do it:
* Connecting to the Spark UI and checking the Environment tab
* Programatically using:

In [1]:
sc._conf.getAll()

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.yarn.jars',
  u'local:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/spark/hive/*'),
 (u'spark.yarn.appMasterEnv.MKL_NUM_THREADS', u'1'),
 (u'spark.sql.queryExecutionListeners',
  u'com.cloudera.spark.lineage.NavigatorQueryListener'),
 (u'spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
  u'c14-18.bd.cluster.cesga.es,c14-19.bd.cluster.cesga.es'),
 (u'spark.ui.killEnabled', u'true'),
 (u'spark.lineage.log.dir', u'/var/log/spark/lineage'),
 (u'spark.eventLog.dir', u'hdfs://nameservice1/user/spark/applicationHistory'),
 (u'spark.dynamicAllocation.executorIdleTimeout', u'60'),
 (u'spark.serializer', u'org.apache.spark.serializer.KryoSerializer'),
 (u'spark.io.encryption.enabled', u'false'),
 (u'spark.authenticate', u'false'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.org.apache.hadoop.yarn.server.webprox

## Tuning the partitioner

The partitioner is the part that decides how to split the data into the different partitions. The default is to use the HashPartitioner but in some cases you may use other partitioners in order to produce a more balanced data distribution between partitions.

Apart from the HashPartitioner Spark provides the [RangePartitioner](https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/RangePartitioner.html).

You can also implement your own partitioner.


## Exercises
* Exercise: Optimize the KMeans exercise by making use of RDD caching and broadcast variables.
* Exercise: Explore the monitoring information for your optimized KMeans notebook, comparing it with the information for the non-optimized version, and answer the following questions:

  * Explore the Jobs tab:
    * How many jobs were run by Spark?
    * What was the typical duration of each job? You can sort the jobs by Duration clicking in the "Duration" column label
    * Explore the global event timeline
    * Explore the job with Job Id 7:
      * Explore the Event Timeline
      * Explore the DAG: How many stages were run?
      
  * Explore the Stages tab:
    * What was the total number of stages for all jobs?
    * Explore the Stage 12:
      * What was the 75th percentile duration of the tasks? 
      * What was the Input Size?
      * Expand the Event Timeline: 
        * How was the time distributed?
        * Compare with Stage 0: In this case the percentage of computing time is reduced, compared to the scheduler delay and task deserialization parts.

  * Explore the Storage tab (notebook must be still running, it is blank for finished applications): 
    * How much data is cached?
    * How many partitions are cached?
    * What is the fraction of the RDD cached in memory?

  * Explore the Environment tab: 
    * Was dynamic resource allocation enabled? Look at the value of the spark.dynamicAllocation.enabled property.
    
  * Explore the Executors tab: 
    * How many executors were used? The driver also appears in the list.
    * In which cluster node run executor 1?
    * Notice that, when using dynamic allocation, the executors not being used will be automatically shutdown
    * Could we take advantadge of more executors? Check if there are executors that did not run any task.
    