# Spark Optimization for Advances Users

<a id="contents"></a>

<a id='rdd'></a>

[<< Contents](#contents)

<a id='partitions'></a>

<a id="joins"></a>

<a id="dataserialisation"></a>

<a id="udf"></a>

<a id="data_skew"></a>

<a id="cache"></a>

<a id="storage"></a>

<a id="jvm"></a>

----------------------------------------------------------------------------------------------------------------------------------------------------

## Java Garbage Collection

To more on the topic read [here](https://javarevisited.blogspot.com/2011/04/garbage-collection-in-java.html)

<img src="images/JavaGarbageCollectionheap.png" />

`spark.executor.extraJavaOptions = -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:UseG1GC XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20`

  - Analyse the logs for the memory usage, most likely the problem would be with the data partitions.
  - https://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
  - EMR : https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html
  - Tool to analyse the logs: https://gceasy.io/


[<<< Contents](#contents)

<a id="monitoring"></a>

----------------------------------------------------------------------------------------------------------------------------------------------------

## Monitoring spark applications

  - Spark includes a configurable metrics system based on the [*dropwizard.metrics*](http://metrics.dropwizard.io/) library. It is set up via the Spark configuration. As we already are heavy users of **Graphite** and **Grafana**, we use the provided [Graphite sink](https://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/).

    The Graphite sink needs to be **used with caution**. This is due to the fact that, for each metric, Graphite creates a fixed-size database to store data points. These zeroed-out [*Whisper*](https://graphite.readthedocs.io/en/latest/whisper.html) files consume quite a lot of disk space.

    By default, the application id is added to the namespace, which means that every *spark-submit* operation creates a new metric. Thanks to [SPARK-5847](https://issues.apache.org/jira/browse/SPARK-5847) it is now possible to **mitigate the \*Whisper\* behavior** and remove the *spark.app.id* from the namespace.

  ```
  	spark.metrics.namespace=$name
  ```

  
  - https://github.com/hammerlab/grafana-spark-dashboards
  - https://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
  - https://github.com/qubole/sparklens


[<<< Contents](#contents)

<a id="misc"></a>

----------------------------------------------------------------------------------------------------------------------------------------------------

## Misc 

* Try alternatives for AWS EMR with plain EC2 

  - https://github.com/nchammas/flintrock
  - https://heather.miller.am/blog/launching-a-spark-cluster-part-1.html


----------------------------------------------------------------------------------------------------------------------------------------------------

## Good reads

- https://medium.com/teads-engineering/lessons-learned-while-optimizing-spark-aggregation-jobs-f93107f7867f
- https://www.slideshare.net/databricks/apache-spark-coredeep-diveproper-optimization
- https://michalsenkyr.github.io/2018/01/spark-performance

Update the respective Gist for this notebook :)

https://gist.github.com/Mageswaran1989/e1957c887cd8ec900f84ae91842636b9

In [41]:
# !jupyter nbconvert --to ltex SparkOptimization.ipynb

In [42]:
# ! rm -rf e1957c887cd8ec900f84ae91842636b9
# ! git clone https://gist.github.com/Mageswaran1989/e1957c887cd8ec900f84ae91842636b9
# ! cd e1957c887cd8ec900f84ae91842636b9 && rm *
# ! cp SparkOptimization.html e1957c887cd8ec900f84ae91842636b9/SparkOptimization.html
# ! cd e1957c887cd8ec900f84ae91842636b9 && git add . && git commit -m "updated" && git push origin HEAD
# ! rm -rf e1957c887cd8ec900f84ae91842636b9