# Performance Tuning

For some workloads, it is possible to improve performance by either caching data in memory of by turning on some experimental options.

## Caching Data in Memory

Spark SQL can cache tables using an in-memory columnar format. It will scan only the required columns and will automatically tune compression to minimize memory usage and GC pressure.

For caching:

* `spark.catalog.cacheTable("<tableName>")
* `dataFrame.cache()`

For uncaching:

* `spark.catalog.uncacheTable("<tableName>")
* `dataFrame.unpersist()`

Configuration can be done using `setConf` method on SparkSession or running `SET key=value` commands using SQL

<u>Configuration Options</u>

Note: Everything begins with a `spark.sql` then the config

* `inMemoryColumnarStorage.compressed` -- True -- automatically select a compression codec for each column
* `inMemoryColumnarStorage.batchSize` -- 10000 -- controls the batch for columnar caching, Larger batch sizes improve memory utilization but risk OOMs when caching data

## Other Configuration Options

Find the other configuration options [ here ](https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options)

## Join Stragety Hints for SQL Queries

Instruct Spark to use a hinted strategy when joining separated relations. It will tend to prioritize the type of join strategy selected even if the size of the table suggested by the statistics is above the configuration selected.

__Join Methods__

* `DataFrame.join(<df>, <on_col>)` -- join 2 dataframes together
* `<df>.hint(<join_strategy>)` -- specify the type of join strategy to apply to that side of the the join

__Join Strategies__

* `BROADCAST`
* `MERGE`
* `SHUFFLE_HASH`
* `SHUFFLE_REPLICATE_NL`

Order of prioritization when both sides of the join are specified: Broadcast > Merge > Shuffle Hash > Shuffle Replicate NL

No guarantee that Spark will choose the join strategy specified, since the strategy might not support all join types

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Join Strategies").getOrCreate()

spark.table("src").join(spark.table("records").hint("broadcast"), "key").show()

## Adaptive Query Execution

Adaptive Query Execution (AQE) is an optimization technique that makes use of runtime statistics to choose the most effecient query execution plan, which is enabled by default

Three major features in AQE: 

1. Coalescing post-shuffle paritions
2. Converting sort-merge join to broadcast join 
3. Skew join optimization

* `spark.sql.adaptive.enabled` -- True/False -- To turn the feature on and off