# Shuffle Operation and Shuffle Service

![footer_logo_new](images/logo_new.png)

## Shuffle Operation

Re-distribution of the dataset is the primary goal of shuffling operation. 
The need to re-distribute the dataset could be there in order to:

#### Increase or Decrease the number of data partitions
Since a data partition represents the quantum of data to be processed together by a single Spark Task, there could be situations:
- existing number of data partitions are not sufficient enough in order to maximize the usage of available resources
- existing number of data partitions are too heavy to be computed reliably without memory overruns
- existing number of data partitions are too high in number such that task scheduling overhead becomes the bottleneck in the overall processing time

####  Perform Aggregation/Join on a data collection
In order to perform aggregation/join operation on data collection(s), all data records belonging to aggregation, or a join key should reside in a single data partition. If this condition is not met, data re-distribution is triggered.

![repartition](images/repartition.png)


#### Shuffle Partitions
**The number of shuffle partitions** specifies the number of output partitions after the shuffle is executed on a data collection.

**Partitioner** decides the target shuffle partition number. 
- Hash Partitioner decides the output partition based on hash code computed for key object specified for the data record.
- Range Partitioner decides the output partition based on the comparison of key value against the range of key values estimated for each of the shuffled partition. 

![shuffle_partitions](images/shuffle_partitions.png)

#### Shuffle Blocks

A shuffle block uniquely identifies a block of data which belongs to a single shuffled partition and is produced from executing shuffle write operation (by ShuffleMap task) on a single input partition during a shuffle write stage in a Spark application.

![shuffle_blocks](images/shuffle_blocks.png)

The unique identifier (corresponding to a shuffle block) is represented as a tuple of ShuffleId, MapId and ReduceId.

#### Shuffle Write

Shuffle write operation is executed independently for each of the input partition which needs to be shuffled.

Shuffle writers produces a **index file** and a **data file** corresponding to each of the input partition to be shuffled. 
- Index file contains locations inside data file for each of the shuffled partition 
- Data file contains actual shuffled data records ordered by shuffled partitions.

![shuffle_write](images/shuffle_write.png)

#### Shuffle Read

Shuffle read does pulling/fetching of those blocks from respective locations using block manager module. 

Finally, a sorted iterator on shuffled data records derived from fetched shuffled blocks is returned for further use.

#### Shuffle Spill

##### Shuffle Write Spill
Before writing to a final index and data file, a buffer is used to store the data records (while iterating over the input partition) in order to sort the records on the basis of targeted shuffled partitions.

If the memory limits of the aforesaid buffer is breached, the contents are first sorted and then spilled to disk in a temporary shuffle file.

After the iteration process is over, these spilled files are again read and merged to produce the final shuffle index and data file.

![shuffle_write_spill](images/shuffle_write_spill.png)

##### Shuffle Read Spill
Similar process happens on the Shuffle Read operation.

![shuffle_read_spill](images/shuffle_read_spill.png)

## External Shuffle Service

ESS is a proxy between Executors which write shuffle blocks and Executors which read these blocks, it helps to fetch shuffle blocks.

Its lifecycle is independent of a Spark Application and any of the Executors.

ESS runs on each Worker Node. Executor registers on ESS and informs where it is storing the shuffle files. ESS is able to stream those files to the reading Executors.

![shuffle_service](images/shuffle_service.png)

Benefits:

- Reliability, if an Executor dies, its shuffle files are not lost
- Dynamic Allocation, it is mandatory to have ESS with DA

#### Configuration

*spark.shuffle.service.enabled* - defines if ESS is enabled
 
*spark.shuffle.service.port* - service port

*spark.shuffle.service.index.cache.size* - how big is the cache for storing Shuffle Index Files

# Questions

Let's run a query and investigate different shuffle parameters in Spark UI


In [None]:
from pyspark import StorageLevel
from pyspark.sql import functions as F, SQLContext, SparkSession, Window
from pyspark.sql.types import*
from random import randint
import time
import datetime

spark = (SparkSession.builder
         .appName("explore-data")
         .master("spark://spark-master:7077")
         .config("spark.eventLog.enabled", "true")
         .config("spark.eventLog.dir", "/opt/workspace/history")
         .enableHiveSupport()
         .getOrCreate()
         )

meteo_data_file = "data/meteo-data/parquet"
meteo_df = spark.read.parquet(meteo_data_file)
observation_type_file = "data/meteo-data/observation_type.csv"

schema = StructType([
    StructField('observation_type', StringType(), True),
    StructField('description', StringType(), True)
])

observation_type_df = (spark.read
               .schema(schema)
               .option("header", "false")
               .csv(observation_type_file)
              )

stations_meta_file = "data/meteo-data/stations.csv"

schema = StructType([
    StructField('station_identifier', StringType(), True),
    StructField('latitude', FloatType(), True),
    StructField('longitude', FloatType(), True),
    StructField('height_above_sea_level', FloatType(), True),
    StructField('station_name', StringType(), True)
])

stations_df = (spark.read
               .schema(schema)
               .option("header", "false")
               .csv(stations_meta_file)
              )

In [None]:
spark.sql("SET spark.sql.shuffle.partitions=100")
df2 = meteo_df.where("yyyy > 2015").join(stations_df,meteo_df["station_identifier"] == stations_df["station_identifier"], "inner")
count = df2.cache().count()
print(count)
df2.unpersist()

1. What is spill to disk and spill to memory? Why it's good or bad? 
2. What you need to do to minimize the spill and when it's needed? 
3. What is controlled by setting spark.sql.shuffle.partitions?
4. What is the ideal number for this setting? 
5. Why the ShuffleWrite is significantly bigger than Data Read? 

To answer this question, you can change setting in pervious example and see the result

Bonus question: where is shuffle stage if we remove the cache() step?

In [None]:
spark.stop()