![title](img/this-is-fine-spark.jpeg)

## 🔥 Spark fires 🔥 - more cores than partitions

In this scenario, we will demonstrate how not having enough in-memory partitions can lead to you not using all the available executor cores.

For this experiment, we will create a small number of input file-splits to highlight the issue.

### Bootstrapping

In [None]:
import os

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = (
    SparkSession
    .builder.master("spark://spark:7077")
    # .config("spark.eventLog.enabled", "true")
    # .config("spark.eventLog.dir", "/data/tmp/spark-events")
    .appName("spark-fires-more-cores-than-partitions")
    .getOrCreate()
)

spark.version

### Let's prep our data

We are going to borrow some test data from the excellent _Spark, The Definitive Guide_ Git repo. 

In [None]:
# !mkdir -p /data/bike-data
# !wget https://raw.githubusercontent.com/udacity/data-analyst/master/projects/bike_sharing/201508_station_data.csv -P /data/bike-data
# !wget https://raw.githubusercontent.com/udacity/data-analyst/master/projects/bike_sharing/201508_trip_data.csv -P /data/bike-data

In [None]:
!ls /data/bike-data

In [None]:
input_data_path_2s = '/data/bike-data-2-splits'
input_data_path_12s = '/data/bike-data-12-splits'
output_data_path = '/data/bike-data-partitioned-out'

In [None]:
# !rm -rf /data/bike-data-2-splits
# !rm -rf /data/bike-data-12-splits

Next we will create some input data with two file splits for demonstration purposes.

In [None]:
sample_size = 0.25

if not os.path.exists(input_data_path_2s):
    df = spark.read.option("header", True).csv("/data/bike-data/201508_trip_data.csv")
    df = df.sample(fraction=sample_size)
    df.repartition(2).write.format('parquet').save(input_data_path_2s)

if not os.path.exists(input_data_path_12s):
    df = spark.read.option("header", True).csv("/data/bike-data/201508_trip_data.csv")
    df = df.sample(fraction=sample_size)
    df.repartition(12).write.format('parquet').save(input_data_path_12s)

In [None]:
# !ls -lh /data/bike-data-two-splits

In [None]:
# !ls -lh /data/bike-data-12-splits

### Now let's do some data processing

For this scenario we are only interested in the data from a single partition, _start_terminal_, which we select in our filter/where clause.

In [None]:
from time import sleep
import os

def process_partition(iterator):
    for item in iterator:
        sleep(0.001)
        yield item

def process_data(input_path: str) -> None:
    df = spark.read.parquet(input_path)
    mapped = df.rdd.mapPartitions(process_partition).toDF()
    
    out_df = mapped.withColumn('someCalc', F.col('Start Terminal') - F.col('End Terminal'))
    out_df.write.mode('overwrite').parquet(output_data_path)

In [None]:
%%time

process_data(input_data_path_2s)

### Putting the fire out  🔥🔥🔥 🚒 🚒 🚒 🧯🧯🧯

So when dig into the Spark UI SQL tab, at http://localhost:4040/jobs, we see the job with a single stage and 2 tasks, it takes ~ 90 seconds to process on my laptop. Ooops, we have 6 cores available but because of our file-splits we only end up with 2 tasks (you can see this under the stage details in the UI), so only use a third of the available cores 😭😭😭

This can happen as a result of:
 * the number of input file-splits
 * repartitioning or shuffling, which could result in a small(er) number of in-memory partitions.

Let's try processing the same data off of 12 file-splits ...

In [None]:
%%time

process_data(input_data_path_12s)

### ... result  🌟🌟🌟

Bonza! With this change, I see a **3x speed-up in the run-time**, in my case it is down to ~ 30 secs. 

Note, there are a number of Spark configuration settings which can affect the number of in-memory partitions we end up with:
* spark.default.parallelism
* spark.sql.shuffle.partitions
* spark.files.maxPartitionBytes

If you are not familiar with these settings it is [worth reading up and understanding how they can impact your jobs](https://spark.apache.org/docs/latest/configuration.html).

In [None]:
# spark.stop()