![title](img/this-is-fine-spark.jpeg)

## 🔥 Spark fires 🔥 - show() on uncached dataframe

In this scenario we create call `df.show()` on a dataframe which has not been cached. This causes recomputation of part of our DAG, slowing our job down.

### Bootstrap

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = (
    SparkSession
    .builder.master("spark://spark:7077")
    # .config("spark.eventLog.enabled", "true")
    # .config("spark.eventLog.dir", "/data/tmp/spark-events")
    .appName("spark-fires-show-on-uncached-df")
    .getOrCreate()
)

spark.version

In [None]:
%%time 

df = spark.range(0, 7200 * 6).repartition(12).cache()
df2 = df.withColumn('created_at', F.current_timestamp()).repartition(df.rdd.getNumPartitions() * 2).cache()
df2.show(10, truncate=False)

In [None]:
%%time

from time import sleep
import os

def process_partition(iterator):
    for item in iterator:
        sleep(0.01)
        yield item

mapped = df.rdd.mapPartitions(process_partition).toDF()  #.cache()
joined = mapped.join(df2, 'id') #.cache()  # -- change 1
joined.show(truncate=False)  # -- change 2 - just remove the show call altogther, and make sure we are not caching in `-- change 1`
joined = joined.withColumn('one_up', F.col('id') + 1)
joined.write.format("parquet").mode('overwrite').save("/data/range_nums")

In [None]:
spark.catalog.clearCache() 

In [None]:
# spark.stop()

### Putting the fire out  🔥🔥🔥 🚒 🚒 🚒 🧯🧯🧯

First run through I see an execution time of ~ 2 min 15 secs seconds. If you look at the Spark UI on http://localhost:4040/ and when we dig through into the jobs we see the following:
* http://localhost:4040/jobs/job/?id=7 - our show() call at ~ 45s
* http://localhost:4040/jobs/job/?id=9 - our save() call at ~ 1m 30s 

So that's fine right? Well, no. Our `show()` call is recomputing part of the DAG costing us time. Let's look at our options:
1. Firstly, we can leave the `show()` call in there but cache the target dataframe, so that work materialised during the show call is saved and is reused, rather than being recomputed from scratch. Try uncommenting `-- change 1` for this. Boom, our runtime drops to around ~ 1m 30s and we see **> 30% drop in runtime**.
1. Next, we can look at the show call itself. It is rarely a good idea to leave debug code in your app code. Let's remove it `-- change 2` and the dataframe caching `-- change 1`. In this case, with our noddy test scenario the runtime reduction is about the same as for our last change. However, in a real-world example our data would probably be much larger so we would take a bigger performance hit for caching it, and may not even be able to fit it into memory.

In some ways you could argue this is a 'noobie' mistake but I have seen multiple examples of this in productions jobs: some of these had multiple _show_ calls in which accounted for > 60% of their runtime 😮.