![title](img/this-is-fine-spark.jpeg)

## ðŸ”¥ Spark fires ðŸ”¥ - the perils of small files

In this scenario, we will demonstrate the impact of lots of small files and why you should consider adding some house-keeping schedules if your write patterns lead to this eventuality.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DoubleType
import pyspark.sql.functions as F

spark = (
    SparkSession
    .builder.master("spark://spark:7077")
    # .config("spark.eventLog.enabled", "true")
    # .config("spark.eventLog.dir", "/Users/owen/dev/data/tmp/spark-events")
    .appName("spark-fires-small-files")
    .getOrCreate()
)

spark.version

### Let's create some test data

So we are going to create a synthetic field which we will use to partition our data, named `pid`. We will randomely assign this across our data.

In [None]:
num_rows = 720000
num_partitions = 300

df = spark.range(0, num_rows).withColumn('pid', F.floor(F.rand() * num_partitions)).cache()
df.count()

Then we will save our data to two locations: one with 1 file-split per partition and at the other location we will have 12 file-splits per partition.

In [None]:
%%time

small_files_path = "/data/small-files"
big_files_path = "/data/big-files"

df.repartition(1).write.format("parquet").mode('overwrite').partitionBy('pid').save(big_files_path)
df.repartition(12).write.format("parquet").mode('overwrite').partitionBy('pid').save(small_files_path)

### Now let's read the data and do some basic transforms on it

First let's read the small files.

In [None]:
def process_data(input_path: str) -> None:
    mapped = spark.read.format('parquet').load(input_path)
    mapped = mapped.withColumn('incd', F.col('id') + 1).repartition(6)
    mapped.write.format("parquet").mode('overwrite').save("/data/mapped")

In [None]:
%%time

process_data(small_files_path)

### Putting the fire out  ðŸ”¥ðŸ”¥ðŸ”¥ ðŸš’ ðŸš’ ðŸš’ ðŸ§¯ðŸ§¯ðŸ§¯

Now what if we did some house-keeping and rolled up our data into fewer larger files?

In [None]:
%%time

process_data(big_files_path)

Wow, so straight away we see a **~ 4x speed increase**, boom. (For me a runtime of ~ 12 secs down to ~ 3 secs). But there are a few things to note here:
1. So one factor is just the overhead of handling the increased number of files. 
2. Another factor, which we are not modelling here, is network latency and remote file-system/object-store interactions - again the number of files adds a significant cost. So we are missing this real-world cost in our experiment.
3. Our processing in our noddy test is very light on computation, so the jobs are dominated by the I/O costs. So the impact of small-files will vary from application to application. That said, because all our data in this test is local we are not seeing the true cost of small-files, so real-world impacts are often very significant.

In [None]:
# spark.stop()