Now, we kept maxPartitionBytes = 128 MB constant, and only changed openCostInBytes.

We used a dataset of 1,467 small parquet files totaling ~654 MB in real size.
The goal was to see how changing openCostInBytes affects:

- Padded size (real + virtual)
- Number of partitions
- Spark task count and job runtime

In [0]:
sc.setJobDescription("Step A-1: Basic initialization")
spark.conf.set("spark.databricks.io.cache.enabled", "false")           
defaultMaxPartitionBytes = int(spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b",""))
openCostInBytes = int(spark.conf.get("spark.sql.files.openCostInBytes").replace("b",""))
displayHTML(f"""<table>
  <tr><td>Max Partition Bytes:</td><td><b>{defaultMaxPartitionBytes/1024/1024}</b> MB</td></tr>
  <tr><td>Open Cost In Bytes: </td><td><b>{openCostInBytes/1024/1024}</b> MB</td></tr>
</table>""")


sc will be removed in future DBR versions


0,1
Max Partition Bytes:,128.0 MB
Open Cost In Bytes:,4.0 MB


In [0]:
sc.setJobDescription("Step A-2: Utility Function")
def predict_num_partitions(files):
    import math
    open_cost = int(spark.conf.get("spark.sql.files.openCostInBytes").replace("b", ""))
    max_partition_bytes = int(spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b", ""))

    actual_bytes = sum(f.size for f in files)
    padded_bytes = actual_bytes + (len(files) * open_cost)

    bytes_per_core = padded_bytes // sc.defaultParallelism
    max_of_cost_bpc = max(open_cost, bytes_per_core)
    target_size = min(max_partition_bytes, max_of_cost_bpc)
    partitions = padded_bytes / target_size

    def row(label, value, extra=""):
        return f'<tr><td>{label}:</td><td style="text-align:right; font-weight:bold">{value:,}</td><td style="padding-left:1em">{extra}</td></tr>'

    html = "<table>" + \
        row("File Count", len(files)) + \
        row("Actual Bytes", actual_bytes) + \
        row("Padded Bytes", padded_bytes, "Actual_Bytes + (File_Count * Open_Cost)") + \
        row("Average Size", padded_bytes // len(files)) + \
        '<tr><td colspan="2" style="border-top:1px solid black">&nbsp;</td></tr>' + \
        row("Open Cost", open_cost, "spark.sql.files.openCostInBytes") + \
        row("Bytes-Per-Core", bytes_per_core) + \
        row("Max Cost", max_of_cost_bpc, "(max of Open_Cost & Bytes-Per-Core)") + \
        '<tr><td colspan="2" style="border-top:1px solid black">&nbsp;</td></tr>' + \
        row("Max Partition Bytes", max_partition_bytes, "spark.sql.files.maxPartitionBytes") + \
        row("Target Size", target_size, "(min of Max_Cost & Max_Partition_Bytes)") + \
        '<tr><td colspan="2" style="border-top:1px solid black">&nbsp;</td></tr>' + \
        row("Number of Partitions", math.ceil(partitions), f"({partitions} from Padded_Bytes / Target_Size)") + \
        "</table>"
    displayHTML(html)


In [0]:
from pyspark.sql.functions import window, col
from pyspark.sql.types import StructType, StructField, DecimalType, StringType


trxPath = "s3a://tks-dados-responsys/EXPORT_FILES/files/STATUS_OPT/"
trxFiles = [f for f in dbutils.fs.ls(trxPath) if f.name.endswith(".parquet")]
trxSchema = StructType([
    StructField("_SDC_SOURCE_LINENO", DecimalType(38, 0)),
    StructField("EMAIL_ADDRESS_", StringType()),
    StructField("EMAIL_PERMISSION_STATUS_", StringType()),
    StructField("_SDC_SOURCE_FILE", StringType()),
    StructField("_SDC_SEQUENCE", DecimalType(38, 0)),
    StructField("_SDC_RECEIVED_AT", StringType()),
    StructField("_SDC_BATCHED_AT", StringType()),
    StructField("_SDC_TABLE_VERSION", DecimalType(38, 0)),
])

sc.setJobDescription("Step C: OCB 4MB")
predict_num_partitions(trxFiles)
spark.read.schema(trxSchema).parquet(trxPath).write.format("noop").mode("overwrite").save()


0,1,2
File Count:,1467.0,
Actual Bytes:,654715376.0,
Padded Bytes:,6807759344.0,Actual_Bytes + (File_Count * Open_Cost)
Average Size:,4640599.0,
,,
Open Cost:,4194304.0,spark.sql.files.openCostInBytes
Bytes-Per-Core:,1701939836.0,
Max Cost:,1701939836.0,(max of Open_Cost & Bytes-Per-Core)
,,
Max Partition Bytes:,134217728.0,spark.sql.files.maxPartitionBytes


In [0]:
sc.setJobDescription("Step D: OCB 1/2 MB")
spark.conf.set("spark.sql.files.openCostInBytes", 524288) # Reduce to 1/2 MB
predict_num_partitions(trxFiles)                     
spark.read.schema(trxSchema).parquet(trxPath).write.format("noop").mode("overwrite").save()   

0,1,2
File Count:,1467.0,
Actual Bytes:,654715376.0,
Padded Bytes:,1423845872.0,Actual_Bytes + (File_Count * Open_Cost)
Average Size:,970583.0,
,,
Open Cost:,524288.0,spark.sql.files.openCostInBytes
Bytes-Per-Core:,142384587.0,
Max Cost:,142384587.0,(max of Open_Cost & Bytes-Per-Core)
,,
Max Partition Bytes:,134217728.0,spark.sql.files.maxPartitionBytes


In [0]:
sc.setJobDescription("Step E: OCB 1/8 MB")
spark.conf.set("spark.sql.files.openCostInBytes", 131072) # Reduce to 1/8 MB
predict_num_partitions(trxFiles)                     
spark.read.schema(trxSchema).parquet(trxPath).write.format("noop").mode("overwrite").save()   

0,1,2
File Count:,1467.0,
Actual Bytes:,654715376.0,
Padded Bytes:,846998000.0,Actual_Bytes + (File_Count * Open_Cost)
Average Size:,577367.0,
,,
Open Cost:,131072.0,spark.sql.files.openCostInBytes
Bytes-Per-Core:,84699800.0,
Max Cost:,84699800.0,(max of Open_Cost & Bytes-Per-Core)
,,
Max Partition Bytes:,134217728.0,spark.sql.files.maxPartitionBytes


##  Results Summary

| Step | `openCostInBytes` | Padded Size | Predicted Partitions | Spark Tasks | Runtime | Notes |
|------|-------------------|-------------|------------------------|-------------|---------|-------|
| C    | 4 MB              | 6.8 GB      | 51                     | 51          | 56 s    | Default setting |
| D    | 0.5 MB            | 1.4 GB      | 11                     | 12          | 51 s    | Much less padding |
| E    | 0.125 MB          | 847 MB      | 10                     | 11          | 39 s    | Padded ≈ actual |
| F    | 0 MB              | 654 MB      | 5                      | 6           | 37 s    | Only real size used |

---

##  Observations

- As `openCostInBytes ↓`, the **padded size ↓**, so Spark creates **fewer partitions**.
- With fewer partitions, the **number of tasks ↓** in Stage 0.
- **Job runtime improved** when reducing from 51 to ~10 tasks.
- Below a certain point (around 10 tasks), **runtime gains flatten** — fewer tasks = less parallelism.

---

##  When to Adjust `openCostInBytes`

| Scenario | Recommendation |
|----------|----------------|
| You have **many small files** | Keep a **higher** open cost (e.g. 4 MB) to avoid huge partitions full of tiny files |
| Your files are **large** already | `openCostInBytes` has **little to no effect** |
| You want **fewer partitions** and trust the file system | Consider reducing `openCostInBytes` |
| You use **autotuning or dynamic partitioning** | Spark might override this value during optimization |

---

## Key Insight

> `openCostInBytes` is a **soft penalty** added to each file to model I/O overhead.  
> It's not about physical size, but about **how Spark plans the workload.**

Reducing it gives more compact jobs, but **too few tasks** might underutilize your cluster.

Notice that it is important to use a moderate value of `openCostInBytes` if I have just a few small files, as in this case with 1.2K files.