# Spark IO config

In this tutorial, we are trying to explore the spark IO configuration. As we know, spark only focuses on data transformation. The data IO is done by hadoop.

In [1]:
from pyspark.sql import SparkSession

In [2]:
# dir of spark temp file for shuffle, rdd, broadcast
spark_temp_dir = "C:/Users/PLIU/Documents/git/SparkInternals/notebooks/temp/spark_temp"
# dir of hadoop temp file for writing hadoop based file format
hadoop_temp_dir = "C:/Users/PLIU/Documents/git/SparkInternals/notebooks/temp/hadoop_temp"

# create a spark session with custom config
spark = SparkSession.builder \
    .appName("Explore hadoop fs") \
    .master("local[*]") \
    .config("spark.driver.memory", "10g") \
    .config("spark.local.dir", spark_temp_dir) \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()




In [3]:
# set hadoop conf

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("hadoop.tmp.dir", "/mnt/ssd1/tmp_hadoop")

## 1. Get all default conf value

We have two types of conf:
- spark conf
- SQL conf
- spark/hadoop conf

### 1.1 Get all spark sql conf

`spark.conf` returns a `RuntimeConfig` object. It does not have a getAll method to show all available key-value pairs. You can use the below code to get
popular spark sql confs.

```python

def get_sql_conf(keys: List[str]) -> None:
    for key in keys:
        value = spark.conf.get(key)
        print(f"{key}: {value}")

# popular sql conf keys
sql_conf_keys = [
    "spark.sql.shuffle.partitions",
    "spark.sql.adaptive.enabled",
    "spark.sql.execution.arrow.pyspark.enabled",
    "spark.sql.files.maxPartitionBytes",
    "spark.sql.broadcastTimeout",
    "spark.sql.autoBroadcastJoinThreshold"
]
get_sql_conf(sql_conf_keys)
```

In [4]:
# get shuffle partition numbers
key = "spark.sql.shuffle.partitions"
value = spark.conf.get(key)

print(f"{key}: {value}")

spark.sql.shuffle.partitions: 8


In [5]:
from typing import List


def get_sql_conf(keys: List[str]) -> None:
    for key in keys:
        value = spark.conf.get(key)
        print(f"{key}: {value}")

In [6]:
# popular sql conf keys
sql_conf_keys = [
    "spark.sql.shuffle.partitions",
    "spark.sql.adaptive.enabled",
    "spark.sql.execution.arrow.pyspark.enabled",
    "spark.sql.files.maxPartitionBytes",
    "spark.sql.broadcastTimeout",
    "spark.sql.autoBroadcastJoinThreshold"
]
get_sql_conf(sql_conf_keys)

spark.sql.shuffle.partitions: 8
spark.sql.adaptive.enabled: true
spark.sql.execution.arrow.pyspark.enabled: false
spark.sql.files.maxPartitionBytes: 134217728b
spark.sql.broadcastTimeout: 300000ms
spark.sql.autoBroadcastJoinThreshold: 10485760b


### 1.2 Get spark conf

In [8]:
# get the spark conf dict
spark_conf = spark.sparkContext.getConf()

In [9]:
# the spark local dir stores temp date for
local_dir = spark_conf.get("spark.local.dir")
print("default spark.local.dir value=", local_dir)

default spark.local.dir value= C:/Users/PLIU/Documents/git/SparkInternals/notebooks/temp/spark_temp


In [11]:
# spark serializer
serializer = spark_conf.get("spark.serializer")
print("default spark.serializer value=", serializer)

default spark.serializer value= org.apache.spark.serializer.KryoSerializer


In [13]:
for k, v in spark_conf.getAll():
    print(f"{k} = {v}")

spark.app.id = local-1753780234380
spark.driver.extraJavaOptions = -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
spark.driver.memory = 10g
spark.app.name = Explore hadoop fs
spark.executor.i

In [14]:
# get default fs
hadoop_conf.get("fs.defaultFS")

'file:///'

Below is a list of spark configurations in different categories

1. Memory and Resource Allocation
2. Disk Spill and Local Temp Directory
3. Shuffle and Join Optimization
4. Serialization and Format I/O
5. Execution & Adaptive Query Execution (AQE)
6. Security and Credential Propagation
7. Debugging, Logging, and UI
8. Misc / Execution Control


#### 1.2.1. Memory and Resource Allocation

```text
spark.driver.memory:	Memory for driver process (e.g. 4g)
spark.executor.memory:	Memory for each executor (YARN only)
spark.memory.fraction:	Fraction of JVM heap used for execution and storage (default: 0.6)
spark.memory.storageFraction:	Portion reserved for caching RDD/DataFrames (default: 0.5)
spark.executor.cores:	Cores per executor
spark.driver.cores:	Cores used by driver (local mode only)
spark.task.cpus:	Threads per task (default: 1)

```

#### 1.2.2 Disk Spill and Local Temp Directory

```text
spark.local.dir:	Where Spark stores temp/shuffle data (e.g., /mnt/ssd1/spark)
hadoop.tmp.dir:	Hadoop-level temporary directory (used by Hadoop I/O)
spark.shuffle.spill.compress:	Compress spilled shuffle files (true/false)
spark.shuffle.spill.numElementsForceSpillThreshold:	Force spill when element count is reached
```

#### 1.2.3. Shuffle and Join Optimization
```text
spark.sql.shuffle.partitions:	Number of partitions after shuffles (default: 200)
spark.sql.autoBroadcastJoinThreshold:	Max size (bytes) for auto broadcast joins (default: 10MB)
spark.shuffle.compress:	Compress shuffle data
spark.shuffle.file.buffer:	Buffer size for shuffle file writes (default: 32k)
spark.shuffle.sort.bypassMergeThreshold:	Use bypass merge sort if num partitions ≤ this
```
#### 1.2.4. Serialization and Format I/O
```text
spark.serializer:	Class used to serialize objects (e.g., KryoSerializer)
spark.kryo.registrationRequired:	Enforce class registration for Kryo
spark.sql.parquet.compression.codec:	Compression codec for Parquet (snappy, gzip, lz4)
spark.sql.orc.compression.codec:	ORC compression
spark.sql.files.maxPartitionBytes:	Bytes per partition when reading files (default: 128MB)
```
#### 1.2.5. Execution & Adaptive Query Execution (AQE)
```text
spark.sql.adaptive.enabled:	Enable AQE (true/false)
spark.sql.adaptive.shuffle.targetPostShuffleInputSize:	Target partition size for coalescing
spark.sql.adaptive.coalescePartitions.enabled:	Allow AQE to coalesce shuffle partitions
spark.sql.adaptive.skewJoin.enabled:	Enable skew join handling
```
#### 1.2.6. Security and Credential Propagation
```text
spark.authenticate:	Enable Spark internal authentication
spark.authenticate.secret:	Shared secret for Spark auth
spark.hadoop.fs.s3a.access.key:	AWS S3 credentials (if needed)
spark.hadoop.fs.s3a.secret.key:	AWS S3 credentials (if needed)
spark.hadoop.dfs.client.use.datanode.hostname:	Needed for some HDFS setups
```
#### 1.2.7. Debugging, Logging, and UI
```text
spark.eventLog.enabled:	Enable Spark event logging
spark.eventLog.dir:	Where to store event logs
spark.ui.port:	Web UI port (default 4040)
spark.executor.logs.rolling.strategy:	Rolling log config
spark.history.fs.logDirectory:	Spark History Server config
```
#### 1.2.8. Misc / Execution Control

```text
spark.default.parallelism :	Default number of partitions (e.g., for RDDs)
spark.dynamicAllocation.enabled:	Enable dynamic executor allocation
spark.sql.broadcastTimeout:	Timeout for broadcast joins (default 300s)
spark.cleaner.ttl:	TTL for cached RDD metadata
```


In [5]:
# get all available conf default values
for entry in hadoop_conf.iterator():
    print(f"{entry.getKey()} = {entry.getValue()}")

yarn.log-aggregation.file-formats = TFile
fs.s3a.select.output.csv.record.delimiter = \n
mapreduce.jobhistory.client.thread-count = 10
hadoop.security.group.mapping.ldap.posix.attr.uid.name = uidNumber
yarn.admin.acl = *
yarn.app.mapreduce.am.job.committer.cancel-timeout = 60000
yarn.federation.enabled = false
mapreduce.job.emit-timeline-data = false
fs.s3a.select.input.csv.quote.character = "
yarn.nodemanager.runtime.linux.sandbox-mode.local-dirs.permissions = read
yarn.resourcemanager.leveldb-state-store.path = ${hadoop.tmp.dir}/yarn/system/rmstore
ipc.client.connection.maxidletime = 10000
yarn.nodemanager.health-checker.scripts = script
yarn.nodemanager.process-kill-wait.ms = 5000
yarn.minicluster.use-rpc = false
io.map.index.interval = 128
mapreduce.task.profile.reduces = 0-2
hadoop.util.hash.type = murmur
yarn.webapp.api-service.enable = false
yarn.resourcemanager.nodemanagers.heartbeat-interval-min-ms = 1000
yarn.nodemanager.aux-services.manifest.reload-ms = 0
fs.s3a.path.style.a

## 2. Local file system conf

In spark local mode, spark writes data on the local file system. We


```python
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()

# Use local filesystem explicitly (default for file:// but override possible)
hadoop_conf.set("fs.defaultFS", "file:///")
hadoop_conf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem")
# Use larger I/O buffer (default is 4 KB — too small for bulk)
hadoop_conf.set("io.file.buffer.size", "131072")  # 128 KB

# Temp / staging directories
hadoop_conf.set("hadoop.tmp.dir", "/tmp/hadoop_spark_tmp")
hadoop_conf.set("mapreduce.cluster.local.dir", "/tmp/spark_local_dir")

# Replication factor not used in local mode, but still prevent unnecessary checks
hadoop_conf.set("dfs.replication", "1")

# Enable fast compression
hadoop_conf.set("mapreduce.output.fileoutputformat.compress", "true")
hadoop_conf.set("mapreduce.output.fileoutputformat.compress.codec",
         "org.apache.hadoop.io.compress.SnappyCodec")
```


In [7]:
# get current
print(hadoop_conf.get("fs.file.impl"))

None


In [8]:
# the default value is 65536 (8kb)
hadoop_conf.get("io.file.buffer.size")

'65536'

In [15]:
# This folder will be used to store temp file for Hadoop I/O, if the file format is Hadoop-based formats like .orc, .avro
hadoop_conf.get("hadoop.tmp.dir")

'/mnt/ssd1/tmp_hadoop'