# PySpark: Zero to Hero
## Module 16: Writing Data in Spark

Reading data is only half the battle. In this module, we will learn how to write data back to storage effectively. We will dive deep into how Spark's distributed architecture influences file generation and how to control the output structure.

### Agenda:
1.  **Spark Writer API:** Basic syntax for writing data.
2.  **Under the Hood:** How Partitions relate to Output Files.
3.  **Partitioning Data:** Using `partitionBy` to create directory structures.
4.  **Write Modes:** `append`, `overwrite`, `ignore`, and `error`.
5.  **Bonus Tip:** How to write a single output file (handling the `part-00000` naming convention).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import spark_partition_id

spark = SparkSession.builder \
    .appName("Writing_Data") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

In [None]:
# We will use the Employee dataset for this exercise.
data = [
    ("001", "John Doe", 30, "Male", 50000, "2015-01-01", "101"),
    ("002", "Jane Smith", 25, "Female", 45000, "2016-02-15", "101"),
    ("003", "Bob Brown", 35, "Male", 55000, "2014-05-01", "102"),
    ("004", "Alice Lee", 28, "Female", 48000, "2017-09-30", "102"),
    ("005", "Jack Chan", 40, "Male", 60000, "2013-08-21", "103"),
    ("006", "Jill Wong", 32, "Female", 52000, "2018-12-01", "103")
]

schema = StructType([
    StructField("employee_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("salary", IntegerType(), True),
    StructField("hire_date", StringType(), True),
    StructField("department_id", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

## 1. How Spark Writes Files

In Spark, **1 Task processes 1 Partition and writes 1 File**.

If your DataFrame has 8 partitions, Spark will launch 8 parallel tasks, and you will end up with 8 output files (e.g., `part-00000`, `part-00001`, etc.) inside the output folder.

Let's check the default parallelism and the number of partitions.

In [None]:
# Check the default parallelism of the cluster (local machine)
print(f"Default Parallelism (Cores): {spark.sparkContext.defaultParallelism}")

# Check how many partitions our DataFrame has
print(f"DataFrame Partitions: {df.rdd.getNumPartitions()}")

# Let's verify which data resides in which partition ID
df.withColumn("partition_id", spark_partition_id()).show()

In [None]:
# Basic Write Syntax: df.write.format(...).save(...)
# Note: Spark creates a DIRECTORY with the given name, not a file.

output_path_basic = "data/output/module_16/basic_write"

# Writing in Parquet format
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .save(output_path_basic)

print(f"Data written to: {output_path_basic}")
# If you check this folder in your OS file explorer, you will see multiple part-files 
# matching the number of partitions (e.g., 8 files if you have 8 cores).

## 2. Partitioning Data on Write

Partitioning organizes data into sub-folders based on column values (e.g., `department_id=101/`).
This improves read performance later because Spark can skip folders that aren't needed (Partition Pruning).

**Syntax:** `.partitionBy("column_name")`

In [None]:
output_path_partitioned = "data/output/module_16/partitioned_write"

df.write \
    .format("csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .partitionBy("department_id") \
    .save(output_path_partitioned)

print(f"Partitioned data written to: {output_path_partitioned}")

# Structure created on disk:
# /partitioned_write
#    |-- department_id=101/
#           |-- part-00000.csv
#    |-- department_id=102/
#           |-- part-00000.csv

## 3. Write Modes

Spark provides different modes to handle existing data at the destination path:

1.  **`error` (default):** Throws an error if the directory already exists.
2.  **`append`:** Adds new files to the existing directory. (Be careful, this can result in duplicate data if run multiple times).
3.  **`overwrite`:** Deletes the entire directory and writes fresh data.
4.  **`ignore`:** If directory exists, do nothing (silently skip writing).

In [None]:
output_path_modes = "data/output/module_16/modes_test"

# First write (Initial creation)
print("1. Initial Write...")
df.write.format("csv").mode("overwrite").save(output_path_modes)

# Second write (Append) - File count will double
print("2. Appending data...")
df.write.format("csv").mode("append").save(output_path_modes)

# Third write (Error) - This should fail
print("3. Testing Error mode (expecting failure)...")
try:
    df.write.format("csv").mode("error").save(output_path_modes)
except Exception as e:
    print(f"Error Caught: {e}")

## 4. Bonus: Writing a Single Output File

Often, downstream systems expect a single CSV file, not a folder with `part-00000`, `part-00001`, etc.

**Solution:**
Use `repartition(1)` or `coalesce(1)` before writing. This forces all data into a single partition, processed by a single task, resulting in one output file.

*Note: This is expensive (shuffle) for large datasets.*

In [None]:
output_path_single = "data/output/module_16/single_file"

# Force data into 1 partition
df_single = df.repartition(1)

df_single.write \
    .format("csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .save(output_path_single)

print(f"Single file written to: {output_path_single}")
# Check the folder: it will contain exactly one 'part-00000....csv' file containing all records.

## Summary

1.  **Output Files:** The number of output files equals the number of DataFrame partitions (Parallel Tasks).
2.  **Organization:** Use `.partitionBy()` to create folder structures for optimized querying.
3.  **Modes:** Use `overwrite` to replace data, `append` to add data.
4.  **Single File:** Use `.repartition(1)` to merge all partitions before writing if a single file is required.

**Next Steps:**
In the next module, we will explore **Spark Clusters**, exploring the internal configuration and how tasks are distributed across nodes.