# Writing Partitioned Data, `repartition`, and `coalesce`

This notebook shows:
1. How to write data to files
2. How `partitionBy` works when writing
3. How `repartition` / `coalesce` affect number of output files
4. Using a built-in dataset: `samples.nyctaxi.trips`

> ⚠️ These examples write to `/tmp/...` in DBFS.  
> You can safely delete those folders later with `dbutils.fs.rm(path, recurse=True)`.


In [None]:
from pyspark.sql import functions as F

nyc_df = spark.read.table("samples.nyctaxi.trips")

print("Total rows:", nyc_df.count())
display(nyc_df.limit(5))


## 1. Writing a Small Sample Without Partitioning

We will:
- Take a small subset
- Write it as Parquet to `/tmp/student_nyc_basic`


In [None]:
# Path for basic write (adjust if needed)
path_basic = "/tmp/student_nyc_basic"

# Clean up old data if it exists
dbutils.fs.rm(path_basic, recurse=True)

sample_df = nyc_df.select(
    "vendor_id",
    "passenger_count",
    "trip_distance",
    "fare_amount"
).limit(5000)

(
    sample_df
    .write
    .mode("overwrite")
    .parquet(path_basic)
)

print("Files written (no partitionBy):")
display(dbutils.fs.ls(path_basic))


Each file inside the folder is a **partition file**.

By default:
- Spark decides the number of output files based on the number of **partitions** of the DataFrame.


## 2. Writing Partitioned Data with `partitionBy`

We will:
- Write the same sample, but partition by `passenger_count`.
- Each distinct `passenger_count` will have its **own folder**.


In [None]:
path_part = "/tmp/student_nyc_partitioned"

dbutils.fs.rm(path_part, recurse=True)

(
    sample_df
    .write
    .mode("overwrite")
    .partitionBy("passenger_count")
    .parquet(path_part)
)

print("Top-level folders when partitioned by passenger_count:")
display(dbutils.fs.ls(path_part))


In [None]:
# Inspect one of the partition folders
# (change the passenger_count value based on what's shown above)
# Example for passenger_count=1:
example_folder = path_part + "/passenger_count=1"
print("Files in one partition folder:")
display(dbutils.fs.ls(example_folder))


When using `partitionBy`:
- Spark writes subfolders like `passenger_count=1`, `passenger_count=2`, etc.
- This makes **filtering** by that column very efficient (Spark can skip whole folders).


## 3. `repartition` and `coalesce` Before Writing

We can control the number of output files by changing the number of partitions **before write**:
- `repartition(n)` → shuffle to exactly **n** partitions.
- `coalesce(n)` → reduce to **n** partitions without full shuffle (best when you are decreasing partitions).


In [None]:
path_repart = "/tmp/student_nyc_repartitioned"

dbutils.fs.rm(path_repart, recurse=True)

# Example: repartition to 8 partitions, then write
sample_repart = sample_df.repartition(8)

print("Partitions before write:", sample_repart.rdd.getNumPartitions())

(
    sample_repart
    .write
    .mode("overwrite")
    .parquet(path_repart)
)

print("Files written after repartition(8):")
display(dbutils.fs.ls(path_repart))


In [None]:
path_coalesce = "/tmp/student_nyc_coalesced"

dbutils.fs.rm(path_coalesce, recurse=True)

# Example: coalesce down to 2 partitions, then write
sample_coal = sample_df.coalesce(2)

print("Partitions before write:", sample_coal.rdd.getNumPartitions())

(
    sample_coal
    .write
    .mode("overwrite")
    .parquet(path_coalesce)
)

print("Files written after coalesce(2):")
display(dbutils.fs.ls(path_coalesce))


## 4. Summary for Students

- When you **write** a DataFrame, each partition usually becomes **one file**.
- `partitionBy(col)` creates **subfolders** based on column values (great for filtering).
- Use `repartition(n)` to change to a specific number of partitions **with a shuffle**.
- Use `coalesce(n)` to **reduce** partitions with minimal shuffling.
- Too many small files is bad (overhead). Too few very large files can reduce parallelism.
