# Partitions, "glom"-style view, Narrow vs Wide (DataFrame-only)

This version avoids using the **RDD API**, so it works better on **serverless** environments
where only DataFrame / SQL operations are supported.

We will use:
- `spark.range()` (built-in) for tiny examples
- `samples.nyctaxi.trips` for a real dataset

Concepts:
1. What is a partition?
2. How to inspect partitions with `spark_partition_id()`
3. A DataFrame-only "glom" style view
4. Narrow vs Wide transformations
5. `repartition` vs `coalesce` (DataFrame API)


In [None]:
from pyspark.sql import functions as F

# Tiny DataFrame with numbers 0..19
df = spark.range(0, 20).toDF("id")

display(df)


## 1. What is a partition? (DataFrame-only)

- A **partition** is a chunk of your data.
- Spark processes data partition by partition.
- We can see which row is in which partition with the built-in function:
  `spark_partition_id()`.


In [None]:
# Add a column showing the partition id for each row
df_with_pid = df.withColumn("partition_id", F.spark_partition_id())

display(df_with_pid.orderBy("partition_id", "id"))


In [None]:
# Count how many partitions we have (DataFrame-only)
num_partitions = (
    df_with_pid
    .select(F.spark_partition_id().alias("pid"))
    .agg(F.countDistinct("pid").alias("num_partitions"))
    .collect()[0]["num_partitions"]
)

print("Number of partitions (DataFrame-only):", num_partitions)


## 2. "glom"-style view without RDD

In RDD API, `glom()` shows the list of elements in each partition.

We can mimic this with DataFrame operations:
- Add `partition_id` with `spark_partition_id()`
- Group by `partition_id`
- Collect rows as a list


In [None]:
glom_like = (
    df_with_pid
    .groupBy("partition_id")
    .agg(F.collect_list("id").alias("rows_in_partition"))
    .orderBy("partition_id")
)

display(glom_like)


## 3. Narrow vs Wide (Concepts Recap)

**Narrow transformations**:
- Each input partition contributes to **only one** output partition.
- Examples: `select`, `filter`, `withColumn` (most simple column ops).
- No shuffle of data between machines.

**Wide transformations**:
- Input data may be **redistributed** across many partitions.
- Examples: `groupBy`, `distinct`, `join`, `repartition`.
- These cause a **shuffle** (expensive).


In [None]:
# Example of NARROW transformations
narrow_df = (
    df
    .withColumn("id_times_2", F.col("id") * 2)  # withColumn = narrow
    .filter(F.col("id") % 2 == 0)              # filter = narrow
    .withColumn("partition_id", F.spark_partition_id())
)

display(narrow_df.orderBy("partition_id", "id"))


Notice how the `partition_id` values are usually preserved
through narrow transformations (as long as we don't explicitly
change the partitioning).


In [None]:
# Example of a WIDE transformation: groupBy (causes shuffle)
wide_df = (
    df
    .withColumn("key", F.col("id") % 3)
    .groupBy("key")
    .agg(F.collect_list("id").alias("ids"))
)

wide_df.explain()  # Look for 'Exchange' in the plan (shuffle)
display(wide_df)


In the physical plan for `wide_df.explain()`, you should see an **Exchange**
operator – this is Spark performing a shuffle for the `groupBy`.


## 4. `repartition` vs `coalesce` (DataFrame-only)

- `repartition(n)`:
  - Can **increase or decrease** partitions.
  - Causes a **shuffle** (wide transformation).
- `coalesce(n)`:
  - Only **decreases** partitions.
  - Tries to avoid a full shuffle (narrow-ish).


In [None]:
# Start with our tiny df
df_small = spark.range(0, 20).toDF("id")

# Repartition to 4 partitions (this will shuffle)
df_repart = df_small.repartition(4)
df_repart_pid = df_repart.withColumn("partition_id", F.spark_partition_id())

print("Number of partitions after repartition(4):",
      df_repart_pid.select(F.countDistinct("partition_id")).first()[0])
display(df_repart_pid.orderBy("partition_id", "id"))


In [None]:
# Coalesce down to 2 partitions (tries to avoid full shuffle)
df_coal = df_repart.coalesce(2)
df_coal_pid = df_coal.withColumn("partition_id", F.spark_partition_id())

print("Number of partitions after coalesce(2):",
      df_coal_pid.select(F.countDistinct("partition_id")).first()[0])
display(df_coal_pid.orderBy("partition_id", "id"))


## 5. Real Dataset Example – `samples.nyctaxi.trips`

Let's repeat some ideas on a real built-in dataset.


In [None]:
nyc_df = spark.read.table("samples.nyctaxi.trips")

nyc_with_pid = nyc_df.withColumn("partition_id", F.spark_partition_id())

# Show how many partitions and how many rows per partition (for the first few)
partition_counts = (
    nyc_with_pid
    .groupBy("partition_id")
    .count()
    .orderBy("partition_id")
)

display(partition_counts)


In [None]:
# Narrow transformation: add column + filter
nyc_narrow = (
    nyc_df
    .withColumn(
        "trip_duration_min",
        (F.col("tpep_dropoff_datetime").cast("long") -
         F.col("tpep_pickup_datetime").cast("long")) / 60.0
    )
    .filter(F.col("trip_distance") > 1.0)
    .withColumn("partition_id", F.spark_partition_id())
)

display(
    nyc_narrow
    .select("tpep_pickup_datetime", "trip_distance", "trip_duration_min", "partition_id")
    .orderBy("partition_id", "tpep_pickup_datetime")
    .limit(50)
)


In [None]:
# Wide transformation: groupBy (shuffle)
nyc_grouped = (
    nyc_df
    .groupBy("passenger_count")
    .agg(F.avg("trip_distance").alias("avg_distance"))
)

nyc_grouped.explain()  # Look for Exchange (shuffle)
display(nyc_grouped.orderBy("passenger_count"))


## 6. Summary for Students (DataFrame-only)

- We can understand partitions using **`spark_partition_id()`** without touching RDDs.
- Our "glom"-style view is just:
  - add `partition_id`
  - group by `partition_id`
  - `collect_list` the rows.
- **Narrow transforms** (e.g. `withColumn`, `filter`) usually keep partitioning.
- **Wide transforms** (e.g. `groupBy`, `repartition`) cause shuffles.
- `repartition(n)` → change to n partitions with shuffle.
- `coalesce(n)` → reduce number of partitions with minimal shuffle.
