**⭐ 1. What This Pattern Solves**

Efficiently controls the number of partitions in a PySpark DataFrame to optimize parallelism and shuffle costs.

repartition → increases or evenly redistributes partitions (full shuffle).

coalesce → reduces partitions without full shuffle, faster for downsizing.

Use cases:

Scaling out joins or aggregations (repartition).

Writing large files to fewer partitions (coalesce).

Optimizing memory and execution in distributed jobs.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Repartition-like effect (force shuffle)
CREATE TABLE new_table AS
SELECT *
FROM big_table
DISTRIBUTE BY some_column;

-- Coalesce-like effect (reduce partitions)
-- Not explicit in SQL; achieved by controlling output file count:
INSERT OVERWRITE TABLE small_table
SELECT *
FROM big_table
CLUSTERED BY some_column INTO 4 BUCKETS;


**⭐ 3. Core Idea**

Repartition: Full shuffle → evenly distributes data → increases parallelism.

Coalesce: Avoids shuffle → reduces partitions → faster for downsizing.

Always balance number of partitions vs data size for optimal performance.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Repartition: increase or redistribute partitions
df_repart = df.repartition(num_partitions, "column_name")

# Coalesce: reduce partitions without shuffle
df_coal = df.coalesce(num_partitions)

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PerfPatterns").getOrCreate()

data = [("Alice", 100), ("Bob", 200), ("Charlie", 300)]
df = spark.createDataFrame(data, ["name", "score"])

# Original partitions
print(df.rdd.getNumPartitions())  # e.g., 1

# Increase partitions (shuffle)
df_repart = df.repartition(4)
print(df_repart.rdd.getNumPartitions())  # 4

# Reduce partitions (no shuffle)
df_coal = df_repart.coalesce(2)
print(df_coal.rdd.getNumPartitions())  # 2


**Explanation:**

Repartition → forces shuffle → data spread across 4 partitions.

Coalesce → merges partitions → avoids unnecessary shuffle → faster.

**⭐ 6. Mini Practice Problems**

Increase DataFrame df from 3 to 8 partitions using column "score".

Reduce a 10-partition DataFrame to 3 partitions without shuffle.

When writing a DataFrame to disk, how would you reduce the output files efficiently?

**⭐ 7. Full Data Engineering Problem**

Scenario: You have a 100GB transactional dataset in Bronze. You need to write it as 10 Parquet files in Silver while preparing for a join-heavy aggregation for analytics.

In [0]:
# Step 1: Repartition for join-heavy operation
df_silver = df_bronze.repartition(200, "customer_id")

# Step 2: Aggregate analytics
agg_df = df_silver.groupBy("customer_id").sum("amount")

# Step 3: Coalesce for efficient write
agg_df.coalesce(10).write.parquet("s3://silver/aggregates/")


**⭐ 8. Time & Space Complexity**

| Operation   | Time Complexity     | Space Complexity                    |
| ----------- | ------------------- | ----------------------------------- |
| repartition | O(n) (full shuffle) | O(n) (all partitions may hold data) |
| coalesce    | O(n/p) (no shuffle) | O(n) (merged partitions)            |


**⭐ 9. Common Pitfalls**

Using repartition unnecessarily → triggers costly shuffle.

Using coalesce to increase partitions → doesn’t evenly distribute data.

Ignoring partitioning when writing large datasets → small files problem.

Forgetting cluster resources → too many partitions can degrade performance.