**⭐ 1. What This Pattern Solves**

Mitigate performance degradation when join key is skewed (a few keys hold most rows) causing single task hotspots and long straggling tasks.

**⭐ 2. SQL Equivalent**

No single SQL hint; implemented via data transformation techniques (salting, splitting large-key joins).

**⭐ 3. Core Idea**

Split the join into (a) handle heavy keys separately, and (b) salt keys to spread load across partitions (add random suffix to join key), or use map-side aggregation and replicate small side selectively. Main approaches: salting, range/bucket join, skewed-key special-case.

**⭐ 4. Template Code (MEMORIZE THIS) — SALTING**

In [0]:
from pyspark.sql import functions as F
import random

# 1. mark heavy keys (example list heavy_keys)
heavy_keys = [...]  # discovered from stats

# 2. For left: add salt for heavy keys
salt_count = 10
left_salted = left.withColumn('salt', F.when(F.col('key').isin(heavy_keys),
                                             (F.rand()*salt_count).cast('int'))
                                           .otherwise(F.lit(0)))
left_salted = left_salted.withColumn('salted_key', F.concat_ws('_', F.col('key'), F.col('salt')))

# 3. For right: replicate heavy keys with salts 0..salt_count-1
right_heavy = right.filter(F.col('key').isin(heavy_keys))
replicated = (right_heavy
              .withColumn('salt', F.explode(F.array([F.lit(i) for i in range(salt_count)])))
              .withColumn('salted_key', F.concat_ws('_', F.col('key'), F.col('salt'))))
# 4. non-heavy keys join normally (salt=0)
right_rest = right.filter(~F.col('key').isin(heavy_keys)).withColumn('salt', F.lit(0)).withColumn('salted_key', F.concat_ws('_', F.col('key'), F.col('salt')))

right_salted = replicated.unionByName(right_rest)
joined = left_salted.join(right_salted, on='salted_key', how='inner')


**⭐ 5. Detailed Example**

Discover skew: left.groupBy('key').count().orderBy(F.desc('count')).show(10) → find keys with huge counts.

Choose salt_count (e.g., 8–32).

Add salt to left only for heavy keys; replicate right heavy keys by exploding salts so every partition holds a slice.

Join on salted_key. This spreads heavy key workload across many tasks.

**⭐ 7. Full Data Engineering Problem**

You run daily enrichment: clicks (large, highly skewed by ip_country='US') join dim_country (small). US dominates 70% rows causing stragglers. Implement skew mitigation: for US create salts on clicks, replicate dim_country for US with salts, join on salted key, then remove salt. Provide metrics to prove reduced max task time.

**⭐ 8. Time & Space Complexity**

Extra cost from replication: right heavy keys replicated salt_count times → increases overall data size. But max task time reduces roughly by factor salt_count. Tradeoff: extra IO for replication vs parallelism gain.

**⭐ 9. Common Pitfalls**

Choosing too high salt_count → overhead outweighs benefit.

Forgetting to replicate right-side heavy keys with same salt values.

Not removing the salt column afterwards (pollutes downstream).

Overlooking alternative fixes: filtering heavy keys and joining separately, or rethinking schema.