# PySpark: Zero to Hero
## Module 12: Partitioning Strategies & Joins

In this module, we cover two critical topics:
1.  **Partitioning:** Controlling how data is physically distributed across the cluster using `repartition()` and `coalesce()`.
2.  **Joins:** Combining two DataFrames using Inner and Left joins, and handling duplicate column names.

### Agenda:
1.  **Data Creation:** Employee and Department datasets.
2.  **Partitioning:**
    *   `repartition()` vs `coalesce()`.
    *   Partitioning by Column.
    *   Visualizing partition distribution using `spark_partition_id()`.
3.  **Joins:**
    *   Inner Join.
    *   Left Join.
    *   Handling **Ambiguous Columns** (The most common Join error).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, spark_partition_id

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Partitions_and_Joins") \
    .master("local[*]") \
    .getOrCreate()

# 1. Employee Data
emp_data = [
    ("001", "John Doe", "101", 50000),
    ("002", "Jane Smith", "102", 60000),
    ("003", "Bob Brown", "101", 55000),
    ("004", "Alice Lee", "103", 52000),
    ("005", "Jack Chan", "102", 48000),
    ("006", "N/A", "104", 40000) # Dept 104 exists in Emp but not Dept
]
emp_cols = ["emp_id", "name", "dept_id", "salary"]
emp_df = spark.createDataFrame(emp_data, emp_cols)

# 2. Department Data
dept_data = [
    ("101", "HR", "NY"),
    ("102", "Finance", "CA"),
    ("103", "Marketing", "TX"),
    ("105", "Sales", "FL") # Dept 105 exists in Dept but not Emp
]
dept_cols = ["dept_id", "dept_name", "location"]
dept_df = spark.createDataFrame(dept_data, dept_cols)

print("--- Employee Data ---")
emp_df.show()
print("--- Department Data ---")
dept_df.show()

In [None]:
# Check current number of partitions
print(f"Current Emp Partitions: {emp_df.rdd.getNumPartitions()}")

# VISUALIZE PARTITIONS: 
# We add a column 'partition_id' to see which partition each row resides in.
emp_df.withColumn("partition_id", spark_partition_id()).show()

In [None]:
# 1. Repartition (Full Shuffle)
# Can Increase or Decrease partitions. 
# Distributes data equally (Round Robin).
# Expensive operation (Network Shuffle).
df_repartitioned = emp_df.repartition(4)
print(f"Repartition Count: {df_repartitioned.rdd.getNumPartitions()}")

# 2. Coalesce (Minimize Shuffle)
# Can ONLY Decrease partitions.
# Merges local partitions. Efficient.
df_coalesced = df_repartitioned.coalesce(2)
print(f"Coalesce Count: {df_coalesced.rdd.getNumPartitions()}")

In [None]:
# Repartition by a specific column ensures all data for that key ends up in the same partition.
# This is useful before Joins or GroupBy.

df_by_dept = emp_df.repartition(4, "dept_id")

print("--- Data Repartitioned by Dept ID ---")
# Notice how rows with same dept_id have same partition_id
df_by_dept.withColumn("partition_id", spark_partition_id()).show()

In [None]:
# Inner Join: Returns only matching records (Dept 101, 102, 103)
# Syntax: df1.join(df2, condition, type)

join_condition = emp_df["dept_id"] == dept_df["dept_id"]

df_inner = emp_df.join(dept_df, join_condition, "inner")

print("--- Inner Join ---")
df_inner.show()

In [None]:
# PROBLEM: In the previous join, 'dept_id' appears twice (once from emp, once from dept).
# If we try to select "dept_id", Spark gets confused and throws AnalysisException.

# df_inner.select("dept_id").show()  # <--- This will FAIL

# SOLUTION 1: Reference the specific DataFrame
df_inner.select(emp_df["dept_id"], "name", "dept_name").show()

# SOLUTION 2 (Better): Rename column before joining or drop duplicate after joining
df_clean_join = emp_df.join(dept_df, join_condition, "inner") \
    .drop(dept_df["dept_id"]) # Drop the duplicate column from the right side

print("--- Clean Join (No Duplicate Columns) ---")
df_clean_join.show()

In [None]:
# Left Join: Returns all rows from Left (Emp) and matched rows from Right (Dept).
# Unmatched rows get NULL (Dept 104 will have NULL dept_name).

df_left = emp_df.join(dept_df, join_condition, "left")

print("--- Left Join ---")
df_left.show()

In [None]:
# You can have multiple conditions in a join using & (AND) / | (OR).

# Example: Join where Dept ID matches AND Salary > 50000
complex_condition = (emp_df["dept_id"] == dept_df["dept_id"]) & (emp_df["salary"] > 50000)

df_complex = emp_df.join(dept_df, complex_condition, "inner")

print("--- Complex Join (Match + Salary > 50k) ---")
df_complex.show()

## Summary

1.  **`repartition(n)`**: Increases/Decreases partitions. Performs full shuffle. Good for filtering/joins.
2.  **`coalesce(n)`**: Only Decreases partitions. No full shuffle. Good for writing files.
3.  **`spark_partition_id()`**: Useful debugging function to see data distribution.
4.  **Joins**:
    *   Always be careful of **Ambiguous Columns** (columns with same name in both tables).
    *   Best practice: Explicitly reference `df['col']` or drop the duplicate column immediately after join.

**Next Steps:**
In the next module, we will look at **Reading and Writing Files** (CSV, Parquet) and understanding **Spark Schemas** in depth.