### Broadcast Join in Spark SQL

In PySpark, the `pyspark.sql.functions.broadcast()` method is used to broadcast a smaller DataFrame, enabling efficient joins with a larger one. Since PySpark processes data across multiple nodes, joining two DataFrames typically requires a shuffle—data must be exchanged between nodes to align matching keys. This is because join keys are often located on different nodes, making traditional joins costly in terms of computation and resources. Broadcasting avoids this overhead by replicating the smaller DataFrame across all nodes, allowing each node to perform a fast local (map-side) join.

In [0]:
# Sample employee data (larger dataset)
employee_data = [
    (101, "Alice", 1, 60000),
    (102, "Bob", 2, 70000),
    (103, "Charlie", 1, 75000),
    (104, "David", 3, 50000),
    (105, "Eva", 2, 72000)
]
employee_columns = ["emp_id", "emp_name", "dept_id", "salary"]
employee_df = spark.createDataFrame(employee_data, employee_columns)

# Sample department data (small lookup table)
department_data = [
    (1, "HR"),
    (2, "Engineering"),
    (3, "Finance")
]
department_columns = ["dept_id", "dept_name"]
department_df = spark.createDataFrame(department_data, department_columns)

### Perform a join

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

In [0]:
joined_df = employee_df.join(department_df, on="dept_id", how="inner")
joined_df.show()
joined_df.explain()

### Perform a broadcast join

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10*1024*1024)

In [0]:
from pyspark.sql.functions import broadcast

joined_df = employee_df.join(
    broadcast(department_df),
    on="dept_id",
    how="inner"
)

joined_df.show()
joined_df.explain()