BroadcastVariable 
- optimizing join operations.
- suitable when one side of the datasets in the join is fairly small. (The threshold can be configured using “spark. sql. autoBroadcastJoinThreshold” which is by default 10MB

**Explicit Broadcast Join:**

Usage: In this type of broadcast join, you explicitly specify which DataFrame should be broadcasted using the broadcast function.

Advantage: Gives you fine-grained control over which DataFrame to broadcast.


**Auto Broadcast Join:**

Usage: In an auto broadcast join, PySpark's optimizer automatically determines whether to use a broadcast join based on the size of the DataFrame and the autoBroadcastJoinThreshold configuration parameter.

Advantage: Simplifies the process as the system automatically decides whether to broadcast a DataFrame.

![](/Workspace/Users/jif170122@gmail.com/spark/broadcast_join.png)

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

#Threshold Size:
#autoBroadcastJoinThreshold is set to a specific size in bytes. If the estimated size of a DataFrame is below this threshold, PySpark will automatically choose to broadcast it in join operations.

spark = SparkSession.builder.appName("BroadcastJoinExample"). \
        config("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024). \
        getOrCreate()

In [0]:
# Create a large DataFrame
large_data = [(1, "A"), (2, "B"), (3, "C")]
large_df = spark.createDataFrame(large_data, ["id", "value"])

# Create a small DataFrame
small_data = [(1, "Category_1"), (2, "Category_2")]
small_df = spark.createDataFrame(small_data, ["id", "category"])

# Perform a broadcast join operation
result = large_df.join(broadcast(small_df), "id")

# When performing a join, PySpark's optimizer estimates the size of each DataFrame involved in the join.
#If the size of one DataFrame is below the threshold, it is broadcasted to all worker nodes, reducing the need for data shuffling.


# Show the results
result.show()

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName("BroadcastVariablePattern").getOrCreate()

#Create Large and Lookup DataFrames:

large_data = [(1, "A"), (2, "B"), (3, "C"), (4, "D"), (5, "E"), (6, "F"),(7,'G'),(8,'H')]
large_df = spark.createDataFrame(large_data, ["id", "value"])

lookup_data = [(1, "Category_1"), (2, "Category_2")]
lookup_df = spark.createDataFrame(lookup_data, ["id", "category"])

#Broadcast the Small DataFrame:
broadcast_lookup_df = broadcast(lookup_df)


#Perform Broadcast Join and Regular Join:
result_broadcast_join = large_df.join(broadcast_lookup_df, "id")

spark = SparkSession.builder.appName("BroadcastVariablePattern").config("spark.sql.autoBroadcastJoinThreshold", "-1").getOrCreate()
result_regular_join = large_df.join(lookup_df, "id")

#Display Results:

print("Broadcast Join Result:")
result_broadcast_join.show()


print("Regular Join Result:")
result_regular_join.show()

#Display Execution Plans:

print("Broadcast Join Execution Plan:")
result_broadcast_join.explain()

print("Regular Join Execution Plan:")
result_regular_join.explain()
