# Broadcast variables

In spark, broadcast variables are read-only shared variables that are cached and available on all nodes in a
cluster rather than shipping a copy of it with tasks. Instead of sending this data along with every task,
Spark distributes broadcast variables to the workers using efficient broadcast algorithms to reduce
communication costs. It means all executor in the same worker can share the same broadcast variable

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.


In [3]:
from pyspark.sql import SparkSession
import os

local=True
if local:
    spark = SparkSession.builder\
        .master("local[4]")\
        .appName("BroadcastVariable")\
        .config("spark.executor.memory", "2g")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("BroadcastVariable")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory","2g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()

# make the large dataframe show pretty
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

## Create a broadcast variable

We can create a broadcast variables from any variable, in our case, it's called **states_map**. We use the following command **broadcast_states = spark.sparkContext.broadcast(states_map)** to create a broadcast var called **broadcast_states**. The **broadcast_states** is a wrapper around **states_map**, and its value can be accessed by calling the value method. 

Note that broadcast variables are not sent to workers when we created the broadcast var **by calling sc.broadcast()**, the broadcast var will be sent to executors **when they are first used**.
 

In [2]:
states_map = {"NY": "New York", "CA": "California", "FL": "Florida"}
broadcast_states = spark.sparkContext.broadcast(states_map)

data = [("James", "Smith", "USA", "CA"),
        ("Michael", "Rose", "USA", "NY"),
        ("Robert", "Williams", "USA", "CA"),
        ("Maria", "Jones", "USA", "FL")
            ]
print("source data: \n {}".format(data))

source data: 
 [('James', 'Smith', 'USA', 'CA'), ('Michael', 'Rose', 'USA', 'NY'), ('Robert', 'Williams', 'USA', 'CA'), ('Maria', 'Jones', 'USA', 'FL')]


## Use a broadcast variables 

After the broadcast variable (i.e. broadcast_states) is created, pay attention to two points:
1. Always use the broadcast variable (i.e. broadcast_states) instead of the value (i.e. states_map) in any functions run on the cluster so that **states_map** is shipped to the nodes only once. 
2. The value **states_map** should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).


In [5]:
rdd = spark.sparkContext.parallelize(data)

# get the broadcast state map
states_map = broadcast_states.value
result = rdd.map(lambda x: (x[0], x[1], x[2], states_map[x[3]])).collect()
print("Exp1: after the rdd map on broadcast var ")
print(result)

[Stage 0:>                                                          (0 + 4) / 4]

Exp1: after the rdd map on broadcast var 
[('James', 'Smith', 'USA', 'California'), ('Michael', 'Rose', 'USA', 'New York'), ('Robert', 'Williams', 'USA', 'California'), ('Maria', 'Jones', 'USA', 'Florida')]


                                                                                

Use a broadcast variables in dataframe

In [7]:
columns = ["firstname", "lastname", "country", "state"]

df = spark.createDataFrame(data, schema=columns)
print("Source data frame")
df.show()

# get the broadcast state map
states_map = broadcast_states.value
df1 = df.rdd.map(lambda x: (x[0], x[1], x[2], states_map[x[3]])).toDF(columns)
print("Exp2: after the data frame map on broadcast var ")
df1.show()

# Once the variable is broadcast, we can use it in any dataframe operation which is impossible for local variable
# Because executors does not have access on the local variables defines on spark drivers
local_states_map = ["NY"]
try:
    df2 = df.where(df.state.isin(local_states_map))
    df2.show()
except:
    print("Can't use local variables in dataframe operations")

# isin takes a list, so we need to get the keys of the dict and return it to a list, and I only take the first
# element of the list.
keys = list(states_map.keys())[0:1]
df3 = df.where(df["state"].isin(keys))
df3.show()

Source data frame
+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|    James|   Smith|    USA|   CA|
|  Michael|    Rose|    USA|   NY|
|   Robert|Williams|    USA|   CA|
|    Maria|   Jones|    USA|   FL|
+---------+--------+-------+-----+

Exp2: after the data frame map on broadcast var 
+---------+--------+-------+----------+
|firstname|lastname|country|     state|
+---------+--------+-------+----------+
|    James|   Smith|    USA|California|
|  Michael|    Rose|    USA|  New York|
|   Robert|Williams|    USA|California|
|    Maria|   Jones|    USA|   Florida|
+---------+--------+-------+----------+

+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|  Michael|    Rose|    USA|   NY|
+---------+--------+-------+-----+

+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|  Michael|    Rose|    USA|   NY|
+---------+----