In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .getOrCreate()

sc = spark.sparkContext

- Shared variables are the variables that are required to be used by functions and methods in parallel.
- Shared variables can be used in parallel operations.
- Spark provides two types of shared variables −
   - Broadcast
   - Accumulator 
</br>
- Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than
shipping a copy of it with tasks.
- Immutable and cached on each worker nodes only once.
- Efficient manner to give a copy of a large dataset to each node.
- They should fit in memory.


#### When to use Broadcast Variable:
- For processing, the executors need information regarding variables or methods. This information is serialized by Spark and
sent to each executor and is known as CLOSURE.
- If we have a huge array that is accessed from spark CLOSURES, for example - if we have 5 nodes cluster with 100 partitions
(20 partitions per node), this Array will be distributed at least 100 times (20 times to each node). If we broadcast it
will be distributed once per node using efficient p2p protocol.
#### What not to do:
- Once we broadcasted the value to the nodes, we shouldn’t make changes to its value to make sure each node have
exact same copy of data. The modified value might be sent to another node later that would give unexpected results.

In [5]:
#Broadcast a Dictionary
days={"sun": "Sunday", "mon" : "Monday", "tue":"Tuesday"}
bcDays = spark.sparkContext.broadcast(days)
bcDays.value
bcDays.value['sun']


'Sunday'

In [6]:
#Broadcast a list
numbers = (1,2,3)
broadcastNumbers=spark.sparkContext.broadcast(numbers)
broadcastNumbers.value
broadcastNumbers.value[0]

1

Convert into full days

In [32]:
data= (("James","Smith","USA","mon"),
("Michael","Rose","USA","tue"),
("Robert","Williams","USA","sun"),
("Maria","Jones","USA","tue")
)
days={"sun": "Sunday", "mon" : "Monday", "tue":"Tuesday"}
bcDays = spark.sparkContext.broadcast(days)

rdd = spark.sparkContext.parallelize(data)

def days_convert(dict_key):
    return bcDays.value[dict_key]

rdd.map(lambda x: (x[0],x[1],x[2],days_convert(x[3]))).collect()

[('James', 'Smith', 'USA', 'Monday'),
 ('Michael', 'Rose', 'USA', 'Tuesday'),
 ('Robert', 'Williams', 'USA', 'Sunday'),
 ('Maria', 'Jones', 'USA', 'Tuesday')]

#### Accumulator Variables
- Accumulator is a shared variable to perform sum and counter operations.
- These variables are shared by all executors to update and add information through associative or commutative
operations.

In [34]:
counter = 0
def f1(x):
    global counter
    counter += 1
rdd = spark.sparkContext.parallelize((1,2,3))
rdd.foreach(f1)
print(counter)

0


In [35]:
### sparkContext.accumulator() is used to define accumulator variables
counter = spark.sparkContext.accumulator(0)
def f2(x):
    global counter
    counter.add(1) ### add() function is used to add/update a value in accumulator
rdd.foreach(f1)
counter.value ### Only accessed by Driver

3