---
---

<center> <h1> Shared Variables: Broadcast </h1> </center>

---

* **Read-only variable cached on each node rather than shipping a copy of it with tasks.**
* **They give every node a copy of a large input dataset in an efficient manner.**
* **Attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.**


---

#### `Creating the Spark Session`

---

In [1]:
# importing the required libraries
from pyspark.sql import SparkSession

In [2]:
# Spark session object
spark = SparkSession.builder.appName("shared-Variables-example").master("local").getOrCreate()

In [3]:
spark

#### `Creating the sample data and RDD of it`

---

In [4]:
# Sample input data
data = [("James","Smith","USA","CA"),
        ("Michael","Rose","USA","NY"),
        ("Robert","Williams","USA","CA"),
        ("Maria","Jones","USA","FL")]

In [5]:
# Parallelize input data into an RDD
rdd_0 = spark.sparkContext.parallelize(data, numSlices= 10)

In [6]:
# check the number of partitions
rdd_0.getNumPartitions()

10

#### `Create the mappings to be broadcasted`

---

In [7]:
# mappings to be broadcasted
states = {"NY":"New York","CA":"California","FL":"Florida", "NZ" : "New Zealand", "IND" :"India"}
countries = {"USA":"United States of America","IN":"India","SA":"South Africa","NZ":"New Zealand"}

---

#### `Broadcasting the variables`

---

In [8]:
# Broadcast variables
broadcastStates = spark.sparkContext.broadcast(states)
broadcastCountries = spark.sparkContext.broadcast(countries)

---

#### `Map the countries & states using the broadcasted variable.`

----

In [9]:
# Use broadcast variable
rdd_1 = rdd_0.map(lambda f: (f[0], f[1], broadcastCountries.value[f[2]], broadcastStates.value[f[3]]))

#### `Collect the Results`

---

In [10]:
# Collect
rdd_1.collect()

[('James', 'Smith', 'United States of America', 'California'),
 ('Michael', 'Rose', 'United States of America', 'New York'),
 ('Robert', 'Williams', 'United States of America', 'California'),
 ('Maria', 'Jones', 'United States of America', 'Florida')]

----
----


<center> <h1> Shared Variables: Accumulators </h1> </center> 

---

 * **They are only "added" to through an associative and commutative operation.**
 * **They can be used to implement counters or sums.** 
 * **Spark natively supports accumulators of numeric types, and programmers can add support for new types**


----
----

---

#### `Initialize the accumulator & create a sample RDD`

---

In [11]:
# initialize the accumulator
accuSum=spark.sparkContext.accumulator(0)

# create a RDD of empty list
rdd=spark.sparkContext.parallelize([1,2,3,4,5], numSlices=8)

#### `Define the function`

---

In [12]:
# define the commutative function
def countFun(x):
    global accuSum
    accuSum += x

#### `Get the sum`

----

In [13]:
# get the sum of accumulator
rdd.foreach(countFun)

In [14]:
# get the value
accuSum.value

15

#### `Let's try to find the sum without using accumulator.`

---

In [15]:
# initialize the accumulator
accuSum2 = 0  # spark.sparkContext.accumulator(0)

# create a RDD of empty list
rdd2 = spark.sparkContext.parallelize([1,2,3,4,5], numSlices= 8)

# define the commutative function
def countFun2(x):
    global accuSum
    accuSum += x
    
# get the sum of accumulator
rdd2.foreach(countFun2)

# get the value
accuSum2

0

---

### Does an empty partition will be used to compute the results of accumulator variable?

-  First of all, we will intialize the accumulator variable with 0.
-  We will create a list of numbers and create rdd of it.
-  Check the number of paritions.
-  Check for the empty partitions.  
-  Add one to each partition and get the result.
---

In [16]:
#Implement a counter
accumCount=spark.sparkContext.accumulator(0) 

# create a list of numbers and create its RDD
rdd2 = spark.sparkContext.parallelize([1, 2, 3, 4, 5], numSlices=8)

In [17]:
# get the number of partitions
rdd2.getNumPartitions()

8

---

Now, we have just 5 numbers in the list and the rdd is created with number of paritions 8. So, 3 of the 8 partitions must be empty. Let's verify it with the help of the `glom` function.

---

#### `Check for the empty partitions`

---

In [18]:
# check for the empty partitions
rdd2.glom().collect()

[[], [1], [], [2], [3], [], [4], [5]]

---

Now, we will add 1 to each partition value and get the results.

---

In [19]:
# add 1 to each value
rdd2.foreach(lambda x:accumCount.add(1))

# final result
accumCount.value

5


So, we got the value 5, therefore the empty partitions will not be used to compute the results of accumulator variable.

---