# SparkContext - number of workers and lazy evaluation

## Impact of number of workers

### It is recommended to have one worker per core of the machines. But it can be more or less. Lets check what effect does the number of workers have on the computation time. 

In [1]:
from time import time
from pyspark import SparkContext

In [3]:
for j in range(1,5):
    sc= SparkContext(master = "local[%d]"%(j))
    t0=time()
    for i in range(10):
        sc.parallelize([1,2]*10000).reduce(lambda x,y:x+y)
    print(f"{j} executors, time = {time()-t0}")
    sc.stop()

1 executors, time = 0.8812801837921143
2 executors, time = 1.0833406448364258
3 executors, time = 1.0433084964752197
4 executors, time = 1.0877187252044678


### the time taken by one worker is almost double than that of time taken by 2 workers. after that there is no significant change in the time, since the code is running on this local machine with 2 cores. If the machine had 4 cores, we could have seen improvement in time till 4 workers. After that it will eventually flatten out. Even the time taken may increase due to context switching between multiple workers

## Lazy Evaluation

### Lazy evaluation or call-by-name is an evaluation strategy which delays the evaluation of an expression until its value is needed

### In Spark there are two types of operations 
1. Transformations - Creation of a RDD from another RDD. For example map()--> applies a function to each element of the RDD and creates a new RDD, Or filter()--> applies a filter to each element of the RDD and creates a new RDD with filtered elements
2. Actions - These are operations where a non-RDD is created from a RDD. For example count()-->gives a count of all elements in the RDD. The output is not a RDD. Or top(n)--> gives the top n elements of a RDD. Here again the output is not a RDD.

#### Since Spark is optimized for BigData datasets, it will not evaluate a RDD for transformations until an action is performed. The series of transformations are maintained by DAG (Directed Acyclic Graph) which is used by Spark to compute an efficient method to compute all the transformations at once. So when an action is performed, all the previous transformation operations are evaluated in a efficient method for the first time and then the action is performed. This type of evaluation is known as Lazy Evaluation

### The series of transformations are maintained as a RDD lineage which can be seen by the function toDebugString()

In [14]:
sc.stop()
sc= SparkContext(master = "local[2]")
# create a sample list
my_list = [i for i in range(1,10)]# parallelize the data
rdd_0 = sc.parallelize(my_list)
rdd_0

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195

#### add  4 to each element of the rdd. check the lineage

In [15]:
# add value 4 to each number
rdd_1 = rdd_0.map(lambda x : x+4)
# RDD object
print(rdd_1)
# get the RDD Lineage 
print(rdd_1.toDebugString())

PythonRDD[1] at RDD at PythonRDD.scala:53
b'(2) PythonRDD[1] at RDD at PythonRDD.scala:53 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []'


#### add 20 to each element of the rdd. check the lineage again

In [16]:
# add value 20 each number
rdd_2 = rdd_1.map(lambda x : x+20)
# RDD Object
print(rdd_2)
# get the RDD Lineage
print(rdd_2.toDebugString())

PythonRDD[2] at RDD at PythonRDD.scala:53
b'(2) PythonRDD[2] at RDD at PythonRDD.scala:53 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []'


#### there is no change in the lineage because Spark knows adding 24 once is more efficient than adding 4 first and then 24. This is how Spark automatically defines the best path to perform any action and only perform the transformations when required. This method of evaluation is known as Lazy Evaluation. It is specifically helpful for computation of a lot of data, BigData.