### Lineage: How to check the lineage between the RDD  

We can track the lineage across wide range of tranformation . We can classify the dependency in two types namely 

Narrow Dependency : Where each partition of the parent RDD is used by the only one child RDD.
ex: Map, Filter ,union , sample ,mapPartitions and Join with input with co-partitions 

Wide Dependency : Here in wide dependencies have each partition of the parent RDD is used by the multiple child RDD's.
ex: groupBYKeys,intersection,distinct,reduceByKey,coalesce,repartition,cartesian and Join with input not co-partitions 


### Transformations Examples 

Transformation is an operation applied on RDD to create new RDD ex : maps, filtes, flatmap and etc 


#### MAPS

We use map transformation when we need to transform a RDD by applying the function on each element.

def func(a):
  a = a * a 
  return a
rdd.map(func)

or we can use lambda function : rdd.map(lambda z : z*z)


In [1]:
from pyspark import SparkContext 
#sc.stop()
sc = SparkContext("local","name1")
arr = [2,4,5,6]
rdd = sc.parallelize(arr)
rdd1 = rdd.map(lambda a : a*a)
rdd1.collect()

[4, 16, 25, 36]

### flatmap

Flatmap is similar to map but each elements are mapped to 0 to more output elements.
Returns new RDD first by applying the function to all the elements of the RDD and then flatten the results.
 
rdd.flatmap(lambda b  : (b,b*b,b+1))
see the below example 


In [3]:
from pyspark import SparkContext
sc.stop()
sc = SparkContext("local","name")

rdd = sc.parallelize([4,5,6,8])

rdd_flatmap = rdd.flatMap(lambda x : (x,x*x,100))

print(rdd.collect())

print(rdd_flatmap.collect())

[4, 5, 6, 8]
[4, 16, 100, 5, 25, 100, 6, 36, 100, 8, 64, 100]


###### Filters

Returns the new RDD containing the only element which satisfy the constraint 

arr = [2,4,6,8,7,5,9,10,11,13]

rdd.filter(lambda x : x%2 == 0)

new RDD will create with data which satisfy the x%2 =0 constraint

arr1 = ['apple', 'mango','allahabad','andaman','alcohol','bananan','aprk']
filter(lambda x : x[0] =='a')

returns new RDD contains only elements starts with 'a' letter 


In [3]:
rdd_filter = sc.parallelize([2,4,6,8,7,5,9,10,11,13])
rdd_filter1 = rdd_filter.filter(lambda x: x%2 == 0)
rdd_filter1.collect()

arr1 = ['apple', 'mango','allahabad','andaman','alcohol','bananan','aprk']
rdd_flter = sc.parallelize(arr1)
rdd_flter1 = rdd_flter.filter(lambda x : x[0] =='a')
rdd_flter1.collect()

['apple', 'allahabad', 'andaman', 'alcohol', 'aprk']

##### Groupby

Group the data in the original RDD and create a pair where key is the output of the user function and value is the all item from which function yeilds the key 

arr1 = ['apple', 'mango','allahabad','andaman','alcohol','bananan','aprk']

creates new RDD with Key and value pairs as below for above input 

[('a', ['apple', 'allahabad', 'andaman', 'alcohol', 'aprk']), ('b', ['bananan']), ('m', ['mango'])]


In [4]:
rdd_grpby = rdd_flter.groupBy(lambda x : x[0])
print(type(rdd_grpby))
val =[(k,list(v))  for (k,v) in rdd_grpby.collect()]
print(val)

<class 'pyspark.rdd.PipelinedRDD'>
[('a', ['apple', 'allahabad', 'andaman', 'alcohol', 'aprk']), ('m', ['mango']), ('b', ['bananan'])]


###### groupbyKey

Group the value for the each key in the RDD Creates the new pair where original key corresponds to the collected group of values 

arr = [(a,1), (a,3),(a,5) ,(b,3),(b,9),(c,8),(d,4),(d,5)]

For the above list if we apply groupbyKey tranformation result will be 

[('a', [1, 3, 5]), ('b', [3, 9]), ('c', [8]), ('d', [4, 5])]

In [5]:
grp_arr = [('a',1), ('a',3),('a',5) ,('b',3),('b',9),('c',8),('d',4),('d',5)]
grp_rdd = sc.parallelize(grp_arr)
grp_rdd1 = grp_rdd.groupByKey()
#print(grp_rdd.coolect())

print(list((j[0],list(j[1])) for j in grp_rdd1.collect()))  

[('a', [1, 3, 5]), ('b', [3, 9]), ('c', [8]), ('d', [4, 5])]


##### MAPPARTITIONS

mapPartition transformation converts the each partition of the source RDD into many result , In mapPartition map() tranformation is applied on each partition simultaneously .
MapPartition is like Map() but difference is it's run separately on each partition of RDD .


In [6]:
print (sc.defaultParallelism)

1


In [7]:
# MapPartition one way selecting 3 blocks of partitions 
input_data = range(1,10)
result_mp = sc.parallelize(input_data, 3)
def mapspart(iterator): yield sum(iterator)
result_mp.mapPartitions(mapspart).collect()

[6, 15, 24]

In [8]:
# i have one one default partitions cluster 
mappar_check = sc.parallelize(input_data)
mappar_Cmp = mappar_check.mapPartitions(mapspart)
mappar_Cmp.collect()

[45]

From the above results you may observed  that how mapPartition will work 

1. In case one where i've taken the 3 blocks into one partitions 
partition 1 = 1,2,3 => sum = 6
partition 2 = 4,5,6 => sum = 15
partition 3 = 7,8,9 => sum = 24

2. In case two I haven't mentioned the partitons map partition will consider the default partition . I have only one partition 
partition 1 = 1,2,3,4,5,6,7,8,9 => sum = 45

