# CombineByKey

In this notebook we will use the transformation combineByKey to get the max and min values by key.

First we define a dataset:

In [3]:
data2=[('A',1),('B',10),('A',5),('B',16),('A',3),('B',1)]
rdd2 = sc.parallelize(data2)
rdd2.collect()

To understand how combineByKey works, we will use it to calculate the average by key:

In [5]:
SumCount = rdd2.combineByKey(lambda val: (val,1),
                 lambda x, val: (x[0]+val,x[1]+1),
                 lambda x, y: (x[0]+y[0],x[1]+y[1]))

SumCount.collect()

In [6]:
avg = SumCount.mapValues(lambda x: x[0]/x[1])
avg.collect()

Now we will calculate the max and min by key using groupByKey.

For this we will define the function maxmin that takes a list as an input and returns a list with the max and the min value:

In [8]:
def maxmin(l):
  list=[max(l),min(l)]
  return (list)

Testing the function:

In [10]:
l=[1,2,3]
maxmin(l)

We create an RDD with the keys and all the values in a list:

In [12]:
rdd2.map(lambda x: (x[0], x[1])).collect()
key_values=rdd2.groupByKey().mapValues(lambda v: list(v))
key_values.collect()

Finally we call the function maxmin to get the max and the min value of each key:

In [14]:
key_values.mapValues(lambda v: maxmin(v)).collect()

The previous method requires to group all the values for each key which is not efficient when we are dealing with large data. So, we will get the max and the min using combineByKey.

To get the max, we create the function maximum that takes as input two values and return the max between them:

In [16]:
def maximum(x,y):
  if x>y:
      maxval=x
  else:
      maxval=y
  return (maxval)

Testing the function:

In [18]:
maximum(1,5)

Now we will use combineByKey to get the max value. We will explain this step by step:


- lambda val: val : The first transformation is just returning the value without doing anything to it
- lambda x, val: maximum(x,val) : The second transformation checks each new value for a key in the partition against the previous one and keeps the max
- lambda x, y: maximum(x,y) : The third transformation checks each new value for a key from the other partitions against the previous value and keeps the max

In [20]:
rdd2.combineByKey(lambda val: val, 
                 lambda x, val: maximum(x,val), 
                 lambda x, y: maximum(x,y)).collect()

For the minimum we follow the same steps that we followed for the max but using the function minimum:

In [22]:
def minimum(x,y):
  if x<y:
      minval=x
  else:
      minval=y
  return (minval)

In [23]:
minimum(5,1)

In [24]:
rdd2.combineByKey(lambda val: val, 
                 lambda x, val: minimum(x,val), 
                 lambda x, y: minimum(x,y)).collect()

Finally, we can combine the maximum and minimum functions to get the max and min in one single transformation:

In [26]:
rdd2.combineByKey(lambda val: (val,val), 
                 lambda x, val: (maximum(x[0],val), minimum(x[1],val)), 
                 lambda x, y: (maximum(x[0],y[0]), minimum(x[1],y[1]))).collect()