# FlatMap ReduceByKey

In this tutorial, we are going to learn two new features of PySpark, FlatMap and ReduceByKey

At the beginning of this session, you have to create a new session to work with this section of the tutorial. Another section of this tutorial is to run our popular Spark Program (Word Count in a document)

In [2]:
from pyspark.sql import SparkSession

ss = SparkSession.builder.master("local").appName("FlatMap-ReduceByKey").getOrCreate();
sc = ss.sparkContext


## ReduceByKey()

ReduceByKey function operates on key,value pairs. The function will merge the data elements with the same key value, and will reduce them based on a lambda that we pass. 

**Note that reduceByKey is not an action function like reduce. This is a transformation function. Which means the data will not be processed when we use it until an action is being used**

### Syntax
reduceByKey(condition with expression)

In [3]:
# From Previous Examples

rdd = sc.parallelize([1,2,2,3,1,5,2,3,1,4,2,4,1,5,6,12])
rdd = rdd.map(lambda x: (x, 1))
print(rdd.collect())

[(1, 1), (2, 1), (2, 1), (3, 1), (1, 1), (5, 1), (2, 1), (3, 1), (1, 1), (4, 1), (2, 1), (4, 1), (1, 1), (5, 1), (6, 1), (12, 1)]


In [4]:
#now we can reduce by the key and add the values that has the key matching
rdd = rdd.reduceByKey(lambda x,y: x+y)
rdd.collect()

[(1, 4), (2, 4), (3, 2), (5, 2), (4, 2), (6, 1), (12, 1)]

You see!!! now we have a reduced mapped values over here.


## FlatMap
PySpark flatMap() is a transformation operation that flattens the RDD/DataFrame (array/map DataFrame columns) after applying the function on every element and returns a new PySpark RDD/DataFrame.

In one of the previous exercises, we processed array of strings and then we split that string. The returned object was an array that contained all the values of the string split. 

If we want to count the occurances of words in these sentences, we can try something similar.

Let's create a sample data first. We will do the map them

In [5]:
lines = [
    "word count from Wikipedia the free encyclopedia",
    "the word count is the number of words in a document or passage of text Word counting may be needed when a text",
    "is required to stay within certain numbers of words This may particularly be the case in academia legal",
    "proceedings journalism and advertising Word count is commonly used by translators to determine the price for"
]

In [6]:
wordsRdd = sc.parallelize(lines).map(lambda x: x.split(" ")) #we are splitting on the space
wordsRdd.collect()

[['word', 'count', 'from', 'Wikipedia', 'the', 'free', 'encyclopedia'],
 ['the',
  'word',
  'count',
  'is',
  'the',
  'number',
  'of',
  'words',
  'in',
  'a',
  'document',
  'or',
  'passage',
  'of',
  'text',
  'Word',
  'counting',
  'may',
  'be',
  'needed',
  'when',
  'a',
  'text'],
 ['is',
  'required',
  'to',
  'stay',
  'within',
  'certain',
  'numbers',
  'of',
  'words',
  'This',
  'may',
  'particularly',
  'be',
  'the',
  'case',
  'in',
  'academia',
  'legal'],
 ['proceedings',
  'journalism',
  'and',
  'advertising',
  'Word',
  'count',
  'is',
  'commonly',
  'used',
  'by',
  'translators',
  'to',
  'determine',
  'the',
  'price',
  'for']]

**Our data is basically an array of words for every element that is representing the line! But this is not what we want. We want every word to be a separate element. So that we can use the word itself as a key and count the occurances.**

As we discussed above. FlatMap is a transofrmation function that will flatten the collection element (array for example) and makes every element in the array an element in the RDD. 

In [7]:
# This example witll run using FlatMap

wordsRdd = sc.parallelize(lines).flatMap(lambda x: x.split(" ")) #we are splitting on the space
wordsRdd.collect()

['word',
 'count',
 'from',
 'Wikipedia',
 'the',
 'free',
 'encyclopedia',
 'the',
 'word',
 'count',
 'is',
 'the',
 'number',
 'of',
 'words',
 'in',
 'a',
 'document',
 'or',
 'passage',
 'of',
 'text',
 'Word',
 'counting',
 'may',
 'be',
 'needed',
 'when',
 'a',
 'text',
 'is',
 'required',
 'to',
 'stay',
 'within',
 'certain',
 'numbers',
 'of',
 'words',
 'This',
 'may',
 'particularly',
 'be',
 'the',
 'case',
 'in',
 'academia',
 'legal',
 'proceedings',
 'journalism',
 'and',
 'advertising',
 'Word',
 'count',
 'is',
 'commonly',
 'used',
 'by',
 'translators',
 'to',
 'determine',
 'the',
 'price',
 'for']

### Exercise 1

Did you see any difference? What is that? Explain...

.map() divivdes each string and stores the output as lists, resulting in a nested array with a length of 4.

.flatMap() simply divides each element and stores the output in one list.

### Exercise 2

The following is the same data used in the previous notebook that represents houses with the following columns:
"Sell", "List", "Living", "Rooms", "Beds", "Baths", "Age", "Acres", "Taxes"

Find the sum of all taxes paid per number of beds in the hourse. For example for all the houses with 4 beds the total taxes paid is X. 


In [2]:
# Given Data

data = [
"142, 160, 28, 10, 5, 3, 60, 0.28, 3167",
"175, 180, 18, 8, 4, 1, 12, 0.43, 4033",
"129, 132, 13, 6, 3, 1, 41, 0.33, 1471",
"138, 140, 17, 7, 3, 1, 22, 0.46, 3204",
"232, 240, 25, 8, 4, 3, 5, 2.05, 3613",
"135, 140, 18, 7, 4, 3, 9, 0.57, 3028",
"150, 160, 20, 8, 4, 3, 18, 4.00, 3131",
"207, 225, 22, 8, 4, 2, 16, 2.22, 5158",
"271, 285, 30, 10, 5, 2, 30, 0.53, 5702",
"89,  90, 10, 5, 3, 1, 43, 0.30, 2054",
"153, 157, 22, 8, 3, 3, 18, 0.38, 4127",
"87,  90, 16, 7, 3, 1, 50, 0.65, 1445",
"234, 238, 25, 8, 4, 2, 2, 1.61, 2087",
"106, 116, 20, 8, 4, 1, 13, 0.22, 2818",
"175, 180, 22, 8, 4, 2, 15, 2.06, 3917",
"165, 170, 17, 8, 4, 2, 33, 0.46, 2220",
"166, 170, 23, 9, 4, 2, 37, 0.27, 3498",
"136, 140, 19, 7, 3, 1, 22, 0.63, 3607",
"148, 160, 17, 7, 3, 2, 13, 0.36, 3648",
"151, 153, 19, 8, 4, 2, 24, 0.34, 3561",
"180, 190, 24, 9, 4, 2, 10, 1.55, 4681",
"293, 305, 26, 8, 4, 3, 6, 0.46, 7088",
"167, 170, 20, 9, 4, 2, 46, 0.46, 3482",
"190, 193, 22, 9, 5, 2, 37, 0.48, 3920",
"184, 190, 21, 9, 5, 2, 27, 1.30, 4162",
"157, 165, 20, 8, 4, 2, 7, 0.30, 3785",
"110, 115, 16, 8, 4, 1, 26, 0.29, 3103",
"135, 145, 18, 7, 4, 1, 35, 0.43, 3363",
"567, 625, 64, 11, 4, 4, 4, 0.85, 12192",
"180, 185, 20, 8, 4, 2, 11, 1.00, 3831",
"183, 188, 17, 7, 3, 2, 16, 3.00, 3564",
"185, 193, 20, 9, 3, 2, 56, 6.49, 3765",
"152, 155, 17, 8, 4, 1, 33, 0.70, 3361",
"148, 153, 13, 6, 3, 2, 22, 0.39, 3950",
"152, 159, 15, 7, 3, 1, 25, 0.59, 3055",
"146, 150, 16, 7, 3, 1, 31, 0.36, 2950",
"170, 190, 24, 10, 3, 2, 33, 0.57, 3346",
"127, 130, 20, 8, 4, 1, 65, 0.40, 3334",
"265, 270, 36, 10, 6, 3, 33, 1.20, 5853",
"157, 163, 18, 8, 4, 2, 12, 1.13, 3982",
"128, 135, 17, 9, 4, 1, 25, 0.52, 3374",
"110, 120, 15, 8, 4, 2, 11, 0.59, 3119",
"123, 130, 18, 8, 4, 2, 43, 0.39, 3268",
"212, 230, 39, 12, 5, 3, 202, 4.29, 3648",
"145, 145, 18, 8, 4, 2, 44, 0.22, 2783",
"129, 135, 10, 6, 3, 1, 15, 1.00, 2438",
"143, 145, 21, 7, 4, 2, 10, 1.20, 3529",
"247, 252, 29, 9, 4, 2, 4, 1.25, 4626",
"111, 120, 15, 8, 3, 1, 97, 1.11, 3205",
"133, 145, 26, 7, 3, 1, 42, 0.36, 3059"]

In [7]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[4]").appName("houses").getOrCreate()

def retrieve(string):
    bed = string.split(",")[4]
    tax = string.split(",")[8]
    return int(tax) / int(bed)
        

masterRDD = spark.sparkContext.parallelize(data)
ordd = masterRDD.map(lambda x: retrieve(x))
print(ordd.collect())

tasum = ordd.reduce(lambda x, y: x + y)
print(f'Total: {tasum}')

[633.4, 1008.25, 490.3333333333333, 1068.0, 903.25, 757.0, 782.75, 1289.5, 1140.4, 684.6666666666666, 1375.6666666666667, 481.6666666666667, 521.75, 704.5, 979.25, 555.0, 874.5, 1202.3333333333333, 1216.0, 890.25, 1170.25, 1772.0, 870.5, 784.0, 832.4, 946.25, 775.75, 840.75, 3048.0, 957.75, 1188.0, 1255.0, 840.25, 1316.6666666666667, 1018.3333333333334, 983.3333333333334, 1115.3333333333333, 833.5, 975.5, 995.5, 843.5, 779.75, 817.0, 729.6, 695.75, 812.6666666666666, 882.25, 1156.5, 1068.3333333333333, 1019.6666666666666]
Total: 48882.549999999996


Exercise 3 Level: hard
## Exercise 3 level: Hard
Find the mean average of the taxes paid per house per number of beds. 

The answer should be something like this:
(5.0, 4119.8), (4.0, 3927.3214285714284), (3.0, 3055.5), (6.0, 5853.0)

In [12]:
def to_list(a):
    return [a]

def append(a, b):
    a.append(b)
    return a

def extend(a, b):
    a.extend(b)
    return a

def avg(arr):
    return sum(arr) / len(arr)

rdd = masterRDD.map(lambda x: (int(x.split(",")[4]), int(x.split(",")[8])))
print(rdd.collect())

grprdd = rdd.combineByKey(to_list, append, extend)
avgrdd = grprdd.map(lambda x: (x[0], avg(x[1])))
print(avgrdd.collect())

[(5, 3167), (4, 4033), (3, 1471), (3, 3204), (4, 3613), (4, 3028), (4, 3131), (4, 5158), (5, 5702), (3, 2054), (3, 4127), (3, 1445), (4, 2087), (4, 2818), (4, 3917), (4, 2220), (4, 3498), (3, 3607), (3, 3648), (4, 3561), (4, 4681), (4, 7088), (4, 3482), (5, 3920), (5, 4162), (4, 3785), (4, 3103), (4, 3363), (4, 12192), (4, 3831), (3, 3564), (3, 3765), (4, 3361), (3, 3950), (3, 3055), (3, 2950), (3, 3346), (4, 3334), (6, 5853), (4, 3982), (4, 3374), (4, 3119), (4, 3268), (5, 3648), (4, 2783), (3, 2438), (4, 3529), (4, 4626), (3, 3205), (3, 3059)]
[(4, 3927.3214285714284), (5, 4119.8), (6, 5853.0), (3, 3055.5)]
