# Map filter Reduce


MapReduce is a software framework for processing large data sets in a distributed fashion over a several machines. The core idea behind MapReduce is mapping your data set into a collection of (key, value) pairs, and then reducing over all pairs with the same key.

To see the above mentioned topics in PySpark, we need to create a spark session at the beginning


In [3]:
import pyspark
from pyspark.sql import SparkSession
sparksession = SparkSession.builder.master("local[4]") \
                    .appName('mapfiltershufflereduce') \
                    .getOrCreate()
print(sparksession)

<pyspark.sql.session.SparkSession object at 0x7f04fc4dad10>


## Map()
Map is a transformation function that takes a function which describes how the data elements will be transformed. Everything we are doing in Spark should work in fully distributed system! which means our trasformation of one elemenet should not be dependent on other elements in the dataset. In other words, this is an RDD transformation that is used to apply the transformation function (lambda) on every element of RDD/DataFrame and returns a new RDD. 

### Map() Syntax
map(f, preservesPartitioning=False), here f is the lambda expression logic based on which the data will be mapped.

Before jumping to perform Map() operation, we need to first create our RDD and over the RDD, we can use the PySpark Map() feature.

### Example 1


In [2]:
rdd = sparksession.sparkContext.parallelize(range(0, 10))
print(rdd.collect())

rdd.map(lambda x: x*x)
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Herem in the rdd.map() line is calling map on the RDD and the lambda function is taking the number input and returning the square of it. However when we call collect on the RDD, the values are still the same! 

**Remember that RDDs are immutable. Every transformation we call will return a new RDD that needs to be stored in a variable if we want to use it.**

In [3]:
squaredRDD = rdd.map(lambda x: x*x)
squaredRDD.collect()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

## Filter()
PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.

### Filter() syntax

filter(condition)

### Example 2

Let us consider, from our SquaredRDD, the lastly created, we want to drop the values which are divisable by 2. We want to keep only those values, which are non-divisible by 2.

In [9]:
filterRDD = squaredRDD.filter(lambda x: x%2 != 0)
print(filterRDD.collect())

[1, 9, 25, 49, 81]


## Reduce()
The reduce function is an action operation that will result in one output at the end. We pass to Reduce a function that takes two arguments of the same type and returns a value of the same type as well. 

Basically, 2 input elements will be passed to the reduce function and the return value will replace these 2 elements. And this process is repeated/parallelized until we are left with one value. This is why it is important that the 2 elements and the returned value must be of the same type.

### Syntax
reduce(f), f represents lambda expression and return

### Example 3

In [None]:
rdd.reduce(lambda x,y: x+y)

45

## Exercise

For every exercise, start with a new RDD created using `parallelize` with the data you want. Take at least 15 values which are randomly generated within the range 1 to 100

### Exercise 1

$$\sum _{i=0}^{10}{i^2}$$

In [25]:
import random
import pyspark
from pyspark.sql import SparkSession

sparka = SparkSession.builder.master("local[2]").appName("sparta").getOrCreate()
ranval = [random.randint(0, 100) for iter in range(15)]
rdd = sparka.sparkContext.parallelize(ranval)
print(rdd.collect())
mapped = rdd.map(lambda x: x ** 2)
print(mapped.collect())
mapped.reduce(lambda x,y: x + y)

[66, 22, 53, 8, 95, 42, 73, 24, 54, 40, 43, 74, 7, 12, 23]
[4356, 484, 2809, 64, 9025, 1764, 5329, 576, 2916, 1600, 1849, 5476, 49, 144, 529]


36970

### Exercise 2

$$\sum _{i=0}^{10}{(i+1)^2}$$

In [28]:

ranval = [random.randint(0, 100) for iter in range(15)]
rdd = sparka.sparkContext.parallelize(ranval)
print(rdd.collect())
ordd = rdd.map(lambda x: (x + 1) ** 2)
print(ordd.collect())
ordd.reduce(lambda x,y: x + y)

[93, 28, 68, 72, 66, 60, 23, 18, 42, 8, 67, 9, 63, 95, 78]
[8836, 841, 4761, 5329, 4489, 3721, 576, 361, 1849, 81, 4624, 100, 4096, 9216, 6241]


55121

### Exercise 3

$$\prod  _{i=0}^{10}{i^2}$$
This is production (multiply)

In [29]:
import random

ranval = [random.randint(0, 100) for iter in range(15)]
rdd = sparksession.sparkContext.parallelize(ranval)
print(rdd.collect())
ordd = rdd.map(lambda x: x ** 2)
print(ordd.collect())
ordd.reduce(lambda x,y: x * y)

[69, 48, 45, 81, 37, 16, 41, 29, 99, 15, 17, 24, 85, 5, 38]
[4761, 2304, 2025, 6561, 1369, 256, 1681, 841, 9801, 225, 289, 576, 7225, 25, 1444]


6913550365378912646186401233769601433600000000

### Exercise 4

We have the following data collection. Each element has two values: x and y. 

- calculate the sum of all x values
- calculate the sum of all y values
- for each element calculate x*y and return the list of numbers

ex4RDD = sc.parallelize([(1,2),(3,4),(5,6),(7,8),(9,10)])

In [3]:
import pyspark
from pyspark.sql import SparkSession
sparksession = SparkSession.builder.master("local[2]").appName("ex-4").getOrCreate()
ex4RDD = sparksession.sparkContext.parallelize([(1,2),(3,4),(5,6),(7,8),(9,10)])
print(ex4RDD.collect())
xrdd = ex4RDD.map(lambda x: x[0])
print(xrdd.collect())
sx = xrdd.reduce(lambda x, y: x + y)
print(sx)

yrdd = ex4RDD.map(lambda x: x[1])
print(yrdd.collect())
sy = yrdd.reduce(lambda x, y: x + y)
print(sy)

mrdd = ex4RDD.map(lambda x: x[0] * x[1])
print(mrdd.collect())

[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]
[1, 3, 5, 7, 9]
25
[2, 4, 6, 8, 10]
30
[2, 12, 30, 56, 90]


### Exercise 5

The following data represent Houses data with the following columns:
"Sell", "List", "Living", "Rooms", "Beds", "Baths", "Age", "Acres", "Taxes"

Each elements is a string separating the column values by comma. 

Find the total amount of Taxes paid for all the houses. 




In [4]:
data = [
"142, 160, 28, 10, 5, 3, 60, 0.28, 3167",
"175, 180, 18, 8, 4, 1, 12, 0.43, 4033",
"129, 132, 13, 6, 3, 1, 41, 0.33, 1471",
"138, 140, 17, 7, 3, 1, 22, 0.46, 3204",
"232, 240, 25, 8, 4, 3, 5, 2.05, 3613",
"135, 140, 18, 7, 4, 3, 9, 0.57, 3028",
"150, 160, 20, 8, 4, 3, 18, 4.00, 3131",
"207, 225, 22, 8, 4, 2, 16, 2.22, 5158",
"271, 285, 30, 10, 5, 2, 30, 0.53, 5702",
"89,  90, 10, 5, 3, 1, 43, 0.30, 2054",
"153, 157, 22, 8, 3, 3, 18, 0.38, 4127",
"87,  90, 16, 7, 3, 1, 50, 0.65, 1445",
"234, 238, 25, 8, 4, 2, 2, 1.61, 2087",
"106, 116, 20, 8, 4, 1, 13, 0.22, 2818",
"175, 180, 22, 8, 4, 2, 15, 2.06, 3917",
"165, 170, 17, 8, 4, 2, 33, 0.46, 2220",
"166, 170, 23, 9, 4, 2, 37, 0.27, 3498",
"136, 140, 19, 7, 3, 1, 22, 0.63, 3607",
"148, 160, 17, 7, 3, 2, 13, 0.36, 3648",
"151, 153, 19, 8, 4, 2, 24, 0.34, 3561",
"180, 190, 24, 9, 4, 2, 10, 1.55, 4681",
"293, 305, 26, 8, 4, 3, 6, 0.46, 7088",
"167, 170, 20, 9, 4, 2, 46, 0.46, 3482",
"190, 193, 22, 9, 5, 2, 37, 0.48, 3920",
"184, 190, 21, 9, 5, 2, 27, 1.30, 4162",
"157, 165, 20, 8, 4, 2, 7, 0.30, 3785",
"110, 115, 16, 8, 4, 1, 26, 0.29, 3103",
"135, 145, 18, 7, 4, 1, 35, 0.43, 3363",
"567, 625, 64, 11, 4, 4, 4, 0.85, 12192",
"180, 185, 20, 8, 4, 2, 11, 1.00, 3831",
"183, 188, 17, 7, 3, 2, 16, 3.00, 3564",
"185, 193, 20, 9, 3, 2, 56, 6.49, 3765",
"152, 155, 17, 8, 4, 1, 33, 0.70, 3361",
"148, 153, 13, 6, 3, 2, 22, 0.39, 3950",
"152, 159, 15, 7, 3, 1, 25, 0.59, 3055",
"146, 150, 16, 7, 3, 1, 31, 0.36, 2950",
"170, 190, 24, 10, 3, 2, 33, 0.57, 3346",
"127, 130, 20, 8, 4, 1, 65, 0.40, 3334",
"265, 270, 36, 10, 6, 3, 33, 1.20, 5853",
"157, 163, 18, 8, 4, 2, 12, 1.13, 3982",
"128, 135, 17, 9, 4, 1, 25, 0.52, 3374",
"110, 120, 15, 8, 4, 2, 11, 0.59, 3119",
"123, 130, 18, 8, 4, 2, 43, 0.39, 3268",
"212, 230, 39, 12, 5, 3, 202, 4.29, 3648",
"145, 145, 18, 8, 4, 2, 44, 0.22, 2783",
"129, 135, 10, 6, 3, 1, 15, 1.00, 2438",
"143, 145, 21, 7, 4, 2, 10, 1.20, 3529",
"247, 252, 29, 9, 4, 2, 4, 1.25, 4626",
"111, 120, 15, 8, 3, 1, 97, 1.11, 3205",
"133, 145, 26, 7, 3, 1, 42, 0.36, 3059",]

In [16]:
def getTaxes(entry):
    values = entry.split(",")
    return int(values[len(values)-1])

rdd = sparksession.sparkContext.parallelize(data)
print(rdd.collect())

trdd = rdd.map(lambda x: getTaxes(x))
print(trdd.collect())

tasum = trdd.reduce(lambda x,y: x + y)
print(f"Sum of taxes: {tasum}")

['142, 160, 28, 10, 5, 3, 60, 0.28, 3167', '175, 180, 18, 8, 4, 1, 12, 0.43, 4033', '129, 132, 13, 6, 3, 1, 41, 0.33, 1471', '138, 140, 17, 7, 3, 1, 22, 0.46, 3204', '232, 240, 25, 8, 4, 3, 5, 2.05, 3613', '135, 140, 18, 7, 4, 3, 9, 0.57, 3028', '150, 160, 20, 8, 4, 3, 18, 4.00, 3131', '207, 225, 22, 8, 4, 2, 16, 2.22, 5158', '271, 285, 30, 10, 5, 2, 30, 0.53, 5702', '89,  90, 10, 5, 3, 1, 43, 0.30, 2054', '153, 157, 22, 8, 3, 3, 18, 0.38, 4127', '87,  90, 16, 7, 3, 1, 50, 0.65, 1445', '234, 238, 25, 8, 4, 2, 2, 1.61, 2087', '106, 116, 20, 8, 4, 1, 13, 0.22, 2818', '175, 180, 22, 8, 4, 2, 15, 2.06, 3917', '165, 170, 17, 8, 4, 2, 33, 0.46, 2220', '166, 170, 23, 9, 4, 2, 37, 0.27, 3498', '136, 140, 19, 7, 3, 1, 22, 0.63, 3607', '148, 160, 17, 7, 3, 2, 13, 0.36, 3648', '151, 153, 19, 8, 4, 2, 24, 0.34, 3561', '180, 190, 24, 9, 4, 2, 10, 1.55, 4681', '293, 305, 26, 8, 4, 3, 6, 0.46, 7088', '167, 170, 20, 9, 4, 2, 46, 0.46, 3482', '190, 193, 22, 9, 5, 2, 37, 0.48, 3920', '184, 190, 21, 9, 5