<a href="https://colab.research.google.com/github/jmbanda/BigDataProgramming_2019/blob/master/Chapter4_Working_with_Key_Value_Pairs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Key/Value Pairs

Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL (extract, transform, and load) to get our data into a key/value format. Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping together data with the same key, and grouping together two different RDDs).

Collab Only code:

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
import findspark
findspark.init()

**Not on Colab you should start form HERE:**

Creating an RDD with textFile() in Python

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Learning_Spark") \
    .getOrCreate()

sc = spark.sparkContext
lines = sc.textFile("spark-2.4.4-bin-hadoop2.7/README.md")

Creating a pair RDD using the first word as the key:

In [0]:
pairs = lines.map(lambda x: (x.split(" ")[0], x))

Let's see how our Key/Value pair looks like:

In [0]:
print(pairs.collect())

# Transformations on pair RDDs

Lets use  {(1, 2), (3, 4), (3, 6)}  as our example.

In [0]:
exampleRDDt = sc.parallelize([(1, 2), (3, 4), (3, 6)])
exampleRDD = exampleRDDt.map(lambda x: (int(x[0]), int(x[1])))   # What is this doing?
print(exampleRDD.collect())

***reduceByKey(func)***

Combine values with the same key.

In [0]:
rbK = exampleRDD.reduceByKey(lambda x, y: x + y)
print(rbK.collect())

***groupByKey()***

Group values with the same key.

In [0]:
exampleRDD.groupByKey().collect()

How do we actually see the elements?

In [0]:
exampleRDD.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

***mapValues(func)***

Apply a function to each value of a pair RDD without changing the key.



In [0]:
exampleRDD.mapValues(lambda x: x+1).collect()

***flatMapValues(func)***

Apply a function that returns an iterator to each value of a pair RDD, and for each element returned, produce a key/value entry with the old key. Often used for tokenization.

In [0]:
exampleRDD.flatMapValues(lambda x: range(x,6)).collect()

***keys()***

Return an RDD of just the keys.

In [0]:
exampleRDD.keys().collect()

***values()***

Return an RDD of just the values.

In [0]:
exampleRDD.values().collect()

***sortByKey()***

Return an RDD sorted by the key.

In [0]:
exampleRDD.sortByKey().collect()

***Go back to slides!!!***

# Transformations on two pair RDDs 

rdd = {(1, 2), (3, 4), (3, 6)} 
other = {(3, 9)})

In [0]:
exampleRDDt = sc.parallelize([(1, 2), (3, 4), (3, 6)])
rdd = exampleRDDt.map(lambda x: (int(x[0]), int(x[1])))  

othert = sc.parallelize([(3, 9)])
other = othert.map(lambda x: (int(x[0]), int(x[1])))  

***subtractByKey***

Remove elements with a key present in the other RDD.

In [0]:
rdd.subtractByKey(other).collect()

***join***

Perform an inner join between two RDDs.

In [0]:
rdd.join(other).collect()

***rightOuterJoin***

Perform a join between two RDDs where the key must be present in the first RDD.

In [0]:
rdd.rightOuterJoin(other).collect()

***leftOuterJoin***

Perform a join between two RDDs where the key must be present in the other RDD.

In [0]:
rdd.leftOuterJoin(other).collect()

***cogroup***

Group data from both RDDs sharing the same key.

In [0]:
rdd.cogroup(other).collect()

Pair RDDs are also still RDDs (of Python tuples), and thus support the same functions as RDDs. For instance, we can take our pair RDD from the previous section and filter out lines longer than 20 characters

In [0]:
result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20)
result.collect()

***Back to Slides***

# WordCount Example

Using the file: https://norvig.com/big.txt

Lets do the following tasks:

1) Open read a text file from the web. NOTE: I know the encoding is UTF-8, but you might have to check like it is recommended here (https://stackoverflow.com/questions/37369901/attributeerror-httpresponse-object-has-no-attribute-split)

In [0]:
import urllib.request
data = urllib.request.urlopen("https://norvig.com/big.txt").read().decode('utf-8')
data = data.split("\n")

Now that we have the file by lines, let's do a word count:

In [0]:
words = sc.parallelize(data).flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
print(wordCounts.collect())

Is there another way?

Yes, and it is faster!

In [0]:
result = words.flatMap(lambda x: x.split(" ")).countByValue()
print(result)

**BACK TO SLIDES**

# combineByKey

In order to aggregate an RDD’s elements in parallel, Spark’s ***combineByKey*** method requires three functions:

```
createCombiner
mergeValue
mergeCombiner
```

**Creating a Combiner:**

`lambda value: (value, 1)`

The first required argument in the ***combineByKey*** method is a function to be used as the very first aggregation step for each key. The argument of this function corresponds to the value in a key-value pair. If we want to compute the `sum` and `count` using ***combineByKey***, then we can create this “combiner” to be a tuple in the form of `(sum, count)`. The very first step in this aggregation is then `(value, 1)`, where `value` is the first RDD value that ***combineByKey*** comes across and `1` initializes the count.

**Merge a Value:**

`lambda x, value: (x[0] + value, x[1] + 1)`

The next required function tells ***combineByKey*** what to do when a combiner is given a new value. The arguments to this function are a combiner and a new value. The structure of the combiner is defined above as a tuple in the form of `(sum, count)` so we merge the new value by adding it to the first element of the tuple while incrementing `1` to the second element of the tuple.

**Merge two Combiners:**

`lambda x, y: (x[0] + y[0], x[1] + y[1])`

The final required function tells ***combineByKey*** how to merge two combiners. In this example with tuples as combiners in the form of `(sum, count)`, all we need to do is add the first and last elements together.

In [0]:
student_rdd = sc.parallelize([
  ("Joseph", "Maths", 83), ("Joseph", "Physics", 74), ("Joseph", "Chemistry", 91), 
    ("Joseph", "Biology", 82), ("Jimmy", "Maths", 69), ("Jimmy", "Physics", 62), 
    ("Jimmy", "Chemistry", 97), ("Jimmy", "Biology", 80), ("Tina", "Maths", 78), 
    ("Tina", "Physics", 73), ("Tina", "Chemistry", 68), ("Tina", "Biology", 87), 
    ("Thomas", "Maths", 87), ("Thomas", "Physics", 93), ("Thomas", "Chemistry", 91), 
    ("Thomas", "Biology", 74), ("Cory", "Maths", 56), ("Cory", "Physics", 65), 
    ("Cory", "Chemistry", 71), ("Cory", "Biology", 68), ("Jackeline", "Maths", 86), 
    ("Jackeline", "Physics", 62), ("Jackeline", "Chemistry", 75), ("Jackeline", "Biology", 83), 
    ("Juan", "Maths", 63), ("Juan", "Physics", 69), ("Juan", "Chemistry", 64), 
    ("Juan", "Biology", 60)], 3)
 
# Defining createCombiner, mergeValue and mergeCombiner functions
def createCombiner(tpl):
    return (tpl[1], 1)
    
def mergeValue(accumulator, element): 
    return (accumulator[0] + element[1], accumulator[1] + 1)
    
def mergeCombiner(accumulator1, accumulator2): 
    return (accumulator1[0] + accumulator2[0], accumulator1[1] + accumulator2[1])
 
comb_rdd = student_rdd.map(lambda t: (t[0], (t[1], t[2]))) \
                    .combineByKey(createCombiner, mergeValue, mergeCombiner) \
                    .map(lambda t: (t[0], t[1][0]/t[1][1]))
 
# See output nicely
for tpl in comb_rdd.collect():
    print(tpl)

**Important Points**

* Apache spark ***combineByKey*** is a transformation operation hence its evaluation is lazy
* It is a wider operation as it shuffles data in the last stage of aggregation and creates another RDD
* Recommended to use when you need to do further aggregation on grouped data
* Use ***combineByKey*** when return type differs than source type (i.e. when you cannot use ***reduceByKey*** )



**Got back to Slides!**

# **reduceByKey()** with custom parallelism

In [0]:
dataF = [("a", 3), ("b", 4), ("a", 1)]
sample1 = sc.parallelize(dataF).reduceByKey(lambda x,y: x + y) # Default parallelism
sample2 = sc.parallelize(dataF).reduceByKey(lambda x,y: x + y, 10) # Custom parallelism

print(sample1.collect())
print(sample2.collect())

# Joins

***Innerjoin*** example.

The simple join operator is an inner join. Only keys that are present in both pair RDDs are output. When there are multiple values for the same key in one of the inputs, the resulting pair RDD will have an entry for every possible pair of values with that key from the two input RDDs.

In [0]:
names1 = sc.parallelize(("abe", "abby", "apple")).map(lambda a: (a, 1))
names2 = sc.parallelize(("apple", "beatty", "beatrice")).map(lambda a: (a, 1))
names1.join(names2).collect()

Sometimes we don’t need the key to be present in both RDDs to want it in our result. For example, if we were joining customer information with recommendations we might not want to drop customers if there were not any recommendations yet. `leftOuterJoin(other)` and `rightOuterJoin(other)` both join pair RDDs together by key, where one of the pair RDDs can be missing the key.

In [0]:
names1.leftOuterJoin(names2).collect()

In [0]:
names1.rightOuterJoin(names2).collect()

**Go back to slides**

# **Sorting Data**

Syntax: **sortByKey(*ascending=True,numPartitions=None,keyfunc = lambda x:str(x)*)**

Sort and show first element


In [0]:
tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
sc.parallelize(tmp).sortByKey().first()

In [0]:
sc.parallelize(tmp).sortByKey(True, 1).collect()

In [0]:
sc.parallelize(tmp).sortByKey(True, 2).collect()

Another example, look at the usages of extend

In [0]:
tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5)]
tmp2.extend([('whose', 6), ('fleece', 7), ('was', 8), ('white', 9)])
sc.parallelize(tmp2).sortByKey(True, 3, keyfunc=lambda k: k.lower()).collect()

# Additional Examples

**Quick Example**: Count all words (naively) from the README.MD file:

In [0]:

counts = lines.flatMap(lambda line: line.split(" ")). \
             map(lambda word: (word, 1)). \
             reduceByKey(lambda a, b: a + b)

print(counts.collect())


Sources:

https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html

https://backtobazics.com/big-data/apache-spark-combinebykey-example/

https://supergloo.com/spark-python/apache-spark-transformations-python-examples/