# Unit 4: Programming with Pair RDDs

## Contents
** Pair RDDs **

** Transformations **

** Actions **

** Considerations about performance **

## Pair RDDs

We have seen that a normal RDD is just a collection of elements.

In [6]:
normal_rdd = sc.parallelize(['a', 'b', 'c'])

A Pair RDDs is a special type of RDD which elements are tuples (*pairs*):

In [3]:
pair_rdd = sc.parallelize([('a', 1), ('b', 1), ('c', 1)])

The interesting part about Pair RDDs is that they provide additional transformations and actions.

# Transformations

### keyBy

In [11]:
rdd1 = sc.parallelize(['cat', 'lion', 'dog', 'tiger', 'elephant'])
rdd1.collect()

['cat', 'lion', 'dog', 'tiger', 'elephant']

In [12]:
rdd2 = rdd1.keyBy(lambda line: len(line))
rdd2.collect()

[(3, 'cat'), (4, 'lion'), (3, 'dog'), (5, 'tiger'), (8, 'elephant')]

### groupByKey

![groupByKey](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/images/group_by.png)

In [19]:
rdd1 = sc.parallelize([('a', 1), ('b', 1), ('a', 1), ('a', 1), ('b', 1), ('b', 1), ('a', 1), ('a', 1), ('a', 1), ('b', 1), ('b',1), ('b', 1)])

In [18]:
rdd2 = rdd1.groupByKey()
rdd2.collect()

[('a', <pyspark.resultiterable.ResultIterable at 0x7f0c9a392a10>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7f0c9c55ca90>)]

### reduceByKey

![reduceByKey](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/images/reduce_by.png)

In [20]:
rdd1 = sc.parallelize([('a', 1), ('b', 1), ('a', 1), ('a', 1), ('b', 1), ('b', 1), ('a', 1), ('a', 1), ('a', 1), ('b', 1), ('b',1), ('b', 1)])

In [21]:
rdd2 = rdd1.reduceByKey(lambda x, y: x + y)
rdd2.collect()

[('a', 6), ('b', 6)]

### sortByKey

In [38]:
rdd1 = sc.parallelize([('c', 1), ('b', 1), ('a', 1)])

In [42]:
rdd2 = rdd1.sortByKey()
rdd2.collect()

[('a', 1), ('b', 1), ('c', 1)]

In [41]:
rdd2 = rdd1.sortByKey(ascending=False)
rdd2.collect()

[('c', 1), ('b', 1), ('a', 1)]

### join

In [24]:
rdd1 = sc.parallelize([('a', 1), ('b', 2), ('c', 3)])
rdd2 = sc.parallelize([('a', 4), ('b', 5), ('c', 6)])

In [25]:
rdd3 = rdd1.join(rdd2)
rdd3.collect()

[('a', (1, 4)), ('c', (3, 6)), ('b', (2, 5))]

The join transformation performs an **inner join**. In case there is no match between the keys or any of them are repeated this would be the situation:

In [28]:
rdd1 = sc.parallelize([('a', 1), ('b', 2)])
rdd2 = sc.parallelize([('a', 4), ('a', 5), ('c', 6)])

In [29]:
rdd3 = rdd1.join(rdd2)
rdd3.collect()

[('a', (1, 4)), ('a', (1, 5))]

### leftOuterJoin

In [30]:
rdd1 = sc.parallelize([('a', 1), ('b', 2)])
rdd2 = sc.parallelize([('a', 4), ('a', 5), ('c', 6)])

In [31]:
rdd3 = rdd1.leftOuterJoin(rdd2)
rdd3.collect()

[('a', (1, 5)), ('a', (1, 4)), ('b', (2, None))]

## Actions

### countByKey

In [37]:
rdd1 = sc.parallelize([('a', 1), ('b', 1), ('a', 1), ('a', 1), ('b', 1)])
rdd1.countByKey()

defaultdict(int, {'a': 3, 'b': 2})

## Considerations about performance

In general [avoid using GroupByKey](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)

If you want to perform an aggregation inside each group it is more efficient to use the reduceByKey, aggregateByKey or combineByKey alternatives because the make use of **combiners** to reduce the amount of data passed between the nodes (it is the same concept of combiners as in Hadoop MapReduce).

## Exercises
Now you can try to apply the above concepts to solve the following problems:
* Unit 4 WordCount
* Unit 4 Working with meteorological data 2
* Unit 4 KMeans