 
### Creating a Pair RDD

In [1]:
%spark
val pairs = lines.map(x => x.split(" ")(0), x)
val ppairs = sc.parallelize(pairs)

### Transforms on a Pair RDDs
example: {(1, 2), (3, 4), (3,6)}

**reduceByKey**

rdd.reduceByKey((x,y)=>x+y)
*{(1, 2),(3, 10))}*

**groupByKey**

rdd.groupByKey
*{(1, [2]), (3, [4, 6])}*

**mapValues(func)**
rdd.flatMapValues(x => (x to 5))
*{(1,2), (1,3), (1,4), (1,5), (3,4), (3,5)}*

**combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)**

val result = in.combineByKey(
    (v) => (v, 1),
    (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
    (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2),
    ).map{ case (key, value) => (key, value._1/value._2.toFloat)}
result.collectAsMap().map(println(_))

**Keys & Values**

**sortByKey**

### Transform on two pair RDDs
rdd = {(1,2),(3,4), (3,6)} other = {(3, 9)}

**subtractByKey**
rdd.subtractByKey(other)
*res = {(1,2)}*

**join**
rdd.join(other)
*res = {(3,[4, 9]), (3, [6, 9])}*

**rightOuterJoin**
rdd.rightOuterJoin(other)
*res = {(3, (Some(4),9)), (3, (Some(6), 9)) }*

**leftOuterJoin**
rdd.leftOuterJoin(other)
*res = {(1, (2, None)), (3, (4,Some(9))), (3, (6, Some(9))) }*

**cogroup**
rdd.cogroup(other)
*res = {(1, [2],[]), (3, ([4, 6], 9)}*

### Functions
**filter**
rdd.filter{case (key, value) => value.length < 20)

**better filter**
rdd.mapValues{x => x.length < 20)

### Aggregations


### *Per-key avg with reduceByKey and mapValues*
rdd.mapValues(x => x, 1).reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2))


### *word count and count words*
val in = sc.paralleliz("s3:?//")
val words = in.flatMap(x => x.split(" "))
val res = words.map(x => x,1).reducByKey((x, y)=>x+y)

### Grouping

**groupByKey** *will group our data with supplied key , we get back RDD[Key, Iterable[Vals]*

### Example sort

In [23]:
%spark
val storeAddr = sc.parallelize(Array(
    (3, "3101 24th St"),
    (4, "23 Seattle"),
    (1, "1026 Valencia St"),
    (2, "748 Van Ness Ave"))
)

In [24]:
%spark
implicit val sortIntegersByString = new Ordering[Int] {
    override def compare(a: Int, b: Int) = a.toString.compare(b.toString)
}
storeAddr.sortByKey().collect.take(4)

### Actions
rdd = {(1,2),(3,4), (3,6)} other = {(3, 9)}

rdd.countByKey()
*res = {(1, 1), (3, 2}}*

rdd.collectAsMap()
*res = Map{(1,2), (3,4), (3,6)}*

rdd.lookup(3)
*res = [4,6]*