# Key/Value RDD's
and the "friends by age" example

## RDD's can hold key / Value pairs
- number of friends by age
- key is age, value is number of friends
- instead of just a list of ages or a list of # of friends, we can store (age, # of friends), (age,# of friends) etc...

## Creating a Key/Value Pair
- just map pairs of data into the RDD. for example:
        `totalsByAge = rdd.map(lambda x: (x,1))`
    _voila you now have a key/value RDD_


- ok to have list as values as well

## Spark can do special stuff with key/value data
- reduceByKey(): combine values with the same key using some function. `rdd.reduceByKey(lambda x,y: x+y)` adds them up.
- groupByKey(): Group values with the same key
- sortByKey(): Sort RDD by key values
- keys(), values() - Create an RDD of just the keys, or just the values

## You can do sql-style joins on two key.value RDD's
- join.rightOuterJoin, leftOuterJoin, cogroup, subtractByKey
- we'll look at an example of this later

## Mapping just the values of a key/value rdd?
- with key/value data use mapValues() and flatMapValues() if your transformation foesn't affect the keys.
- it's more efficient

## Friends by age example
#### Data format
- Input data: ID, name, age, number of friends
        0,Will,33,385
        1,Jean-Luc,26,2
        2,Hugh,55,221
        3,Deanna,40,465
        4,Quark,68,21

In [1]:
from pyspark import SparkContext

sc = SparkContext("local","FriendsByAge")

def parseLine(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

In [6]:
lines = sc.textFile("../data/fakefriends.csv")
rdd = lines.map(parseLine)

totalsByAge = rdd \
    .mapValues(lambda x: (x, 1)) \
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])

results = averagesByAge \
    .sortByKey() \
    .collect()

for result in results:
    print(result)


(18, 343.375)
(19, 213.27272727272728)
(20, 165.0)
(21, 350.875)
(22, 206.42857142857142)
(23, 246.3)
(24, 233.8)
(25, 197.45454545454547)
(26, 242.05882352941177)
(27, 228.125)
(28, 209.1)
(29, 215.91666666666666)
(30, 235.8181818181818)
(31, 267.25)
(32, 207.9090909090909)
(33, 325.3333333333333)
(34, 245.5)
(35, 211.625)
(36, 246.6)
(37, 249.33333333333334)
(38, 193.53333333333333)
(39, 169.28571428571428)
(40, 250.8235294117647)
(41, 268.55555555555554)
(42, 303.5)
(43, 230.57142857142858)
(44, 282.1666666666667)
(45, 309.53846153846155)
(46, 223.69230769230768)
(47, 233.22222222222223)
(48, 281.4)
(49, 184.66666666666666)
(50, 254.6)
(51, 302.14285714285717)
(52, 340.6363636363636)
(53, 222.85714285714286)
(54, 278.0769230769231)
(55, 295.53846153846155)
(56, 306.6666666666667)
(57, 258.8333333333333)
(58, 116.54545454545455)
(59, 220.0)
(60, 202.71428571428572)
(61, 256.22222222222223)
(62, 220.76923076923077)
(63, 384.0)
(64, 281.3333333333333)
(65, 298.2)
(66, 276.4444444444444