Filter. Sometimes we are only interested in certain lines in the RDD of which the value of a certain column is a certain number or a certain range. For example, say in the previous fake friends problem, if we are only interested in 33 year olds:

In [1]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("FriendsByAge")
sc = SparkContext(conf = conf)

def parseLine(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

lines = sc.textFile("fakefriends.csv")
rdd = lines.map(parseLine)
thirtythreeyo = rdd.filter(lambda x: x[0]==33)
thirtythreeyo.collect()

[(33, 385),
 (33, 74),
 (33, 471),
 (33, 275),
 (33, 245),
 (33, 356),
 (33, 460),
 (33, 294),
 (33, 243),
 (33, 463),
 (33, 228),
 (33, 410)]

Or a certain range, says between 40 and 49, inclusive:

In [8]:
forties = rdd.filter(lambda x: 40 <= x[0] <= 49)
forties.collect()

[(40, 465),
 (43, 49),
 (45, 455),
 (42, 363),
 (49, 476),
 (48, 364),
 (43, 249),
 (40, 254),
 (41, 278),
 (44, 194),
 (48, 135),
 (45, 184),
 (40, 459),
 (40, 407),
 (46, 88),
 (46, 63),
 (44, 178),
 (40, 18),
 (41, 244),
 (45, 400),
 (45, 439),
 (47, 429),
 (40, 284),
 (45, 252),
 (46, 462),
 (45, 340),
 (42, 427),
 (45, 470),
 (49, 340),
 (40, 389),
 (44, 360),
 (48, 57),
 (47, 87),
 (43, 404),
 (47, 488),
 (44, 84),
 (48, 287),
 (47, 225),
 (40, 349),
 (45, 497),
 (48, 381),
 (46, 125),
 (41, 206),
 (41, 394),
 (40, 406),
 (44, 277),
 (40, 198),
 (49, 22),
 (48, 345),
 (46, 154),
 (45, 332),
 (41, 260),
 (40, 172),
 (40, 33),
 (49, 106),
 (44, 353),
 (47, 13),
 (46, 300),
 (44, 499),
 (43, 101),
 (40, 56),
 (45, 395),
 (49, 147),
 (46, 319),
 (41, 340),
 (45, 59),
 (43, 48),
 (44, 61),
 (46, 407),
 (40, 7),
 (47, 4),
 (46, 151),
 (46, 352),
 (41, 397),
 (48, 266),
 (47, 97),
 (43, 335),
 (42, 467),
 (45, 147),
 (40, 261),
 (44, 388),
 (45, 54),
 (42, 275),
 (42, 95),
 (48, 394),
 

Find minimum. Say we only want to know the minimum number of friends each age has. The code is:

In [9]:
minim = rdd.reduceByKey(lambda x,y: min(x,y))
minim.collect()

[(33, 74),
 (26, 2),
 (55, 57),
 (40, 7),
 (68, 21),
 (59, 14),
 (37, 46),
 (54, 7),
 (38, 2),
 (27, 53),
 (53, 86),
 (57, 8),
 (56, 15),
 (43, 48),
 (36, 49),
 (22, 6),
 (35, 13),
 (45, 54),
 (60, 2),
 (67, 35),
 (19, 5),
 (30, 17),
 (51, 81),
 (25, 1),
 (21, 89),
 (42, 95),
 (49, 17),
 (48, 57),
 (50, 119),
 (39, 68),
 (32, 24),
 (58, 6),
 (64, 65),
 (31, 15),
 (52, 77),
 (24, 49),
 (20, 1),
 (62, 12),
 (41, 62),
 (44, 61),
 (69, 9),
 (65, 101),
 (61, 2),
 (28, 32),
 (66, 41),
 (46, 63),
 (29, 11),
 (18, 24),
 (47, 4),
 (34, 48),
 (63, 342),
 (23, 65)]

That tells us, for example, of all friends who are 33 years old, the one with the fewest friends has 74 friends.

Flat map. The <code>map</code> function transforms an RDD to another, but the number of lines in the new RDD is the same as the original RDD. To transform an RDD to another with a different number of lines, we need to use a different function called <code>flatMap</code>. For example, say we have a file of a book and we want to know how many words are in the book. We shall first load the book into an RDD in our program:

In [4]:
lines = sc.textFile("Book")
lines.collect()[0:10]

['Self-Employment: Building an Internet Business of One',
 'Achieving Financial and Personal Freedom through a Lifestyle Technology Business',
 'By Frank Kane',
 '',
 '',
 '',
 'Copyright � 2015 Frank Kane. ',
 'All rights reserved worldwide.',
 '',
 '']

We can see here the book is separated into lines. So if we want to transform the RDD into another RDD but separated into words, we need the <code>flatMap</code> function. First of all, if we want to split the first line, which is a Python string, into a list of words, assuming words are separated by a space, the code is as follows:

In [5]:
lines.collect()[0].split()

['Self-Employment:', 'Building', 'an', 'Internet', 'Business', 'of', 'One']

Therefore to write it into a lambda function and apply it to Pyspark, the code is as follows:

In [6]:
words = lines.flatMap(lambda x: x.split())
words.collect()[0:25]

['Self-Employment:',
 'Building',
 'an',
 'Internet',
 'Business',
 'of',
 'One',
 'Achieving',
 'Financial',
 'and',
 'Personal',
 'Freedom',
 'through',
 'a',
 'Lifestyle',
 'Technology',
 'Business',
 'By',
 'Frank',
 'Kane',
 'Copyright',
 '�',
 '2015',
 'Frank',
 'Kane.']

And Pyspark has a simple function that counts the number of lines in an RDD:

In [7]:
words.count()

46249

Let's say we want to know the frequency of each different word that appears in the book. There is also another simple Pyspark function for that, countByValue. It returns a dictionary, of which the keys are unique words and the values are their respective frequencies:

In [None]:
words.countByValue()

Looking at the result, we can see that there are some potential improvements we can make. For example, a word followed by a punctuation appears to count as a separate word from the same word that does not, as does a word with letters in a different case. 

Sorting. One way we might want to present our word count result is by the frequency, with the most common word first. But Pyspark only has a sortByKey function, which will sort the result in alphabetical order. One neat trick we can use is to simply flip the key and value in our key-value pair, then sort.

In [69]:
wordCount = words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
wordCountSorted = wordCount.map(lambda x: (x[1],x[0])).sortByKey(ascending=False)

In [71]:
wordCountSorted.collect()[:50]

[(1789, 'to'),
 (1339, 'your'),
 (1267, 'you'),
 (1176, 'the'),
 (1148, 'a'),
 (941, 'of'),
 (901, 'and'),
 (641, 'that'),
 (552, 'in'),
 (531, 'is'),
 (500, 'for'),
 (399, 'on'),
 (391, 'are'),
 (347, 'be'),
 (322, 'I'),
 (319, 'can'),
 (311, 'it'),
 (299, 'have'),
 (297, 'as'),
 (292, 'with'),
 (267, 'or'),
 (261, 'business'),
 (237, 'If'),
 (220, 'will'),
 (208, 'this'),
 (199, 'my'),
 (192, 'they'),
 (192, 'but'),
 (189, 'at'),
 (187, 'more'),
 (181, 'about'),
 (177, 'what'),
 (174, '�'),
 (174, 'if'),
 (172, 'an'),
 (169, 'not'),
 (166, 'need'),
 (165, 'time'),
 (161, 'from'),
 (159, "you're"),
 (156, 'do'),
 (155, 'up'),
 (144, 'You'),
 (143, 'new'),
 (138, 'out'),
 (131, 'just'),
 (127, 'how'),
 (125, 'product'),
 (122, 'people'),
 (117, 'their')]

Exercise: total spent by customers. The customer-orders.csv file contains data about orders some customers made in a store. The first column of the table is the customer ID, the second column the order ID, and the third column the cost of that order. Using Pyspark, compile a list of the highest spending customers in the store.

In [77]:
lines = sc.textFile("customer-orders.csv")
def parseLine(line):
    fields = line.split(",")
    cust = int(fields[0])
    amount = float(fields[2])
    return (cust,amount)
rdd = lines.map(parseLine)
totalByCustomer = rdd.reduceByKey(lambda x,y: x+y)
totalByCustomer.map(lambda x: (x[1],x[0])).sortByKey(ascending=False).collect()

[(6375.449999999997, 68),
 (6206.199999999999, 73),
 (6193.109999999999, 39),
 (6065.389999999999, 54),
 (5995.660000000003, 71),
 (5994.59, 2),
 (5977.189999999995, 97),
 (5963.109999999999, 46),
 (5696.840000000003, 42),
 (5642.89, 59),
 (5637.62, 41),
 (5524.949999999998, 0),
 (5517.240000000001, 8),
 (5503.43, 85),
 (5497.479999999998, 61),
 (5496.050000000004, 32),
 (5437.7300000000005, 58),
 (5415.150000000001, 63),
 (5413.510000000001, 15),
 (5397.879999999998, 6),
 (5379.280000000002, 92),
 (5368.83, 43),
 (5368.249999999999, 70),
 (5337.44, 72),
 (5330.8, 34),
 (5322.649999999999, 9),
 (5298.090000000002, 55),
 (5290.409999999998, 90),
 (5288.689999999996, 64),
 (5265.750000000001, 93),
 (5259.920000000003, 24),
 (5254.659999999998, 33),
 (5253.3200000000015, 62),
 (5250.4, 26),
 (5245.059999999999, 52),
 (5206.4, 87),
 (5186.429999999999, 40),
 (5155.419999999999, 35),
 (5152.290000000002, 11),
 (5140.3499999999985, 65),
 (5123.010000000001, 69),
 (5112.709999999999, 81),
 (5