# Resilient Distributed Dataset (RDD)
- Resilient
- Distributed
- Dataset
- Lazy Evaluation: Driver program does not begin execution until an action is called.

### Spark Context
- Created by driver program
- Creates RDDs
- Responsible for making RDDs resilient and distributed
- SparkContxt (sc) object created automatically

### Common RDD Transformations
- map
- flatmap
- filter
- distinct
- sample
- union
- intersection
- subtract
- cartesian

### Common RDD Actions
- collect
- count
- countByValue
- take
- top
- reduce

In [0]:
# import modules
# from pyspark import SparkConf, SparkContext
import re
import collections

### Simple Example

In [0]:
# map example using lambda
rdd1 = sc.parallelize([1, 2, 3, 4])
rdd1.map(lambda x: x*x)

# same result as the following
rdd2 = sc.parallelize([1, 2, 3, 4])
def squareIt(x):
    return x*x

rdd2.map(squareIt)


Out[60]: PythonRDD[237] at RDD at PythonRDD.scala:58

### Full RDD Example (Getting Started Module)

In [0]:
# manually create spark context (applicable in shell)
# conf = SparkConf() \
#     .setMaster("local") \
#     .setAppName("RatingsHistogram")

# sc = SparkContext(conf = conf)

In [0]:
# load data from datalake
# schema: userId, movieId, ratingValue, timeStamp
ratings = sc.textFile("abfss://pyspark-course@dbcourselakehouse.dfs.core.windows.net/ml-100k/u.data")

# display contents of text file
rdd_contents = ratings.collect()

for i, line in enumerate(rdd_contents):
    if i < 5:
        print(line)
    else:
        break

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596


In [0]:
# split line on white space and return index 2 (ratingValue)
ratings = ratings.map(lambda x: x.split()[2])

# count records per rating
result = ratings.countByValue()

# result is collected as key: value pair
# rating: count
result

Out[63]: defaultdict(int, {'3': 27145, '1': 6110, '2': 11370, '4': 34174, '5': 21201})

In [0]:
# sort and print results
sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.items():
    print("%s %i" % (key, value))

1 6110
2 11370
3 27145
4 34174
5 21201


### Key: Value RDDs
- Similar to a NoSQL store
- Allows aggregation by keys

### Common Key: Value Functions
- mapValues()
- flatMapValues()
- reduceByKey()
- groupByKey()
- sortByKey()
- keys(), values()

In [0]:
# key value rdd example (age: number of friends)
# load data from datalake
# schema: userId, userName, numFriends
friends = sc.textFile("abfss://pyspark-course@dbcourselakehouse.dfs.core.windows.net/fakefriends.csv")

In [0]:
# parse (map) input data
def parseFriends(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

rdd3 = friends.map(parseFriends)

In [0]:
# calculate average friends by age
totalsByAge = rdd3 \
    .mapValues(lambda x: (x, 1)) \
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

averagesByAge = totalsByAge.mapValues(lambda x: int(x[0] / x[1]))
sortedAveragesByAge = averagesByAge.sortBy(lambda x: x[0])

results = sortedAveragesByAge.collect()

for result in results:
    print(result)

(18, 343)
(19, 213)
(20, 165)
(21, 350)
(22, 206)
(23, 246)
(24, 233)
(25, 197)
(26, 242)
(27, 228)
(28, 209)
(29, 215)
(30, 235)
(31, 267)
(32, 207)
(33, 325)
(34, 245)
(35, 211)
(36, 246)
(37, 249)
(38, 193)
(39, 169)
(40, 250)
(41, 268)
(42, 303)
(43, 230)
(44, 282)
(45, 309)
(46, 223)
(47, 233)
(48, 281)
(49, 184)
(50, 254)
(51, 302)
(52, 340)
(53, 222)
(54, 278)
(55, 295)
(56, 306)
(57, 258)
(58, 116)
(59, 220)
(60, 202)
(61, 256)
(62, 220)
(63, 384)
(64, 281)
(65, 298)
(66, 276)
(67, 214)
(68, 269)
(69, 235)


### Filtering RDDs
- Removes data from RDD based on boolean condition

In [0]:
# load data from datalake
# schema: weatherStationId, date, entryType, temperature
temps = sc.textFile("abfss://pyspark-course@dbcourselakehouse.dfs.core.windows.net/1800.csv")

In [0]:
# parse input data
def parseTemps(line):
    fields = line.split(',')
    stationId = fields[0]
    entryType = fields[2]
    temperature = float(fields[3]) * 0.1 * (9.0 / 5.0) + 32.0
    return (stationId, entryType, temperature)

rdd4 = temps.map(parseTemps)

In [0]:
# calculate min temp for each weather station
minTemps = rdd4.filter(lambda x: 'TMIN' in x[1])
minStationTemps = minTemps.map(lambda x: (x[0], x[2]))
minStationTempsPer = minStationTemps.reduceByKey(lambda x, y: min(x,y))

results = minStationTempsPer.collect()

for result in results:
    print(result[0] + f'\t{round(result[1], 4)}F')

ITE00100554	5.36F
EZE00100082	7.7F


In [0]:
# calculate max temp for each weather station
maxTemps = rdd4.filter(lambda x: 'TMAX' in x[1])
maxStationTemps = maxTemps.map(lambda x: (x[0], x[2]))
maxStationTempsPer = maxStationTemps.reduceByKey(lambda x, y: max(x,y))

results = maxStationTempsPer.collect()

for result in results:
    print(result[0] + f'\t{round(result[1], 4)}F')

ITE00100554	90.14F
EZE00100082	90.14F


### Applying map() and flatMap() to RDDs
- map() transforms each element of an RDD into one new element
- flatMap() can create many new elements from each element

In [0]:
# load data from datalake
book = sc.textFile("abfss://pyspark-course@dbcourselakehouse.dfs.core.windows.net/book.txt")

In [0]:
# normalize words function
def normalizeWords(text):
    return re.compile(r'\W+', re.UNICODE).split(text.lower())

In [0]:
# calculate unique word count and sort
words = book.flatMap(normalizeWords)
wordCounts = words.countByValue()
sortedWordCounts = sorted(wordCounts.items(), key = lambda x: x[1], reverse = True)

for word, count in sortedWordCounts:
    cleanWord = word.encode('ascii', 'ignore')

    if cleanWord:
        print(cleanWord.decode() + ' ' + str(count))

you 1878
to 1828
your 1420
the 1292
a 1191
of 970
and 934
that 747
it 649
in 616
is 560
for 537
on 428
are 424
if 411
s 391
i 387
business 383
can 376
be 369
as 343
have 321
with 315
t 301
this 280
or 278
time 255
but 242
they 234
will 231
what 229
at 220
my 215
re 214
do 207
not 203
about 202
more 200
product 182
an 178
up 177
need 174
them 166
from 166
how 163
there 162
out 161
new 153
people 145
work 144
so 143
just 142
own 140
all 137
don 133
get 123
customers 123
by 122
want 122
company 122
their 122
some 121
ll 114
self 111
website 109
make 108
may 107
even 104
when 102
one 100
ve 95
than 92
also 91
job 90
much 90
who 88
money 86
was 85
these 82
find 81
sales 80
only 79
into 79
yourself 78
other 78
like 78
no 76
probably 76
employment 75
ads 75
day 73
good 72
many 71
before 70
most 70
might 70
ad 70
should 69
those 68
products 67
market 66
well 66
sure 65
still 65
plan 64
google 63
someone 62
over 62
any 62
software 60
idea 60
enough 59
once 59
then 59
very 58
working 58
think 58

# Exercise: Find Total Amount by Customer

In [0]:
# load data from datalake
# schema: customerId, itemId, amountSpent

orders = sc.textFile("abfss://pyspark-course@dbcourselakehouse.dfs.core.windows.net/customer-orders.csv")

In [0]:
# extract cutomer price pair function
def extractCustomerPricePairs(line):
    fields = line.split(',')
    return(int(fields[0]), float(fields[2]))

In [0]:
# calcuate sum by customer
mappedOrders = orders.map(extractCustomerPricePairs)
totalByCustomer = mappedOrders.reduceByKey(lambda x, y: x + y)
sortedTotalByCustomer = totalByCustomer.sortBy(lambda x: x[1], ascending = False)

results = sortedTotalByCustomer.collect()

for result in results:
    customer = result[0]
    totalAmount = round(result[1], 2)
    print(f'CustomerId: {customer} \tTotal Amt: ${totalAmount}')

CustomerId: 68 	Total Amt: $6375.45
CustomerId: 73 	Total Amt: $6206.2
CustomerId: 39 	Total Amt: $6193.11
CustomerId: 54 	Total Amt: $6065.39
CustomerId: 71 	Total Amt: $5995.66
CustomerId: 2 	Total Amt: $5994.59
CustomerId: 97 	Total Amt: $5977.19
CustomerId: 46 	Total Amt: $5963.11
CustomerId: 42 	Total Amt: $5696.84
CustomerId: 59 	Total Amt: $5642.89
CustomerId: 41 	Total Amt: $5637.62
CustomerId: 0 	Total Amt: $5524.95
CustomerId: 8 	Total Amt: $5517.24
CustomerId: 85 	Total Amt: $5503.43
CustomerId: 61 	Total Amt: $5497.48
CustomerId: 32 	Total Amt: $5496.05
CustomerId: 58 	Total Amt: $5437.73
CustomerId: 63 	Total Amt: $5415.15
CustomerId: 15 	Total Amt: $5413.51
CustomerId: 6 	Total Amt: $5397.88
CustomerId: 92 	Total Amt: $5379.28
CustomerId: 43 	Total Amt: $5368.83
CustomerId: 70 	Total Amt: $5368.25
CustomerId: 72 	Total Amt: $5337.44
CustomerId: 34 	Total Amt: $5330.8
CustomerId: 9 	Total Amt: $5322.65
CustomerId: 55 	Total Amt: $5298.09
CustomerId: 90 	Total Amt: $5290.41