# Class 7 Notebook 1: Spark functional

Class 7 (6 Dec 2016) of [BS1804-1617 Fundamentals of Database Technologies](https://imperialbusiness.school/category/bs1804-1617/) by [Piotr Migdał](http://p.migdal.pl/)

References:

* [Lambda Expressions - The Python Tutorial](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions)
* http://spark.apache.org/
* [Spark Programming Guide](http://spark.apache.org/docs/latest/programming-guide.html)
* [Gentle Introduction to Apache Spark on Databricks](https://docs.databricks.com/_static/notebooks/gentle-introduction-to-apache-spark.html)
* [Apache Spark: An Introduction - Srini Kadamati](http://www.dataquest.io/blog/apache-spark/)

## Python

In [1]:
# run it just to see it works
# (confirm creation of a cluster)
2 + 2

4

In [2]:
# warning: it is Python 2
# see e.g. (because it is integer division):
3 / 2

1.5

In [3]:
# (Spark works both with Python 2 and Python 3)

In [4]:
# function in Python
def add_one(x):
    return x + 1

In [5]:
add_one(5)

6

In [6]:
# there is another syntax:
add_two = lambda x: x + 2

In [7]:
add_two(5)

7

In [None]:
# why? we will see!

## Spark

In [9]:
import pyspark
sc = pyspark.SparkContext('local[*]')
#Spark context object
# in Databricks it is provided by default
sc 

<pyspark.context.SparkContext at 0x7f3893f998d0>

In [10]:
# creating a Resilient Distributed Dataset (RDD)
# from a Python list
rdd = sc.parallelize([5, 21, 3, 14, 6, 9])

In [11]:
# RDD is just a pointer
rdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

In [12]:
# to get it we need to write
# WARNING: it will download all data
# be sure you want it (e.g. it is not tarabytes)
rdd.collect()

[5, 21, 3, 14, 6, 9]

In [13]:
# count the number of elements
rdd.count()

6

In [14]:
# get the first one
rdd.first()

5

In [15]:
# get only a few (~ LIMIT, top)
rdd.take(2)

[5, 21]

In [16]:
# take elements with the highest value
rdd.top(2)

[21, 14]

In [17]:
# you can also specify another key for getting
# we will discuss this notation shortly
rdd.top(2, key=lambda x: -x)

[3, 5]

In [18]:
# take sample (without replacement)
rdd.takeSample(False, 5)

[14, 5, 21, 9, 6]

In [19]:
# take sample (with replacement)
rdd.takeSample(True, 5)

[5, 14, 14, 9, 3]

In [20]:
# taking only some elements
# operations can be chained
rdd.filter(lambda x: x % 2 == 0).collect()

[14, 6]

In [21]:
# if we want to do it in many lines (one instruction per line)
# we need to use \ as Python is space-sensitive
rdd \
  .filter(lambda x: x % 2 == 0) \
  .collect()

[14, 6]

In [22]:
# changing (mapping) elements
rdd.map(lambda x: 10 * x).collect()

[50, 210, 30, 140, 60, 90]

In [23]:
# reducing elements
# (no collect!)
rdd.reduce(lambda x, y: x + y)

58

In [25]:
# if we want to make it shorter, we can import add
from operator import add
rdd.reduce(add)

58

In [26]:
# chaining things is crucial for longer instructions
rdd \
  .map(lambda x: x + 3) \
  .filter(lambda x: x % 2 == 0) \
  .reduce(add)

50

## Exercises

Using Spark:

* What is the total number of letters for each word in rdd2?
* What is the total number of letters in rdd2?
* What is the total number of letters in words with even number of letters?
* What is product of `[1, 2, 3, 4, 5, 6]`?

In [27]:
rdd2 = sc.parallelize(["a", "few", "words", "we", "use", "here"])

## Map Reduce

In [37]:
text = ["I like cats", "cats like me", "Cats are fun",
        "CATS are nice", "I LIKE fun"]

In [38]:
rdd3 = sc.parallelize(text)

In [39]:
# to get words
rdd3.flatMap(lambda x: x.split()).take(2)

['I', 'like']

In [None]:
# how to get lowercased words?
# tip: "CATS".lower()

In [40]:
rdd3 \
  .flatMap(lambda x: x.split()) \
  .map(lambda x: (x, 1)) \
  .reduceByKey(add) \
  .collect()

[('like', 2),
 ('cats', 2),
 ('nice', 1),
 ('CATS', 1),
 ('are', 2),
 ('LIKE', 1),
 ('I', 2),
 ('Cats', 1),
 ('fun', 2),
 ('me', 1)]

In [7]:
# change the above to excluse "I" and lowercase all words

# if you want to get only some result, try:
# .take(3)
# .top(3)
# .top(3, lambda (k, v): v)
# what is the difference

In [24]:
rdd.collect()

[5, 21, 3, 14, 6, 9]

In [44]:
rdd3 \
  .flatMap(lambda x: x.split()) \
  .map(lambda x: (x, 1)) \
  .reduceByKey(add) \
  .top(3, lambda kv:kv[1])

[('like', 2), ('cats', 2), ('are', 2)]