# Outline

* Intro / motivation to dive into functional programming (FP)
* Spark
* Intro to FP in Scala

# How can FP help you produce better code?

* Relieve some of the burden of flow control
* Control side effects
* Improved error handling
* More reliable programs
* Ease the burden of testing by using the type system to ensure correctness

# Example: calculate the histogram / PDF 

In [11]:
from typing import Sequence, Any
def pdf(xs: Sequence[Any]) -> Sequence[float]:
    # TODO: implement
    raise NotImplementedError("!")

In [18]:
from collections import defaultdict
def pdf(xs: Sequence[Any]) -> Sequence[float]:
    """returns a list of tuples with element and probability"""
    freq = defaultdict(lambda: 0)
    total = 0
    for x in xs:
        freq[x] += 1
        total += 1
    for k,v in freq.items():
        freq[k] /= total
    return list(freq.items())  

xs = ['man', 'woman', 'person', 'camera', 'tv']
print(pdf(xs))

[('man', 0.2), ('woman', 0.2), ('person', 0.2), ('camera', 0.2), ('tv', 0.2)]


![map](https://d33wubrfki0l68.cloudfront.net/f0494d020aa517ae7b1011cea4c4a9f21702df8b/2577b/diagrams/functionals/map.png)

![reduce](https://d33wubrfki0l68.cloudfront.net/9c239e1227c69b7a2c9c2df234c21f3e1c74dd57/eec0e/diagrams/functionals/reduce.png)

In [23]:
from functional import seq
import numpy as np
xs = np.random.choice(['a', 'b', 'c'], size=10)

counts = seq(xs).map(lambda x: (x, 1)).reduce_by_key(lambda x, y: x + y)
counts  

0,1
c,5
b,3
a,2


In [26]:
total = counts.map(lambda x: x[1]).sum()
result_pdf = counts.map(lambda x: (x[0], x[1] / total)).sorted().list()
result_pdf

[('a', 0.2), ('b', 0.3), ('c', 0.5)]

In [27]:
def pdf(xs: Sequence[Any]) -> Sequence[float]:
    counts = seq(xs).map(lambda x: (x, 1)).reduce_by_key(lambda x, y: x + y)
    total = counts.map(lambda x: x[1]).sum()
    result_pdf = counts.map(lambda x: (x[0], x[1] / total)).sorted().list()
    return result_pdf

In [30]:
print(pdf(xs))

[('a', 0.2), ('b', 0.3), ('c', 0.5)]


# Comparison

```
    freq = defaultdict(lambda: 0)
    total = 0
    for x in xs:
        freq[x] += 1
        total += 1
    for k,v in freq.items():
        freq[k] /= total
    return list(freq.items())  
```
## VS
```
    counts = seq(xs).map(lambda x: (x, 1)).reduce_by_key(lambda x, y: x + y)
    total = counts.map(lambda x: x[1]).sum()
    result_pdf = counts.map(lambda x: (x[0], x[1] / total)).sorted().list()
    return result_pdf
    
```
* Side effects
* State / immutability
* Referential transparency / pure functions
* Scalability / parallelization


# Spark

* Cluster computing software
* Extends map-reduce model to queries, dataframes and stream processing abstracting over cluster infrastructure and related complexity
* In-memory intermediate storage
* Structured data (tables) and semi-structured data (Json, XML) 

# Spark components
* Language support: Scala, Java, Python, R...
* Additional libraries: Spark SQL, ML-Lib, GraphX, Streaming
* Base libraries: Spark Core, RDD API, DataFrame API
* Cluster Management: Yarn, Mesos, Standalone, K8
* Storage / data sources: Local, HDFS, S3, RDBMS, NoSQL

![spark architecture](https://spark.apache.org/docs/latest/img/cluster-overview.png)

# Key Spark abstractions & concepts

* RDD (Resilient Distributed Dataset). A collection of elements.
    - Immutable
    - Partitioned
    - Enable efficient data reuse
    - fault-tolerant parallel data structures
    - Intermediate persistence
    - Partition & placement control
    - Manipulation through coarse-grained transforms: (map, filter, persist, groupByKey, join...)
* DAG (Directed Acyclic Graph) Scheduler.
    - Transforms a logical execution plan of RDD lineage dependencies to a physical execution plan.

* DataFrame: 2-dimensional data structure of heterogeneous types