<img src='https://www.rc.colorado.edu/sites/all/themes/research/logo.png'>

# Introduction to Spark

Many examples courtesy Monte Lunacek

## Landscape of Distributed Computing

How do you process 100's of GB of data?

- Filtering unstructured data
- Aggregation
- Large-scale machine learning
- Graph analysis

## Outline

- Functional programming in Python
- Spark's programming model
- As many examples as we can get through!

## Functional Python

<blockquote>
Python acquired lambda, reduce, filter and map, courtesy of a Lisp hacker who missed them and submitted working patches. -Guido van Rossum
</blockquote>

- `map` 
- `reduce`
- `filter`
- `lambda`
- And more: [itertools](https://docs.python.org/2/library/itertools.html), [pytoolz](https://github.com/pytoolz/toolz/)

We will use these concepts (and more) in `Spark`

### The `map` abstraction

In [None]:
def square(x):
    return x*x

numbers = [1,2,3]

def map_squares(nums):
    res = []
    for x in nums:
        res.append( square(x) )
    return res

or...

In [None]:
results = map(square, numbers)

For parallel computing in python, `map` is a key abstraction.

In [None]:
from multiprocessing import Pool
pool = Pool(5)
results = pool.map(square, numbers)

### `lambda`

Anonymous function: a function without a name

In [None]:
lambda_square = lambda x: x*x
map(lambda_square, range(10))

In [None]:
map(lambda x: x*x, range(10))

In [None]:
res = map(lambda x: x*x, range(10))
print res

### `reduce`

Apply a function with **two** arguments cumulatively to the container.

In [None]:
def add_num(x1, x2):
    return x1+x2

print reduce(add_num, res)

In [None]:
print reduce(lambda x,y: x+y, res)

### `filter`

Constructs a new list for items where the applied function is `True`.

In [None]:
def less_than(x):
    return x>10

filter(less_than, res)

In [None]:
filter(lambda x: x>10, res)

## Spark Programming Model

Everything starts with a `SparkContext`

In [None]:
import findspark
import os
findspark.init() # you need that before import pyspark.

import pyspark

In [None]:
sc = pyspark.SparkContext()

This [gist](http://nbviewer.ipython.org/gist/fperez/6384491/00-Setup-IPython-PySpark.ipynb) by Fernando Perez explains how to initialize the `CLUSTER_URL` during the startup of IPython.

- local
-  URL for a distributed cluster
    - e.g. `spark://node1239:7077`

### Create RDDs

[RDD Documentation](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)

The `parallelize` method is a utility for initializing RDDs.

- Not efficient (it writes a file and reads back in).

In [None]:
import numpy as np

rdd = sc.parallelize(np.arange(20), numSlices=5)

### Transformations and Actions

**Actions** return values- beware of memory limitations!

- `collect`
- `reduce`
- `take`
- `count`

**Transformations** return edges to new vertex in DAG

- `map`, `flatmap`
- `reduceByKey`
- `filter`
- `glom`

What does this look like?

- `glom`: Returns an RDD list from each partition of an RDD.
- `collect`: Returns a list from all elements of an RDD.

In [None]:
for x in rdd.glom().collect():
    print x

In [None]:
rdd = sc.parallelize(np.arange(20), numSlices=10)
for x in rdd.glom().collect():
    print x

### `map` and `Flatmap`

Return a new RDD by first applying a function and then flattening the results.

In [None]:
rdd = sc.parallelize([ [2, 3, 4],[0, 1],[5, 6, 7, 8] ])
rdd.collect()

In [None]:
rdd.map(lambda x: range(len(x))).collect()

Or I can flatten the results...

In [None]:
rdd.flatMap(lambda x: range(len(x))).collect()

Or flatten the original results

In [None]:
rdd.flatMap(lambda x: x).collect()

### Reduction

In [None]:
rdd.flatMap(lambda x: x).reduce(lambda x,y: x+y)

In [None]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
rdd.collect()

In [None]:
rdd.reduceByKey(lambda x,y: x+y).collect()

In [None]:
rdd = sc.parallelize([("hamlet", 1), ("claudius", 1), ("hamlet", 1)])

In [None]:
rdd.countByKey()