----
Apache Spark
============
<img src="images/spark-logo.png">

> “Spark is a Swiss Army knife of Big Data analytics tools”

> — Reynold Xin (@rxin), Berkeley AmpLab Development Lead

__When we learn a new framework__

<img src="http://tclhost.com/vgTJ7Zi.gif" style="width: 400px;"/>

[Source](http://thecodinglove.com/post/142839984399/when-we-learn-a-new-framework)

----
What is Spark?
------

- Spark is a framework for distributed processing.

- It is a streamlined alternative to MapReduce.


Why learn Spark?
---------

- Spark enables you to analyze petabytes of data using a variety of methods (e.g., Streaming, SQL, and machine learning)
- Spark is signficantly faster than MapReduce.
- Paradoxically, Spark's API is simpler than the MapReduce API.
- [Spark skills are in high demand (and pay well)](http://www.indeed.com/salary/q-Apache-Spark-Developer-l-San-Francisco,-CA.html)

---
By the end of this session, you will be able to:
---

- Understand the fundmental architecture and abstractions of Spark
- Use the terminology: Resilient Distributed Dataset (RDD), transformation, and action
- Create RDDs to distribute data across a cluster
- Use Spark to analyze data

Matei Zaharia
-------------

<img src="images/matei.jpg" style="width: 400px;"/>

----
Essense of Spark (aka, the world's worst cologne)
----------------

What is the basic idea of Spark? __Leverage Memory__

![](images/memory.png)

- Spark takes the Map-Reduce paradigm and changes it in some critical
  ways.

- Instead of writing single Map-Reduce jobs a Spark job consists of a
  series of map and reduce functions. 
  
- However, the intermediate data is kept in memory instead of being
  written to disk or written to HDFS.

![](images/memory_detailed.png)

---
RTFM for Spark
----

- In-memory analytics, many times faster than Hadoop
- A general-purpose computation framework that leverages distributed
    - More flexible than MapReduce (it supports general execution graphs)
    - Linear scalability and fault tolerance
- Designed for running iterative algorithms 
- It supports a rich set of higher-level tools. Including:
    - SparkSQL for 
    - MLlib for machine learning
    - GraphX for graph processing
    - Spark Streaming for realtime data processing
- Highly compatible with Hadoop’s Storage APIs
- Programming in Scala, Python, or Java       

Points to Ponder
--------

<details><summary>
Since Spark keeps intermediate data in memory to get a speed, what does it make us give up? Where's the catch?
</summary>
1. Spark does a trade-off between memory and performance.
<br>
2. While Spark apps are faster, they also consume more memory.
<br>
3. Spark outshines Map-Reduce in iterative algorithms where the overhead of saving the results of each step to HDFS slows down Map-Reduce.
</details>

----
Spark Architecture Simple
---------------

<img src="images/spark-cluster.png">


----
Spark Architecture More Complex
---------------

![](images/architecture2.png)

Spark Terminology
-----------------

Term                   |Meaning
----                   |-------
Driver                 |Process that contains the Spark Context
Executor               |Process that executes one or more Spark tasks
Master                 |Process which manages applications across the cluster
Worker                 |Process which manages executors on a particular worker node

----
Spark Demo
---------

<img src="http://images.mentalfloss.com/sites/default/files/styles/article_640x430/public/flip-coin_5.jpg" style="width: 400px;"/>

Flip a coin a "big data" number of times. What fraction of the time do you get heads?

---
Spark Install Sidebar
----

Installing and configing software suuuuuux!

Try using Databrick's cloud. There is also local install...

In [4]:
%%bash

brew install apache-spark # Automtically builds the right version with pyspark!



Ways to Launch
----

```shell
# PySpark with IPython
IPYTHON=1 pyspark

# PySpark with IPython Notebook / Jupyter Notebook
IPYTHON_OPTS="notebook" pyspark 

# Spark with Scala REPL
spark-shell --master local[*]

```

In [3]:
from pyspark import SparkContext
sc = SparkContext()

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by <module> at /Users/brianspiering/anaconda/envs/py2_de/lib/python2.7/site-packages/IPython/utils/py3compat.py:288 

`IPYTHON_OPTS="notebook" pyspark` automagically make the Spark context

In [1]:
sc #=> <pyspark.context.SparkContext at 0x10b477b10>

<pyspark.context.SparkContext at 0x1117bb4d0>

In [7]:
import random
flips = 1000000
heads = (sc.parallelize(xrange(flips))
         .map(lambda i: random.random())
         .filter(lambda r: r < 0.51)
         .count())

ratio = float(heads)/float(flips)

print(heads)
print("{:.0%}".format(ratio))

509452
51%


Highlights
-----

- `sc.parallelize` creates an RDD (more on that later)

- `map` and `filter` are __transformations__ (functional fun!)

- `count` is an __action__ and brings the data from the RDDs back to the
  driver.


----
Spark Context
---

There is only 1 context per cluster

In [8]:
# Stop cluster
sc.stop() 

Some data

In [9]:
%%bash
echo '1,Hillary,Clinton,hclinton@hotmail.com,Female,7.247.200.34
2,Donald,Trump,the_boss_man@trump.com,Male,212.79.109.69
3,Ted,Cruz,cruzing_with_ted@yahoo.com,Female,150.106.140.235
4,Bernie,Sanders,sanders@freenet.org,Male,175.21.69.76' >logs.txt

Create new one with custom logging

In [10]:
from pyspark import SparkContext

logFile = "logs.txt"  # Should be some file on your system

sc = SparkContext("local", "Simple App")

logData = sc.textFile(logFile).cache()
male_count = logData.filter(lambda s: 'Male' in s).count()
print("Number of 'Males': {}").format(male_count)

Number of 'Males': 2


---
Programming model 
-----

Resilient Distributed Datasets (RDDs) are basic building blocks:

- Distributed collection of objects, cached in-memory across cluster nodes
- Automatically rebuilt on failure

RDD operations:

- Transformations: create new RDDs from existing ones
- Actions: return a value to the master node after running a computation on the dataset

Check out the [RDD docs](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)

Spark Terminology
-----------------

Term                   |Meaning
----                   |-------
RDD                    |*Resilient Distributed Dataset* or a distributed sequence of records
Transformation         |Spark operation that produces an RDD
Action                 |Spark operation that produces a local object

![](images/api.png)

![](images/workflow.png)

---
Spark Operations
---

Transformations:

- map
- filter
- groupBy

Actions:

- count
- collect
- save

![](images/transformations.png)

![](images/actions.png)

Lambda (😍)
-------------------

- Instead of `lambda` you can pass in fully defined functions into
  `map`, `filter`, and other RDD transformations.

- Use `lambda` for short functions. 

- Use `def` for more substantial functions.

Common RDD Constructors
-----------------------

Expression                               |Meaning
----------                               |-------
`sc.parallelize(list1)`                  |Create RDD of elements of list
`sc.textFile(path)`                      |Create RDD of lines from file

Common Transformations
----------------------

Expression                               |Meaning
----------                               |-------
`filter(lambda x: x % 2 == 0)`           |Discard non-even elements
`map(lambda x: x * 2)`                   |Multiply each RDD element by `2`
`map(lambda x: x.split())`               |Split each string into words
`flatMap(lambda x: x.split())`           |Split each string into words and flatten sequence
`sample(withReplacement=True,0.25)`      |Create sample of 25% of elements with replacement
`union(rdd)`                             |Append `rdd` to existing RDD
`distinct()`                             |Remove duplicates in RDD
`sortBy(lambda x: x, ascending=False)`   |Sort elements in descending order


Common Actions
--------------

Expression                             |Meaning
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(3)`                              |First 3 elements of RDD 
`top(3)`                               |Top 3 elements of RDD
`takeSample(withReplacement=True,3)`   |Create sample of 3 elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)

Pop Quiz
--------

Q: What will this output?

In [11]:
sc.parallelize([1,3,2,2,1]).distinct().collect()

[1, 2, 3]

Q: What will this output?

In [13]:
sc.parallelize([1,3,2,2,1]).sortBy(lambda x: x).collect()

[1, 1, 2, 2, 3]

Create this input file.

In [15]:
%%writefile input.txt
hello world
another line
yet another line
yet another another line

Writing input.txt


- What do you get when you run this code?

In [16]:
sc.textFile('input.txt') \
    .map(lambda x: x.split()) \
    .count()

4

- What about this?

In [17]:
sc.textFile('input.txt') \
    .flatMap(lambda x: x.split()) \
    .count()

11

---
Check for understanding
--------

<details><summary>
In this Spark job, what is the transformation and what is the action? 
`sc.parallelize(xrange(10)).filter(lambda x: x % 2 == 0).collect()`
</summary>
1. `filter` is the transformation.
<br>
2. `collect` is the action.
</details>

---
Spark Job
---

- A Spark job consists of a series of transformations followed by an
  action.

- It pushes the data to the cluster, all computation happens on the
  *executors*, then the result is sent back to the driver.
    
Spark Terminology
-----------------

Term                   |Meaning
----                   |-------
Spark Job              |Sequence of transformations on data with a final action
Spark Application      |Sequence of Spark jobs and other code


----
Finding Primes
----

Find all the primes less than 100.

In [20]:
def is_prime(number):
    "Determine if a number is prime"
    factor_min = 2
    factor_max = int(number**0.5)+1
    for factor in xrange(factor_min,factor_max):
        if number % factor == 0:
            return False
    return True

Use this to filter out non-primes.

In [24]:
numbers = xrange(2,100)
primes = sc.parallelize(numbers)\
            .filter(is_prime)\
            .collect()
    
print(primes)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]


Check for understanding
--------

<img src="images/spark-cluster.png">

<details><summary>
Q: Where does `is_prime` execute?
</summary>
On the executors.
</details>

<details><summary>
Q: Where does the RDD code execute?
</summary>
On the driver.
</details>

---
Map vs FlatMap
----

[Source](http://stackoverflow.com/questions/22350722/can-someone-explain-to-me-the-difference-between-map-and-flatmap-and-what-is-a-g)

Map:

In [25]:
sc.textFile('input.txt') \
    .map(lambda x: x.split()) \
    .collect()

[[u'hello', u'world'],
 [u'another', u'line'],
 [u'yet', u'another', u'line'],
 [u'yet', u'another', u'another', u'line']]

`map` transforms an RDD of length N into another RDD of length N.

FlatMap:

In [26]:
sc.textFile('input.txt') \
    .flatMap(lambda x: x.split()) \
    .collect()

[u'hello',
 u'world',
 u'another',
 u'line',
 u'yet',
 u'another',
 u'line',
 u'yet',
 u'another',
 u'another',
 u'line']

On the other hand, `flatMap` (loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.

Data Scientists generally use `flatmap`.

---
Word Count
----

![](images/word_count_demo.png)

### Hadoop

```java

package org.apache.hadoop.examples;
 
import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
public class WordCount {
 
    public static class TokenizerMapper extends
        Mapper<Object, Text, Text, IntWritable> {
 
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
 
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }
 
    public static class IntSumReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
 
        private IntWritable result = new IntWritable();
 
        public void reduce(Text key, Iterable<IntWritable> values,
        Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
 
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
            .getRemainingArgs();
 
        Job job = new Job(conf, "word count");
 
        job.setJarByClass(WordCount.class);
 
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
 
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
 
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
 
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
```

### Spark

```python
from pyspark import SparkContext
 
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
 
sc = SparkContext("spark://bigdata-vm:7077", "WordCount")
 
textFile = sc.textFile(logFile)
 
wordCounts = textFile\
    .flatMap(lambda line: line.split())\
    .map(lambda word: (word, 1))\
    .reduceByKey(lambda a, b: a+b)
 
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")

```

---
Architecture Revisited
---

![](images/spark_detailed_overivew.png)

---
Summary
----

- Spark is a general purpose, in-memory Big Data analytics framework
- Beats down the elephant (Hadoop)
- Simple API in Scala, Python, and Java
- Functional, idiomatic programming style
- Enables advanced programming models (e.g., SQL, machine learning, and graph)

<br>
<br>
<br>
<br>
---
Bonus Materials
====

Key Value Pairs
-----

PairRDD
-------

At this point we know how to aggregate values across an RDD. If we
have an RDD containing sales transactions we can find the total
revenue across all transactions.

Q: Using the following sales data find the total revenue across all
transactions.

In [27]:
%%writefile sales.txt
#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00

Writing sales.txt


- Read the file.

In [28]:
sc.textFile('sales.txt')\
    .take(2)

[u'#ID    Date           Store   State  Product    Amount',
 u'101    11/13/2014     100     WA     331        300.00']

- Split the lines.

In [29]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .take(2)

[[u'#ID', u'Date', u'Store', u'State', u'Product', u'Amount'],
 [u'101', u'11/13/2014', u'100', u'WA', u'331', u'300.00']]

- Remove `#`.

In [30]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: x[0].startswith('#'))\
    .take(2)

[[u'#ID', u'Date', u'Store', u'State', u'Product', u'Amount']]

- Try again.

In [31]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .take(2)

[[u'101', u'11/13/2014', u'100', u'WA', u'331', u'300.00'],
 [u'104', u'11/18/2014', u'700', u'OR', u'329', u'450.00']]

- Pick off last field.

In [32]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: x[-1])\
    .take(2)

[u'300.00', u'450.00']

- Convert to float and then sum.

In [33]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: float(x[-1]))\
    .sum()

2230.0

ReduceByKey
-----------

Q: Calculate revenue per state?

- Instead of creating a sequence of revenue numbers we can create
  tuples of states and revenue.

In [34]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .collect()

[(u'WA', 300.0),
 (u'OR', 450.0),
 (u'CA', 200.0),
 (u'CA', 330.0),
 (u'WA', 750.0),
 (u'CA', 200.0)]

- Now use `reduceByKey` to add them up.

In [35]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .collect()

[(u'CA', 730.0), (u'OR', 450.0), (u'WA', 1050.0)]

Q: Find the state with the highest total revenue.

- You can either use the action `top` or the transformation `sortBy` then `take`.

In [40]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .sortBy(lambda state_amount:state_amount[1],ascending=False) \
    .take(3)

[(u'WA', 1050.0), (u'CA', 730.0), (u'OR', 450.0)]

In [42]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .top(3, key=lambda state_amount:state_amount[1])

[(u'WA', 1050.0), (u'CA', 730.0), (u'OR', 450.0)]

Check for understanding
--------

<details><summary>
Q: What does `reduceByKey` do?
</summary>
1. It is like a reducer.
<br>
2. If the RDD is made up of key-value pairs, it combines the values
   across all tuples with the same key by using the function we pass
   to it.
<br>
3. It only works on RDDs made up of key-value pairs or 2-tuples.
</details>

Notes
-----

- `reduceByKey` only works on RDDs made up of 2-tuples.

- `reduceByKey` works as both a reducer and a combiner.

- It requires that the operation is associative.

----
Even more architecture
---

![](images/spark_platform.png)

---
Cluster Manager
---

![](images/manager.png)

----
Spark Logging
-------------

Q: How can I make Spark logging less verbose?

- By default Spark logs messages at the `INFO` level.

- Here are the steps to make it only print out warnings and errors.

```sh
cd $SPARK_HOME/conf
cp log4j.properties.template log4j.properties
```

- Edit `log4j.properties` and replace `rootCategory=INFO` with `rootCategory=ERROR`

<br>
<br> 
<br>

----