---
<img src="images/spark-logo.png">



---
Where are we?
----

![](http://www.platfora.com/wp-content/uploads/2015/06/gartner-hype-cycle.png)

Spark is (relatively) platform agnostic. 

The goal is:
> Write once, run every where

By the end of this session, you should be able to:
----

- Describe Spark's architecture and execution model
- Persist and cache data
- Create word count in Spark
- Understand and use Spark DataFrames

--- 
Spark Architecture Review
---

![](images/spark-cluster.png)

<img src="images/plan.png" style="width: 400px;"/>

<img src="images/save_time.png" style="width: 400px;"/>

---
RDD Laziness
------------

![](https://prateekvjoshi.files.wordpress.com/2014/10/3-laziness.png)

Q: What is this Spark job doing?

In [1]:
max = 10000000
%time sc.parallelize(xrange(max)).map(lambda x:x+1).count()

CPU times: user 17.2 ms, sys: 22.5 ms, total: 39.8 ms
Wall time: 7.94 s


10000000

Q: How is the following job different from the previous one? How long do you expect it to take?

In [2]:
%time sc.parallelize(xrange(max)).map(lambda x:x+1)

CPU times: user 2.27 ms, sys: 1.93 ms, total: 4.2 ms
Wall time: 6.35 ms


PythonRDD[3] at RDD at PythonRDD.scala:43

Check for understanding
--------

<details><summary>
Why did the second job complete so much faster?
</summary>
1. Because Spark is lazy. 
<br>
2. Transformations produce new RDDs and do no operations on the data.
<br>
3. Nothing happens until an action is applied to an RDD.
<br>
4. An RDD is the *recipe* for a transformation, rather than the
   *result* of the transformation.
</details>
<br>
<br>
<details><summary>
What is the benefit of keeping the recipe instead of the result of the action?
</summary>
1. It save memory.
<br>
2. It produces *resilience*. 
<br>
3. If an RDD loses data on a machine, it always knows how to recompute it.
</details>



---
Deep Dive on DAGs
---

![](http://image.slidesharecdn.com/2014-05-29-spark-140529034646-phpapp01/95/spark-internals-hadoop-source-code-reading-16-in-japan-10-638.jpg?cb=1401392040)

![](http://image.slidesharecdn.com/sparkoverview-140729190732-phpapp01/95/spark-overview-18-638.jpg?cb=1406660911)

----
Caching and Persistence
-----

Consider this Spark job:

In [5]:
import random
num_count = 500*1000
num_list = [random.random() for i in xrange(num_count)]
rdd1 = sc.parallelize(num_list)
rdd2 = rdd1.sortBy(lambda num: num)

Lets time running `count()` on `rdd2`.

In [6]:
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

CPU times: user 9.36 ms, sys: 4.91 ms, total: 14.3 ms
Wall time: 2.26 s
CPU times: user 6.98 ms, sys: 1.84 ms, total: 8.82 ms
Wall time: 849 ms
CPU times: user 6.22 ms, sys: 1.66 ms, total: 7.88 ms
Wall time: 954 ms


500000

The RDD does no work until an action is called. And then when an action is called it figures out the answer and then throws away all the data.

If you have an RDD that you are going to reuse in your computation you can use `cache()` to make Spark cache the RDD.

Lets cache it and try again.

In [7]:
rdd2.cache()
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

CPU times: user 5.3 ms, sys: 3.04 ms, total: 8.34 ms
Wall time: 1.29 s
CPU times: user 4.99 ms, sys: 1.58 ms, total: 6.57 ms
Wall time: 168 ms
CPU times: user 7.14 ms, sys: 1.94 ms, total: 9.08 ms
Wall time: 377 ms


500000

Caching the RDD speeds up the job because the RDD does not have to be computed from scratch again.

Notes
-----

- Calling `cache()` flips a flag on the RDD. 

- The data is not cached until an action is called.

- You can uncache an RDD using `unpersist()`.

Check for understanding
--------

<details><summary>
Q: Will `unpersist` uncache the RDD immediately or does it wait for an
action?
</summary>
It unpersists immediately.
</details>

Caching and Persistence
-----------------------

Q: Persist RDD to disk instead of caching it in memory.

- You can cache RDDs at different levels.

- Here is an example.

In [8]:
import pyspark
rdd = sc.parallelize(xrange(100))
rdd.persist(pyspark.StorageLevel.DISK_ONLY)

PythonRDD[20] at RDD at PythonRDD.scala:43

Check for understanding
--------

<details><summary>
Q: Will the RDD be stored on disk at this point?
</summary>
No. It will get stored after we call an action.
</details>

Persistence Levels
------------------

Level                      |Meaning
-----                      |-------
`MEMORY_ONLY`              |Same as `cache()`
`MEMORY_AND_DISK`          |Cache in memory then overflow to disk
`MEMORY_AND_DISK_SER`      |Like above; in cache keep objects serialized instead of live 
`DISK_ONLY`                |Save to disk, not to memory

Notes
-----

- `MEMORY_AND_DISK_SER` is a good compromise between the levels. 

- Fast, but not too expensive.

- Make sure you unpersist when you don't need the RDD any more.


----
Word Count
---

The "Hello, world!" of Big Data.

Know it, 😍 it

1) Create some input.

In [9]:
%%writefile quotes.txt
You’re fired
You’re fired
I will build a great wall and nobody builds walls better than me, believe me and I’ll build them very inexpensively. 
You’re fired

Writing quotes.txt


2) Count the words.

In [10]:
(sc.textFile('quotes.txt')
     .flatMap(lambda line: line.split(" "))
     .map(lambda word: (word,1))
     .reduceByKey(lambda count1, count2: count1+count2)
     .collect())

[(u'and', 2),
 (u'', 1),
 (u'them', 1),
 (u'I', 1),
 (u'very', 1),
 (u'will', 1),
 (u'great', 1),
 (u'better', 1),
 (u'me', 1),
 (u'walls', 1),
 (u'a', 1),
 (u'I\u2019ll', 1),
 (u'wall', 1),
 (u'fired', 3),
 (u'me,', 1),
 (u'nobody', 1),
 (u'You\u2019re', 3),
 (u'build', 2),
 (u'inexpensively.', 1),
 (u'believe', 1),
 (u'than', 1),
 (u'builds', 1)]

---
Handling tabular data
---

In [11]:
sales_by_state = (sc.textFile('sales.csv')
                  .map(lambda x: x.split(","))
                  .filter(lambda x: not x[0].startswith('#'))
                  .map(lambda x: (x[-3],float(x[-1])))
                  .reduceByKey(lambda amount1,amount2: amount1+amount2)
                  .sortBy(lambda state_amount:state_amount[1],ascending=False))

In [12]:
sales_by_state.collect()

[(u'WA', 1050.0), (u'CA', 730.0), (u'OR', 450.0)]

While this code looks reasonable, the list indexes are cryptic and hard to read.

<br>
<br> 
<br>

----
DataFrames, you might remeber them from Pandas 🐼
----

Let's just automagically load a csv.

[Source](https://community.cloud.databricks.com/?o=6058142077065523#externalnotebook/https%3A%2F%2Fdocs.cloud.databricks.com%2Fdocs%2Flatest%2Fdatabricks_guide%2Findex.html%2303%2520Accessing%2520Data%2F3%2520Common%2520File%2520Formats%2F1%2520CSV%2520-%2520py.html)

In [6]:
# # Read csv data as DataFrame, Scala-style
# sales = (sqlContext.read.format('com.databricks.spark.csv')
#             .options(header='true', inferSchema='true')
#             .load('/FileStore/tables/3onpii8c1465685311162/sales.csv'))

In [None]:
# # Read csv data as DataFrame, Python-style
# sales = spark.read.csv('/FileStore/tables/aef5f0rv1465685940318/sales.csv',
#                    header=True)

In [7]:
# display(sales)

In [None]:
# sales.printSchema()

In [None]:
# sales_by_state = (sales
#                   .groupBy("state")
#                   .sum("amount")
#                   .withColumnRenamed("sum(amount)", "total_sales")
#                   .orderBy("total_sales"))

In [None]:
# display(sales_by_state)

---
Why Spark Sparkles ✨
----

### DataFrame API

> A DataFrame is a distributed collection of data organized into __named__ columns. 

> It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. 

![](images/dataframes_in_spark_stack.png)

DataFrames and Spark SQL. These are high-level APIs for working with structured data (e.g. database tables, JSON files), which let Spark automatically optimize both storage and computation. DataFrame API that provides a type-safe, object-oriented programming interface. 

DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. 

Behind these APIs, the Catalyst optimizer and Tungsten execution engine optimize applications in ways that were not possible with Spark’s object-oriented (RDD) API.

---
Where is the DataFrames API
----

`pyspark.sql.SQLContext`

Main entry point for DataFrame and SQL functionality.

[RTFM](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html)

Spark 2.0 has simplier API

In [None]:
# spark.

---
Spark DataSet API
----

A DataSet is a new interface that tries to provide the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. 

A DataSet can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). 

Spark DataSet are strongly-typed, immutable collection of objects that are mapped to a relational schema with object-oriented programming interface. 

[Source 1](https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20Spark/1%20Intro%20Datasets.html)  
[Source 2](https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html)

DataSet are "closer to the metal" / JVM thus allows for increased optimizations.

<img src="https://www.stayathomemum.com.au/wp-content/uploads/2013/10/tantrum.jpg" style="width: 400px;"/>

Only available in Scala and Java. Not in Python.

[DataSet documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/sql-programming-guide.html#creating-datasets)

---
Summary
---

- Spark is the 💩: can do Word Count simply and has features for advanced operations.
- There are tricks to optimally Persist and Cache data.
- Spark DataFrames 💗 ~ Big Data Pandas
    - Use them as much as possible
    - More _Data Science_ features
- Keep an eye out for DataSets (or learn Scala)

<br>
<br> 
<br>

----