<img src="images/sads-logo.jpeg" align="left" width="14%">

<img src="images/spark-logo-hd.png" align="right" width="21%">
<img src="images/python-logo-notext.png" align="right" width="11%">

<h1 align='center'>Intro to PySpark Workshop</h1>
<h2 align='center'>Meghann Agarwal</h2>
<h3 align='center'>September 14, 2017</h3>



### Bottlenecks and Blockers
#### CPU Bound
    - Optimize algorithms
    - Optimize compilation
    - Parallelize code
#### Storage Bound
    - Store data elsewhere and transfer to compute machine when needed
    - Spread data across multiple machines and run computations on all machines
#### Network I/O Bound
    - Decrease network transfer time
        - Compress data for faster transfer
        - Move computation to the data
    - Consider multithreading
        - Unblock code by allowing data to transfer in a separate thread
#### Disk I/O  Bound
    - Cache data in RAM for faster access

# Outline
- **Definitions:**
    - MapReduce
    - Hadoop
- **PySpark:**
    - Exercises:
        - Word Count
        - Logistic Regression
        - Clickstream
    - References and Resources

# Definitions

## What is MapReduce?
https://research.google.com/archive/mapreduce.html

## Why MapReduce?

>"The primary benefit of MapReduce is that it allows us to distribute computations by moving the processing to the data."  
-- J. Grus, *Data Science from Scratch*

### Example: Word Count

## What is Hadoop?



## Why Hadoop?

# PySpark

## Environment Setup

### Spark Deployment Modes
- Spark Standalone Cluster Mode
- Spark on Hadoop YARN Cluster
- Spark on Apache Mesos Cluster

## Installation Options
Here are a few options for "installing" Spark:

### Databricks Community Edition
1. Use Google Chrome browser (Firefox should also work, but not Internet Explorer, Safari, etc.)
2. Sign up for a free Community Edition account here: https://databricks.com/try-databricks

### Local on Mac OSX
1. Download http://apache.spinellicreations.com/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
2. Extract the Spark package and create `SPARK_HOME`
```
tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz
sudo mv spark-2.2.0-bin-hadoop2.7 /opt/spark
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
```
3. Run the included Pi Estimator example by executing the following command:
```
$SPARK_HOME/bin/run-example SparkPi 10
```
Expect to see something like:
```
Pi is roughly 3.140576
```

### Amazon Web Services (AWS)
1. Deploy Spark on Elastic Cloud Compute (EC2) cluster
2. Deploy Spark on Elastic MapReduce (EMR) cluster

## What is Spark?
Spark is a fast and expressive cluster computing system for doing Big Data computation. It's good for:
- iterative tasks
- doing big batch processing
- interactive data exploration

It's compatible with Hadoop-supported file systems and data formats (HDFS, S3, SequenceFile, ...), so if you've been using Hadoop you can use it with your existing data and deploy it on your existing clusters.

It achieves fault tolerance through *lineage*: if you lose a partition (chunk) of data you can reconstruct it through a set of *transformations* that act on data stored in memory. This is in contrast to distributed shared memory systems where you have to write to disk and roll back.

## Why use Spark?
https://spark.apache.org/

<img src="images/speed.png" align="left" width="70%">

<img src="images/ease-of-use.png" align="left" width="70%">

<img src="images/generality.png" align="left" width="70%">

<img src="images/runs-everywhere.png" align="left" width="70%">


>"Although current frameworks provide numerous abstractions for accessing a cluster’s computational resources, they <font color='red'> lack abstractions for leveraging distributed memory</font>. This makes them inefficient for an important class of emerging applications: those that reuse intermediate results across multiple computations. <font color='blue'>Data reuse is common in many iterative machine learning and graph algorithms</font>, including PageRank, K-means clustering, and logistic regression. <font color='blue'>Another compelling use case is interactive data mining</font>, where a user runs multiple ad-hoc queries on the same subset of the data. Unfortunately, in most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to <font color='red'>write it to an external stable storage system</font>, e.g., a distributed file system. This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times."  
-- Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," *In NSDI '12*, April 2012

https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

## Spark vs MapReduce vs MPI vs ...
- [MapReduce](https://en.wikipedia.org/wiki/MapReduce) --> [Hadoop](http://hadoop.apache.org/): heavily used in business computing
- [Message Passing Interface (MPI)](https://en.wikipedia.org/wiki/Message_Passing_Interface) --> [MVAPICH](http://mvapich.cse.ohio-state.edu/): heavily used in scientific computing
- [Spark](http://spark.apache.org/): complement to Hadoop, faster for iterative applications, rich set of APIs in Scala, Python, and Java, and an interactive shell

## Spark Architecture
- Spark Driver and Workers
- SparkContext (replaced by SparkSession in version 2.X)
- Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
- SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN)

<img src="images/cluster-overview.png" align="center" width="75%">

<h4 align='right'>https://spark.apache.org/docs/1.1.0/cluster-overview.html</h4>

## Spark Programming Concepts (version 1.X)
- ****SparkContext****: entry point to Spark functions
- ****Resilient Distributed Datasets (RDDs)****:
    - Immutable, distributed collections of objects
    - Can be cached in memory for fast reuse
- ****Operations on RDDs****:
    - *Transformations*: define a new RDD (map, join, ...)
    - *Actions*: return or output a result (count, save, ...)
- ****Two ways to create RDDs****:
    1. By parallelizing an existing collection in your driver program:  
        `data = [1, 2, 3, 4, 5]  
        distData = sc.parallelize(data)`  
    2. Or by referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat:  
        `distFile = sc.textFile("data.txt")`       
<h4 align='right'>http://spark.apache.org/docs/latest/programming-guide.html</h4>

## Spark Data Interfaces

- [RDD API](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)
- [DataFrame API](https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
- [Machine Learning API](https://spark.apache.org/docs/latest/mllib-guide.html)

There are several key interfaces that you should understand when you go to use Spark.

-   ****The Dataset****
    -   The Dataset is Apache Spark's newest distributed collection and can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.
-   ****The DataFrame****
    -   The DataFrame is collection of distributed `Row` types. These provide a flexible interface and are similar in concept to the DataFrames you may be familiar with in python (pandas) as well as in the R language.
-   ****The RDD (Resilient Distributed Dataset)****
    -   Apache Spark's first abstraction was the RDD or Resilient Distributed Dataset. Essentially it is an interface to a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster. RDD's can be created in a variety of ways and are the "lowest level" API available to the user. While this is the original data structure made available, new users should focus on Datasets as those will be supersets of the current RDD functionality.

*(slide taken from "Introduction to Apache Spark on Databricks" notebook)*

## What is PySpark?
- The Python API for Spark
- Run interactive jobs in the shell
- Supports numpy, pandas and other Python libraries

## Why use PySpark?
- If you already know Python
- Can use Spark in tandem with your favorite Python libraries
- If you don't need Python libraries, maybe just write code in Scala

### PySpark's core classes (version 1.X):
- ****pyspark.SparkContext****  
Main entry point for Spark functionality.
- ****pyspark.RDD****  
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
- ****pyspark.streaming.StreamingContext****  
Main entry point for Spark Streaming functionality.
- ****pyspark.streaming.DStream****  
A Discretized Stream (DStream), the basic abstraction in Spark Streaming.
- ****pyspark.sql.SQLContext****  
Main entry point for DataFrame and SQL functionality.
- ****pyspark.sql.DataFrame****  
A distributed collection of data grouped into named columns.

## Transformations
- Transform one RDD to another, **new** RDD (immutable)

| Transformation | Description | Type |
| :------:  | :-----------: | :-----: |
| `map(func)`     | Apply a function over each element | Narrow |
| `flatMap(func)` | Map then flatten output | Narrow |
| `filter(func)`  | Keep only elements where function is `True` | Narrow |
| `sample(withReplacement, fraction, seed)` | Return a sampled subset of this RDD | Narrow |
| `groupByKey(k, v)` | Group the values for each key in the RDD into a single sequence | Wide |
| `reduceByKey(func)` | Merge the values for each key using an associative reduce function | Wide |


<img src="images/narrow_wide_transformations.png" align="center" width="90%">

<h4 align='right'>https://dzone.com/articles/big-data-processing-spark</h4>

## Actions
- Return or output a result

| Action | Description | Try it Out\*|
| :------:  | :-----------:| :---: |
| `collect()`     | Return a list that contains all of the elements in this RDD | `sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()` |
| `count()`  | Return the number of elements | `sc.parallelize([2, 3, 4]).count()` |
| `saveAsTextFile(path)` | Save as a text file, using string representations of elements | `sc.parallelize(['foo', '-', 'bar', '!']).saveAsTextFile("/FileStore/foo-bar.txt")])`|
| `first()`    | Return the first element | `sc.parallelize([2, 3, 4]).first()` |
| `take(num)`    | Take the first num elements | `sc.parallelize([2, 3, 4, 5, 6]).take(2)` |

### \* Try it Out:
1. Go to your databricks Workspace and create a new directory within your Users directory called "2017-09-14-sads-pyspark" 
2. Create a notebook called "0-Introduction"  within this directory
3. Type or copy/paste lines of code into separate cells and run them (you will be prompted to launch a cluster) 

When using Databricks the `SparkContext` is created for you automatically as `sc`.

In the Databricks Community Edition there are no Worker Nodes - the Driver Program (Master) executes the entire code.

### Try a couple more examples with transformations and actions:

In [None]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.groupByKey().mapValues(len).collect())
sorted(rdd.groupByKey().mapValues(list).collect())

In [None]:
from operator import add

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

We've shown only a subset of possible *transformations* and *actions*. Check out others for your application in the docs: http://spark.apache.org/docs/latest/api/python/pyspark.html

### Example: Log Mining

In [None]:
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('\t')(2))

messages.filter(_.contains("foo")).count

The computation is expressed declaratively and nothing actually takes place until calling `count` at the end.

# Exercises

## Exercise 1: Word Count
Create a few transformations to build a dataset of (String, Int) pairs called counts and then save it to a file.

1. Create a notebook in "2016-06-20-pyladies-pyspark" called "1-WordCount"
2. Try to implement the following Word Count example:

http://spark.apache.org/examples.html

In [None]:
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

## Exercise 2: Logistic Regression

1. Create a notebook in "2016-06-20-pyladies-pyspark" called "2-LogisticRegression"
2. Try to implement one of the following Logistic Regression examples:
    - http://spark.apache.org/examples.html (Prediction with Logistic Regression)
    - https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/logistic_regression.py
    - https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py

In [None]:
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)

# Fit the model to the data.
model = lr.fit(df)

# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()

### Exercise 3: Clickstream

***Switch over to Databricks***
- Import notebook "3-Clickstream" in to "2017-09-14-sads-pyspark" 


***Objective: Get some practice using Spark DataFrames***

***Data Source: February 2015 English Wikipedia Clickstream dataset***
- https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
- http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82. 

>"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following Databricks file: `/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed`

*Notebook translated from the Databricks "Quick Start DataFrames" tutorial in Scala, which was based on a lab developed by [Sameer Farooqui](https://www.linkedin.com/in/blueplastic).*

# Resources and References
- ****MOOCs****:
    - "Introduction to Apache Spark": https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x
    - "Hadoop Platform and Application Framework" (week 5 covers Spark): https://www.coursera.org/learn/hadoop/home/week/5
- ****Spark/PySpark Docs****:
    - (v2.0.0) http://spark.apache.org/docs/2.0.0-preview/
    - http://spark.apache.org/research.html
    - http://spark.apache.org/examples.html
- ****Other****:
    - Spark 2.0 Webinar (2016): http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin
    - PySpark Talk (J. Rosen, 2013): https://www.youtube.com/watch?v=xc7Lc8RA8wE
    - "Apache Spark in 24 Hours": https://www.amazon.com/Apache-Spark-Hours-Teach-Yourself/dp/0672338513

# Thanks for Coming!

<agarwal.meghann@gmail.com>