# Overview

In this notebook, I want to cover areas related to performance as well as some more advanced spark functionality (MLib, Streaming, DataFrames). This notebook is to give you a taste of what is possible with spark, and is **not** a comprehensive guide.

## Terminology

Let's get some key terminology down first before we delve into the nuts and bolts of performance enhancements. I will motivate this with a simple example:

### Spark Job

Whenever we call an action (i.e. `collect()`(), this action always returns a result. As such, we can say that this action resulted in a **spark job**.

Each **spark job** is comprised of a set of **stages**. It is important to note here that the **number of stages in a job is directly related to the number of shuffle operations required to deliver a result.**

### Stages

A stage is a group of **tasks** that need to be executed in unison to compute **an operation on several machines**.

Spark is optimized to put as many possible **tasks**  into a single stage, however, a **new stage is always created** after a **shuffle** occurs.

Stages are **always executed in parallel.**

### Shuffle

A shuffle represents a **repartitioning** of the data i.e. sorting or grouping data.

When we run these types of operations, the executors must co-ordinate in order for the data to move around in the appropriate order.

Spark is also optimized to keep track of the stages and the order in which they must be executed in order to complete the spark job.

### Tasks

As stated before, a stage consists of various **tasks**. A **task** is simply a combination of chunks of data and a series of transformations that will be run on a **single executor.**

You can think of a task as a **unit of computation** that is applied to a **unit of data**. In this context, the unit of data is a **partition**.

As such, if we have 8 partitions, we would have have 8 tasks.

Furthermore, we previously states that **stages are executed in parallel.** As such, the higher we set our number of partitions, the higher parallelism we will achieve.

Recall, the spark documentation recommends 2-4 CPUs per core in your cluster.

# Performance

### Broadcast Variables

Let's imagine for a moment that we are dealing with a very large dataset. 

Furthermore, let's say that we need all of our executors to access this dataset.

Now, we could do this by simply assigning the data to variable, and then pass this variable around:

In [1]:
really_large_dataset = 100

In [2]:
rdd = sc.parallelize(range(10))

In [3]:
rdd.map(lambda x: x**2 + really_large_dataset).collect()

[100, 101, 104, 109, 116, 125, 136, 149, 164, 181]

Now, the above code is perfectly correct. Simply referencing your large data set when you have a singular stage to process is no problem, but what if we had multiple stages:

In [4]:
rdd.map(lambda x: x**2 + really_large_dataset).repartition(4).map(
lambda x: x ** 2 + really_large_dataset).collect()

[10100, 10301, 10916, 11981, 13556, 15725, 18596, 22301, 26996, 32861]

Again, the above code might look fine to you. However, the above code is **masking** a very important consideration.

The variable `really_large_dataset`, is being **serialized** and sent with the task information for **both** stages.

Now, the above example is incredibly simplistic, but if we really did have a large amount of data, **numerous serializations** is highly inefficient.

Now, since sharing data between executors is so common, spark has a way to **efficiently** handle this. Instead of just referencing a variable, we use **broadcast variables.**

In essence, a broadcast variable ensures that your data remains available to all executors **between stages**. The data is simple sent to the spark job once, instead of as many times as there are stages.

Here is an example of a broadcast variable:

In [5]:
# this data is now sent out to the executors.
really_large_dataset = sc.broadcast(100)

In [6]:
really_large_dataset #broadcast variable

<pyspark.broadcast.Broadcast at 0x1094bf080>

In [7]:
really_large_dataset.value #Broadcast variable VALUE

100

In [8]:
#same result as before!
rdd.map(lambda x: x**2 + really_large_dataset.value).collect()

[100, 101, 104, 109, 116, 125, 136, 149, 164, 181]

### Broadcast Variable vs Broadcast Value

As we see can see from above, we obtain the same results as before when using a broadcast variable. However, we used the `.value` attribute.

This defeats the purpose of using a broadcast variable. Rather, we should just pass around the broadcast variable and extract the value at the very end. This will ensure that we are making good use of our efficiency gains.

#### Read Only

Broadcast variables are read only. They cannot be changed in a particular place and expect all executors to be made aware of the change.

It is good practice to set your broadcast variables up at the very beginning and leave them unchanged. Attempts to change their value in the executors will lead to very odd results.

#### Persistence

One of the side effects of broadcast variables is that spark creates a cached copy of the data. This is what allows the data to be available to all executors between stages.

As such, we you are done with a particular broadcast variable, be sure to remove cached copies to ensure continued efficiency of your code. Unpersisting the data will allow you to reclaim memory space.

In [9]:
really_large_dataset.unpersist()

### Accumulators

For those of you experienced with functional programming, accumulators will be nothing new. Haskell constantly uses accumulators.

In essence, an accumulator is a structure that can **always be added to** but is **only visible** in the **main driver.**

Let's illustrate this with a simple example:

In [14]:
even_counter = sc.accumulator(0)

data = sc.parallelize(range(20))

def add_one(val):
    global even_counter
    if (val%2 == 0):
        even_counter += 1
    return val + 1

In [15]:
# results 
data.map(add_one).collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [16]:
# value of the accumulator - the number of even numbers in data
even_counter.value

10

Now, the above code is pretty self explanatory, however, lets talk about some common mistakes that occur with accumulators.

#### Running actions numerous times

If I were to run `data.map(add_one).collect()` again, our `even_counter` would now have a value of 20.

Recall that we did **not** reset the accumulator value to 0. As such, for every action, our accumulator value will update.

#### Accumulators & transformations/failures

There are times where a spark job will fail - in these instances, spark will sometimes restart the job. When this happens, it is possible that action is executed multiple times. This will result in your accumulator value incrementing a number of times.

Furthermore, when spark is running transformations, there is no guarantee that accumulators will be consistent.


Thus, counting inside of an **action** has the guarantee of hitting your accumulator once. Counting inside of a **transformation** has no guarantee of hitting your accumulator once.

#### Lazy Evaluation

Recall that spark relies on lazy evaluation, thus if I had just run `data.map(add_one)`, our counter would have remained at 0.

Accumulators are often used for internal statistics or maintaining long pipelines.

They are a very useful feature!

# Advanced Spark

### Spark Streaming

Spark Streaming is the abstraction required for dealing with data in real-time. It allows your to read in input data in discrete time chunks and run various processes on those chunks.

This is incredibly important for scalable/production grade solutions using spark architecture.

There is incredibly detailed and well written documentation available.

Below I will give a very simple example.

The central construct in spark streaming is not an RDD, but rather [Discretized Stream (DSteams)](https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#discretized-streams-dstreams).

In [1]:
from pyspark.streaming import StreamingContext

In [2]:
# the 5 means we will stream every 5 seconds!
stream_context = StreamingContext(sc, 5)

In [3]:
input_file = stream_context.textFileStream("test")

In [4]:
pairs = input_file.flatMap(lambda x: x.split(" ")).map(lambda v: (v,1))

count = pairs.reduceByKey(lambda x,y: x+y)

count.pprint()

In [5]:
stream_context.start()

-------------------------------------------
Time: 2018-02-25 20:17:00
-------------------------------------------

-------------------------------------------
Time: 2018-02-25 20:17:05
-------------------------------------------

-------------------------------------------
Time: 2018-02-25 20:17:10
-------------------------------------------
('B', 1)
('C', 1)
('A', 1)
('words', 1)
('some', 1)

-------------------------------------------
Time: 2018-02-25 20:17:15
-------------------------------------------

-------------------------------------------
Time: 2018-02-25 20:17:20
-------------------------------------------

-------------------------------------------
Time: 2018-02-25 20:17:25
-------------------------------------------



In [6]:
stream_context.stop()

What is important to note in the above example is that I **first** started the stream and **then created a file on the fly** in the `test` directory.

This mimics the arrival of new data.

### DataFrame & SQL

Spark's API allows you to interface with DataFrames (think Pandas, R) as well as SQL.

Let's walk through an example:

In [12]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
data = sqlContext.read.json("samples/sample.json")

In [13]:
#show the inferred schema SQL style
data.printSchema()

root
 |-- Height: double (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)



In [14]:
# see data in tabular form!
data.show()

+------+---+-------+
|Height| id|   name|
+------+---+-------+
| 181.5|  1|Ibrahim|
| 175.4|  2|   Juan|
| 160.7|  3| Andrew|
+------+---+-------+



Now, there are 2 ways to run SQL queries. The first way is by applying SQL methods to `data`.

In [15]:
data.filter(data['Height'] > 161).select(data['id']).show()

+---+
| id|
+---+
|  1|
|  2|
+---+



The second way is to write SQL queries directly! For this, we need to use the `registerTempTable` method first!

In [16]:
# I am giving our data a name and instantiating a SQL table
data.registerTempTable("my_data")

In [17]:
sqlContext.sql("SELECT id FROM my_data WHERE Height > 161").collect()

[Row(id=1), Row(id=2)]

Spark [DataFrames](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#datasets-and-dataframes) are used extensively for Machine Learning. Be sure to check them out in the documentation.

### ML & MLib

There are 2 machine learning API's for spark, ML & MLib:

- [ML](https://spark.apache.org/docs/2.2.0/ml-guide.html): A higher level API that is built upon **DataFrames**

- [MLib](https://spark.apache.org/docs/2.2.0/mllib-guide.html): A lower level API that is built upon **RDDs**

Where possible, you should always aim to use **ML** - DataFrames are buit upon RDDs. However, DataFrames should always be the data structure of choice when using spark, especially in the context of machine learning, unless you really need to fidelity of an RDD.

The ML API uses the abstraction of a Pipeline - click [here](https://spark.apache.org/docs/2.2.0/ml-pipeline.html#example-pipeline) to see an excellent example of this in the documentation.

# Conclusion

That's all I have for this short series on on PySpark!

In essence, this series focused on the [RDD](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#resilient-distributed-datasets-rdds) which is at the core of all spark abstractions!

In practice, it is rare that you will always be manipulating RDD's - but understanding how RDDs work are essential when thinking about how to write efficient code.

It is more likely that you will interface with [DataFrames](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#datasets-and-dataframes) in your day to day spark'ing. The documentation is pretty excellent for this data structure. You should have everything you need to shift over to dataframes! Especially if you have any experience with Pandas/R.

I would highly recommend the following [book](https://www.amazon.com/Spark-Definitive-Guide-Processing-Simple/dp/1491912219) for those who want to continue their spark journey.

This blog post was far from a comprehensive overview - it is important to keep in mind that Spark is an **active** project and this will change over time. Rather, I hope this blog series gave you an insight into thinking about code in the context of big data as well as a flavor of the tools, abstractions and methods that can be used to streamline your work flow in addition to generating insight along the way.

As always, feel free to get in touch: igabr@uchicago.edu or [@Gabr\_Ibrahim](https://twitter.com/Gabr_Ibrahim)

Cheers!