# Introduction to Spark 🎇🎇

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Apache_Spark_logo-f31fc9de-9456-4351-b459-2fe24f92628b.png)

## What you will learn in this course 🧐🧐
This course will teach you some theory about the Spark framework, how it works and what advantages it has over other distributed computing frameworks. Here's the outline:

* Numbers everyone should know
* Apache Spark
* Hadoop vs Spark
    * Faster through In-Memory computation
    * Simpler (high-level APIs) and execution engine optimisation
* Need for a DFS (Distributed File System)
* Spark's mechanics
    * DAG (Directed Acyclic Graphs) scheduling
    * Lazy execution
        * Transformations
        * Actions
    * Mixed language
* PySpark


## Numbers everyone should know 🔢🔢

> Remember that there is a "computer" in "computer science".
— Peter Norvig

- CPU ≈ 1 ns
- Memory ≈ 100 ns
- Disk ≈ 20 μs
- Network ≈ 150 ms

This means CPU $10^{-9}s$ is a hundred time faster than memory access $10^{-7}s$ is two hundred times faster than disk access $2\times 10^{-5}s$ which is itself about ten thousand times faster than network data transmission $1.5 \times 10^{-1}s$. This gives you some high level perspective on how developers can make code run faster, by relying primarily on faster components for computation.

Before we dive into Spark, it's important to understand some concepts of computer science, in particular these latency numbers every programmer should know, first talked about by Peter Norvig in its famous [Teach yourself programming in 10 years](http://norvig.com/21-days.html#answers).

## Apache Spark ✨✨

Apache Spark in an open-source distributed cluster-computing system designed to be fast and general-purpose.

Created in 2009 at Berkeley's AMPLab by Matei Zaharia, the Spark codebase was donated in 2013 to the Apache Software Foundation. It has since become one of its most active projects.

Spark provides high-level APIs in Scala, Java, Python and R and an optimized execution engine. On top of this technology, sit higher-lever tools including Spark SQL, MLlib, GraphX and Spark Streaming.

## Hadoop vs Spark 🐘🆚✨

- Faster through In-Memory computation
- Simpler (high-level APIs) and execution engine optimisation

### Faster through In-Memory computation ⚡

Because memory time access is much faster than disk access, Spark's In-Memory computation makes it much faster than Hadoop which relies on disk

### Simpler (high-level APIs) and execution engine optimisation 🧸

Spark's high-level APIs combined with lazy computation *means we don't have to optimize each query. Spark execution engine will take care of building an optimized physical execution plan.*

Also, code you write in "local" mode will work on a cluster "out-of-the-box" thanks to Spark's higher level API.

That doesn't mean it will be easy to write Spark code, but Spark makes it much easier to write optimized code that will run at big data scale.

## The need for a distributed storage 🔀🔀

If compute is distributed, all the machine needs to have access to the data, without a distributed storage that would be **very tedious**.

Unlike Hadoop, Spark doesn't come with its own file system, but can interface with many existing ones, such as Hadoop Distributed File System (HDFS), Cassandra, Amazon S3 and many more...

Spark can supports a pseudo-distributed local mode (for development or testing purposes), in this case, Spark is run on a single machine with one executor per CPU core and a distributed file storage is not required.

## Spark mechanics ⚙️⚙️

> At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster. The driver program contains your application's main function and defines distributed datasets on the cluster, then applies operations on them.
- Learning Spark, page 14

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/cluster-overview-273ddf73-9063-47bb-9060-e094443700eb.png" />

### DAG (Directed Acyclic Graph) Scheduling 📅

In order to distribute the execution among the worker nodes, Spark transforms the logical execution plan into a physical execution plan (how the computation will actually take place). While doing so, it implements an execution plan that will maximize performances, in particular avoiding moving data across the network, because as we've seen, network latency is the worse.

### Lazy Execution 😴

A consequence of Spark's being so efficient when computing operations is lazy execution. This concept sets Spark (and therefore PySpark, the python API for using the Spark framework in python language) from classic python.

Python runs in what we call **eager execution**, meaning everytime you write some code and execute it, the operations your code is asking the computer to execute happen immediatly and the result is returned.

In Spark things work a little differently, there are two types of operations: **transformations** and **actions**.

#### Transformations 🧙

**Transformations** are all operations that do not explicitely require the computer to return a result that should be stored, displayed or saved somewhere. These operations are only writen to the Graph waiting for an action to come up. For example, if you wish to calculate the frequency of all the the words in a set of text data, you may want to

1. isolate each word,
2. assign them a value of 1,
3. group elements by key (meaning the word itself) so all occurences of the same words are grouped together,
4. aggregate by summing the values associated with the words for each group.

None of these operations require direct display or storage of a result, they just constitute in a roadmap plan that can be optimized whenever you request to see the result!

#### Actions 🦸

**Actions** are operations that explicitely ask the computer to display or store the result of an operation. Taking our previous example, if we ask to see the complete list of words with their frequency, then all the previously mentionned transformation will actually execute one after the other. It can be very computing efficient because Spark knows all the operations that need to be done and can therefore plan accordingly, but additionnaly, if you're not looking to see the full result but just an extract to make sure the code runs correctly for example, then Spark will only work enough to give you want you want and stop (think of it as testing a piece of code on a sample instead of the full dataset for speed reasons).

Lazy execution makes Spark very computing efficient, but it also makes it harder to debug when something goes wrong. Because only some errors can be detected when running transformations because Spark does not actually try to run the code. You can later be met with a runtime error when using an action later (when the code actually starts running), and if the result you get is not the one you expected, you'll need to go back and inspect every transformation to find out where something went wrong.

Seems intimidating I know, but you can always set up actions like displaying the first few lines of data after each transformation in order to run sanity checks on what you are doing. It's a fair price to pay to be able to work with huge amounts of data.

## Mixed language ☯️☯️

Apache Spark is written in Scala, making wide usage of the Java Virtual Machine and can be interfaced with: Scala (primary), Java, Python (PySpark) and R.

Because Spark is written in Scala, PySpark, the Python interface tends to follow Scala's principle, whether for small details like naming convention (PySpark's API is frequently not consistent with Python's standard good practices, for example using pascalCase instead of snake_case) or global programming paradigm like functional programming.

The functional paradigm is particularly adapted for distributed computing as it uses concept like immutability.

## PySpark 🐍✨

PySpark is the Python API for Apache Spark.

Powerful, but some caveats:

- *Not as exhaustive as other's python libraries for data analysis and modeling (pandas, sklearn, etc..)*
- *Will be slower than these on small data*
- *Mixed language (harder to debug, common to find resources for Scala and not Python)*

Debugging PySpark is hard:

- *Debugging Distributed systems is hard*
- *Debugging mixed languages is hard*
- *Lazy evaluation can be difficult to debug*

## Ressources 📚📚

* The [official spark documentation](http://spark.apache.org/docs/latest/)
* [cluster-overview](https://spark.apache.org/docs/latest/cluster-overview.html)
* Interesting notes on clusters: [https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-cluster.html](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-cluster.html)
* You can take a look at [Spark Basics : RDDs,Stages,Tasks and DAG](https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454) but this covers concepts we haven't seen yet
* [Debugging PySpark](https://www.youtube.com/watch?v=McgG09XriEI)