# How to Run PySpark in a Jupyter Notebook on Your Laptop

Well, I'm not going to fuss around and try to sugar coat this for us, so I'll just say it: it's time to learn pyspark.
Yes, yes, I know, I can hear you saying that we just spent all that time converting from R and learning python and pandas and why the hell do we need yet another API for working with dataframes?
That's a totally fair question.

Basically it comes down to scalability. 
If our pythonic prototypes are going to be of much use in the real world where datasets are humongous, we need a way for our computations and datasets to scale across multiple nodes in a distributed system.

Enter spark.

[Apache Spark](https://spark.apache.org/) is a unified analytics engine for large-scale data processing. 
[PySpark](https://spark.apache.org/docs/latest/api/python/)  is essentially a way to access the functionality of spark via python code. 
While there are other high-level interfaces to spark (such as Java, Scala, and R), for data scientists who are already working extensively with python, pyspark will be the natural interface of choice.

So, here's the plan. 
First we're going to get set up to run pyspark locally in a jupyter notebook on our laptop.
This is my preferred environment for interactively playing with pyspark and learning the ropes.
Then we're going to get up and running in pyspark as quickly as possible by reviewing the most essential functionality and comparing it to how we would do things in pandas.
Once we're comfortable running pyspark on the laptop, it's going to be much easier to jump onto a distributed cluster and run pyspark at scale.

Great, let's do this.

## How to Run PySpark in a Jupyter Notebook on Your Laptop

Ok, I'm going to walk us through how to get things installed on a Mac or Linux machine where we're using homebrew and conda to manage virtual environments.
If you have a different setup, your favorite search engine will help you.

### Install Spark

Install Scala.

```
$ brew install scala
```

Install Spark.

```
$ brew install apache-spark
```

Check where spark is installed.
```
$ brew info apache-spark
apache-spark: stable 3.1.1, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/3.1.2 (1,361 files, 242.6MB) *
...
```

Set the spark home environment variable to the path returned by `brew info` with `/libexec` appended to the end.
Don't forget to add the export to your `.zshrc` file too.

```
$ export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2/libexec
```

Test the installation by starting the spark shell.

```
$ spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 14.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 
```

If you get the `scala>` prompt, then you've successfully installed spark on your laptop!

### Install PySpark

Use conda to install the pyspark python package.
As usual, it's advisable to do this in a new virtual environment.


```
$ conda install pyspark
```

You should be able to launch an interactive pyspark REPL by saying pyspark.

```
$ pyspark
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.3 (default, Jul  2 2020 11:26:31)
Spark context Web UI available at http://192.168.100.47:4041
Spark context available as 'sc' (master = local[*], app id = local-1624127229929).
SparkSession available as 'spark'.
>>> 
```

This time we get a familiar python `>>>` prompt.
This is an interactive pyspark shell where we can easily experiment with pyspark.
Feel free to run the example code in this post here in the pyspark shell, or, if you prefer a notebook, read on and we'll get set up to run pyspark in a jupyter notebook.

### The Spark Session Object

You may have noticed that when we launched that pyspark interactive shell, it told us that something called `SparkSession` was available as `'spark'`.
So basically, what's happening here is that when we launch the pyspark shell, it instantiates an object called `spark` which is an instance of class `pyspark.sql.session.SparkSession`.
The spark session object is going to be our entry point for all kinds of pyspark functionality, i.e., we're going to be saying things like `spark.this` and `spark.that()` to make stuff happen. 

The pyspark interactive shell is kind enough to instantiate one of these spark session objects for us automatically.
However, when we're using another interface to pyspark (like say a jupyter notebook running a python kernal), we'll have to make a spark session object for ourselves.

### Create a PySpark Session in a Jupyter Notebook

There are a few ways to run pyspark in jupyter which you can read about [here](https://www.datacamp.com/community/tutorials/apache-spark-python).

For derping around with pyspark on your laptop, I think the best way is to instantiate a spark session from a  jupyter notebook running on a regular python kernal. 
The method we'll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session.
So, first install the findspark package.

```
$ conda install findspark
```

Launch jupyter as usual.

``` 
$ jupyter notebook
```


Go ahead and fire up a new notebook using a regular python 3 kernal.
Once you land inside the notebook, there are a couple things we need to do to get a spark session instantiated.
You can thing of this as boilerplate code that we need to run in the first cell of a notebook where we're going to use pyspark.

First import findspark and run its `init` method.

In [7]:
import findspark

findspark.init()

If you get errors, re-read the spark installation instructions above to make sure you correctly set the `SPARK_HOME` environment variable.

Next, import pyspark and instantiate a spark session.

In [8]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('My Spark App').getOrCreate()

Once you run this, you're ready to rock and roll with pyspark in your jupyter notebook.

Note that Spark provides a handy web UI that you can use for monitoring and debugging.
Once you instantiate the spark session You can open the UI in your web browser at [http://localhost:4040/jobs/](http://localhost:4040/jobs/).

[This tutorial](https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning#install) walks us through a simple ML application.