# How to Get PySpark Running in Your Local Jupyter Notebook

## How to Install Spark

Install Scala.

```
$ brew install scala
```

Install Spark.

```
$ brew install apache-spark
```

Check where spark is installed.
```
$ brew info apache-spark
apache-spark: stable 3.1.1, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/3.1.2 (1,361 files, 242.6MB) *
...
```

Set the spark home environment variable to the path returned by `brew info` with `/libexec` appended to the end.
Don't forget to add the export to your `.zshrc` file too.

```
$ export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2/libexec
```

Test the installation by starting the spark shell.

```
$ spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 14.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 
```

If you get the `scala>` prompt, then you've successfully installed spark on your laptop!

## How to Install PySpark

Use conda to install the pyspark and findspark python packages.


```
$ conda install pyspark
```

You should be able to launch an interactive pyspark REPL by saying pyspark.

```
$ pyspark
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.3 (default, Jul  2 2020 11:26:31)
Spark context Web UI available at http://192.168.100.47:4041
Spark context available as 'sc' (master = local[*], app id = local-1624127229929).
SparkSession available as 'spark'.
>>> 
```

This time we get a familiar `>>>` prompt.
This is an interactive pyspark shell where we can easily experiment with pyspark.
Feel free to run the example code in this post here in the pyspark shell, or, if you prefer a notebook, read on and we'll get set up to run pyspark in a jupyter notebook.

## The Spark Session Object

You may have noticed that when we launched that pyspark interactive shell, it told us that something called `SparkSession` was available as `'spark'`.
So basically, what's happening here is that when we launch the pyspark shell, it instantiates an object called `spark` which is an instance of class `pyspark.sql.session.SparkSession`.
The spark session object is going to be our entry point for all kinds of pyspark functionality, i.e., we're going to be saying things like `spark.this` and `spark.that()` to make stuff happen. 

The pyspark interactive shell is kind enough to instantiate one of these spark session objects for us automatically.
However, when we're using another interface to pyspark (like say a jupyter notebook running a python kernal), we'll have to make a spark session object for ourselves.

## Run PySpark in a Jupyter Notebook

There are a few ways to run pyspark in jupyter which you can read about [here](https://www.datacamp.com/community/tutorials/apache-spark-python).

For derping around with pyspark on your laptop, I think the best way is to instantiate a spark session from a  jupyter notebook running on a regular python kernal. 
The method we'll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session.
Install the findspark package.

```
$ conda install findspark
```

Launch jupyter as usual.

``` 
$ jupyter notebook
```

After firing up a fresh notebook, there are a couple things we need to do to create a new spark session.
First, we import findspark and run its `init` method. This is going to find your spark install.

In [7]:
import findspark

findspark.init()

If you get errors, re-read the spark installation instructions above to make sure you correctly set the `SPARK_HOME` environment variable.

Next, import pyspark and instantiate a spark session.

In [8]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('My Spark App').getOrCreate()

Spark provides a handy web UI that you can use for monitoring and debugging.
You can open it in your web browser at [http://localhost:4040/jobs/](http://localhost:4040/jobs/).

[This tutorial](https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning#install) walks us through a simple ML application.