# How to Get PySpark Running in Your Local Jupyter Notebook

## How to Install Spark

Install Scala.

```
$ brew install scala
```

Install Spark.

```
$ brew install apache-spark
```

Check where spark is installed.
```
$ brew info apache-spark
apache-spark: stable 3.1.1, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/3.1.1 (1,361 files, 242.6MB) *
...
```

Set the spark home environment variable to the path returned by `brew info` with `/libexec` appended to the end.
Don't forget to add the export to your `.zshrc` file too.

```
$ export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.1/libexec
```

Test the installation by starting the spark shell.

```
$ spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 14.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 
```



## How to Install PySpark

Use conda to install the pyspark and findspark python packages.

```
$ conda install pyspark
$ conda install findspark
```



## Run PySpark in a Jupyter Notebook

There are a few ways to run pyspark in jupyter which you can read about [here](https://www.datacamp.com/community/tutorials/apache-spark-python).
The method we'll use involves running a standard jupyter notebook session with a python kernal and using the findspark package to initialize the spark session.

Launch jupyter as usual.

``` 
$ jupyter notebook
```

Once you land in the notebook, import findspark and run its `init` method.

In [9]:
import findspark

findspark.init()

Then import pyspark and create the spark context.

In [24]:
import pyspark

sc = pyspark.SparkContext.getOrCreate()

You can open the spark UI in your web browser at [http://localhost:4040/jobs/](http://localhost:4040/jobs/).

[This tutorial](https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning#install) walks us through a simple ML application.

In [22]:
rdd1 = sc.parallelize([('a', 1), ('b', 2), ('c', 3)])
rdd1

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274