# Installing PySpark

To begin, it is often good practice to create a conda environment for a project. These environments have their own installed set of packages; therefore, you can keep the 'base' environment as a 'clean' environment that has only the basics, and then create seperate environments for any major projects (or type of work). 

Each environment can also have its own python version.

Let's begin by creating a new conda environment called pyspark that has python version 3.10 (note: 3.11 was recently released, but the current version of pyspark will not run in version 3.11 (as of March, 2023)). 

## Install Java 8

Spark requires java 8, so we need to install the java runtime environment first. Follow the instructions to download and install java from https://www.java.com/en/download/

NOTE: If you are using a M1 or M2 mac, you will need to take one of two approaches: 
1. Install the Rosetta 2 emulator (https://support.apple.com/en-us/HT211861) and then install java 8.
2. RECOMMENDED: Install Azul Zulu JDK 8 (https://www.azul.com/downloads/zulu-community/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk) and then install java 8.
   1. Make sure you set your JAVA HOME environment variable to the location of the JDK (e.g. /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home)
      * edit .zshrc and add the following lines
        * ```export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home```
        * ```export PATH=$JAVA_HOME/bin:$PATH```

For Windows or Mac's with x86 processors, you can install java 8 from https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

Make sure to test that you have java 8 installed by running the following command in your terminal:

    java -version

If you get an error or a difference version, you may need to alter/update your environment variables.
* On windows, you can access your environment variables by searching for 'environment variables' in the start menu.
  * Also, see here https://docs.oracle.com/en/database/oracle/machine-learning/oml4r/1.5.1/oread/creating-and-modifying-environment-variables-on-windows.html
* On MacOS X86 processors, you can access your environment variables using the termoinal.
  * https://phoenixnap.com/kb/set-environment-variable-mac
   

## Create a conda environment for your spark coding

```conda create --name spark python=3.10```

We then need to activate the environment. This will change the terminal prompt to show the name of the environment.

```conda activate spark```

## Install pyspark and findspark

Now we can install pyspark and findspark. We will also install jupyter notebook, which is a great tool for running and sharing code. 

```conda install pyspark findspark jupyterlab```

To launch a jupyter notebook, we can simply type ```jupyter notebook``` in the terminal. This notebook will use the 'spark' conda environment that is currently active. 



### Optional: Creating a pyspark jupyter kernal

If you wish to have this environment selectable within jupyter lab, you can install the conda environment as a kernel. 

```python -m ipykernel install --user --name=spark --display-name="Python (spark)"```

* NOTE: To verify is the kernel is installed, you can run ```jupyter kernelspec list``` in the terminal. This will list all of the kernels that are installed.

## Install other packages

This new 'spark' environment has only the basics installed, so we will need to install the packages we need.

For now, we will need to install pandas, matplotlib, jupyter lab, and pyspark (you may need to install more packages later; but you should know how to do this by now)

```conda install -c conda-forge pandas=1.5.3 matplotlib sparkmagic```

#### NOTE: Pandas is transitioning to PyArrrow - at the time of this writing, Pandas 2.0 is released and the version tha Conda installs. However, PySpark does not yet seem to support pandas with PyArrow, so we need to install Pandas 1.5.3. Look for this to change soon.

## Testing you PySpark installation

Now, let's try to create and use a spark session. Make sure you download the BostonHousing.csv dataset (available in canvas and in the class github repos)

In [2]:
import findspark
findspark.init()

from pyspark.sql import SparkSession;

spark = SparkSession.builder.master("local[4]").appName("ISM6562 Spark App01").enableHiveSupport().getOrCreate();

# note: If you have multiple spark sessions running (like from a previous notebook you've run), 
# this spark session webUI will be on a different port than the default (4040). One way to 
# identify this part is with the following line. If there was only one spark session running, 
# this will be 4040. If it's higher, it means there are still other spark sesssions still running.
spark_session_port = spark.sparkContext.uiWebUrl.split(":")[-1]
print("Spark Session WebUI Port: " + spark_session_port)

RuntimeError: Java gateway process exited before sending its port number

Click on the link "Spark UI" that is displayed after running the code above. If everything is working, you should see that your spark session is running.

*NOTE: Keep in mind that when you're running a notebook, it's not just the code in the notebook that's running. Until you hit the shutdown or restart kernal button, the spark session is still running. So if you run the above code, you'll see that the port is 4041. If you run the code again, it will be 4042. If you run the code again, it will be 4043. And so on.*

If you need stop the spark session you can therefore restart/stop the kernal or run the following code:


In [None]:
#spark.stop()