## What's this all about?

In order to run Spark applications on your local machine, you must have **Java 8**, **Spark**, and the **PySpark** package installed.  Additionally, for the Jupyter Notebook kernel to use your locally installed **Spark**, you may need to use the **findspark** module.   If you are unsure whether you have any or all of these requirements, we recommend you follow the instructions in this notebook.

### Automatic Java Installation

Executing the following cell will install **Java 8** into your home directory (i.e., `$HOME`) in its own directory (i.e., `$HOME/java/`).  If you would rather install Java yourself, do not execute the following cell and download Java and follow the instructions for your operating system found on the [Java website](https://www.java.com/en/download/).

*Note: If you have Java already installed, we recommend uninstalling it before running the following cell.  The next cell of code will delete anything located in `$HOME/java/` and add the installed Java directory to your `PATH` variable.  This installation code also assumes you use Bash as your default shell (and will modify your `$HOME/.bashrc` file).*

**OS specific notes:**

*Linux distros: This code has been tested and works for most distributions of Linux (32 and 64 bit)*

*Mac OSX: The code to install Java may or may not work on your machine.  If you experience an error, please download Java and follow instructions for installation found on the [Java website](https://www.java.com/en/download/).*

*Windows OS: You must download and follow instructions for installation found on the [Java website](https://www.java.com/en/download/).*

In [1]:
import platform as arch
from sys import platform
   
print('Beginning Java 8 installation!')

# Check which OS we are running on
if platform.startswith('linux'):
    print('Now installing Java on Linux...')
    if arch.architecture()[0] == '64bit':
        !curl -o ~/java.tar.gz -L https://javadl.oracle.com/webapps/download/AutoDL?BundleId=241526_1f5b5a70bf22433b84d0e960903adac8
    elif arch.architecture()[0] == '32bit':
        !curl -o ~/java.tar.gz -L https://javadl.oracle.com/webapps/download/AutoDL?BundleId=241524_1f5b5a70bf22433b84d0e960903adac8
    !rm -rf ~/java && mkdir ~/java && tar -xzf ~/java.tar.gz -C ~/java --strip-components=1 && rm ~/java.tar.gz

    # Define JAVA_HOME and add to PATH
    !echo 'export JAVA_HOME=$HOME/java' >> ~/.bashrc
    !echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
    !. ~/.bashrc
    
    print('Installation of Java 8 complete!')

elif platform == 'darwin':
    print('Now installing Java on Mac...')
    !curl -o ~/java.dmg -L http://javadl.oracle.com/webapps/download/AutoDL?BundleId=234465_96a7b8442fe848ef90c96a2fad6ed6d1
    !hdiutil attach ~/java.dmg
    !sudo installer -pkg /Volumes/Java\ 8\ Update\ 181/Java\ 8\ Update\ 181.app/Contents/Resources/JavaAppletPlugin.pkg -target /
    !diskutil umount /Volumes/Java\ 8\ Update\ 181 
    print('Installation of Java 8 complete (maybe)... If there was an error, please mount java.dmg located in your home directory and follow the instructions to install')

elif platform == 'win32':
    print('You are running a Windows OS.  Please download the correct version of Java from here: https://java.com/en/download/manual.jsp and install following the instructions.')

else:
    print('We had trouble determining which OS you are running.  Please ask for help.')

Beginning Java 8 installation!
Now installing Java on Linux...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   909  100   909    0     0   5443      0 --:--:-- --:--:-- --:--:--  5443
100 82.8M  100 82.8M    0     0  6720k      0  0:00:12  0:00:12 --:--:-- 6947kM    0     0  6558k      0  0:00:12  0:00:06  0:00:06 7132k
Installation of Java 8 complete!


### Automatic Spark Installation

Executing the following cell will install the latest **Apache Spark** into your home directory (i.e., `$HOME`) in its own directory (i.e., `$HOME/spark/`).

If you would rather install Spark yourself, do not execute the following cell and download the pre-built version of Spark found on the [Spark website](https://spark.apache.org/downloads.html).  You will need to extract the contents of the tarball and add Spark to your `PATH` variable.

*Note: If you have Spark already installed, we recommend uninstalling it before running the following cell.  The next cell of code will delete anything located in `$HOME/spark/` and add the installed Spark directory to your `PATH` variable.  This installation code also assumes you use Bash as your default shell (and will modify your `$HOME/.bashrc` file).*

**OS specific notes:**

*Linux distros: This code has been tested and works for most distributions of Linux (32 and 64 bit).*

*Mac OSX: The code has been tested and should work for modern Macs.*

*Windows OS: You must download and follow instructions for installation found on the [Spark website](https://spark.apache.org/downloads.html).*

In [2]:
from sys import platform

print('Beginning Apache Spark installation!')

# Check which OS we are running on
if (platform.startswith('linux')) or (platform == 'darwin'):
    # Download Spark and extract into $HOME/spark/
    !curl -o ~/spark.tar.gz -L http://apache.cs.utah.edu/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
    !rm -rf ~/spark && mkdir ~/spark && tar -xzf ~/spark.tar.gz -C ~/spark --strip-components=1 && rm ~/spark.tar.gz
    
    # Define SPARK_HOME and add to PATH
    !echo 'export SPARK_HOME=$HOME/spark' >> ~/.bashrc
    !echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc
    
    # Set spark master to localhost (may not be necessary)
    !echo 'export SPARK_LOCAL_IP="127.0.0.1"' >> ~/.bashrc
    !. ~/.bashrc
    
    print('Installation of Apache Spark complete!')

elif platform == 'win32':
    print('You are running a Windows OS.  Please download the correct version of Spark from here: https://spark.apache.org/downloads.html and install following the instructions.')

else:
    print('We had trouble determining which OS you are running.  Please ask for help.')

Beginning Apache Spark installation!
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  219M  100  219M    0     0  1660k      0  0:02:15  0:02:15 --:--:-- 2020k0     0  1392k      0  0:02:41  0:00:37  0:02:04 2330k   0  0:02:20  0:01:55  0:00:25 85480
Installation of Apache Spark complete!


### Automatic Pyspark Installation

Executing the following cell will install **pyspark** and **findspark**.  We will use either `conda` (if you have anaconda3 or miniconda installed) or `pip`.  At the minimum, you must have `pip` installed.  You likely have one of these installed already!  If you do not, you can download and install [Anaconda3](https://www.anaconda.com/download/) or [pip](https://pip.pypa.io/en/stable/installing/).  We will also install **matplotlib** so you can plot results from your assignments and pre-build the font cache to save time in the future!

*Note: These instructions should work regardless of the OS you are running!*

In [3]:
import shutil

# Method to build Matplotlib font cache
def buildFontCache():
    import matplotlib
    matplotlib.use('AGG')
    from matplotlib import pyplot as plt
    plt.plot([0],[0])
    plt.show()
    plt.clf()

# Check for conda
if shutil.which('conda'):
    # Update conda
    !conda update -n base conda --yes
    
    # Install pyspark and findspark
    !conda install pyspark --yes
    !conda install -c conda-forge findspark --yes
    
    # Install matplotlib and build font cache
    !conda install matplotlib --yes
    buildFontCache()
    
    print('Python package installation complete!')

    
# Check for pip if conda is not found
elif shutil.which('pip'):
    # Update pip
    !pip install --upgrade pip
    
    # Install pyspark and findspark
    !pip install pyspark
    !pip install findspark
    
    # Install matplotlib and build font cache
    !pip install matplotlib
    buildFontCache()
    
    print('Python packages installation complete!')
    
else:
    print('Could not find conda or pip, please follow the instructions above to install either Anaconda3 or pip.')

Requirement already up-to-date: pip in /home/nigel/Documents/Courses/COSC526/spark_env/lib/python3.7/site-packages (20.0.2)
Collecting findspark
  Using cached findspark-1.3.0-py2.py3-none-any.whl (3.0 kB)
Installing collected packages: findspark
Successfully installed findspark-1.3.0
Collecting matplotlib
  Downloading matplotlib-3.1.3-cp37-cp37m-manylinux1_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 46 kB/s  eta 0:00:01
[?25hCollecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (90 kB)
[K     |████████████████████████████████| 90 kB 199 kB/s eta 0:00:011
[?25hCollecting cycler>=0.10
  Downloading cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting numpy>=1.11
  Downloading numpy-1.18.1-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
[K     |████████████████████████████████| 20.1 MB 97 kB/s  eta 0:00:01
[?25hCollecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1
  Downloading pyparsing-2.4.6-py2.py3-none-any.whl (67 kB

  if __name__ == '__main__':


### Does it Work?

Let's test that **Java**, **Spark**, **pyspark**, and **findspark** were all installed correctly.  The following should create a `SparkContext`, create an `RDD` from a python list, and print the values in the `RDD`.

**Expected output:**
`[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`

In [4]:
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

data = sc.parallelize(range(10))
print(data.collect())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


**If the output of the previous cell did not match the expected output or you received an error message, please ask for assistance!**

### Writing Spark Code

Now that you have all the software installed to run **Spark** code in a Jupyter Notebook, keep in mind that you will need to use the following code at the beginning of each notebook where you wish to use **Spark**.  This initialization code will create a `SparkContext` which you can access via `sc`.

In [None]:
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()