# Spark Setup and Tutorial

We will be using Docker again today, this time to explore Spark. You'll want to start the AWS EC2 instance that you were using yesterday. (Or, alternatively, recreate it if you terminated the instance.)

Note that the IP address will usually change when an instance is restarted on AWS.

## Sharing directories with Docker

Yesterday, we did all our work *on* the Docker instance itself. In many cases, this is not ideal, given that Docker images are designed to be used for many different applications.

Instead, data and analysis files are often placed in a directory that is shared between both the host machine (the AWS instance, in this case) and the Docker image. 

We will create a directory on the AWS instance called `notebooks` that resides in the home directory of the default user (`ubuntu`). This directory will be shared with the Spark Docker image.

To create the directory:

```bash
mkdir ~/notebooks
```

## Running the Spark Docker image

Like yesterday, we will have to pull and run the new Docker image:

```bash
docker pull mlgill/metis-spark-python:latest

docker run -d -p 8888:8888 -v /home/ubuntu/notebooks:/home/ubuntu/notebooks \
                              mlgill/metis-spark-python:latest
```

The run command looks a bit differant than the one we used yesterday. This command does multiple things:

1. Runs the image called `mlgill/metis-spark-python:latest`
2. Makes port 8888 from the Docker image available on the AWS instance at the same port number (8888). Recall that this is the port Jupyter notebook runs on by default. This will allow us to interact with the notebook.
3. Shares the directory we created (called `notebooks`) with the Docker image.

As we did yesterday, verify the image is running with the following command:

```bash
docker ps -a
```

The output should look something like this:

```bash
CONTAINER ID IMAGE              COMMAND                CREATED       STATUS       PORTS   
41f84d1bf747 metis-spark-python "/bin/bash /home/ubun" 2 seconds ago Up 2 seconds 0.0.0.0:8888->8888/tcp
```

This indicates our image is running! Note the container ID, which will be needed in a moment.

## PySpark Tutorial

Before we start working with Spark in Jupyter, let's see how the interactive shell (PySpark) works.

Run the following command from the AWS instance to enter PySpark. Be sure to substitute `_container_id_` with the real container ID from above.

```bash
docker exec -it _container_id_ pyspark
```

This command tells docker to start an interactive shell and run the final command, which is `pyspark` in this case.

Once PySpark is launched, the output should resemble the following: 

```bash
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.2
      /_/

Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
SparkSession available as 'spark'.
>>>
```

### Spark context

Confirm that the spark context is there by typing `sc`:  

```bash
>>> sc
<pyspark.context.SparkContext object at 0x107083e50>
```

### More code to test:  

```bash
>>> input = [1, 2, 3, 4, 5]
>>> input
[1, 2, 3, 4, 5]
>>> type(input)
<type 'list'>
>>> 
```

### Exit PySpark

To exit interactive PySpark and return to the AWS instance, type `exit()`:


```bash
>>> exit()
```

## Connect to Jupyter on Docker

What we are really interested in is connecting to the Jupyter instance running on Docker and using that instance to analyze some data.

First, `scp` the notebook [Spark_Supervised_Machine_Learning.ipynb](Spark_Supervised_Machine_Learning.ipynb) into the notebooks directory (`/home/ubuntu/notebooks`) on AWS. Recall that this directory is shared with Docker, so we will be able to access this notebook with Jupyter running on Docker.

The `scp` command will look something like this:

```bash
scp -i ~/.ssh/pemfile_name.pem Spark_Supervised_Machine_Learning.ipynb ubuntu@xx.xxx.xxx.xx:~/notebooks
```

Jupyter running on Docker is accessible to the host (the AWS instance), but not currently to other computers. What we'd really like is to access Jupyter from our laptop. To do this, we can use something called an `SSH tunnel`, which is used to connect ports on remote machines via SSH. 

To setup the tunnel, open a terminal and run the following from your laptop (*not* while logged into AWS):

```bash
ssh -i ~/.ssh/aws_key_file.pem -NL 12345:localhost:8888 ubuntu@XX.XXX.XXX.XXX
```

For the command above, substitute the path to the key file and the IP address of your instance.

When you press return, the command will remain in the foreground, so a prompt will not reappear in the window. **DO NOT** close this window.

Then open http://localhost:12345 in the browser on your laptop. Voila! You should see the supervised machine learning notebook.

What's going on here? Recall that Jupyter is running on port 8888 (the default) on AWS. This command connects that port to port 12345 on our laptop (`localhost`). 

Could we have used a different local port other than 12345? Definitely, however it's nice to avoid 8888, 8889, etc. since these ports may be occupied by instances of Jupyter running locally on your laptop.

## Reconnecting to Jupyter

If you shut your laptop or lose internet connectivity, the SSH tunnel and thus the connection to Jupyter can be disconnected. Never fear, though, as Jupyter is still running and has been performing any calculations that we started before becoming disconnected. This is actually one of the cool things about running on AWS or other cloud instances--our computations can be running even when we are not working!

It's also easy to reconnect to the existing Jupyter instance from your laptop. First, you will probably need to restart the SSH tunnel. You will get an error if you try to re-open the tunnel if it's already running, though. For this reason, it's best to go to the same terminal window that was used to start the tunnel and hit `control-c` to quit the tunnel if it's running. Then you can restart it.

Finally, open (or refresh) http://localhost:12345 in your browser.

Alternatively, if you're absolutely certain that the tunnel is still running, you can try to reconnect to the kernel from within Jupyter notebook. In the Jupyter menu, go to `Kernel --> Reconnect`. If this doesn't work, though, it's best to try to restart the tunnel and reload the notebook as suggested above.