# Notebook Installations 1: Jupyter

## Usage Notes

This notebook looks at preparing our installation by installing any prerequisite libraries that may be used by jobs that we would run from a notebook, whether that notebook is Jupyter or Zeppelin.

## Notebook Imports

In [None]:
from aws_request import *
from aws_util import *

## Check Spot Instance Request

Maybe you have some specific servers already running. If so, specify them here as a list.

In [None]:
app_host_names = None

The instances for the application were generated by the previous notebook.

In [None]:
app_request = InstanceRequest('app')
app_instances = app_request.get_fulfilled()

if app_host_names is None and app_instances is not None:
    app_host_names = [instance['PublicDnsName'] for instance in app_instances]

assert(app_host_names is not None)

## Specify SSH User

If you created your cluster through EMR, the user is `hadoop`. If you created your cluster as a standard EC2 instance using this notebook series, the user is either `ec2-user` or `ubuntu`.

In [None]:
#user_name = 'ubuntu'
#user_name = 'hadoop'
user_name = 'ec2-user'

## Performance Libraries

Next, we'll want to make sure that we install utilities that make jobs run faster in Python, such as `re2` and `ujson`.`mrjob`.

In [None]:
%%writefile scripts/faster_mrjob.sh
#!/bin/bash

if [ "" != "$(uname -a | grep Ubuntu)" ]; then
    sudo apt-get -y install build-essential git libre2-dev
else
    sudo yum -y install gcc-c++ git

    # Download and install re2

    if [ ! -d re2 ]; then
        git clone https://code.googlesource.com/re2

        pushd re2 > /dev/null

        make test
        sudo -H make install
        make testinstall

        popd > /dev/null
    fi
fi

# Download and install pyre2

if [ ! -d pyre2 ]; then
    sudo -H ldconfig

    git clone git://github.com/axiak/pyre2.git

    pushd pyre2 > /dev/null

    sudo -H python setup.py install

    popd > /dev/null
fi

# Install ujson

sudo -H /usr/local/bin/pip install ujson

And we'll do this on all the servers

In [None]:
run_script(user_name, app_host_names, 'faster_mrjob.sh')

## Install Display Libraries

We'll want to do some plotting (whether it's through Jupyter or through Zeppelin), which requires some additional libraries.

In [None]:
%%writefile scripts/install_matplotlib.sh
#!/bin/bash

# Install numpy

if [ "" != "$(uname -a | grep Ubuntu)" ]; then
    sudo apt-get -y install build-essential
    sudo apt-get -y install libblas-dev liblapack-dev libatlas-base-dev gfortran
elif [ "hadoop" == "$USER" ]; then
    sudo yum -y make gcc gcc-c++ kernel-devel
    sudo yum -y install lapack-devel atlas-sse3-devel
fi

sudo -H /usr/local/bin/pip install numpy

# Install matplotlib and other stuff related to it

if [ "" != "$(uname -a | grep Ubuntu)" ]; then
    sudo apt-get -y install libfreetype6-dev libpng12-dev pkg-config python-qt4
fi

sudo -H /usr/local/bin/pip install matplotlib networkx pandas scikit-learn seaborn

It doesn't cost us any extra time to install everything to all servers, since the installation is all done in parallel, so we'll install it on all servers.

In [None]:
run_script(user_name, app_host_names, 'install_matplotlib.sh')

## Install Jupyter

Now we install Jupyter notebook. We'll install it (along with plotting libraries) to all servers so that we can theoretically run it on any server we want, since waiting for the installation to finish on all servers takes the same amount of time as waiting for it on all servers.

Unlike local installations that might use Miniconda or Anaconda, we will install it using regular Python. This is to ensure that it uses the same libraries as any jobs that may run on this server and also to ensure that we remember to install any necessary libraries on other members of the cluster.

In [None]:
%%writefile scripts/install_jupyter.sh
#!/bin/bash
source ~/.profile

if [ "ubuntu" == "$USER" ]; then
    sudo apt-get -y install unzip
fi

# Install jupyter, findspark

sudo -H /usr/local/bin/pip install jupyter findspark

# Install test_helper

sudo -H /usr/local/bin/pip install test_helper

# Fix MathJax

if [ "" != "$CONDA_ENV_PATH" ]; then
    PYTHON_PACKAGES=$CONDA_ENV_PATH/lib/python2.7/site-packages
elif [ "" != "$(which conda)" ]; then
    PYTHON_PACKAGES=$(dirname $(dirname $(which conda)))/lib/python2.7/site-packages
elif [ -d /usr/local/lib/python2.7/dist-packages ]; then
    PYTHON_PACKAGES=/usr/local/lib/python2.7/dist-packages
elif [ -d /usr/local/lib/python2.7/site-packages ]; then
    PYTHON_PACKAGES=/usr/local/lib/python2.7/site-packages
else
    PYTHON_PACKAGES=
fi

NOTEBOOK_COMPONENTS=$PYTHON_PACKAGES/notebook/static/components

if [ -d $NOTEBOOK_COMPONENTS ] && [ ! -d "$NOTEBOOK_COMPONENTS/MathJax-2.6*" ]; then
    wget --quiet https://github.com/mathjax/MathJax/archive/v2.6-latest.zip
    unzip -qq v2.6-latest.zip
    rm v2.6-latest.zip

    NEW_MATHJAX_VERSION=MathJax-$(
        grep -o "\.fileversion=\"[^\"]*\"" MathJax-2.6-latest/MathJax.js | \
            cut -d '"' -f 2
    )

    sudo mv MathJax-2.6-latest $NOTEBOOK_COMPONENTS/$NEW_MATHJAX_VERSION

    pushd $NOTEBOOK_COMPONENTS

    OLD_MATHJAX_VERSION=MathJax-$(
        grep -o "\.fileversion=\"[^\"]*\"" MathJax/MathJax.js | \
            cut -d '"' -f 2
    )

    sudo mv MathJax $OLD_MATHJAX_VERSION
    sudo ln -s $NEW_MATHJAX_VERSION MathJax

    popd
fi

In [None]:
run_script(user_name, app_host_names, 'install_jupyter.sh')

## Start Jupyter Notebook

With Jupyter and all of its dependencies set, it's safe to start.

In [None]:
%%writefile scripts/start_jupyter.sh
#!/bin/bash
source ~/.profile

# Start the notebook

if [ "" == "$(netstat -an | grep 8888 | grep LISTEN)" ]; then
    mkdir -p notebook

    if [ ! -f .jupyter/jupyter_notebook_config.py ]; then
        mkdir -p .jupyter
        echo "c.NotebookApp.token = u''" > .jupyter/jupyter_notebook_config.py
    fi

    nohup jupyter notebook --ip="0.0.0.0" \
        --no-browser --notebook-dir="$HOME/notebook" \
        > jupyter.out 2> jupyter.err < /dev/null &
fi

In [None]:
run_script(user_name, app_host_names, 'start_jupyter.sh')

## Access Notebook GUI

In [None]:
print 'Jupyter Servers:'

for app_host_name in app_host_names:
    print 'http://' + app_host_name + ':8888/'