In [None]:
%%javascript
require(['base/js/utils'], function(utils) {
    utils.load_extensions('usability/ruler/main');
    utils.load_extensions('usability/toc2/main');
});

# Hadoop Installations 3: Ambari Cluster

## Usage Notes

This notebook uses Apache Ambari in order to install a Hadoop cluster. If you already have your Hadoop cluster through some other installation (such as the previous notebook), you should not use this notebook.

https://ambari.apache.org

We'll prepare our Ambari master server for an Ambari installation and then we'll use its Admin GUI in order to perform the actual software installation. Afterwards, we'll update all of our support nodes to recognize the installation that was performed.

## Notebook Imports

In [None]:
from aws_request import *
from aws_util import *

## Check Spot Instance Request

The instances for the application were generated by the previous notebook.

In [None]:
app_request = InstanceRequest('app')
app_instances = app_request.get_fulfilled()

app_host_names = [instance['PublicDnsName'] for instance in app_instances]
app_host_names

## Identify Server Tasks

This script can be used both for the initial cluster creation and to expand the capacity of a cluster after initial setup. If this is an initial setup, set `is_initial_setup` to `True`. Otherwise, set it to `False`.

In [None]:
is_initial_setup = True

After knowing whether this is an initial setup or not, the following decides whether Ambari and Jupyter need to be installed and identifies all the other nodes in the cluster. Please check the output to make sure you know which host names will be in your cluster.

In [None]:
ambari_host_name = None

if is_initial_setup:
    ambari_host_name = app_host_names[0]

    print 'Ambari', ambari_host_name

for host_name in app_host_names:
    if host_name != ambari_host_name:
        print 'Support', host_name

## Install Ambari Server

Installing Ambari requires identifying the version of Ambari we wish to install. At the time of this notebook's creation, the latest versions available from a public repository is version 2.2.2.0 (released April 29, 2016).

In [None]:
%%writefile scripts/install_ambari.sh
#!/bin/bash

# Add Ambari release repository

AMBARI_VERSION=2.2.2.0
UBUNTU_REPO=http://public-repo-1.hortonworks.com/ambari/ubuntu14

if [ ! -f /etc/apt/sources.list.d/ambari.list ]; then
    curl -Ls $UBUNTU_REPO/2.x/updates/$AMBARI_VERSION/ambari.list | \
        sudo tee /etc/apt/sources.list.d/ambari.list

    sudo apt-key adv --recv-keys \
        --keyserver keyserver.ubuntu.com B9733A7A07513CAD

    sudo apt-get update
fi

# Install Ambari

if [ "" == "$(which ambari-server)" ]; then
    sudo apt-get install --no-install-recommends --yes ambari-server
fi

# Configure Ambari

AMBARI_JAVA_HOME=$(grep java.home /etc/ambari-server/conf/ambari.properties | cut -d'=' -f 2)

if [ "/usr/lib/jvm/java-7-oracle" != "$AMBARI_JAVA_HOME" ]; then
    sudo ambari-server setup -j /usr/lib/jvm/java-7-oracle -s
fi

if [ "" == "$(netstat -an | grep 8080)" ]; then
    sudo ambari-server start
fi

And we will install it only on the Ambari server.

In [None]:
if ambari_host_name in app_host_names:
    run_script('ubuntu', [ambari_host_name], 'install_ambari.sh')

## Access Admin GUI

From here, the next step is to use the Ambari user interface to configure your cluster. It may take a few seconds before the UI is actually available (you will get connection errors), but once it's up:

* **username**: `admin`
* **password**: `admin`

In [None]:
print 'Ambari Server:'
print 'http://' + ambari_host_name + ':8080/'

Most of the steps are self-explanatory. The version of the Hortonworks Data Platform (HDP) you will want to install depends on the version of Ambari (usually you will want to install the latest). For Ambari 2.2.1, the latest HDP version is 2.4.

Use the private host names for your EC2 instances when asked for the host names, with `ubuntu` as the user and the same private key file you have been using for SSH authentication.

In [None]:
for instance in app_instances:
    print instance['PrivateDnsName']

Install just the bare minimum services that you need (Hadoop, Yarn, Zookeeper) and allow Ambari to select where to install its master servers. When you reach the Slaves and Clients step, make sure all servers participate as a Data Node and have a Client installed.

If you mounted extra volumes (the optional step from earlier in this notebook) and you are using an instance with local storage, it should have been automatically added on the Customize Services screen.

Click the deploy button to proceed with the changes.

## Recognize Ambari Clients

Now that Ambari is installed, we'll want all the servers in the cluster to know how to access the Hadoop and Spark clients.

In [None]:
%%writefile scripts/find_clients.sh
#!/bin/bash
source ~/.profile

if [ ! -d /usr/hdp ]; then
    echo Hortonworks Data Platform has not yet been deployed.
    exit 1
fi

# Find Hadoop Streaming JAR

if [ "" == "$HADOOP_HOME" ]; then
    HADOOP_HOME=$(find /usr/hdp -maxdepth 2 -name hadoop)
    HADOOP_STREAMING_JAR=$HADOOP_HOME/hadoop-streaming.jar

    if [ ! -f "$HADOOP_STREAMING_JAR" ]; then
        HADOOP_STREAMING_JAR=$(find /usr/hdp -name hadoop-streaming.jar)

        if [ "" != "$HADOOP_STREAMING_JAR" ]; then
            sudo ln -s $HADOOP_STREAMING_JAR $HADOOP_HOME
        else
            echo Please place hadoop-streaming.jar in $HADOOP_HOME first.
            return
        fi
    fi

    echo >> $HOME/.profile
    echo "# Added for MRJob" >> $HOME/.profile
    echo export HADOOP_HOME="$HADOOP_HOME" >> $HOME/.profile
fi

# Find Spark installation

if [ "" == "$SPARK_HOME" ]; then
    if [ -d "/usr/hdp/current/spark-client" ]; then
        echo >> $HOME/.profile
        echo "# Added for PySpark" >> $HOME/.profile
        echo export SPARK_HOME=/usr/hdp/current/spark-client >> $HOME/.profile
    fi
fi

And we'll do this on all the servers.

In [None]:
run_script('ubuntu', app_host_names, 'find_clients.sh')

## Initialize HDFS

We'll want to make sure that the proper directories exist on HDFS. We'll want the user home for the Ubuntu user, which is the default location where data is stored in an MRJob.

In [None]:
%%writefile scripts/init_hdfs.sh
#!/bin/bash
source ~/.profile

sudo su -c "$HADOOP_HOME/bin/hdfs dfs -mkdir -p /user/ubuntu" hdfs
sudo su -c "$HADOOP_HOME/bin/hdfs dfs -chown ubuntu:ubuntu /user/ubuntu" hdfs

sudo su -c "$HADOOP_HOME/bin/hdfs dfs -mkdir -p /tmp" hdfs
sudo su -c "$HADOOP_HOME/bin/hdfs dfs -chmod a+rwx /tmp" hdfs

And now we run the command on our designated Notebook server.

In [None]:
run_script('ubuntu', app_host_names[:1], 'init_hdfs.sh')