# Hadoop Installations 1: Preparing Hadoop Nodes

## Usage Notes

This notebook looks at preparing our installation by installing MRJob itself along with any prerequisite libraries that may be used by our jobs.

## Notebook Imports

In [None]:
from aws_request import *
from aws_util import *
from aws_volumes import *

## Check Spot Instance Request

The instances for the application were generated by the previous notebook.

In [None]:
app_request = InstanceRequest('app')
app_instances = app_request.get_fulfilled()

app_host_names = [instance['PublicDnsName'] for instance in app_instances]
app_host_names

## Enable Swap Space

Spark needs swap space that it uses for overflow during shuffle phases. This script creates an 8G swap partition on the first available local device.

In [None]:
enable_swap('ubuntu', app_host_names, 8)

## Install Prerequisites

### Install NTP

In order to make sure the nodes stay synchronized, Ambari requires the Network Time Protocol (`ntp`). You can read more about it on its website.

http://www.ntp.org/

This does not come pre-installed on Ubuntu, so we'll need to manually install it.

In [None]:
%%writefile scripts/install_ntp.sh
#!/bin/bash

if [ ! -f /etc/init.d/ntp ]; then
    sudo apt-get -y install ntp
fi

And we will install it on all servers.

In [None]:
run_script('ubuntu', app_host_names, 'install_ntp.sh')

### Install Java

Every instance in the cluster (whether it's the Ambari master node or the slave nodes that are running jobs) requires Java, so we will need to create a script that downloads and installs Java on Ubuntu. We could specify an S3 bucket that contains the installer, but it's also possible to install it using `apt`.

Based on [HADOOP-11090](https://issues.apache.org/jira/browse/HADOOP-11090), the safest version to use is Java 7, even though Oracle has technically dropped support for it as of April 2015. The script below auto-accepts the license agreement for Java 7, which you are encouraged to read on the Oracle website:

http://www.oracle.com/technetwork/java/javase/terms/license/index.html

In [None]:
%%writefile scripts/install_java.sh
#!/bin/bash

# Add webupd8team to apt sources

if [ ! -f /etc/apt/sources.list.d/webupd8team-java-trusty.list ]; then
    sudo add-apt-repository -y ppa:webupd8team/java
    sudo apt-get update
fi

# Install Java 8

if  [ "" == "$(which java)" ]; then
    echo debconf shared/accepted-oracle-license-v1-1 select true | \
        sudo debconf-set-selections
    echo debconf shared/accepted-oracle-license-v1-1 seen true | \
        sudo debconf-set-selections
    sudo apt-get -y install oracle-java8-installer

    echo "export JAVA_HOME=/usr/lib/jvm/java-8-oracle" >> $HOME/.profile
fi

And we will install it on all servers.

In [None]:
run_script('ubuntu', app_host_names, 'install_java.sh')

### Disable Transparent HugePages

According to Cloudera, there is a performance problem related to transparent huge pages and Hadoop workloads.

https://www.ghostar.org/2015/02/transparent-huge-pages-on-hadoop-makes-me-sad/

Therefore, they recommend that you disable it if it winds up enabled on the Linux distribution you are using. In our case, our base AMI is Ubuntu 14.04, so we are affected by this problem. So, we'll want a script that disables it on all environments.

In [None]:
%%writefile scripts/disable_thp.sh
#!/bin/bash

# Install the hugeadm tool
if  [ "" == "$(which hugeadm)" ]; then
    sudo apt-get -y install hugepages
fi

# Set the flag to disable transparent huge pages
sudo hugeadm --thp-never

And we will disable it on all servers.

In [None]:
run_script('ubuntu', app_host_names, 'disable_thp.sh')

## Prepare MRJob

Next, we should install `mrjob` to skip bootstrapping if we run jobs on our cluster.

In [None]:
%%writefile scripts/install_mrjob.sh
#!/bin/bash

sudo apt-get -y install build-essential

# Install MRJob

sudo -H pip install mrjob

And we'll do this on all the servers

In [None]:
run_script('ubuntu', app_host_names, 'install_mrjob.sh')