Skip to content
nirvanesque edited this page Sep 3, 2015 · 7 revisions

Introduction

The Apache Hadoop project provides an open-source framework for reliable, scalable, distributed computing. As such, it can be deployed and used on the Grid 5000 platform. Nevertheless, its configuration and management may be sometimes difficult, especially due to the dynamic nature of clusters within Grid 5000 reservations. In turn, Execo offers a Python API to manage processes execution. It is well suited for quick and easy creation of reproducible experiments on distributed hosts.

The project presented here is called hadoop_g5k and provides a layer built on top of Execo that allows to manage Hadoop clusters and prepare reproducible experiments using Hadoop on the Grid'5000 platform. It offers a set of scripts to be used at the command-line interfaces and also at the Python interface.

It also provides the classes and script to manage Apache Spark clusters and link them to the corresponding Hadoop clusters.

Installation

As hadoop_g5k is based on Execo, we need to install it with easy_install. First enable the web proxy with:

export http_proxy="http://proxy:3128"; export https_proxy="https://proxy:3128"

Then install execo

easy_install --user execo

You are ready to install hadoop_g5k using setuptools. Download the sources from the repository and unzip them.

wget https://github.com/mliroz/hadoop_g5k/archive/master.zip .
unzip master.zip

Then you can use the setup script to install hadoop_g5k. Move to the uncompressed directory and execute:

python setup.py install --user

Depending on your Python configuration, the scripts will be installed in a different directory. You may add this directory to the path in order to be able to call them from any directory.

Hadoop Cluster Management

Hadoop_g5k provides a main Python class, called HadoopCluster, which exposes several useful methods to manage a Hadoop cluster deployed on top of Grid5000. Alternatively, a command-line script can be used to manage the cluster from the terminal. More information about this can be found in the wiki page of hg5k.

Support for newer versions of Hadoop (>= 2.0) is also provided with the class HadoopV2Cluster.

The documentation of the methods of HadoopCluster and HadoopV2Cluster and all the related classes can be found in Read the Docs.

Spark Cluster Management

Hadoop_g5k also provides support for Apache Spark both through a Python class, SparkCluster, and a command-line script called spark_g5k. More information about the script can be found in spark_g5k wiki page.

Clone this wiki locally