Permalink
Browse files

Make it easier to run the example.

- Make use of the new jobtracker-ip function to get the IP address of the jobtracker.
- Add the creation in the nodes of a script that will download the data for running the example.
  • Loading branch information...
1 parent f0118ed commit f56097cb19dcca53a9ec2fd90b3982055537917c @tbatchelli tbatchelli committed Jul 22, 2011
Showing with 58 additions and 28 deletions.
  1. +17 −19 README.md
  2. +2 −2 project.clj
  3. +39 −7 src/pallet_hadoop_example/core.clj
View
36 README.md
@@ -11,10 +11,15 @@ I'm going to assume that you have some basic knowledge of clojure, and know how
$ git clone git://github.com/pallet/pallet-hadoop-example.git
$ cd pallet-hadoop-example
-Open up `./src/pallet-hadoop-example/core.clj` with your favorite text editor. `example-cluster` contains a data description of a full hadoop cluster with:
+Open up `./src/pallet-hadoop-example/core.clj` with your favorite text
+editor. `make-example-cluster` is a function that builds a data description of a full hadoop cluster with:
* One master node functioning as jobtracker and namenode
-* Two slave nodes (`(slave-group 2)`), each acting as datanode and tasktracker.
+* A number of slave nodes (`slave-count`), each acting as datanode and
+ tasktracker.
+
+For convenience, we have created an `example-cluster` that
+defines a cluster with 2 slaves and the jobtracker/namenode node.
Start a repl:
@@ -40,7 +45,7 @@ Alternatively, if you want to keep these out of your code base, save the followi
and define `ec2-service` with:
- user=> (def ec2-service (service :aws))
+ user=> (def ec2-service (compute-service-from-config-file :aws))
#'user/ec2-service
### Booting the Cluster ###
@@ -57,13 +62,11 @@ Once `create-cluster` returns, we're done! We now have a fully configured, multi
To test our new cluster, we're going log in and run a word counting MapReduce job on a number of books from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page).
-Point your browser to the [EC2 Console](https://console.aws.amazon.com/ec2/), log in, and click "Instances" on the left.
+At the REPL, type
-You should see three nodes running; click on the node whose security group contains "jobtracker", and scroll the lower pane down to retrieve the public DNS address for the node. It'll look something like
+ user=> (jobtracker-ip ec2-service)
- ec2-50-17-103-174.compute-1.amazonaws.com
-
-I'll refer to this address as `jobtracker.com`.
+This will print out the IP address of the jobtracker node. I'll refer to this address as `jobtracker.com`.
Point your browser to `jobtracker.com:50030`, and you'll see the JobTracker web console. (Keep this open, as it will allow us to watch our MapReduce job in action.).`jobtracker.com:50070` points to the NameNode console, with information about HDFS.
@@ -86,17 +89,12 @@ Our first step will be to collect a bunch of text to process. We start by downlo
* [The Devil’s Dictionary by Ambrose Bierce](http://www.gutenberg.org/cache/epub/972/pg972.txt)
* [Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3](http://www.gutenberg.org/cache/epub/19699/pg19699.txt)
-Running the following commands at the remote shell should do the trick.
-
- $ mkdir /tmp/books
- $ cd /tmp/books
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg20417.txt
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg5000.txt
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg4300.txt
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg132.txt
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg1661.txt
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg972.txt
- $ curl -O https://hadoopbooks.s3.amazonaws.com/pg19699.txt
+For convenience, pallet-hadoop-example has created a script that will
+download all such files for you. Running the following commands at the
+remote shell should do the trick. The books will be downloaded into `/tmp/books`:
+
+ $ cd /tmp
+ $ ./download-books.sh
Next, navigate to the Hadoop directory:
View
4 project.clj
@@ -1,10 +1,10 @@
-(defproject pallet-hadoop-example "0.0.1"
+(defproject pallet-hadoop-example "0.0.2"
:description "Example project for running Hadoop on Pallet."
:repositories {"sonatype"
"http://oss.sonatype.org/content/repositories/releases/"}
:dependencies [[org.clojure/clojure "1.2.0"]
[org.clojure/clojure-contrib "1.2.0"]]
- :dev-dependencies [[pallet-hadoop "0.1.0"]
+ :dev-dependencies [[pallet-hadoop "0.3.1"]
[org.jclouds/jclouds-all "1.0-beta-9c"]
[org.jclouds.driver/jclouds-jsch "1.0-beta-9c"]
[org.jclouds.driver/jclouds-log4j "1.0-beta-9c"]
View
46 src/pallet_hadoop_example/core.clj
@@ -1,12 +1,17 @@
(ns pallet-hadoop-example.core
(:use pallet-hadoop.node
[pallet.crate.hadoop :only (hadoop-user)]
- [pallet.extensions :only (def-phase-fn)])
+ [pallet.extensions :only (def-phase-fn)]
+ [pallet.phase :only (phase-fn)]
+ [pallet.stevedore :only (script)]
+ [pallet.script.lib :only (download-file mkdir)])
(:require [pallet.core :as core]
- [pallet.resource.directory :as d]))
+ [pallet.resource.directory :as d]
+ [pallet.resource.remote-file :as rf]))
(defn bootstrap []
- (use 'pallet.compute))
+ (use 'pallet.compute)
+ (use '[pallet-hadoop.node :only (jobtracker-ip)]))
(def remote-env
{:algorithms {:lift-fn pallet.core/parallel-lift
@@ -21,13 +26,37 @@
:group hadoop-user
:mode "0755"))
+(def-phase-fn download-data
+ []
+ (d/directory "/tmp/books"
+ :owner hadoop-user
+ :group hadoop-user
+ :mode "0755")
+ (rf/remote-file "/tmp/download-books.sh"
+ :content
+ (script
+ (~mkdir "/tmp/books")
+ (cd "/tmp/books")
+ (doseq [f ["pg20417.txt" "pg5000.txt" "pg4300.txt"
+ "pg132.txt" "pg1661.txt" "pg972.txt" "pg19699.txt"]]
+ (println @f)
+ (~download-file
+ (str "https://hadoopbooks.s3.amazonaws.com/" @f)
+ (str "/tmp/books/" @f) )))
+ :owner hadoop-user
+ :group hadoop-user
+ :mode "0755"
+ :literal true))
+
(defn create-cluster
[cluster compute-service]
(do (boot-cluster cluster
:compute compute-service
:environment remote-env)
(lift-cluster cluster
- :phase authorize-mnt
+ :phase (phase-fn
+ authorize-mnt
+ download-data)
:compute compute-service
:environment remote-env)
(start-cluster cluster
@@ -40,14 +69,15 @@
:compute compute-service
:environment remote-env))
-(def example-cluster
+(defn make-example-cluster
+ [slave-count ram-size-in-mb]
(cluster-spec :private
{:jobtracker (node-group [:jobtracker :namenode])
- :slaves (slave-group 2)}
+ :slaves (slave-group slave-count)}
:base-machine-spec {:os-family :ubuntu
:os-version-matches "10.10"
:os-64-bit true
- :min-ram (* 4 1024)}
+ :min-ram ram-size-in-mb}
:base-props {:hdfs-site {:dfs.data.dir "/mnt/dfs/data"
:dfs.name.dir "/mnt/dfs/name"}
:mapred-site {:mapred.local.dir "/mnt/hadoop/mapred/local"
@@ -57,6 +87,8 @@
:mapred.tasktracker.reduce.tasks.maximum 3
:mapred.child.java.opts "-Xms1024m"}}))
+(def example-cluster (make-example-cluster 2 (* 4 1024)))
+
(comment
(use 'pallet-hadoop-example.core)
(bootstrap)

0 comments on commit f56097c

Please sign in to comment.