Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
231 lines (145 sloc) 5.22 KB
<h1>Hadoop Workshop Student Guide</h1>
<p>This is a guide to the exercises Matthew executes during his Hadoop training sessions. Once Hadoop is set up, most of the commands can be run in any order. The only over-arching requirement is that the Shakespeare files have been MapReduced, as they are used as inputs to other Pig and Hive processes.</p>
<h2>Install Hadoop and Subprojects</h2>
<p>Download core binaries:</p>
<pre><code>http://hadoop.apache.org/common/releases.html
http://www.apache.org/dyn/closer.cgi/hadoop/core/
http://mirror.cc.columbia.edu/pub/software/apache/hadoop/core/hadoop-0.20.2/
</code></pre>
<p>Download ancillary projects (Pig, Hive, HBase, Zookeeper, Avro, Chukwa):</p>
<pre><code>http://mirror.cc.columbia.edu/pub/software/apache/hadoop/
</code></pre>
<h3>Cloudera's VMWare Image</h3>
<pre><code>http://www.cloudera.com/developers/downloads/virtual-machine/
http://cloudera-vm.s3.amazonaws.com/cloudera-training-0.3.3.tar.bz2
</code></pre>
<h3>VMWare Player, Fusion</h3>
<pre><code>http://www.vmware.com/download/player/
</code></pre>
<h2>Go to the Hadoop home directory</h2>
<pre><code>cd $HADOOP_HOME
</code></pre>
<h2>Examine input files</h2>
<pre><code>cd /home/training/git/data/input
tail -1000 all-shakespeare
</code></pre>
<h2>Tail Hadoop Logs</h2>
<p>Open a separate terminal window</p>
<pre><code>cd /usr/lib/hadoop-0.20/logs
tail -f *
</code></pre>
<h2>Format the HDFS File System</h2>
<pre><code>hadoop namenode -format
</code></pre>
<h2>HDFS</h2>
<h3>Put a file</h3>
<pre><code>hadoop fs -put ~/cloudera/cloudera_black.png .
hadoop fs -lsr .
hadoop fsck /
hadoop fsck -racks
</code></pre>
<h3>Reset the processing (delete old files)</h3>
<pre><code>hadoop fs -rmr shakes*
</code></pre>
<h2>Map Reduce</h2>
<h3>List all example tools</h3>
<pre><code>hadoop jar $HADOOP_HOME/hadoop*examples.jar
</code></pre>
<h3>Calculate Pi</h3>
<pre><code>hadoop jar $HADOOP_HOME/hadoop-*-examples.jar pi 6 5000
</code></pre>
<p>Look at the source code for Pi</p>
<pre><code>http://code.google.com/edu/parallel/tools/hadoopvm/index.html
</code></pre>
<h3>Word Count</h3>
<pre><code>hadoop jar $HADOOP_HOME/hadoop*examples.jar wordcount thriller.txt wordcountoutput
</code></pre>
<h3>Parse all of Shakespeare's works</h3>
<p>Grep yields unsorted output</p>
<pre><code>hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep file:/home/training/git/data/input shakespeare_freq '\w+'
hadoop jar $HADOOP_HOME/hadoop*examples.jar grep thriller.txt grepoutputdir '\w+'
</code></pre>
<h3>Examine the file</h3>
<pre><code>hadoop fs -lsr shakespeare_freq/*
hadoop fs -cat shakespeare_freq/part-00000 | more
</code></pre>
<h3>Examine the job logs</h3>
<pre><code>hadoop fs -cat shakespeare_freq/_logs/history/* | more
</code></pre>
<h2>Streaming</h2>
<pre><code>cd /home/training/git/data
cat thriller.txt
cut -f 2 -d , thriller.txt
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop*streaming*.jar -input thriller.txt -output outputfromstreaming -mapper 'cut -f 2 -d ,' -reducer 'uniq'
</code></pre>
<p>Fails if we haven't uploaded the file to HDFS. We can upload it or we can address it via a file:// URL. This is because our default in </p>
<h3>Sort reverse</h3>
<pre><code>sort -r &lt;SOMEFILE&gt;
</code></pre>
<p>Work this into a streaming job</p>
<h2>Pig</h2>
<p>Get a pig prompt</p>
<pre><code>pig
A = load 'shakespeare_freq/part-00000' using PigStorage('\t') AS (rawcount:int, rawword: chararray);
explain A;
B = foreach A generate $1 as word, $0 as count;
C = ORDER B BY count ASC;
illustrate C;
dump C;
store C into 'shakespeare_words.output';
Y = LIMIT C 5;
dump Y;
quit;
hadoop fs -cat shakespeare_words.output/part-00000
</code></pre>
<h3>Pig Again</h3>
<pre><code>A = load 'myfile' as (t, u, v);
C = filter A by t == 1;
D = join C by t, B by x;
E = group D by u;
F = foreach E generate group, COUNT($1);
</code></pre>
<h2>HBase</h2>
<p>On Mac</p>
<pre><code>start-hbase.sh
hbase shell
create 'blogposts', 'post', 'image'
</code></pre>
<h3>Check the metadata created</h3>
<pre><code>hadoop fs -lsr /
</code></pre>
<h3>Put some data into HBase</h3>
<pre><code>put 'blogposts', 'post1', 'post:title', 'Hello World'
put 'blogposts', 'post1', 'post:author', 'The Author'
put 'blogposts', 'post1', 'post:body', 'This is a blog post'
put 'blogposts', 'post1', 'image:header', 'image1.jpg'
put 'blogposts', 'post1', 'image:bodyimage', 'image2.jpg'
get 'blogposts', 'post1'
</code></pre>
<h2>Hive</h2>
<pre><code>hive
</code></pre>
<h3>Clean up logs prior to running Hive</h3>
<pre><code>hadoop fs -rmr shakespeare_freq/_logs
SHOW TABLES;
DROP TABLE shakespeare;
CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
DESCRIBE shakespeare;
LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;
SELECT * FROM shakespeare LIMIT 10;
SELECT * FROM shakespeare WHERE freq &gt; 100 SORT BY freq ASC LIMIT 10;
exit;
</code></pre>
<h2>Configuring Hadoop</h2>
<pre><code>cd /usr/lib/hadoop-0.20/conf
</code></pre>
<h2>Managing Hadoop</h2>
<pre><code>cd /usr/lib/hadoop-0.20/bin
</code></pre>
<p>or
cd $HADOOP_HOME/bin
sudo -u hadoop start-all.sh
sudo -u hadoop stop-all.sh</p>
<h2>Rebalancing</h2>
<pre><code>su - hadoop /usr/lib/hadoop/bin/start-balancer.sh
</code></pre>