Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Refactoring and correcting types

  • Loading branch information...
commit ab6b4c18cf9c101d5aebecd64f253a7afed4988d 1 parent dd181b0
@matthewmccullough authored
Showing with 233 additions and 3 deletions.
  1. +230 −0 Hadoop Workshop Labs.html
  2. +3 −3 Hadoop Workshop Labs.md
View
230 Hadoop Workshop Labs.html
@@ -0,0 +1,230 @@
+<h1>Hadoop Workshop Student Guide</h1>
+
+<p>This is a guide to the exercises Matthew executes during his Hadoop training sessions. Once Hadoop is set up, most of the commands can be run in any order. The only over-arching requirement is that the Shakespeare files have been MapReduced, as they are used as inputs to other Pig and Hive processes.</p>
+
+<h2>Install Hadoop and Subprojects</h2>
+
+<p>Download core binaries:</p>
+
+<pre><code>http://hadoop.apache.org/common/releases.html
+http://www.apache.org/dyn/closer.cgi/hadoop/core/
+http://mirror.cc.columbia.edu/pub/software/apache/hadoop/core/hadoop-0.20.2/
+</code></pre>
+
+<p>Download ancillary projects (Pig, Hive, HBase, Zookeeper, Avro, Chukwa):</p>
+
+<pre><code>http://mirror.cc.columbia.edu/pub/software/apache/hadoop/
+</code></pre>
+
+<h3>Cloudera's VMWare Image</h3>
+
+<pre><code>http://www.cloudera.com/developers/downloads/virtual-machine/
+http://cloudera-vm.s3.amazonaws.com/cloudera-training-0.3.3.tar.bz2
+</code></pre>
+
+<h3>VMWare Player, Fusion</h3>
+
+<pre><code>http://www.vmware.com/download/player/
+</code></pre>
+
+<h2>Go to the Hadoop home directory</h2>
+
+<pre><code>cd $HADOOP_HOME
+</code></pre>
+
+<h2>Examine input files</h2>
+
+<pre><code>cd /home/training/git/data/input
+tail -1000 all-shakespeare
+</code></pre>
+
+<h2>Tail Hadoop Logs</h2>
+
+<p>Open a separate terminal window</p>
+
+<pre><code>cd /usr/lib/hadoop-0.20/logs
+tail -f *
+</code></pre>
+
+<h2>Format the HDFS File System</h2>
+
+<pre><code>hadoop namenode -format
+</code></pre>
+
+<h2>HDFS</h2>
+
+<h3>Put a file</h3>
+
+<pre><code>hadoop fs -put ~/cloudera/cloudera_black.png .
+hadoop fs -lsr .
+hadoop fsck /
+hadoop fsck -racks
+</code></pre>
+
+<h3>Reset the processing (delete old files)</h3>
+
+<pre><code>hadoop fs -rmr shakes*
+</code></pre>
+
+<h2>Map Reduce</h2>
+
+<h3>List all example tools</h3>
+
+<pre><code>hadoop jar $HADOOP_HOME/hadoop*examples.jar
+</code></pre>
+
+<h3>Calculate Pi</h3>
+
+<pre><code>hadoop jar $HADOOP_HOME/hadoop-*-examples.jar pi 6 5000
+</code></pre>
+
+<p>Look at the source code for Pi</p>
+
+<pre><code>http://code.google.com/edu/parallel/tools/hadoopvm/index.html
+</code></pre>
+
+<h3>Word Count</h3>
+
+<pre><code>hadoop jar $HADOOP_HOME/hadoop*examples.jar wordcount thriller.txt wordcountoutput
+</code></pre>
+
+<h3>Parse all of Shakespeare's works</h3>
+
+<p>Grep yields unsorted output</p>
+
+<pre><code>hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep file:/home/training/git/data/input shakespeare_freq '\w+'
+
+hadoop jar $HADOOP_HOME/hadoop*examples.jar grep thriller.txt grepoutputdir '\w+'
+</code></pre>
+
+<h3>Examine the file</h3>
+
+<pre><code>hadoop fs -lsr shakespeare_freq/*
+hadoop fs -cat shakespeare_freq/part-00000 | more
+</code></pre>
+
+<h3>Examine the job logs</h3>
+
+<pre><code>hadoop fs -cat shakespeare_freq/_logs/history/* | more
+</code></pre>
+
+<h2>Streaming</h2>
+
+<pre><code>cd /home/training/git/data
+cat thriller.txt
+cut -f 2 -d , thriller.txt
+
+hadoop jar $HADOOP_HOME/contrib/streaming/hadoop*streaming*.jar -input thriller.txt -output outputfromstreaming -mapper 'cut -f 2 -d ,' -reducer 'uniq'
+</code></pre>
+
+<p>Fails if we haven't uploaded the file to HDFS. We can upload it or we can address it via a file:// URL. This is because our default in </p>
+
+<h3>Sort reverse</h3>
+
+<pre><code>sort -r &lt;SOMEFILE&gt;
+</code></pre>
+
+<p>Work this into a streaming job</p>
+
+<h2>Pig</h2>
+
+<p>Get a pig prompt</p>
+
+<pre><code>pig
+
+A = load 'shakespeare_freq/part-00000' using PigStorage('\t') AS (rawcount:int, rawword: chararray);
+explain A;
+
+B = foreach A generate $1 as word, $0 as count;
+C = ORDER B BY count ASC;
+
+illustrate C;
+
+dump C;
+store C into 'shakespeare_words.output';
+
+Y = LIMIT C 5;
+dump Y;
+
+quit;
+
+hadoop fs -cat shakespeare_words.output/part-00000
+</code></pre>
+
+<h3>Pig Again</h3>
+
+<pre><code>A = load 'myfile' as (t, u, v);
+C = filter A by t == 1;
+D = join C by t, B by x;
+E = group D by u;
+F = foreach E generate group, COUNT($1);
+</code></pre>
+
+<h2>HBase</h2>
+
+<p>On Mac</p>
+
+<pre><code>start-hbase.sh
+
+hbase shell
+create 'blogposts', 'post', 'image'
+</code></pre>
+
+<h3>Check the metadata created</h3>
+
+<pre><code>hadoop fs -lsr /
+</code></pre>
+
+<h3>Put some data into HBase</h3>
+
+<pre><code>put 'blogposts', 'post1', 'post:title', 'Hello World'
+put 'blogposts', 'post1', 'post:author', 'The Author'
+put 'blogposts', 'post1', 'post:body', 'This is a blog post'
+put 'blogposts', 'post1', 'image:header', 'image1.jpg'
+put 'blogposts', 'post1', 'image:bodyimage', 'image2.jpg'
+
+get 'blogposts', 'post1'
+</code></pre>
+
+<h2>Hive</h2>
+
+<pre><code>hive
+</code></pre>
+
+<h3>Clean up logs prior to running Hive</h3>
+
+<pre><code>hadoop fs -rmr shakespeare_freq/_logs
+
+SHOW TABLES;
+DROP TABLE shakespeare;
+
+CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
+DESCRIBE shakespeare;
+
+LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;
+
+SELECT * FROM shakespeare LIMIT 10;
+
+SELECT * FROM shakespeare WHERE freq &gt; 100 SORT BY freq ASC LIMIT 10;
+exit;
+</code></pre>
+
+<h2>Configuring Hadoop</h2>
+
+<pre><code>cd /usr/lib/hadoop-0.20/conf
+</code></pre>
+
+<h2>Managing Hadoop</h2>
+
+<pre><code>cd /usr/lib/hadoop-0.20/bin
+</code></pre>
+
+<p>or
+ cd $HADOOP_HOME/bin
+ sudo -u hadoop start-all.sh
+ sudo -u hadoop stop-all.sh</p>
+
+<h2>Rebalancing</h2>
+
+<pre><code>su - hadoop /usr/lib/hadoop/bin/start-balancer.sh
+</code></pre>
View
6 Hadoop Workshop Labs.md
@@ -6,7 +6,7 @@ Download core binaries:
http://hadoop.apache.org/common/releases.html
http://www.apache.org/dyn/closer.cgi/hadoop/core/
- http://mirror.cc.columbia.edu/pub/software/apache/hadoop/core/hadoop-0.20.2/
+ http://mirror.cc.columbia.edu/pub/software/apache/hadoop/core/hadoop-0.20.2/
Download ancillary projects (Pig, Hive, HBase, Zookeeper, Avro, Chukwa):
@@ -52,14 +52,14 @@ Open a separate terminal window
hadoop jar $HADOOP_HOME/hadoop*examples.jar
### Calculate Pi
- hadoop jar hadoop-*-examples.jar pi 4 1000
+ hadoop jar $HADOOP_HOME/hadoop-*-examples.jar pi 6 5000
Look at the source code for Pi
http://code.google.com/edu/parallel/tools/hadoopvm/index.html
### Word Count
- hadoop jar $HADOOP_HOME/hadoop*examples.jar wordcount thrillers.txt wordcountoutput
+ hadoop jar $HADOOP_HOME/hadoop*examples.jar wordcount thriller.txt wordcountoutput
### Parse all of Shakespeare's works
Grep yields unsorted output
Please sign in to comment.
Something went wrong with that request. Please try again.