<DIV ALIGN=CENTER>

# Introduction to Hadoop
## Professor Robert J. Brunner
  
</DIV>  
----- 
-----

## Introduction

## Introduction

In this Notebook, we will demonstrate how to run a Hadoop Streaming
Map/Reduce job in a docker container. Our setup will be using a single
Hadoop node, which will not be very fast, especially when compared to
simply running the map/reduce Python code directly. However, the full
Hadoop process will be demonstrated, including the use of the Hadoop
file system (HDFS) and the Hadoop Streaming process model. Before
proceeding with this Notebook, be sure to (at least start to) download
the SequenceIQ Hadoop Docker container.

Typically, basic Hadoop is operated on a large cluster that runs both
Hadoop and HDFS, although with the development of Yarn, more diverse
workflows are now possible. In this Notebook, we only explore the basic
Hadoop components of Hadoop and HDFS, which work together to run code on
the nodes that hold the relevant data in order to maximize throughput.
[Other resources][hort] exist to learn more about Yarn and other Hadoop
workflows. The basic Hadoop task is a map/reduce process, where a map
process analyzes data and creates a sequential list of key-value pairs
(like a Python dictionary). The Hadoop process model sorts the output of
the mappers before passing the results to a reduce process. The reduce
process combines the key-value pairs to generate final output. The
prototype map/reduce example is the [word-count problem][wcp], where a
large corpus is analyzed to quantify how many times each word appears
(one can quickly see how this model can be extended to analyze website
as opposed to texts).

Thus to complete a map/reduce task in Hadoop we need to complete the
following tasks:

1. create a Map program
2. create a Reduce program
3. obtain a data set to analyze
4. load our data into HDFS
5. execute our map/reduce program by using Hadoop

The rest of this Notebook will demonstrate how to perform each of these
tasks. We first will create the two Python programs, download a sample
text, and also download the hadoop-streaming jar file into a shared
local directory from within our course4 Docker container. Once these
steps are complete, we will start our Hadoop Docker container to
complete the rest of the process.

In the next code cell, we start the process by running a shell script
that creates (and deletes first if it exists) the shared directory that
will hold the Python codes and data for our Map/Reduce Hadoop project.

-----
[hort]: http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
[wcp]: https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

-----

https://github.com/sequenceiq/hadoop-docker

https://github.com/UI-DataScience/docker-info490/blob/master/hadoop/example.ipynb


In [105]:
%%bash
#!/usr/bin/env bash
# A Bash Shell Script to delete the Hadoop diorectory if it exists, afterwhich
# make a new Hadoop directory

# Our directory name
DIR=/home/data_scientist/rppdm/hadoop

# Delete if exists
if [ -d "$DIR" ]; then
    rm -rf "$DIR"
fi

# Now make the directory
mkdir "$DIR"

-----

### Data to process

Our simple map/reduce programs require text data to operate. While there
are a number of possible options, for this example we can grab a free
book from [Project Gutenberg][pg]:

    wget --directory-prefix=/notebooks/rppds/hadoop/ --output-document=book.txt \
        http://www.gutenberg.org/cache/epub/4300/pg4300.txt`

In this case, we have grabbed the full text of the novel _Ulysses_, by
James Joyce.

-----
[pg]: http://www.gutenberg.org

In [106]:
# Grab a book to process
!wget --output-document=/home/data_scientist/rppdm/hadoop/book.txt \
http://www.gutenberg.org/cache/epub/4300/pg4300.txt

--2016-04-05 23:46:45--  http://www.gutenberg.org/cache/epub/4300/pg4300.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1573151 (1.5M) [text/plain]
Saving to: ‘/home/data_scientist/rppdm/hadoop/book.txt’


2016-04-05 23:46:47 (1.11 MB/s) - ‘/home/data_scientist/rppdm/hadoop/book.txt’ saved [1573151/1573151]



-----

### Student Activity

In the preceding cells, we introduced Bayesian Hierarchical modeling.
Now that you have run the Notebook, go back and make the following
changes to see how the results change.

1. Change the model parameters, rerun the Notebook and see how the
different (unpooled versus pooled) modeling approaches perform.
2. Change the number of points in each bin, rerun the Notebook and see
how the different (unpooled versus pooled) modeling approaches perform.
3. Change the model for the prior distributions, e.g., try a uniform
distribution for the intercept. How do the different (unpooled versus
pooled) modeling approaches perform?

-----

In [107]:
!echo $HADOOP_PREFIX

/usr/local/hadoop


In [108]:
%%bash

# This script is necessary to work around a bug in the distribution.

# Simplify command line, cd to root HADOOP directory
cd $HADOOP_PREFIX

# Stop nodes in case they are already running.
sbin/hadoop-daemon.sh stop datanode
sbin/hadoop-daemon.sh stop namenode

# Start HADOOP datanode and namenode
sbin/hadoop-daemon.sh start datanode
sbin/hadoop-daemon.sh start namenode

stopping datanode
stopping namenode
starting datanode, logging to /usr/local/hadoop/logs/hadoop--datanode-83a8dc7241b5.out
starting namenode, logging to /usr/local/hadoop/logs/hadoop--namenode-83a8dc7241b5.out


In [55]:
# This will need to be modified to correct filename from previous code output
!tail /usr/local/hadoop/logs/hadoop--namenode-83a8dc7241b5.out

max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


-----

## HDFS

At this point, we need to move our data to process into the Hadoop
Distributed File system, or HDFS. HDFS is a a file system that is designed
to work effectively with the Hadoop environment. In a typical Hadoop
cluster, files would be broken up and distributed to different Hadoop
nodes. The processing is moved to the data in this model, which can
produce high throughput, especially for map/reduce programming tasks.
However, this means you can not simply move around the HDFS file system
in the same manner as a traditional Unix file system, since the
components of a particular file are not all col-located. Instead, we
must use the [HDFS file system interface][hdfs], which is invoked by
using `$HADOOP_PREFIX/bin/hdfs`. Running this command by itself in your
Hadoop Docker container will list the available commands, as shown in
the following code cell.

-----

[hdfs]: https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfs


In [49]:
!$HADOOP_PREFIX/bin/hdfs

Usage: hdfs [--config confdir] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  mover                run a utility to move block replicas across
                       storage types
  oiv                  apply the offline fsimage viewer to an fsimage
  oiv_legacy           apply the offline fsimage viewer to an legac

-----

The standard command we will use is `dfs` which runs a filesystem
command on the HDFS file system that is supported by Hadoop. The [list
of supported `dfs` commands][dfsl] is extensive, and mirrors many of the
traditional Unix file systems commands. The full listing can be obtained
by entering `$HADOOP_PREFIX/bin/hdfs dfs` at the prompt in our Hadoop
Docker container.

-----

[dfsl]: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

In [27]:
!$HADOOP_PREFIX/bin/hdfs dfs

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] <path> ...]
	[-expunge]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-d] [-h] [-R] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
	[-renameSnapsh

-----

 Some of the more useful commands for this class
include:

- `-cat`: copies the source path to STDOUT.

- `-count -h`: counts the number of directories, files and byts under the
path specified. With the `-h` flag, the output is displayed in a
human-readable format.

- `-expunge`: empties the trash. By default, files and directories are
not removed from HDFS with the `rm` command, they are simply moved to the
trash. This can be useful when HDFS supplies a `Name node is in safe
mode.` message. 

- `-ls`: lists the contents of the indicated directory in HDFS.

- `-mkdir -p`: creates a new directory in HDFS at the specified
location. With the `-p` flag any parent directory specified in the full
path will also be created as necessary.

- `-put`: copies indicated file(s) from local host file system into the
specified path in HDFS.

- `-rm -f -r`: delete the indicated file or directory. With the `-r -f`
flags, the command will not display any message and any will delete any
files or directories under the indicated directory. The `-skipTrash`
flag should be used to delete the indicated resource immediately.

- `-tail`: display the last kilobyte of the indicated file to STDOUT.

-----

In [56]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls /

Found 3 items
drwxr-xr-x   - data_scientist supergroup          0 2016-04-05 20:24 /home
drwx------   - data_scientist supergroup          0 2016-04-05 20:25 /tmp
drwxr-xr-x   - data_scientist supergroup          0 2016-04-05 20:26 /user


In [66]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls /user/

Found 1 items
drwxr-xr-x   - data_scientist supergroup          0 2016-04-05 20:46 /user/data_scientist


In [58]:
!ls /user

ls: cannot access /user: No such file or directory


In [63]:
# Free Space
!$HADOOP_PREFIX/bin/hdfs dfs -df -h

Filesystem                  Size   Used  Available  Use%
hdfs://83a8dc7241b5:9000  18.2 G  1.6 M      8.2 G    0%


In [64]:
# Usage
!$HADOOP_PREFIX/bin/hdfs dfs -du -h

75.2 K  hadoop


-----

### Running shell script

-----

In [72]:
%%bash

cd $HADOOP_PREFIX
bin/hdfs dfs -mkdir -p hadoop
#bin/hdfs dfs -put $HADOOP_PREFIX/etc/hadoop/ hadoop/input
bin/hdfs dfs -mkdir -p hadoop/input

# run the mapreduce
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep hadoop/input hadoop/output 'dfs[a-z.]+'

# check the output
bin/hdfs dfs -tail hadoop/output/*


mkdir: `hadoop/input': File exists
16/04/05 22:46:25 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/04/05 22:46:26 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
16/04/05 22:46:26 INFO input.FileInputFormat: Total input paths to process : 32
16/04/05 22:46:26 INFO mapreduce.JobSubmitter: number of splits:32
16/04/05 22:46:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1459886174643_0008
16/04/05 22:46:27 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
16/04/05 22:46:27 INFO impl.YarnClientImpl: Submitted application application_1459886174643_0008
16/04/05 22:46:27 INFO mapreduce.Job: The url to track the job: http://83a8dc7241b5:8088/proxy/application_1459886174643_0008/
16/04/05 22:46:27 INFO mapreduce.Job: Running job: job_1459886174643_0008
16/04/05 22:46:34 INFO mapreduce.Job: Job job_1459886174643_0008 running in uber mode : false
1

In [19]:
!$HADOOP_PREFIX/bin/hadoop \
    jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar \
    grep /home/data_scientist/hadoop/input /home/data_scientist/hadoop/output 'dfs[a-z.]+'

16/04/05 20:25:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/04/05 20:25:59 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
16/04/05 20:25:59 INFO input.FileInputFormat: Total input paths to process : 31
16/04/05 20:25:59 INFO mapreduce.JobSubmitter: number of splits:31
16/04/05 20:25:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1459886174643_0002
16/04/05 20:26:00 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
16/04/05 20:26:00 INFO impl.YarnClientImpl: Submitted application application_1459886174643_0002
16/04/05 20:26:00 INFO mapreduce.Job: The url to track the job: http://83a8dc7241b5:8088/proxy/application_1459886174643_0002/
16/04/05 20:26:00 INFO mapreduce.Job: Running job: job_1459886174643_0002
16/04/05 20:26:10 INFO mapreduce.Job: Job job_1459886174643_0002 running in uber mode : false
16/04/05 20:26:10 INFO mapreduce.Job

-----

At this point, we first need to create an directory to hold the input
and output of our Hadoop task. We will create a new directory called
`wc` with a subdirectory called `in` to hold the input data for our
Hadoop task. Second, we will need to copy the book text file into this
new HDFS directory. This means we will need to run the following two
commands at the prompt in our Hadoop Docker container:

1. `$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p wc/in`
2. `$HADOOP_PREFIX/bin/hdfs dfs -put book.txt wc/in/book.txt`

The following screenshot displays the result of running these two
commands, as well as the `dfs -ls` command to display the contents of
our new HDFS directory, and the `dfs -count` command to show the size of
the directory contents.

-----

In [109]:
!$HADOOP_PREFIX/bin/yarn

Usage: yarn [--config confdir] COMMAND
where COMMAND is one of:
  resourcemanager -format-state-store   deletes the RMStateStore
  resourcemanager                       run the ResourceManager
  nodemanager                           run a nodemanager on each slave
  timelineserver                        run the timeline server
  rmadmin                               admin tools
  version                               print the version
  jar <jar>                             run a jar file
  application                           prints application(s)
                                        report/kill application
  applicationattempt                    prints applicationattempt(s)
                                        report
  container                             prints container(s) report
  node                                  prints node report(s)
  queue                                 prints queue information
  logs                                  dump container

In [73]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls hadoop/input

Found 32 items
-rw-r--r--   1 data_scientist supergroup       4436 2016-04-05 20:43 hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 data_scientist supergroup       1335 2016-04-05 20:43 hadoop/input/configuration.xsl
-rw-r--r--   1 data_scientist supergroup        318 2016-04-05 20:43 hadoop/input/container-executor.cfg
-rw-r--r--   1 data_scientist supergroup        158 2016-04-05 20:43 hadoop/input/core-site.xml
-rw-r--r--   1 data_scientist supergroup        154 2016-04-05 20:43 hadoop/input/core-site.xml.template
drwxr-xr-x   - data_scientist supergroup          0 2016-04-05 22:40 hadoop/input/hadoop
-rw-r--r--   1 data_scientist supergroup       3670 2016-04-05 20:43 hadoop/input/hadoop-env.cmd
-rw-r--r--   1 data_scientist supergroup       4302 2016-04-05 20:43 hadoop/input/hadoop-env.sh
-rw-r--r--   1 data_scientist supergroup       2490 2016-04-05 20:43 hadoop/input/hadoop-metrics.properties
-rw-r--r--   1 data_scientist supergroup       2598 2016-04-05 20:43 hadoop/input/ha

In [101]:
!$HADOOP_PREFIX/bin/yarn jar $HADOOP_PREFIX/share/hadoop/hdfs/hadoop-*test*.jar org.apache.hadoop.hdfs.TestDFSShell


Exception in thread "main" java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.TestDFSShell.main([Ljava.lang.String;)
	at java.lang.Class.getMethod(Class.java:1665)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:215)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


In [102]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls share/hadoop/

ls: `share/hadoop/': No such file or directory


In [103]:
!pwd

/home/data_scientist/rppdm/info490-sp16/Week12/notebooks
