<DIV ALIGN=CENTER>

# Introduction to Hadoop
## Professor Robert J. Brunner
  
</DIV>  
----- 
-----

## Introduction

## Introduction

In this Notebook, we will demonstrate how to run a Hadoop Streaming
Map/Reduce job in a docker container. Our setup will be using a single
Hadoop node, which will not be very fast, especially when compared to
simply running the map/reduce Python code directly. However, the full
Hadoop process will be demonstrated, including the use of the Hadoop
file system (HDFS) and the Hadoop Streaming process model. Before
proceeding with this Notebook, be sure to (at least start to) download
the SequenceIQ Hadoop Docker container.

Typically, basic Hadoop is operated on a large cluster that runs both
Hadoop and HDFS, although with the development of Yarn, more diverse
workflows are now possible. In this Notebook, we only explore the basic
Hadoop components of Hadoop and HDFS, which work together to run code on
the nodes that hold the relevant data in order to maximize throughput.
[Other resources][hort] exist to learn more about Yarn and other Hadoop
workflows. The basic Hadoop task is a map/reduce process, where a map
process analyzes data and creates a sequential list of key-value pairs
(like a Python dictionary). The Hadoop process model sorts the output of
the mappers before passing the results to a reduce process. The reduce
process combines the key-value pairs to generate final output. The
prototype map/reduce example is the [word-count problem][wcp], where a
large corpus is analyzed to quantify how many times each word appears
(one can quickly see how this model can be extended to analyze website
as opposed to texts).

Thus to complete a map/reduce task in Hadoop we need to complete the
following tasks:

1. create a Map program
2. create a Reduce program
3. obtain a data set to analyze
4. load our data into HDFS
5. execute our map/reduce program by using Hadoop

The rest of this Notebook will demonstrate how to perform each of these
tasks. We first will create the two Python programs, download a sample
text, and also download the hadoop-streaming jar file into a shared
local directory from within our course4 Docker container. Once these
steps are complete, we will start our Hadoop Docker container to
complete the rest of the process.

In the next code cell, we start the process by running a shell script
that creates (and deletes first if it exists) the shared directory that
will hold the Python codes and data for our Map/Reduce Hadoop project.

-----
[hort]: http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
[wcp]: https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

-----

https://github.com/sequenceiq/hadoop-docker

https://github.com/UI-DataScience/docker-info490/blob/master/hadoop/example.ipynb


In [1]:
%%bash
#!/usr/bin/env bash

echo '##### Out File #####'
out_file=$(ls -la /usr/local/hadoop/logs/hadoop-data*.out | head -1 | awk '{print $9}')
cat  $out_file

echo
echo '##### Log File #####'
log_file=$(ls -la /usr/local/hadoop/logs/hadoop-data*.log | head -1 | awk '{print $9}')
tail -10  $log_file

##### Out File #####
ulimit -a for user data_scientist
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7902
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

##### Log File #####
2016-04-07 17:10:57,648 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Scheduling blk_1073742005_1181 file /tmp/hadoop-data_scientist/dfs/data/current/BP-629745585-172.17.0.3-1459931584833

-----

## HDFS

At this point, we need to move our data to process into the Hadoop
Distributed File system, or HDFS. HDFS is a a file system that is designed
to work effectively with the Hadoop environment. In a typical Hadoop
cluster, files would be broken up and distributed to different Hadoop
nodes. The processing is moved to the data in this model, which can
produce high throughput, especially for map/reduce programming tasks.
However, this means you can not simply move around the HDFS file system
in the same manner as a traditional Unix file system, since the
components of a particular file are not all col-located. Instead, we
must use the [HDFS file system interface][hdfs], which is invoked by
using `$HADOOP_PREFIX/bin/hdfs`. Running this command by itself in your
Hadoop Docker container will list the available commands, as shown in
the following code cell.

-----

[hdfs]: https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfs


In [2]:
!$HADOOP_PREFIX/bin/hdfs

Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  classpath            prints the classpath
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  mover                run a utility to move block replicas across
                       storage types
  oiv                  apply the offline fsimage viewer to an fsimage


-----

The standard command we will use is `dfs` which runs a filesystem
command on the HDFS file system that is supported by Hadoop. The [list
of supported `dfs` commands][dfsl] is extensive, and mirrors many of the
traditional Unix file systems commands. The full listing can be obtained
by entering `$HADOOP_PREFIX/bin/hdfs dfs` at the prompt in our Hadoop
Docker container.

-----

[dfsl]: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

In [3]:
!$HADOOP_PREFIX/bin/hdfs dfs

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-d] [-h] [-R] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] 

-----

 Some of the more useful commands for this class
include:

- `-cat`: copies the source path to STDOUT.

- `-count -h`: counts the number of directories, files and byts under the
path specified. With the `-h` flag, the output is displayed in a
human-readable format.

- `-expunge`: empties the trash. By default, files and directories are
not removed from HDFS with the `rm` command, they are simply moved to the
trash. This can be useful when HDFS supplies a `Name node is in safe
mode.` message. 

- `-ls`: lists the contents of the indicated directory in HDFS.

- `-mkdir -p`: creates a new directory in HDFS at the specified
location. With the `-p` flag any parent directory specified in the full
path will also be created as necessary.

- `-put`: copies indicated file(s) from local host file system into the
specified path in HDFS.

- `-rm -f -r`: delete the indicated file or directory. With the `-r -f`
flags, the command will not display any message and any will delete any
files or directories under the indicated directory. The `-skipTrash`
flag should be used to delete the indicated resource immediately.

- `-tail`: display the last kilobyte of the indicated file to STDOUT.

-----

In [4]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls /

Found 2 items
drwxrwx---   - data_scientist supergroup          0 2016-04-06 19:34 /tmp
drwxr-xr-x   - data_scientist supergroup          0 2016-04-06 08:34 /user


In [5]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls /user/

Found 1 items
drwxr-xr-x   - data_scientist supergroup          0 2016-04-07 17:57 /user/data_scientist


In [6]:
!ls /user

ls: cannot access /user: No such file or directory


In [7]:
# Free Space
!$HADOOP_PREFIX/bin/hdfs dfs -df -h

Filesystem                  Size   Used  Available  Use%
hdfs://e7d89fb87de4:9000  18.2 G  4.3 M     11.5 G    0%


-----

### Testing Hadoop

Run simple example. First make directories, coopy in data, and run a grep search.

-----

In [8]:
%%bash

cd $HADOOP_PREFIX

# Remove old directory (if it exsits) to have clean example
bin/hdfs dfs -rm -r -f hadoop

# Make directorties for example application
bin/hdfs dfs -mkdir -p hadoop
bin/hdfs dfs -mkdir -p hadoop/input

# Copy data into example input directory
bin/hdfs dfs -put etc/hadoop/*.xml hadoop/input

# Running Hadoop example to test installation
example_file=$(ls share/hadoop/mapreduce/hadoop-mapreduce-examples*)
bin/hadoop jar $example_file grep hadoop/input hadoop/output 'dfs[a-z.]+'

# Display directory heirarchy
bin/hdfs dfs -ls hadoop/
bin/hdfs dfs -ls hadoop/input
bin/hdfs dfs -ls hadoop/output

Deleted hadoop
Found 2 items
drwxr-xr-x   - data_scientist supergroup          0 2016-04-07 18:16 hadoop/input
drwxr-xr-x   - data_scientist supergroup          0 2016-04-07 18:17 hadoop/output
Found 9 items
-rw-r--r--   1 data_scientist supergroup       4436 2016-04-07 18:16 hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 data_scientist supergroup        158 2016-04-07 18:16 hadoop/input/core-site.xml
-rw-r--r--   1 data_scientist supergroup       9683 2016-04-07 18:16 hadoop/input/hadoop-policy.xml
-rw-r--r--   1 data_scientist supergroup        354 2016-04-07 18:16 hadoop/input/hdfs-site.xml
-rw-r--r--   1 data_scientist supergroup        620 2016-04-07 18:16 hadoop/input/httpfs-site.xml
-rw-r--r--   1 data_scientist supergroup       3518 2016-04-07 18:16 hadoop/input/kms-acls.xml
-rw-r--r--   1 data_scientist supergroup       5511 2016-04-07 18:16 hadoop/input/kms-site.xml
-rw-r--r--   1 data_scientist supergroup        357 2016-04-07 18:16 hadoop/input/mapred-site.xml
-rw-r--r-

16/04/07 18:16:07 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
16/04/07 18:16:18 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/04/07 18:16:19 INFO input.FileInputFormat: Total input paths to process : 9
16/04/07 18:16:19 INFO mapreduce.JobSubmitter: number of splits:9
16/04/07 18:16:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1459971273803_0015
16/04/07 18:16:19 INFO impl.YarnClientImpl: Submitted application application_1459971273803_0015
16/04/07 18:16:19 INFO mapreduce.Job: The url to track the job: http://e7d89fb87de4:8088/proxy/application_1459971273803_0015/
16/04/07 18:16:19 INFO mapreduce.Job: Running job: job_1459971273803_0015
16/04/07 18:16:27 INFO mapreduce.Job: Job job_1459971273803_0015 running in uber mode : false
16/04/07 18:16:27 INFO mapreduce.Job:  map 0% reduce 0%
16/04/07 18:16:50 INFO mapreduce.Job:  map 11% reduce 0%
16/04/07 18:16:51 INFO mapredu

-----

Examine output

-----

In [9]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls hadoop/input

Found 9 items
-rw-r--r--   1 data_scientist supergroup       4436 2016-04-07 18:16 hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 data_scientist supergroup        158 2016-04-07 18:16 hadoop/input/core-site.xml
-rw-r--r--   1 data_scientist supergroup       9683 2016-04-07 18:16 hadoop/input/hadoop-policy.xml
-rw-r--r--   1 data_scientist supergroup        354 2016-04-07 18:16 hadoop/input/hdfs-site.xml
-rw-r--r--   1 data_scientist supergroup        620 2016-04-07 18:16 hadoop/input/httpfs-site.xml
-rw-r--r--   1 data_scientist supergroup       3518 2016-04-07 18:16 hadoop/input/kms-acls.xml
-rw-r--r--   1 data_scientist supergroup       5511 2016-04-07 18:16 hadoop/input/kms-site.xml
-rw-r--r--   1 data_scientist supergroup        357 2016-04-07 18:16 hadoop/input/mapred-site.xml
-rw-r--r--   1 data_scientist supergroup       1525 2016-04-07 18:16 hadoop/input/yarn-site.xml


In [10]:
!$HADOOP_PREFIX/bin/hdfs dfs -ls hadoop/output

Found 2 items
-rw-r--r--   1 data_scientist supergroup          0 2016-04-07 18:17 hadoop/output/_SUCCESS
-rw-r--r--   1 data_scientist supergroup         74 2016-04-07 18:17 hadoop/output/part-r-00000


In [11]:
!$HADOOP_PREFIX/bin/hdfs dfs -cat hadoop/output/part-r-00000

1	dfsadmin
1	dfs.replication
1	dfs.namenode.servicerpc
1	dfs.namenode.rpc


-----

Compare with unix grep

-----

In [12]:
!grep --color 'dfs[a-z.]' $HADOOP_PREFIX/etc/hadoop/*.xml

[35m[K/usr/local/hadoop/etc/hadoop/hadoop-policy.xml[m[K[36m[K:[m[K    [01;31m[Kdfsa[m[Kdmin and mradmin commands to refresh the security policy in-effect.
[35m[K/usr/local/hadoop/etc/hadoop/hdfs-site.xml[m[K[36m[K:[m[K        <name>[01;31m[Kdfs.[m[Kreplication</name>
[35m[K/usr/local/hadoop/etc/hadoop/hdfs-site.xml[m[K[36m[K:[m[K        <name>[01;31m[Kdfs.[m[Knamenode.rpc-bind-host</name>
[35m[K/usr/local/hadoop/etc/hadoop/hdfs-site.xml[m[K[36m[K:[m[K        <name>[01;31m[Kdfs.[m[Knamenode.servicerpc-bind-host</name>


-----

## Data 


-----

In [13]:
%%bash
#!/usr/bin/env bash
# A Bash Shell Script to delete the Hadoop diorectory if it exists, afterwhich
# make a new Hadoop directory

# Our directory name
DIR=$HOME/hadoop

# Delete if exists
if [ -d "$DIR" ]; then
    rm -rf "$DIR"
fi

# Now make the directory
mkdir "$DIR"

ls -la $DIR

total 8
drwxr-xr-x  2 data_scientist users 4096 Apr  7 18:17 .
drwxr-xr-x 18 data_scientist users 4096 Apr  7 18:17 ..


-----

### Acquiring Data

To perform data analysis by using Hadoop, we will need a data set. In the
Notebook for the second lesson this week, we will perform a simple
map/reduce operation that will require text data to operate. While there
are a number of possible options, for this example we can grab a free
book from [Project Gutenberg][pg]:

    wget --directory-prefix=$HOME/hadoop/ --output-document=book.txt \
        http://www.gutenberg.org/cache/epub/4300/pg4300.txt`

In this case, we have grabbed the full text of the novel _Ulysses_, by
James Joyce, and placed the text in the `hadoop` subdirectory of our
_home_ directory.

-----
[pg]: http://www.gutenberg.org

In [14]:
# Grab a book to process
!wget --output-document=$HOME/hadoop/book.txt \
http://www.gutenberg.org/cache/epub/4300/pg4300.txt

--2016-04-07 18:17:51--  http://www.gutenberg.org/cache/epub/4300/pg4300.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1573151 (1.5M) [text/plain]
Saving to: ‘/home/data_scientist/hadoop/book.txt’


2016-04-07 18:17:52 (1.11 MB/s) - ‘/home/data_scientist/hadoop/book.txt’ saved [1573151/1573151]



-----

At this point, we first need to create an directory to hold the input
and output of our Hadoop task. We will create a new directory called
`wc` with a subdirectory called `in` to hold the input data for our
Hadoop task. Second, we will need to copy the book text file into this
new HDFS directory. This means we will need to run the following two
commands at the prompt in our Hadoop Docker container:

1. `$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p wc/in`
2. `$HADOOP_PREFIX/bin/hdfs dfs -put book.txt wc/in/book.txt`

The following screenshot displays the result of running these two
commands, as well as the `dfs -ls` command to display the contents of
our new HDFS directory, and the `dfs -count` command to show the size of
the directory contents.

-----

In [17]:
%%bash

cd $HADOOP_PREFIX

bin/hdfs dfs -rm -r -f wc

bin/hdfs dfs -mkdir -p wc/in
bin/hdfs dfs -put $HOME/hadoop/book.txt wc/in/book.txt

bin/hdfs dfs -ls wc/in

Deleted wc
Found 1 items
-rw-r--r--   1 data_scientist supergroup    1573151 2016-04-07 23:37 wc/in/book.txt


16/04/07 23:37:53 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.


-----

### Student Activity

In the preceding cells, we introduced Bayesian Hierarchical modeling.
Now that you have run the Notebook, go back and make the following
changes to see how the results change.

1. Change the model parameters, rerun the Notebook and see how the
different (unpooled versus pooled) modeling approaches perform.
2. Change the number of points in each bin, rerun the Notebook and see
how the different (unpooled versus pooled) modeling approaches perform.
3. Change the model for the prior distributions, e.g., try a uniform
distribution for the intercept. How do the different (unpooled versus
pooled) modeling approaches perform?

-----