Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

# References
- <a id='hadoop'></a>[Official Apache Hadoop Website](http://hadoop.apache.org/) 
- [Hadoop Wiki](https://wiki.apache.org/hadoop/FrontPage)
- [Outils pour le Big Data - Pierre Nerzic 🇫🇷](https://perso.univ-rennes1.fr/pierre.nerzic/Hadoop/)
- [wikistat - Ateliers Big Data - Philippe Besse 🇫🇷](https://github.com/wikistat/Ateliers-Big-Data)
- [Big data - Wikipedia](https://en.wikipedia.org/wiki/Big_data)
- <a id='hedlung'></a>[Understanding Hadoop Clusters and the Network - Brad Hedlung](http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/)
- [Running Hadoop on Ubuntu Linux (Single-Node Cluster)](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/)
- [Setting up Hadoop 2.4 and Pig 0.12 on OSX locally](https://getblueshift.com/setting-up-hadoop-2-4-and-pig-0-12-on-osx-locally/)

# Installation on macOS Sierra

## Java
```bash
$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
```

## SSH
Go to **Preferences Sharing** menu and enable **Remote login**.
```bash
$ ssh localhost
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ exit
```
Check you can log again without password.

## Bash
Hadoop put executables in /usr/local/sbin.
```bash
$ echo 'export PATH="/usr/local/sbin:$PATH"' >> ~/.bash_profile
```


## Hadoop

```bash
$ brew install hadoop
$ hadoop 
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
                       note: please use "yarn jar" to launch
                             YARN applications, not this command.
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  credential           interact with credential providers
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings

Most commands print help when invoked w/o parameters.
```

## Configuration

```bash
$ brew info hadoop
hadoop: stable 2.8.0
Framework for distributed processing of large data sets
https://hadoop.apache.org/
/usr/local/Cellar/hadoop/2.8.0 (25,169 files, 2.1GB) *
  Built from source on 2017-08-15 at 13:36:34
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/hadoop.rb
==> Requirements
Required: java >= 1.7 ✔
==> Caveats
In Hadoop's config file:
  /usr/local/opt/hadoop/libexec/etc/hadoop/hadoop-env.sh,
  /usr/local/opt/hadoop/libexec/etc/hadoop/mapred-env.sh and
  /usr/local/opt/hadoop/libexec/etc/hadoop/yarn-env.sh
$JAVA_HOME has been set to be the output of:
  /usr/libexec/java_home
```

* /usr/local/opt/hadoop/libexec/etc/hadoop/core-site.xml
```xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```
* /usr/local/opt/hadoop/libexec/etc/hadoop/hdfs-site.xml
```xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

## Map Reduce
 /usr/local/opt/hadoop/libexec/etc/hadoop/mapred-site.xml
```xml
<configuration>
 <property>
  <name>mapreduce.framework.name</name>
        <value>yarn</value>
 </property>
</configuration>
```

## Yarn
/usr/local/opt/hadoop/libexec/etc/hadoop/yarn-site.xml
```xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
```

# Big Data

- Data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. 
- Data analysis requires massively parallel software running on several servers.
- **Volume, Variety, Velocity, Variability and Veracity** describe Big Data properties.

# Apache Hadoop

- Framework for running applications on large cluster. 
- The Hadoop framework transparently provides applications both reliability and data motion. 
- Hadoop implements the computational paradigm named **Map/Reduce**, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. 
- It provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
- Both MapReduce and the **Hadoop Distributed File System** are designed so that node failures are automatically handled by the framework.

# HDFS
* It is a distributed file systems.
* HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
* HDFS is suitable for applications that have large data sets. 
* HDFS provides interfaces to move applications closer to where the data is located. The computation is much more efficient when the size of the data set is huge. 
* HDFS consists of a single NameNode with a number of DataNodes which manage storage. 
* HDFS exposes a file system namespace and allows user data to be stored in files. 
    1. A file is split by the NameNode into blocks stored in DataNodes. 
    2. The **NameNode** executes operations like opening, closing, and renaming files and directories.
    3. The **Secondary NameNode** stores information from **NameNode**. 
    4. The **DataNodes** manage perform block creation, deletion, and replication upon instruction from the NameNode.
    5. The placement of replicas is optimized for data reliability, availability, and network bandwidth utilization.
    6. User data never flows through the NameNode.
* Files in HDFS are write-once and have strictly one writer at any time.
* The DataNode has no knowledge about HDFS files. 

# Accessibility dfs

All [HDFS commands](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html)  are invoked by the bin/hdfs Java script:
```shell
hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
```
## Manage files and directories
```shell
hdfs dfs -ls -h -R # Recursively list subdirectories with human-readable file sizes.
hdfs dfs -cp  # Copy files from source to destination
hdfs dfs -mv  # Move files from source to destination
hdfs dfs -mkdir /foodir # Create a directory named /foodir	
hdfs dfs -rmr /foodir   # Remove a directory named /foodir	
hdfs dfs -cat /foodir/myfile.txt #View the contents of a file named /foodir/myfile.txt	
```

# Transfer between nodes

## put
```shell
hdfs fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>
```
Copy single src, or multiple srcs from local file system to the destination file system. 

Options:

    -p : Preserves rights and modification times.
    -f : Overwrites the destination if it already exists.

```shell
hdfs fs -put localfile /user/hadoop/hadoopfile
hdfs fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
```
Similar to the fs -put command
- `moveFromLocal` : to delete the source localsrc after copy.
- `copyFromLocal` : source is restricted to a local file
- `copyToLocal` : destination is restricted to a local file

![hdfs blocks](http://saphanatutorial.com/wp-content/uploads/2014/06/Hadoop-Course-4.jpg)

The Name Node is not in the data path. The Name Node only provides the map of where data is and where data should go in the cluster (file system metadata).

- Before running Hadoop format HDFS
```bash
$ hdfs namenode -format
```

- In output you should have
```bash
Storage directory /tmp/hadoop-*/dfs/name has been successfully formatted.
```

- Start NameNode daemon and DataNode daemon:
```bash
$ start-dfs.sh
```
You can browse the web interface for NameNode at http://localhost:50070/
 

To supress warnings about native-hadoop library do:
```bash
export HADOOP_HOME_WARN_SUPPRESS=1
export HADOOP_ROOT_LOGGER="WARN,DRFA"
```
- Make the HDFS directories required to execute MapReduce jobs:
```bash
$ bin/hdfs dfs -mkdir -p /user/$USER
```

- Log on to the cluster and type the following commands: 
```bash
$ hdfs dfs -ls
$ hdfs dfs -ls /
$ hdfs dfs -mkdir books
```
- Create a local file user.txt containing your name and the date:

```bash
$ echo "Pierre Navaro" > user.txt
$ echo `date` >> user.txt 
$ cat user.txt
```

Copy it on  HDFS :
```bash
hdfs dfs -put user.txt
```

Check with:
```bash
$ hdfs dfs -ls -R 
$ hdfs dfs -cat user.txt 
$ hdfs dfs -tail user.txt 
```

Remove the file:
```bash
$ hdfs dfs -rm user.txt
```

Put it again on HDFS and move to books directory:
```bash
$ hdfs dfs -copyFromLocal user.txt
$ hdfs dfs -mv user.txt books/user.txt
$ hdfs dfs -ls -R -h
```

Copy user.txt to hello.txt and remove it.
```bash
$ hdfs dfs -cp books/user.txt books/hello.txt
$ hdfs dfs -count -h /user/$USER
$ hdfs dfs -rm books/user.txt
```

# Run MapReduce example job.

```bash
$ mkdir input
$ cp /usr/local/opt/hadoop/libexec/etc/hadoop/*.xml input/
$ hdfs dfs -put input
$ hadoop jar \
  /usr/local/opt/hadoop/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output 'dfs[a-z.]+'
$ hdfs dfs -ls output
```


# YARN

Start NameNode daemon and DataNode daemon:
```bash
$ start-yarn.sh
```
You can browse two more web interfaces:
   - JobTracker: http://localhost:8088/
   - Node Specific Info: http://localhost:8042/

# WordCount Example ([hadoop wiki](https://wiki.apache.org/hadoop/WordCount))

Download three ebooks from Project Gutenberg as input data.
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci — Complete by da Vinci Leonardo
- Ulysses by James Joyce

```bash
$ mkdir -p books
$ wget -q -O books/20417.txt  http://www.gutenberg.org/ebooks/20417.txt.utf-8
$ wget -q -O books/5000-8.txt http://www.gutenberg.org/files/5000/5000-8.txt
$ wget -q -O books/4300-0.txt http://www.gutenberg.org/files/4300/4300-0.txt
```


In [9]:
%ls books

20417.txt   4300-0.txt  5000-8.txt
