Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

# References
- <a id='hadoop'></a>[Official Apache Hadoop Website](http://hadoop.apache.org/) 
- [Hadoop Wiki](https://wiki.apache.org/hadoop/FrontPage)
- [Outils pour le Big Data - Pierre Nerzic 🇫🇷](https://perso.univ-rennes1.fr/pierre.nerzic/Hadoop/)
- [wikistat - Ateliers Big Data - Philippe Besse 🇫🇷](https://github.com/wikistat/Ateliers-Big-Data)
- [Big data - Wikipedia](https://en.wikipedia.org/wiki/Big_data)
- <a id='hedlung'></a>[Understanding Hadoop Clusters and the Network - Brad Hedlung](http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/)
- [Running Hadoop on Ubuntu Linux (Single-Node Cluster)](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/)

# Big Data

- Data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. 
- Data analysis requires massively parallel software running on several servers.
- **Volume, Variety, Velocity, Variability and Veracity** describe Big Data properties.

# Apache Hadoop

- Framework for running applications on large cluster. 
- The Hadoop framework transparently provides applications both reliability and data motion. 
- Hadoop implements the computational paradigm named **Map/Reduce**, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. 
- It provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
- Both MapReduce and the **Hadoop Distributed File System** are designed so that node failures are automatically handled by the framework.

# HDFS
* It is a distributed file systems.
* HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
* HDFS is suitable for applications that have large data sets. 
* HDFS provides interfaces to move applications closer to where the data is located. The computation is much more efficient when the size of the data set is huge. 
* HDFS consists of a single NameNode with a number of DataNodes which manage storage. 
* HDFS exposes a file system namespace and allows user data to be stored in files. 
    1. A file is split by the NameNode into blocks stored in DataNodes. 
    2. The **NameNode** executes operations like opening, closing, and renaming files and directories.
    3. The **Secondary NameNode** stores information from **NameNode**. 
    4. The **DataNodes** manage perform block creation, deletion, and replication upon instruction from the NameNode.
    5. The placement of replicas is optimized for data reliability, availability, and network bandwidth utilization.
    6. User data never flows through the NameNode.
* Files in HDFS are write-once and have strictly one writer at any time.
* The DataNode has no knowledge about HDFS files. 

# Accessibility dfs

All [HDFS commands](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html)  are invoked by the bin/hdfs Java script:
```shell
hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
```
## Manage files and directories
```shell
hdfs dfs -ls -h -R # Recursively list subdirectories with human-readable file sizes.
hdfs dfs -cp  # Copy files from source to destination
hdfs dfs -mv  # Move files from source to destination
hdfs dfs -mkdir /foodir # Create a directory named /foodir	
hdfs dfs -rmr /foodir   # Remove a directory named /foodir	
hdfs dfs -cat /foodir/myfile.txt #View the contents of a file named /foodir/myfile.txt	
```

# Transfer between nodes

## put
```shell
hdfs fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>
```
Copy single src, or multiple srcs from local file system to the destination file system. 

Options:

    -p : Preserves rights and modification times.
    -f : Overwrites the destination if it already exists.

```shell
hdfs fs -put localfile /user/hadoop/hadoopfile
hdfs fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
```
Similar to the fs -put command
- `moveFromLocal` : to delete the source localsrc after copy.
- `copyFromLocal` : source is restricted to a local file
- `copyToLocal` : destination is restricted to a local file

![hdfs blocks](http://saphanatutorial.com/wp-content/uploads/2014/06/Hadoop-Course-4.jpg)

The Name Node is not in the data path. The Name Node only provides the map of where data is and where data should go in the cluster (file system metadata).

- Log on to the cluster and type the following commands: 
```bash
hdfs dfs -ls
hdfs dfs -ls /
hdfs dfs -ls -R -h /var
hdfs dfs -mkdir books
```
- Create a file user.txt containing your Name:

In [13]:
!echo "Pierre Navaro" > user.txt
%cat user.txt

Pierre Navaro


Copy it on  HDFS :
```bash
hdfs dfs -put user.txt
```
Check with:
```bash
hdfs dfs -ls -R 
hdfs dfs -cat user.txt 
hdfs dfs -tail bonjour.txt 
```
Remove the file:
```bash
hdfs dfs -rm bonjour.txt.
```
Put it again on HDFS:
```bash
hdfs dfs -copyFromLocal user.txt
```
and check!
```bash
hdfs dfs -cp files/user.txt files/hello.txt
hdfs dfs -count -h $HOME 
hdfs dfs -rm files/user.txt
```


# WordCount Example ([hadoop wiki](https://wiki.apache.org/hadoop/WordCount))

Download three ebooks from Project Gutenberg as input data.
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci — Complete by da Vinci Leonardo
- Ulysses by James Joyce

In [8]:
%mkdir -p books
!wget -q -O books/20417.txt  http://www.gutenberg.org/ebooks/20417.txt.utf-8
!wget -q -O books/5000-8.txt http://www.gutenberg.org/files/5000/5000-8.txt
!wget -q -O books/4300-0.txt http://www.gutenberg.org/files/4300/4300-0.txt


In [9]:
%ls books

20417.txt   4300-0.txt  5000-8.txt
