## Overview of HDFS

As part of this module we will be covering all important aspects of HDFS that are required for development. We have covered all the essentials for the development.
* Using HDFS CLI
* Getting Help or Usage
* Listing HDFS Files
* Managing HDFS Directories
* Copying files from HDFS to Local
* Copying files from HDFS to HDFS
* Previewing data in HDFS Files
* Getting File Metadata
* HDFS Block Size
* HDFS Replication Factor
* Getting HDFS Storage Usage
* Using HDFS Stat Commands
* HDFS File Permissions
* Overriding Properties

## Using HDFS CLI

Let us understand how to use HDFS CLI to interact with HDFS.
* Typically the cluster contain 3 types of nodes.
  * Gateway nodes or client nodes or edge nodes
  * Master nodes
  * Worker nodes
* Developers like us will typically have access to Gateway nodes or Client nodes.
* We can connect to Gateway nodes or Client nodes using SSH.
* Once login, we can interact with HDFS either by using `hadoop fs` or `hdfs dfs`. Both of them are aliases to each other.
* `hadoop` have other subcommands than `fs` and is typically used to interact with HDFS or Map Reduce as developers.
* `hdfs` have other subcommands than `dfs`. It is typically used to not only manage files in HDFS but also administrative tasks related HDFS components such as **Namenode**, **Secondary Namenode**, **Datanode** etc.
* As deveopers, our scope will be limited to use `hdfs dfs` or `hadoop fs` to interact with HDFS.
* Both have sub commands and each of the sub command take additional control arguments. Let us understand the structure by taking the example of `hdfs dfs -ls -l -S -r /public`.
  * `hdfs` is the main command to manage all the components of HDFS.
  * `dfs` is the sub command to manage files in HDFS.
  * `-ls` is the file system command to list files in HDFS.
  * `-l -S -r` are control arguments for `-ls` to control the run time behavior of the command.
  * `/public` is the argument for the `-ls` command. It is path in HDFS. You will understad as you get into the details.

## Getting help or usage

Let us explore details about how to list the commands and get the help or usage for given command.
* Even though we can run commands from almost all the nodes in the clusters, we should only use Gateway to run HDFS Commands.
* First we need to make sure designated Gateway server is Gateway for HDFS service so that we can run commands from Gateway node. In our case we have designated **gw02.itversity.com** or **gw03.itversity.com** as Gateways.
* Typically Namenode process will be running on port number 8020. We can also pass namenode URI to access HDFS.

* `hadoop fs` or `hdfs dfs` – list all the commands available
* `hadoop fs -usage` – will give us basic usage for given command
* `hadoop fs -help` – will give us additional information for all the commands. It is same as just running `hadoop fs` or `hdfs dfs`.
* We can run help on individual commands as well - example: `hadoop fs -help ls` or `hdfs dfs -help ls`

In [None]:
%%sh
#  note : %sh to run code in bash enviorment  
sudo mkdir -p /etc/hadoop
sudo ln -s /opt/hadoop/etc/hadoop /etc/hadoop/conf
head -20 /etc/hadoop/conf/core-site.xml

In [None]:
%%sh
# note ${} used in bash to user variable here user is global enviorment variable
hdfs dfs -ls /user/${USER}

In [None]:
%%sh

hdfs dfs -ls hdfs://localhost:9000/user/${USER}

In [None]:
%%sh

hdfs dfs -help
hdfs dfs -usage ls
hdfs dfs -help ls
hdfs dfs -ls /public/retail_db

## Listing HDFS Files

Now let us walk through different options we have with hdfs `ls` command to list the files.
* We can get usage by running `hdfs dfs -usage ls`.
* We can get help using `hdfs dfs -help ls`

In [None]:
%%sh

hdfs dfs -usage ls
hdfs dfs -help ls

In [None]:
%%sh
!hdfs dfs -mkdir -p /public
# note 

In [None]:
!hdfs dfs -put /data/nyse_all /public

In [None]:
%%sh
#  sort  by reverse order 
hdfs dfs -ls -r /public/nyse_all/nyse_data
# sort by time 
hdfs dfs -ls -t /public/nyse_all/nyse_data
# sort by desc with directory 
hdfs dfs -ls -t -r /public/nyse_all/nyse_data



#### We can sort the files and directories by size using `-S`. By default, the files will be sorted in descending order by size. We can reverse the sorting order using `-S -r`.

In [None]:
%%sh

hdfs dfs -ls -S /public/nyse_all/nyse_data

In [None]:
%%sh

hdfs dfs -ls -h /public/nyse_all/nyse_data
#  sorting in readable format 

In [None]:
%%sh

hdfs dfs -ls -h -t /public/nyse_all/nyse_data
#  sorting in time based size wise with time 

## Managing HDFS Directories

Now let us have a look at how to create directories and manage ownership.
* By default hdfs is superuser of HDFS
* `hadoop fs -mkdir` or `hdfs dfs -mkdir` – to create directories
* `hadoop fs -chown` or `hdfs dfs -chown` – to change ownership of files
* `chown` can also be used to change the group. We can change the group using `-chgrp` command as well. Make sure to run `-help` on chgrp and check the details.
* Here are the steps to create user space. Only users in HDFS group can take care of it.
  * Create directory with user id `itversity` under /user
  * Change ownership to the same name as the directory created earlier (/user/itversity)
  * You can validate permissions by using `hadoop fs -ls` or `hdfs dfs -ls` command on /user. Make sure to grep for the user name you are looking for.
* Let's go ahead and create user space in HDFS for `itversity`. I have to login as sudoer and run below commands.

```shell
sudo -u hdfs hdfs dfs -mkdir /user/itversity
sudo -u hdfs hdfs dfs -chown -R itversity:students /user/itversity
hdfs dfs -ls /user|grep itversity
```

* You should be able to create folders under your home directory.
* You can create the directory structure using `mkdir -p`. The existing folders will be ignored and non existing folders will be created.
  * Let us run `hdfs dfs -mkdir -p /user/${USER}/retail_db/orders/year=2020`.
  * As `/user/${USER}/retail_db` already exists, it will be ignored.
  * Both `/user/${USER}/retail_db/orders` as well as `/user/${USER}/retail_db/orders/year=2020` will be created.

  * We can delete non empty directory using `hdfs dfs -rm -R` and empty directory using `hdfs dfs -rmdir`. We will explore `hdfs dfs -rm` in detail later.

In [None]:
%%sh

hdfs dfs -ls /user/`whoami`
# userspace


In [None]:
%%sh

hdfs dfs -mkdir /user/`whoami`/retail_db

## Copying files from local to HDFS

We can copy files from local file system to HDFS either by using `copyFromLocal` or `put` command.
* `hdfs dfs -copyFromLocal` or `hdfs dfs -put` – to copy files or directories from local filesystem into HDFS. We can also use `hadoop fs` in place of `hdfs dfs`.
* However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then copy back to HDFS.
* Files will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor. We will get into the details later.

![test](https://s3.amazonaws.com/kaizen.itversity.com/hadoop-overview/04HDFSAnatomyOfFileWrite.png)

In [None]:
%%sh
# note skiptrash willpermantly delte the file

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db

## Copying files from HDFS to Local

We can copy files from HDFS to local file system either by using `copyToLocal` or `get` command.
* `hdfs dfs -copyToLocal` or `hdfs dfs -get` – to copy files or directories from HDFS to local filesystem.
* It will read all the blocks using index in sequence and construct the file in local file system.
* If the target file or directory already exists in the local file system, `get` will fail saying **already exists**
* We can also use patterns while using `get` command to get files from HDFS to local file system. Also, we can pass multiple files or folders in HDFS to `get` command.

In [None]:
%%sh

hdfs dfs -help copyToLocal

## Copying files from HDFS to HDFS

Let us understand how to copy files with in HDFS (from one HDFS location to another HDFS location). 

* We can use `hdfs dfs -cp` command to copy files with in HDFS.
* One need to have at least read permission on source folders or files and write permission on target folder for `cp` command to work as expected.
* We can also use patterns while using `cp` command to copy files within HDFS. Also, we can pass multiple files or folders in HDFS to `cp` command.

## Previewing data in HDFS Files

Let us see how we can preview the data in HDFS.
* If we are dealing with files contain text data (files of text file format), we can preview contents of the files using different commands as `-tail`, `-cat` etc.
* `-tail` can be used to preview last 1 KB of the file
* `-cat` can be used to print the whole contents of the file on the screen. Be careful while using `-cat` as it will take a while for even medium sized files.
* If you want to get first few lines from file you can redirect output of `hadoop fs -cat` or `hdfs dfs -cat` to Linux `more` command

## Getting File Metadata

Let us see how to get metadata for the  files stored in HDFS using `hdfs fsck` command. 
* We have files copied under HDFS location `/user/${USER}/retail_db`. We also have some sample large files copied under HDFS location `/public/randomtextwriter`. We can use `hdfs fsck` command.
* We will first see how to get metadata of these files and then try to interpret it in subsequent topics.
* HDFS stands for Hadoop Distributed File System. It means files are copied in distributed fashion.
* Our cluster have master nodes and worker nodes, in this case the files will be physically copied in the worker nodes where data node process is running. We will cover this as part of the HDFS architecture.
* Here are the details about worker nodes along with corresponding private ips.

|Private ip|Full DNS|Short DNS|
|---|---|---|
|172.16.1.102|wn01.itversity.com|wn01|
|172.16.1.103|wn02.itversity.com|wn02|
|172.16.1.104|wn03.itversity.com|wn03|
|172.16.1.107|wn04.itversity.com|wn04|
|172.16.1.108|wn05.itversity.com|wn05|

In [None]:
%%sh

# hdfs fsck -help
#  hdfs fsck file system cheack gives all information about distributed file 

hdfs fsck /user/${USER}/retail_db -files
#  gives all information about all file isn retaildb folder 
 hdfs fsck /user/${USER}/retail_db -files -blocks
# all information about blocks
hdfs fsck /user/${USER}/retail_db -files -blocks -locations
## which file at what location 


In [None]:
#  gives all infromation about part file in human readable form 
hdfs dfs -ls -h /public/randomtextwriter/part-m-00000


In [None]:
%%sh
hdfs fsck /public/randomtextwriter/part-m-00000 -files -blocks -locations
#  code to get all fsck via location 

## HDFS Blocksize

Let us get into details related to blocksize in HDFS.
* HDFS stands for Hadoop Distributed File System.
* It means the large files will be physically stored on multiple nodes in distributed fashion.
* Let us review the `hdfs fsck` output of `/public/randomtextwriter/part-m-00000`. The file is approximately 1 GB in size and you will see 9 files.
  * 8 files of size 128 MB
  * 1 file of size 28 MB approximately
* It means a file of size 1 GB 28 MB is stored in 9 blocks. It is due to the default block size which is 128 MB.

* The default block size is 128 MB and it is set as part of hdfs-site.xml.
* The property name is `dfs.blocksize`.
* If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.
* Let us determine the number of blocks for `/data/retail_db/orders/part-00000`. If we store this file of size 2.9 MB in HDFS, there will be one block associated with it as size of the file is less than the block size.
* It occupies 2.9 MB storage in HDFS (assuming replication factor as 1)

In [None]:
%%sh

hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
    -files \
    -blocks \
    -locations

#  note this will validate the file size , location etc 

## HDFS Replication Factor

Let us get an overview of replication factor - another important building block of HDFS.
* While blocksize drives distribution of large files, replication factor drives reliability of the files.
* If we only have one copy of each block for a given file and if the node goes down, then the data in the files is not readable.
* HDFS replication mitigates this by maintaining multiple copies of each block.
* Keep in mind that the default replication factor is **3** unless we override it.


* As part of our lab cluster we maintain 2 copies of each block.
* In production implementations, typically we have 3 copies with rack awareness enabled.
* The default replication factor is 3 and it is set as part of hdfs-site.xml. In our case we have overridden to save the storage.
* The property name is `dfs.replication`.
* If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.
* In a typical configuration with n replication factor, there will not be any down time even if n - 1 nodes go down in the cluster.
* If replication factor is 3, cluster will be stable even if 2 of the nodes goes down in a cluster.
* Replication factor covers all the hardware failures of the hosts.
* In Production, we typically configure Rack Awareness which will get us much better reliability.

In [None]:
%%sh

grep -B 1 -A 3 replication /etc/hadoop/conf/hdfs-site.xml
#  to see default bash size property 


In [None]:
%%sh

hdfs dfs -stat %r /user/${USER}/retail_db/orders/part-00000
# to get statestics hadoop part file 

## Getting HDFS Storage Usage

Let us get an overview of HDFS usage using `du` and `df` commands.

* We can use `hdfs dfs -df` to get the current capacity and usage of HDFS.
* We can use `hdfs dfs -du` to get the size occupied by a file or folder.

In [None]:
%%sh
hdfs dfs -df
#  gives filesystem capcity 

In [None]:

%%sh

hdfs dfs -du -s -h /user/${USER}/retail_db
#  gives capcity of files 

## Using HDFS Stat Commands

Let us understand how to get details about HDFS files such as replication factor, block size etc.

* `hdfs dfs -stat` can be used to get the statistics related to file or directory.

In [None]:
%%sh

hdfs dfs -stat /user/${USER}/retail_db/orders
#  statestics of files 

## HDFS File Permissions

Let us go through file permissions in HDFS.

* As we create the files, we can check the permissions on them using `-ls` command.
* Typically the owner of the user space will have **rwx**, while members of the group specified as well as others have **r-x**.
* **rwx** stands for read, write and execute while **r-x** stands for only read and execute permissions.
* We can change the permissions using `hadoop fs -chmod` or `hdfs dfs -chmod`. However one can change the permissions of their own files.
* We can specify permissions mode (e.g.: `+x` to grant execute access to owner, group as well as others) as well as octal mode (e.g.: 755 to grant rwx for owner, rx for group and others)

If you are not familiar with linux command chmod, we would highly recommend you to spend some time to get detailed understanding of it as it is very important with respect to file permissions.

* Adding write permissions only to owner. Now the owner will be able to delete the file, but others cannot.

In [None]:
%%sh

hdfs dfs -chmod -R -w /user/${USER}/retail_db/order_items
hdfs dfs -chmod -R 757  /user/${USER}/retail_db/orders
#  for changing permissions of hadoop file system 

## Overriding Properties

Let us understand how we can override the properties while running `hdfs dfs` or `hadoop fs` commands.

* We can change any property which is not defined as final in **core-site.xml** or **hdfs-site.xml**.
* We can change `blocksize` as well as `replication` while copying the files. We can also change them after copying the files as well.
* We can either pass individual properties using `-D` or bunch of properties by passing xml similar to **core-site.xml** or **hdfs-site.xml** as part of `--conf`.
* Let's copy a file **/data/crime/csv/rows.csv** with default values. The file is splitted into 12 blocks with 2 copies each (as our default blocksize is 128 MB and replication factor is 2).