## HDFS Replication Factor

Let us get an overview of replication factor - another important building block of HDFS.

* While blocksize drives distribution of large files, replication factor drives reliability of the files.
* If we only have one copy of each block for a given file and if the node goes down, then the data in the files is not readable.
* HDFS replication mitigates this by maintaining multiple copies of each block.
* Keep in mind that the default replication factor is **3** unless we override it.

In [1]:
%%sh

hdfs dfs -ls -h /public/retail_db/orders

Found 1 items
-rw-r--r--   2 hdfs supergroup      2.9 M 2021-01-28 09:27 /public/retail_db/orders/part-00000


In [2]:
%%sh

hdfs fsck /public/retail_db/orders/part-00000 \
    -files \
    -blocks \
    -locations

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/retail_db/orders/part-00000 at Wed Jan 27 17:16:31 EST 2021
/public/retail_db/orders/part-00000 2999944 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1110719773_36998835 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]

Status: HEALTHY
 Total size:	2999944 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 2999944 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Wed Jan 27 17:16:31 EST 2021 in 0 milliseconds


The filesys

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Fretail_db%2Forders%2Fpart-00000


* As part of our lab cluster we maintain 2 copies of each block.
* In production implementations, typically we have 3 copies with rack awareness enabled.
* The default replication factor is 3 and it is set as part of hdfs-site.xml. In our case we have overridden to save the storage.
* The property name is `dfs.replication`.
* If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.
* In a typical configuration with n replication factor, there will not be any down time even if n - 1 nodes go down in the cluster.
* If replication factor is 3, cluster will be stable even if 2 of the nodes goes down in a cluster.
* Replication factor covers all the hardware failures of the hosts.
* In Production, we typically configure Rack Awareness which will get us much better reliability.

In [3]:
%%sh

grep -B 1 -A 3 replication /etc/hadoop/conf/hdfs-site.xml

    <property>
      <name>dfs.replication</name>
      <value>2</value>
    </property>
    
    <property>
      <name>dfs.replication.max</name>
      <value>50</value>
    </property>
    


* Let us determine overall size occupied by `/data/retail_db/orders/part-00000` when it is copied to HDFS.
* It occupies 5.8 MB storage in HDFS (as our replication factor is 2).

In [2]:
%%sh

ls -lhtr /data/retail_db/orders/part-00000

-rw-r--r-- 1 root root 2.9M Jan 21  2021 /data/retail_db/orders/part-00000


In [3]:
%%sh

hdfs dfs -help stat

-stat [format] <path> ... :
  Print statistics about the file/directory at <path>
  in the specified format. Format accepts permissions in
  octal (%a) and symbolic (%A), filesize in
  bytes (%b), type (%F), group name of owner (%g),
  name (%n), block size (%o), replication (%r), user name
  of owner (%u), access date (%x, %X).
  modification date (%y, %Y).
  %x and %y show UTC date as "yyyy-MM-dd HH:mm:ss" and
  %X and %Y show milliseconds since January 1, 1970 UTC.
  If the format is not specified, %y is used by default.


In [4]:
%%sh

hdfs dfs -stat %r /user/${USER}/retail_db/orders/part-00000

3


In [5]:
%%sh

hdfs dfs -stat %o /user/${USER}/retail_db/orders/part-00000

134217728


In [12]:
%%sh

hdfs dfs -stat %b /user/${USER}/retail_db/orders/part-00000

2999944


* Let's review yelp_academic_dataset_user.json. It is of size 2.4 GB and it occupies 4.8 GB storage in HDFS as our replication factor is 2.

In [6]:
%%sh

ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json

ls: cannot access /data/yelp-dataset-json/yelp_academic_dataset_user.json: No such file or directory


CalledProcessError: Command 'b'\nls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json\n'' returned non-zero exit status 2.

* We can validate properties of the file using `stat` command. The file is available in HDFS under `/public/yelp-dataset-json/yelp_academic_dataset_user.json`.

In [7]:
%%sh

hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
    -files \
    -blocks \
    -locations

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /public/yelp-dataset-json/yelp_academic_dataset_user.json at Wed May 25 12:07:51 EDT 2022

/public/yelp-dataset-json/yelp_academic_dataset_user.json 2485747393 bytes, replicated: replication=2, 19 block(s):  OK
0. BP-1685381103-172.16.1.103-1609223169030:blk_1073747415_6594 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.105:9866,DS-6cd19d66-af36-4030-9b5a-8c881ae5efc8,DISK], DatanodeInfoWithStorage[172.16.1.107:9866,DS-cc8f7dbb-28ed-477a-b831-7b5d9f146f80,DISK]]
1. BP-1685381103-172.16.1.103-1609223169030:blk_1073747483_6662 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.106:9866,DS-b1aa8def-bcd8-4514-8697-29c2f7fd008d,DISK], DatanodeInfoWithStorage[172.16.1.105:9866,DS-6cd19d66-af36-4030-9b5a-8c881ae5efc8,DISK]]
2. BP-1685381103-172.16.1.103-1609223169030:blk_1073747543_6722 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.106:9866,DS-3cdd1a86-1122-4b3f-9d9d-c9fe36cab433,DISK], 

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&files=1&blocks=1&locations=1&path=%2Fpublic%2Fyelp-dataset-json%2Fyelp_academic_dataset_user.json


In [8]:
%%sh

hdfs dfs -stat %r /public/yelp-dataset-json/yelp_academic_dataset_user.json

2


In [9]:
%%sh

hdfs dfs -stat %o /public/yelp-dataset-json/yelp_academic_dataset_user.json

134217728


In [10]:
%%sh

hdfs dfs -stat %b /public/yelp-dataset-json/yelp_academic_dataset_user.json

2485747393
