## HDFS Replication Factor

Let us get an overview of replication factor - another important building block of HDFS.
* While blocksize drives distribution of large files, replication factor drives reliability of the blocks.
* If we only have one copy of each block for a given file and if the node goes down, then the data in the files is not readable.
* HDFS replication mitigates this by maintaining multiple copies of each block.
* Keep in mind that the default replication factor is **3** unless we override it.

In [None]:
%%sh

hdfs dfs -ls -h /public/retail_db/orders

In [None]:
%%sh

hdfs fsck /public/retail_db/orders/part-00000 \
    -files \
    -blocks \
    -locations

* As part of our lab cluster we maintain 2 copies of each block.* In production implementations, typically we have 3 copies with rack awareness enabled.
* The default replication factor is 3 and it is set as part of hdfs-site.xml. In our case we have overridden to save the storage.
* The property name is `dfs.replication`.
* If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.
* In a typical configuration with n replication factor, there will not any down time even if n - 1 nodes go down in the cluster.
* If replication factor is 3, cluster will be stable even if 2 of the nodes goes down in a cluster.

In [None]:
%%sh

grep -B 1 -A 3 replication /etc/hadoop/conf/hdfs-site.xml

* Let us determine overall size occupied by `/data/retail_db/orders/part-00000` when it is copied to HDFS.
* It occupies 5.8 MB storage in HDFS (as our replication factor is 2).

In [None]:
%%sh

ls -lhtr /data/retail_db/orders/part-00000

In [None]:
%%sh

hdfs dfs -help stat

In [None]:
%%sh

hdfs dfs -stat %r /user/${USER}/retail_db/orders/part-00000

In [None]:
%%sh

hdfs dfs -stat %o /user/${USER}/retail_db/orders/part-00000

In [None]:
%%sh

hdfs dfs -stat %b /user/${USER}/retail_db/orders/part-00000

In [None]:
%%sh

hdfs dfs -ls -h /user/${USER}/retail_db/orders/part-00000

* It occupies 4.8 GB storage in HDFS as our replication factor is 2.

In [None]:
%%sh

ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json

* We can validate properties of the file using `stat` command. The file is available in HDFS under `/public/yelp-dataset-json/yelp_academic_dataset_user.json`.

In [None]:
%%sh

hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
    -files \
    -blocks \
    -locations