## HDFS Blocksize

Let us get into details related to blocksize in HDFS.

* HDFS stands for Hadoop Distributed File System.
* It means the large files will be physically stored on multiple nodes in distributed fashion.
* Let us review the `hdfs fsck` output of `/public/randomtextwriter/part-m-00000`. The file is approximately 1 GB in size and you will see 9 files.
  * 8 files of size 128 MB
  * 1 file of size 28 MB approximately
* It means a file of size 1 GB 28 MB is stored in 9 blocks. It is due to the default block size which is 128 MB.

In [1]:
%%sh

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000

-rw-r--r--   2 hdfs supergroup      1.0 G 2021-01-28 11:01 /public/randomtextwriter/part-m-00000


In [2]:
%%sh

hdfs fsck /public/randomtextwriter/part-m-00000 \
    -files \
    -blocks \
    -locations

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /public/randomtextwriter/part-m-00000 at Wed May 25 10:34:13 EDT 2022

/public/randomtextwriter/part-m-00000 1102230331 bytes, replicated: replication=2, 9 block(s):  OK
0. BP-1685381103-172.16.1.103-1609223169030:blk_1073749695_8874 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.106:9866,DS-b1aa8def-bcd8-4514-8697-29c2f7fd008d,DISK], DatanodeInfoWithStorage[172.16.1.105:9866,DS-cd1d8ab0-7d77-4607-98bf-961a7ad81f45,DISK]]
1. BP-1685381103-172.16.1.103-1609223169030:blk_1073749719_8898 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.105:9866,DS-cd1d8ab0-7d77-4607-98bf-961a7ad81f45,DISK], DatanodeInfoWithStorage[172.16.1.107:9866,DS-53639da4-6786-42af-a4a6-5021150dddf3,DISK]]
2. BP-1685381103-172.16.1.103-1609223169030:blk_1073749741_8920 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.105:9866,DS-cd1d8ab0-7d77-4607-98bf-961a7ad81f45,DISK], DatanodeInfoWithStorage[172.16.1.107:9866

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000


* The default block size is 128 MB and it is set as part of hdfs-site.xml.
* The property name is `dfs.blocksize`.
* If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.

In [3]:
%%sh

cat /etc/hadoop/conf/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data1/hadoop/hadoop/dfs/nn/,/data2/hadoop/hadoop/dfs/nn/</value>
    </property>
    <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>/data1/ha

* Let us determine the number of blocks for `/data/retail_db/orders/part-00000`. If we store this file of size 2.9 MB in HDFS, there will be one block associated with it as size of the file is less than the block size.
* It occupies 2.9 MB storage in HDFS (assuming replication factor as 1)

In [4]:
%%sh

ls -lhtr /data/retail_db/orders/part-00000

-rw-r--r-- 1 root root 2.9M Jan 21  2021 /data/retail_db/orders/part-00000


In [5]:
%%sh

hdfs fsck /user/${USER}/retail_db/orders/part-00000 -files -blocks -locations

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /user/itv002480/retail_db/orders/part-00000 at Wed May 25 10:36:58 EDT 2022

/user/itv002480/retail_db/orders/part-00000 2999944 bytes, replicated: replication=3, 1 block(s):  OK
0. BP-1685381103-172.16.1.103-1609223169030:blk_1078718904_4982361 len=2999944 Live_repl=3  [DatanodeInfoWithStorage[172.16.1.105:9866,DS-6cd19d66-af36-4030-9b5a-8c881ae5efc8,DISK], DatanodeInfoWithStorage[172.16.1.106:9866,DS-3cdd1a86-1122-4b3f-9d9d-c9fe36cab433,DISK], DatanodeInfoWithStorage[172.16.1.107:9866,DS-53639da4-6786-42af-a4a6-5021150dddf3,DISK]]


Status: HEALTHY
 Number of data-nodes:	3
 Number of racks:		1
 Total dirs:			0
 Total symlinks:		0

Replicated Blocks:
 Total size:	2999944 B
 Total files:	1
 Total blocks (validated):	1 (avg. block size 2999944 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&files=1&blocks=1&locations=1&path=%2Fuser%2Fitv002480%2Fretail_db%2Forders%2Fpart-00000


* Let us determine the number of blocks for `/data/yelp-dataset-json/yelp_academic_dataset_user.json`. If we store this file of size 2.4 GB in HDFS, there will be 19 blocks associated with it
  * 18 128 MB Files
  * 1 ~69 MB File
* It occupies 2.4 GB storage in HDFS (assuming replication factor as 1)

In [7]:
%%sh

ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json

ls: cannot access /data/yelp-dataset-json/yelp_academic_dataset_user.json: No such file or directory


CalledProcessError: Command 'b'\nls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json\n'' returned non-zero exit status 2.

* We can validate by using `hdfs fsck` command against the same file in HDFS.

In [8]:
%%sh

hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
    -files \
    -blocks \
    -locations

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /public/yelp-dataset-json/yelp_academic_dataset_user.json at Wed May 25 10:39:02 EDT 2022

/public/yelp-dataset-json/yelp_academic_dataset_user.json 2485747393 bytes, replicated: replication=2, 19 block(s):  OK
0. BP-1685381103-172.16.1.103-1609223169030:blk_1073747415_6594 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.105:9866,DS-6cd19d66-af36-4030-9b5a-8c881ae5efc8,DISK], DatanodeInfoWithStorage[172.16.1.107:9866,DS-cc8f7dbb-28ed-477a-b831-7b5d9f146f80,DISK]]
1. BP-1685381103-172.16.1.103-1609223169030:blk_1073747483_6662 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.106:9866,DS-b1aa8def-bcd8-4514-8697-29c2f7fd008d,DISK], DatanodeInfoWithStorage[172.16.1.105:9866,DS-6cd19d66-af36-4030-9b5a-8c881ae5efc8,DISK]]
2. BP-1685381103-172.16.1.103-1609223169030:blk_1073747543_6722 len=134217728 Live_repl=2  [DatanodeInfoWithStorage[172.16.1.106:9866,DS-3cdd1a86-1122-4b3f-9d9d-c9fe36cab433,DISK], 

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&files=1&blocks=1&locations=1&path=%2Fpublic%2Fyelp-dataset-json%2Fyelp_academic_dataset_user.json
