## Getting File Metadata

Let us see how to get metadata for the  files stored in HDFS using `hdfs fsck` command. 

* We have files copied under HDFS location `/user/${USER}/retail_db`. We also have some sample large files copied under HDFS location `/public/randomtextwriter`. We can use `hdfs fsck` command.
* We will first see how to get metadata of these files and then try to interpret it in subsequent topics.
* HDFS stands for Hadoop Distributed File System. It means files are copied in distributed fashion.
* Our cluster have master nodes and worker nodes, in this case the files will be physically copied in the worker nodes where data node process is running. We will cover this as part of the HDFS architecture.
* Here are the details about worker nodes along with corresponding private ips.

|Private ip|Full DNS|Short DNS|
|---|---|---|
|172.16.1.102|wn01.itversity.com|wn01|
|172.16.1.103|wn02.itversity.com|wn02|
|172.16.1.104|wn03.itversity.com|wn03|
|172.16.1.107|wn04.itversity.com|wn04|
|172.16.1.108|wn05.itversity.com|wn05|

In [None]:
%%sh

hdfs fsck -help

* We can get high level overview for a retail_db folder by using `hdfs fsck retail_db`

In [1]:
%%sh

hdfs fsck /user/${USER}/retail_db

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /user/itv002480/retail_db at Wed May 25 10:16:20 EDT 2022


Status: HEALTHY
 Number of data-nodes:	3
 Number of racks:		1
 Total dirs:			7
 Total symlinks:		0

Replicated Blocks:
 Total size:	9537787 B
 Total files:	6
 Total blocks (validated):	6 (avg. block size 1589631 B)
 Minimally replicated blocks:	6 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Missing blocks:		0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Blocks queued for replication:	0

Erasure Coded Block Groups:
 Total size:	0 B
 Total files:	0
 Total block groups (validated):	0
 Minimally erasure-coded block groups:	0
 Over-erasure-coded block groups:	0
 Under-erasure-coded block groups:	0
 Unsatisfactory placement block groups:	0
 Average block group size:	0.0
 Missing block groups:		0
 Corrupt block groups:		0
 Missi

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&path=%2Fuser%2Fitv002480%2Fretail_db


* We can get details about file names using `-files` option.

In [2]:
%%sh

hdfs fsck /user/${USER}/retail_db -files

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /user/itv002480/retail_db at Wed May 25 10:16:48 EDT 2022

/user/itv002480/retail_db <dir>
/user/itv002480/retail_db/categories <dir>
/user/itv002480/retail_db/categories/part-00000 1029 bytes, replicated: replication=3, 1 block(s):  OK
/user/itv002480/retail_db/customers <dir>
/user/itv002480/retail_db/customers/part-00000 953719 bytes, replicated: replication=3, 1 block(s):  OK
/user/itv002480/retail_db/departments <dir>
/user/itv002480/retail_db/departments/part-00000 60 bytes, replicated: replication=3, 1 block(s):  OK
/user/itv002480/retail_db/order_items <dir>
/user/itv002480/retail_db/order_items/part-00000 5408880 bytes, replicated: replication=3, 1 block(s):  OK
/user/itv002480/retail_db/orders <dir>
/user/itv002480/retail_db/orders/part-00000 2999944 bytes, replicated: replication=3, 1 block(s):  OK
/user/itv002480/retail_db/products <dir>
/user/itv002480/retail_db/products/part-00000 174155 bytes, replicated

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&files=1&path=%2Fuser%2Fitv002480%2Fretail_db


* Files in HDFS will be physically stored in worker nodes as blocks. We can get details of blocks associated with files using `-blocks` option.

In [5]:
%%sh

hdfs fsck /user/${USER}/retail_db -files -blocks

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:36:09 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455898_41737435 len=1029 repl=2

/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455899_41737436 len=953719 repl=2

/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455900_41737437 len=60 repl=2

/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455901_41737438 len=5408880 repl=2

/user/itversity/retail

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&path=%2Fuser%2Fitversity%2Fretail_db


* `-blocks` will only provide details about the names of the blocks, we need to use `-locations` as well to get the details about the worker nodes where the blocks are physically stored.
* A block is nothing but a physical file in HDFS. We will understand more about blocks as part of the subsequent topics.
* To understand where a block is physically stored you can get the infromation from **DatanodeInfoWithStorage** part of the output. It will contain ip address and we can get the corresponding DNS from the above table.

In [3]:
%%sh

hdfs fsck /user/${USER}/retail_db -files -blocks -locations

FSCK started by itv002480 (auth:SIMPLE) from /172.16.1.102 for path /user/itv002480/retail_db at Wed May 25 10:21:04 EDT 2022

/user/itv002480/retail_db <dir>
/user/itv002480/retail_db/categories <dir>
/user/itv002480/retail_db/categories/part-00000 1029 bytes, replicated: replication=3, 1 block(s):  OK
0. BP-1685381103-172.16.1.103-1609223169030:blk_1078718659_4982116 len=1029 Live_repl=3  [DatanodeInfoWithStorage[172.16.1.105:9866,DS-cd1d8ab0-7d77-4607-98bf-961a7ad81f45,DISK], DatanodeInfoWithStorage[172.16.1.107:9866,DS-cc8f7dbb-28ed-477a-b831-7b5d9f146f80,DISK], DatanodeInfoWithStorage[172.16.1.106:9866,DS-b1aa8def-bcd8-4514-8697-29c2f7fd008d,DISK]]

/user/itv002480/retail_db/customers <dir>
/user/itv002480/retail_db/customers/part-00000 953719 bytes, replicated: replication=3, 1 block(s):  OK
0. BP-1685381103-172.16.1.103-1609223169030:blk_1078718660_4982117 len=953719 Live_repl=3  [DatanodeInfoWithStorage[172.16.1.106:9866,DS-3cdd1a86-1122-4b3f-9d9d-c9fe36cab433,DISK], DatanodeIn

Connecting to namenode via http://m01.itversity.com:9870/fsck?ugi=itv002480&files=1&blocks=1&locations=1&path=%2Fuser%2Fitv002480%2Fretail_db


In [8]:
%%sh

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000

-rw-r--r--   3 hdfs hdfs      1.0 G 2017-01-18 20:24 /public/randomtextwriter/part-m-00000


In [9]:
%%sh

hdfs fsck /public/randomtextwriter/part-m-00000 -files -blocks -locations

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:39:53 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=1342177

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000
