# This code was tested in a particular environment
These examples were tested in a pseudo-distributed installation of Hadoop ([see more](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)).

In [1]:
!cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"



In [2]:
!python --version

Python 3.9.0


In [3]:
!hadoop version | head -1

Hadoop 3.2.1


# Working with folders on HDFS
It's quite similar to working with folders on any \*nix system, just don't forget to add `hdfs dfs -` before common commands:
* `ls`
* `mkdir`
* `rmdir`
* `mv`

In [22]:
!hdfs dfs -mkdir data

In [23]:
!hdfs dfs -ls data

In [24]:
!hdfs dfs -mkdir data/tmp

In [25]:
!hdfs dfs -ls data

Found 1 items
drwxr-xr-x   - boris atg          2 2020-10-19 13:05 data/tmp


In [26]:
!hdfs dfs -mv data/tmp data/tmp1

In [27]:
!hdfs dfs -ls data

Found 1 items
drwxr-xr-x   - boris atg          2 2020-10-19 13:05 data/tmp1


In [28]:
!hdfs dfs -rmdir data/tmp1

In [29]:
!hdfs dfs -ls data

# Uploading data from your local filesystem to HDFS and getting it back
For uploading and downloading data from your HDFS cluster, an `ftp`-like parlance is used:
* `put`
* `get`

In [80]:
!hdfs dfs -put file:///home/boris/Downloads/yelp_dataset/yelp_academic_dataset_tip.json data

You can also use `du` command to measure the actual size of your data on HDFS:

In [68]:
!hdfs dfs -du -h data

251.3 M  251.3 M  data/yelp_academic_dataset_tip.json


In [51]:
!hdfs dfs -get data/* file:///home/boris/Downloads/from_hdfs

In [64]:
!ls ~/Downloads/from_hdfs

yelp_academic_dataset_tip.json


Mind that since storage on HDFS is distributed, you can end up `get`-ing a bunch of files instead of one, each slice coming from a different node of the cluster.

If you want to get a single file, use `getmerge` command.

To get rid of data which you don't need anymore, you can use an good old `rm`:

In [69]:
!hdfs dfs -rm data/*

2020-10-19 13:36:43,509 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted data/yelp_academic_dataset_tip.json


In [70]:
!hdfs dfs -du -h data

If you have several HDFS clusters in your disposal, you can move data from one to another with the same pair of commands (`get`/`put`).

# Do It Yourself
* run `hdfs dfs --help` and get an idea of other possible commands
* try to create a subfolder in your `/user/%username%` folder
* create a new file on HDFS using `touch`
* `cat` the newly created file (don't do that in future when working with huge files:))
* upload some files from your system to the cluster
* use `head`/`tail` to look into the uploaded files
* create a full copy of your working subfolder on HDFS (use `cp`)
* remove the redundant copy
* compare the results of `get` and `getmerge` commands
* compare the results of `put` and `moveFromLocal` commands
* feel free to play with other HDFS commands