## Copying files from HDFS to HDFS

Let us understand how to copy files with in HDFS (from one HDFS location to another HDFS location). 

* We can use `hdfs dfs -cp` command to copy files with in HDFS.
* One need to have at least read permission on source folders or files and write permission on target folder for `cp` command to work as expected.

In [1]:
%%sh

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db

Deleted /user/spark/retail_db


In [4]:
!hdfs dfs -put /data/retail_db /public

In [5]:
%%sh

hdfs dfs -ls /public/retail_db

Found 9 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /public/retail_db/categories
-rw-r--r--   1 spark supergroup   10303297 2022-05-29 17:27 /public/retail_db/create_db.sql
-rw-r--r--   1 spark supergroup       1748 2022-05-29 17:27 /public/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /public/retail_db/customers
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /public/retail_db/departments
-rw-r--r--   1 spark supergroup   10297372 2022-05-29 17:27 /public/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /public/retail_db/order_items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /public/retail_db/orders
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /public/retail_db/products


In [6]:
%%sh

hdfs dfs -ls /user/`whoami`

Found 1 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:08 /user/spark/.sparkStaging


In [7]:
%%sh

hdfs dfs -help cp

-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst> :
  Copy files that match the file pattern <src> to a destination.  When copying
  multiple files, the destination must be a directory. Passing -p preserves status
  [topax] (timestamps, ownership, permission, ACLs, XAttr). If -p is specified
  with no <arg>, then preserves timestamps, ownership, permission. If -pa is
  specified, then preserves permission also because ACL is a super-set of
  permission. Passing -f overwrites the destination if it already exists. raw
  namespace extended attributes are preserved if (1) they are supported (HDFS
  only) and, (2) all of the source and target pathnames are in the /.reserved/raw
  hierarchy. raw namespace xattr preservation is determined solely by the presence
  (or absence) of the /.reserved/raw prefix and not by the -p option. Passing -d
  will skip creation of temporary file(<dst>._COPYING_).


* Let us create directory to store all the folders and files related to HDFS under user space. You can review the permissions on retail_db, user have write permissions on the target folder.

In [8]:
%%sh

hdfs dfs -mkdir /user/`whoami`/retail_db

In [9]:
%%sh

hdfs dfs -ls /user/`whoami`

Found 2 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:08 /user/spark/.sparkStaging
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:27 /user/spark/retail_db


In [10]:
%%sh

hdfs dfs -cp /public/retail_db/* /user/`whoami`/retail_db

In [11]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 9 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/categories
-rw-r--r--   1 spark supergroup   10303297 2022-05-29 17:28 /user/spark/retail_db/create_db.sql
-rw-r--r--   1 spark supergroup       1748 2022-05-29 17:28 /user/spark/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/customers
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/departments
-rw-r--r--   1 spark supergroup   10297372 2022-05-29 17:28 /user/spark/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/order_items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/orders
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/products


```{note}
This will fail as retail_db folder already exists.
```

In [12]:
%%sh

hdfs dfs -cp /public/retail_db /user/`whoami`

cp: `/user/spark/retail_db/categories/part-00000': File exists
cp: `/user/spark/retail_db/create_db.sql': File exists
cp: `/user/spark/retail_db/create_db_tables_pg.sql': File exists
cp: `/user/spark/retail_db/customers/part-00000': File exists
cp: `/user/spark/retail_db/departments/part-00000': File exists
cp: `/user/spark/retail_db/load_db_tables_pg.sql': File exists
cp: `/user/spark/retail_db/order_items/part-00000': File exists
cp: `/user/spark/retail_db/orders/part-00000': File exists
cp: `/user/spark/retail_db/products/part-00000': File exists


CalledProcessError: Command 'b'\nhdfs dfs -cp /public/retail_db /user/`whoami`\n'' returned non-zero exit status 1.

```{note}
Alternative approach, where the folder and contents are copied directly.
```

In [13]:
%%sh

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db

Deleted /user/spark/retail_db


In [14]:
%%sh

hdfs dfs -ls /user/`whoami`

Found 1 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:08 /user/spark/.sparkStaging


In [15]:
%%sh

hdfs dfs -cp /public/retail_db /user/`whoami`

In [16]:
%%sh

hdfs dfs -ls -R /user/`whoami`/retail_db

drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/categories
-rw-r--r--   1 spark supergroup       1029 2022-05-29 17:28 /user/spark/retail_db/categories/part-00000
-rw-r--r--   1 spark supergroup   10303297 2022-05-29 17:28 /user/spark/retail_db/create_db.sql
-rw-r--r--   1 spark supergroup       1748 2022-05-29 17:28 /user/spark/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/customers
-rw-r--r--   1 spark supergroup     953719 2022-05-29 17:28 /user/spark/retail_db/customers/part-00000
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/departments
-rw-r--r--   1 spark supergroup         60 2022-05-29 17:28 /user/spark/retail_db/departments/part-00000
-rw-r--r--   1 spark supergroup   10297372 2022-05-29 17:28 /user/spark/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:28 /user/spark/retail_db/order_items
-rw-r--r-

* We can also use patterns while using `cp` command to copy files within HDFS. Also, we can pass multiple files or folders in HDFS to `cp` command.

In [17]:
%%sh

hdfs dfs -rm -R -skipTrash /user/`whoami`/retail_db

Deleted /user/spark/retail_db


In [18]:
%%sh

hdfs dfs -ls /user/`whoami`

Found 1 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:08 /user/spark/.sparkStaging


In [19]:
%%sh

hdfs dfs -mkdir /user/`whoami`/retail_db

In [20]:
%%sh

hdfs dfs -cp /public/retail_db/order* /user/`whoami`/retail_db

In [21]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 2 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/order_items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/orders


In [22]:
%%sh

hdfs dfs -cp /public/retail_db/departments /public/retail_db/products /user/`whoami`/retail_db

In [23]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 4 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/departments
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/order_items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/orders
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/products


In [24]:
%%sh

hdfs dfs -cp /public/retail_db/categories /public/retail_db/customers /user/`whoami`/retail_db

In [25]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db

Found 6 items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/categories
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/customers
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/departments
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/order_items
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/orders
drwxr-xr-x   - spark supergroup          0 2022-05-29 17:29 /user/spark/retail_db/products
