## Copying files from local to HDFS

We can copy files from local file system and vice versa. We can append data into existing files in HDFS.
* `hadoop fs -copyFromLocal` or `hadoop fs -put` – to copy files from local filesystem and HDFS. Process of copying data is already covered. File will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor.
* `hadoop fs -copyToLocal` or `hadoop fs -get` – to copy files from HDFS to local filesystem. It will read all the blocks using index in sequence and construct the file in local file system.
* We can also use `hadoop fs -appendToFile` to append data to existing file.
* However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then again copy to HDFS.

![test](https://s3.amazonaws.com/kaizen.itversity.com/hadoop-overview/04HDFSAnatomyOfFileWrite.png)


In [None]:
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db

In [6]:
%%sh

hdfs dfs -help put

-put [-f] [-p] [-l] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                       
  -p  Preserves access and modification times, ownership and the mode. 
  -f  Overwrites the destination if it already exists.                 
  -l  Allow DataNode to lazily persist the file to disk. Forces        
         replication factor of 1. This flag will result in reduced
         durability. Use with care.


In [7]:
%%sh

hdfs dfs -help copyFromLocal

-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst> :
  Identical to the -put command.


```{warning}
This will copy the entire folder to `/user/${USER}/retail_db` and you will see `/user/${USER}/retail_db/retail_db`. You can use the next command to get files as expected.
```

In [8]:
%%sh

hdfs dfs -put /data/retail_db /user/${USER}/retail_db

```{note}
Alternatively you can use `copyFromLocal` as well.
```

In [8]:
%%sh

hdfs dfs -copyFromLocal /data/retail_db /user/${USER}/retail_db

In [9]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 1 items
drwxr-xr-x   - itversity students          0 2021-01-06 20:40 /user/itversity/retail_db/retail_db


```{note}
Let's drop this folder and make sure files are copied as expected.
```

In [10]:
%%sh

hdfs dfs -help rm

-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... :
  Delete all files that match the specified file pattern. Equivalent to the Unix
  command "rm <src>"
                                                                                 
  -f          If the file does not exist, do not display a diagnostic message or 
              modify the exit status to reflect an error.                        
  -[rR]       Recursively deletes directories.                                   
  -skipTrash  option bypasses trash, if enabled, and immediately deletes <src>.  
  -safely     option requires safety confirmation, if enabled, requires          
              confirmation before deleting large directory with more than        
              <hadoop.shell.delete.limit.num.files> files. Delay is expected when
              walking over large directory recursively to count the number of    
              files to be deleted before the confirmation.                       


In [12]:
%%sh

hdfs dfs -rm -R /user/itversity/retail_db/retail_db

21/01/06 20:41:33 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nn01.itversity.com:8020/user/itversity/retail_db/retail_db' to trash at: hdfs://nn01.itversity.com:8020/user/itversity/.Trash/Current/user/itversity/retail_db/retail_db


In [13]:
%%sh

hdfs dfs -ls /user/itversity/retail_db/

In [14]:
%%sh

hdfs dfs -put /data/retail_db/* /user/${USER}/retail_db

In [15]:
%%sh

hdfs dfs -ls /user/itversity/retail_db/

Found 6 items
drwxr-xr-x   - itversity students          0 2021-01-06 20:42 /user/itversity/retail_db/categories
drwxr-xr-x   - itversity students          0 2021-01-06 20:42 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity students          0 2021-01-06 20:42 /user/itversity/retail_db/departments
drwxr-xr-x   - itversity students          0 2021-01-06 20:42 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity students          0 2021-01-06 20:42 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity students          0 2021-01-06 20:42 /user/itversity/retail_db/products


```{note}
We can also use this alternative approach to directly copy the folder `/data/retail_db` to `/user/${USER}/retail_db`. Let us first delete `/user/${USER}/retail_db` using `skipTrash`.
```

In [16]:
%%sh

hdfs dfs -rm -R -skipTrash /user/itversity/retail_db

Deleted /user/itversity/retail_db


```{note}
We can specify the target location as `/user/${USER}`. It will create the retail_db folder and its contents.
```

In [None]:
%%sh

hdfs dfs -put /data/retail_db /user/${USER}

* If we try to run `hdfs dfs -put /data/retail_db /user/${USER}` again it will fail as the target folder already exists.

In [18]:
%%sh

hdfs dfs -put /data/retail_db /user/${USER}

put: `/user/itversity/retail_db/categories/part-00000': File exists
put: `/user/itversity/retail_db/customers/part-00000': File exists
put: `/user/itversity/retail_db/departments/part-00000': File exists
put: `/user/itversity/retail_db/order_items/part-00000': File exists
put: `/user/itversity/retail_db/orders/part-00000': File exists
put: `/user/itversity/retail_db/products/part-00000': File exists


CalledProcessError: Command 'b'\nhdfs dfs -put /data/retail_db /user/${USER}\n'' returned non-zero exit status 1.

* We can use `-f` as part of `put` or `copyFromLocal` to replace existing folder.

In [19]:
%%sh

hdfs dfs -put -f /data/retail_db /user/${USER}

In [20]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 6 items
drwxr-xr-x   - itversity students          0 2021-01-06 20:48 /user/itversity/retail_db/categories
drwxr-xr-x   - itversity students          0 2021-01-06 20:48 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity students          0 2021-01-06 20:48 /user/itversity/retail_db/departments
drwxr-xr-x   - itversity students          0 2021-01-06 20:48 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity students          0 2021-01-06 20:48 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity students          0 2021-01-06 20:48 /user/itversity/retail_db/products


In [1]:
%%sh

hdfs dfs -help get

-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
  Copy files that match the file pattern <src> to the local name.  <src> is kept. 
  When copying multiple files, the destination must be a directory. Passing -p
  preserves access and modification times, ownership and the mode.


In [2]:
%%sh

hdfs dfs -help copyToLocal

-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
  Identical to the -get command.


```{warning}
This will copy the entire folder from `/user/${USER}/retail_db` to local home directory and you will see `/home/${USER}/retail_db`. 
```

In [3]:
%%sh

hdfs dfs -get /user/${USER}/retail_db /home/${USER}

In [4]:
%%sh

ls -ltr /home/${USER}/retail_db

total 0
