# HDFS with Jupyter

This notebook shows simple snippets to access and work with a remote hdfs. The packages pyarrow and hdfs3 will be used for this. For api details about these packages:

https://hdfs3.readthedocs.io/en/latest/api.html

https://arrow.apache.org/docs/python/api.html

In order to make this snippets work, the variable "nameNodeHost" must be properly set based on the hdfs namenode location in the deployment.

## Connect to hdfs namenode

In [4]:
# Connect to hdfs
import os
import pandas as pd
import hdfs3
import pyarrow as pa
import pandas as pd

nameNodeHost = 'jupyterhdfs_namenode'
nameNodeIPCPort = 8020
hdfs = hdfs3.HDFileSystem(nameNodeHost, port=nameNodeIPCPort)

['/user/hive', '/user/myname']

## CSV files with Hdfs

### Create folders and files in hdfs

In [50]:
# Display example.csv file
localPath = os.getcwd()
dataPath = localPath + '/data'
csvFile = '/example.csv'

print ("Local files: %s " % os.listdir(dataPath))
print ("Content of local %s: " % csvFile)
df = pd.read_csv(dataPath + csvFile)
df

Local files: ['.ipynb_checkpoints', 'example.csv', 'copy.csv'] 
Content of local /example.csv: 


Unnamed: 0,name,age
0,john,22
1,mary,34


In [48]:
# Create folder in hdfs under /user and copy the local csv file
localPath = os.getcwd()
dataPath = localPath + '/data'
hdfspath = '/user/lainotik'

hdfs.mkdir(hdfspath)
hdfs.put(dataPath + csvFile, hdfspath + csvFile)
print ("Hdfs files: %s " % hdfs.ls(hdfspath, detail=False))

Hdfs files: ['/user/lainotik/example.csv'] 


### Get Files from hdfs

In [51]:
# Copy just created file from hdfs to local
csvFileCopy = '/copy.csv'

hdfs.get(hdfspath + csvFile, dataPath + csvFileCopy)
print ("Local files: %s " % os.listdir(dataPath))
print ("Content of %s: " % csvFileCopy)
df = pd.read_csv(dataPath + csvFileCopy)
df

Local files: ['.ipynb_checkpoints', 'example.csv', 'copy.csv'] 
Content of /copy.csv: 


Unnamed: 0,name,age
0,john,22
1,mary,34


### Delete folder and files in hdfs

In [52]:
# Delete hdfs file
hdfs.rm(hdfspath + csvFile)
print ("Hdfs files: %s " % hdfs.ls(hdfspath, detail=False))

# Delete hdfs folder.
hdfs.rm(hdfspath)
print ("Hdfs folders under user: %s " % hdfs.ls('/user'))

# Note: If the folder has any content this will fail. To remove a folder with content use hdfs.rm(hdfspath, -r) 

# The local csv copy will also be removed
os. remove(dataPath + csvFileCopy)

Hdfs files: [] 
Hdfs folders under user: ['/user/hive'] 


## Parquet files with Hdfs

In [55]:
import pyarrow.parquet as pq