# HDFS with Jupyter

This notebook shows simple snippets to access and work with a remote hdfs with datafiles of different formats (csv, parquet, orc). The packages pyarrow and hdfs3 will be used for this. For api details about these packages:

https://hdfs3.readthedocs.io/en/latest/api.html

https://arrow.apache.org/docs/python/api.html

In order to make this snippets work, the variable "nameNodeHost" must be properly set based on the remote hdfs namenode location.

## Connect to hdfs namenode

In [24]:
# Connect to hdfs
import os
import pandas as pd
import hdfs3
import pyarrow as pa

nameNodeHost = 'jupyterhdfs_namenode'
nameNodeIPCPort = 8020
hdfs = hdfs3.HDFileSystem(nameNodeHost, port=nameNodeIPCPort)

## CSV files with Hdfs

### Create folder and put csv file in hdfs

In [25]:
# Display example.csv file
localPath = os.getcwd()
dataPath = localPath + '/data'
csvFile = '/example.csv'

print ("Local data files: %s " % os.listdir(dataPath))
print ("Content of local data %s: " % csvFile)
df = pd.read_csv(dataPath + csvFile)
df

Local data files: ['example.csv'] 
Content of local data /example.csv: 


Unnamed: 0,name,age
0,john,22
1,mary,34


In [26]:
# Create folder in hdfs under /user and copy the local csv file
localPath = os.getcwd()
dataPath = localPath + '/data'
hdfspath = '/user/lainotik'

hdfs.mkdir(hdfspath)
hdfs.put(dataPath + csvFile, hdfspath + csvFile)
print ("Hdfs files: %s " % hdfs.ls(hdfspath, detail=False))

Hdfs files: ['/user/lainotik/example.csv'] 


### Get csv File from hdfs

In [27]:
# Copy just created file from hdfs to local
csvFileCopy = '/copy.csv'

hdfs.get(hdfspath + csvFile, dataPath + csvFileCopy)
print ("Local files: %s " % os.listdir(dataPath))
print ("Content of local data %s: " % csvFileCopy)
df = pd.read_csv(dataPath + csvFileCopy)
df

Local files: ['example.csv', 'copy.csv'] 
Content of local data /copy.csv: 


Unnamed: 0,name,age
0,john,22
1,mary,34


### Delete folder and csv file in hdfs

In [28]:
# Delete hdfs file
hdfs.rm(hdfspath + csvFile)
print ("Hdfs files: %s " % hdfs.ls(hdfspath, detail=False))

# Delete hdfs folder.
hdfs.rm(hdfspath)
print ("Hdfs folders under user: %s " % hdfs.ls('/user'))

# Note: If the folder has any content this will fail. To remove a folder with content use hdfs.rm(hdfspath, -r) 

# The local csv copy will also be removed
os. remove(dataPath + csvFileCopy)

Hdfs files: [] 
Hdfs folders under user: ['/user/hive'] 


## Parquet files with Hdfs

### Convert local csv file to parquet

In [42]:
import pyarrow.parquet as pq

# Convert local csv file to local parquet
localPath = os.getcwd()
dataPath = localPath + '/data'
csvFile = '/example.csv'
parquetFile = '/example.parquet'

df = pd.read_csv(dataPath + csvFile)
table = pa.Table.from_pandas(df)
pq.write_table(table, dataPath + parquetFile)
print ("Local data files: %s " % os.listdir(dataPath))

Local data files: ['example.csv', 'example.parquet'] 


### Create folder and put parquet file in hdfs

In [43]:
# Create folder in hdfs under /user and copy the local csv file
localPath = os.getcwd()
dataPath = localPath + '/data'
hdfspath = '/user/lainotik'

hdfs.mkdir(hdfspath)
hdfs.put(dataPath + parquetFile, hdfspath + parquetFile)
print ("Hdfs files: %s " % hdfs.ls(hdfspath, detail=False))

Hdfs files: ['/user/lainotik/example.parquet'] 


### Get parquet File from hdfs

In [44]:
# Copy just created file from hdfs to local
parquetFileCopy = '/copy.parquet'

hdfs.get(hdfspath + parquetFile, dataPath + parquetFileCopy)
print ("Local files: %s " % os.listdir(dataPath))
print ("Content of local data %s: " % parquetFileCopy)
df = pd.read_parquet(dataPath + parquetFileCopy, engine='pyarrow')
df

Local files: ['copy.parquet', 'example.csv', 'example.parquet'] 
Content of local data /copy.parquet: 


Unnamed: 0,name,age
0,john,22
1,mary,34


### Delete folder and parquet file

In [45]:
# Delete hdfs file
hdfs.rm(hdfspath + parquetFile)
print ("Hdfs files: %s " % hdfs.ls(hdfspath, detail=False))

# Delete hdfs folder.
hdfs.rm(hdfspath)
print ("Hdfs folders under user: %s " % hdfs.ls('/user'))

# Note: If the folder has any content this will fail. To remove a folder with content use hdfs.rm(hdfspath, -r) 

# The local parquet files will also be removed
os. remove(dataPath + parquetFileCopy)
os. remove(dataPath + parquetFile)

Hdfs files: [] 
Hdfs folders under user: ['/user/hive'] 


## ORC files with Hdfs

### Convert local csv file to ORC

In [46]:
import pyarrow.orc as po

It seems that there is functionality to read orc files in pyarrow, but not to write a table in orc format:

df = pd.read_orc(dataPath + orcFile, engine='pyarrow')