Skip to content
Python WebHDFS library and shell.
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
debian
lib
.gitignore
LICENSE.txt
README.md
release.sh
setup.py
webhdfs

README.md

Python WebHDFS

WebHDFS python client library and simple shell.

Table of Contents

Prerequisites

Installation

Install python-webhdfs as a Debian package by building a deb:

dpkg-buildpackage
# or
pdebuild

Install python-webhdfs using the standard setuptools script:

python setup.py install

API

To use the WebHDFS Client API, start by importing the class from the module

>>> from webhdfs import WebHDFSClient

All functions may throw a WebHDFSError exception or one of these subclasses:

Exception Type Remote Exception Description
WebHDFSConnectionError Unable to connect to active NameNode
WebHDFSIncompleteTransferError Transferred file doesn't match origin size
WebHDFSAccessControlError AccessControlException Access to specified path denied
WebHDFSIllegalArgumentError IllegalArgumentException Invalid parameter value
WebHDFSFileNotFoundError FileNotFoundException Specified path does not exist
WebHDFSSecurityError SecurityException Failed to obtain user/group information
WebHDFSUnsupportedOperationError UnsupportedOperationException Requested operation is not implemented
WebHDFSUnknownRemoteError Remote exception unrecognized

WebHDFSClient

__init__(base, user, conf=None, wait=None)

Creates a new WebHDFSClient object

Parameters:

  • base: base webhdfs url. (e.g. http://localhost:50070)
  • user: user name with which to access all resources
  • conf: (optional) path to hadoop configuration directory for NameNode HA resolution
  • wait: (optional) floating point number in seconds for request timeout waits
>>> import getpass
>>> hdfs = WebHDFSClient('http://localhost:50070', getpass.getuser(), conf='/etc/hadoop/conf', wait=1.5)

stat(path, catch=False)

Retrieves metadata about the specified HDFS item. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=GETFILESTATUS

Parameters:

  • path: HDFS path to fetch
  • catch: (optional) trap WebHDFSFileNotFoundError instead of raising the exception

Returns:

  • A single WebHDFSObject object for the specified path.
  • False if object not found in HDFS and catch=True.
>>> o = hdfs.stat('/user')
>>> print o.full
/user
>>> print o.kind
DIRECTORY
>>> o = hdfs.stat('/foo', catch=True)
>>> print o
False

ls(path, recurse=False, request=False)

Lists a specified HDFS path. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS

Parameters:

  • path: HDFS path to list
  • recurse: (optional) descend down the directory tree
  • request: (optional) filter request callback for each returned object

Returns:

  • Generator producing children WebHDFSObject objects for the specified path.
>>> l = list(hdfs.ls('/')) # must convert to list if referencing by index
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY
>>> l = list(hdfs.ls('/user', request=lambda x: x.name.startswith('m')))
>>> print l[0].full
/user/max

glob(path)

Lists a specified HDFS path pattern. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS

Parameters:

  • path: HDFS path pattern to list

Returns:

>>> l = hdfs.glob('/us*')
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY

du(path, real=False)

Gets the usage of a specified HDFS path. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMARY

Parameters:

  • path: HDFS path to analyze
  • real: (optional) specifies return type

Returns:

  • If real is None: Instance of a du object: du(dirs=, files=, hdfs_usage=, disk_usage=, hdfs_quota=, disk_quota=)
  • If real is a string: Integer for the du object attribute name.
  • If real is boolean True: Integer of hdfs bytes used by the specified path.
  • If real is boolean False: Integer of disk bytes used by the specified path.
>>> u = hdfs.du('/user')
>>> print u
110433
>>> u = hdfs.du('/user', real=True)
>>> print u
331299
>>> u = hdfs.du('/user', real='disk_quota')
>>> print u
-1
>>> u = hdfs.du('/user', real=None)
>>> print u
du(dirs=3, files=5, hdfs_usage=110433, disk_usage=331299, hdfs_quota=-1, disk_quota=-1)

mkdir(path)

Creates the specified HDFS path. Uses this WebHDFS rest request:

PUT <BASE>/webhdfs/v1/<PATH>?op=MKDIRS

Parameters:

  • path: HDFS path to create

Returns:

  • Boolean True
>>> hdfs.mkdir('/user/%s/test' % getpass.getuser())
True

mv(path, dest)

Moves/renames the specified HDFS path to specified destination. Uses this WebHDFS rest request:

PUT <BASE>/webhdfs/v1/<PATH>?op=RENAME&destination=<DEST>

Parameters:

  • path: HDFS path to move/rename
  • dest: Destination path

Returns:

  • Boolean True on success and False on error
>>> hdfs.mv('/user/%s/test' % getpass.getuser(), '/user/%s/test.old' % getpass.getuser())
True
>>> hdfs.mv('/user/%s/test.old' % getpass.getuser(), '/some/non-existant/path')
False

rm(path)

Removes the specified HDFS path. Uses this WebHDFS rest request:

DELETE <BASE>/webhdfs/v1/<PATH>?op=DELETE

Parameters:

  • path: HDFS path to remove

Returns:

  • Boolean True
>>> hdfs.rm('/user/%s/test' % getpass.getuser())
True

repl(path, num)

Sets the replication factor for the specified HDFS path. Uses this WebHDFS rest request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETREPLICATION

Parameters:

  • path: HDFS path to change
  • num: new replication factor to apply

Returns:

  • Boolean True on success, False otherwise
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
1
>>> hdfs.repl('/user/%s/test' % getpass.getuser(), 3).repl
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
3

chown(path, owner='', group='')

Sets the owner and/or group of a specified HDFS path. Uses this WebHDFS REst request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETOWNER[&owner=<OWNER>][&group=<GROUP>]

Parameters:

  • path: HDFS path to change
  • owner: (optional) new object owner
  • group: (optional) new object group

Returns:

  • Boolean True if ownership successfully applied

Raises:

  • WebHDFSIllegalArgumentError if both owner and group are unspecified or empty
>>> hdfs.chown('/user/%s/test' % getpass.getuser(), owner='other_owner', group='other_group')
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).owner
'other_owner'
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).group
'other_group'

chmod(path, perm)

Sets the permission of a specified HDFS path. Uses this WebHDFS REst request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETPERMISSION&permission=<PERM>

Parameters:

  • path: HDFS path to change
  • perm: new object permission

Returns:

  • Boolean True if permission successfully applied

Raises:

  • WebHDFSIllegalArgumentError if permission is not octal integer under 0777
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rwxr-xr-x'
>>> hdfs.chmod('/user/%s/test' % getpass.getuser(), perm=0644)
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rw-r--r--'

touch(path, time=None)

Sets the modification time of a specified HDFS path, optionally creating it. Uses this WebHDFS REst request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETTIMES&modificationtime=<TIME>

Parameters:

  • path: HDFS path to change
  • time: (optional) object modification time, represented as a Python datetime object or int epoch timestamp, defaulting to current time

Returns:

  • Boolean True if modification time successfully changed

Raises:

  • WebHDFSIllegalArgumentError if time is not a valid type
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser())
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2019, 1, 28, 12, 10, 20)
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser(), datetime.datetime(2018, 9, 27, 11, 1, 17))
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2018, 9, 27, 11, 1, 17)

get(path, data=None)

Fetches the specified HDFS path. Returns a string or writes a file, based on parameters. Uses this WebHDFS request:

GET <BASE>/webhdfs/v1/<PATH>?op=OPEN

Parameters:

  • path: HDFS path to fetch
  • data: (optional) file-like object open for write

Returns:

  • Boolean True if data is set and written file size matches source
  • String contents of the fetched file if data is None

Raises:

  • WebHDFSIncompleteTransferError

put(path, data)

Creates the specified HDFS file using the contents of a file open for read, or value of the string. Uses this WebHDFS request:

PUT <BASE>/webhdfs/v1/<PATH>?op=CREATE

Parameters:

  • path: HDFS path to fetch
  • data: file-like object open for read or string

Returns:

  • Boolean True if written file size matches source

Raises:

  • WebHDFSIncompleteTransferError

calls

Read-only property that retrieves number of HTTP requests performed so far.

>>> l = list(hdfs.ls('/user', recurse=True))
>>> hdfs.calls
11

WebHDFSObject

__init__(path, bits)

Creates a new WebHDFSObject object

Parameters:

  • path: HDFS path prefix
  • bits: dictionary as returned by stat() or ls() call.
>>> o = hdfs.stat('/')
>>> type(o)
<class 'webhdfs.attrib.WebHDFSObject'>

is_dir()

Determines whether the HDFS object is a directory or not.

Parameters: None

Returns:

  • boolean True when object is a directory, False otherwise
>>> o = hdfs.stat('/')
>>> o.is_dir()
True

is_empty()

Determines whether the HDFS object is empty or not.

Parameters: None

Returns:

  • boolean True when object is a directory and has no children or a file and is of 0 size, and False otherwise
>>> o = hdfs.stat('/')
>>> o.is_empty()
False

owner

Read-only property that retreives the HDFS object owner.

>>> o = hdfs.stat('/')
>>> o.owner
'hdfs'

group

Read-only property that retreives the HDFS object group.

>>> o = hdfs.stat('/')
>>> o.group
'supergroup'

name

Read-only property that retreives the HDFS object base file name.

>>> o = hdfs.stat('/user/max')
>>> o.name
'max'

full

Read-only property that retreives the HDFS object full file name.

>>> o = hdfs.stat('/user/max')
>>> o.full
'/user/max'

size

Read-only property that retreives the HDFS object size in bytes.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.size
20552

repl

Read-only property that retreives the HDFS object replication factor.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.repl
1

kind

Read-only property that retreives the HDFS object type (FILE or DIRECTORY).

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.kind
'FILE'

date

Read-only property that retreives the HDFS object last modification timestamp as a Python datetime object.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.date
datetime.datetime(2015, 3, 7, 3, 53, 6)

mode

Read-only property that retreives the HDFS object symbolic permissions mode.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.mode
'-rw-r--r--'

perm

Read-only property that retreives the HDFS object octal permissions mode, usable by Python's stat module.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> oct(o.perm)
'0100644'
>>> stat.S_ISDIR(o.perm)
False
>>> stat.S_ISREG(o.perm)
True

Usage

usage: webhdfs [-h] [-d CWD] [-l LOG] [-c CFG] [-t TIMEOUT] [-v]
               url [cmd [cmd ...]]

webhdfs shell

positional arguments:
  url                   webhdfs base url
  cmd                   run this command and exit

optional arguments:
  -h, --help            show this help message and exit
  -d CWD, --cwd CWD     initial hdfs directory
  -l LOG, --log LOG     logger destination url
  -c CFG, --cfg CFG     hdfs configuration dir
  -t TIMEOUT, --timeout TIMEOUT
                        request timeout in seconds
  -v, --version         print version and exit

supported logger formats:
  console://?level=LEVEL
  file://PATH?level=LEVEL
  syslog+tcp://HOST:PORT/?facility=FACILITY&level=LEVEL
  syslog+udp://HOST:PORT/?facility=FACILITY&level=LEVEL
  syslog+unix://PATH?facility=FACILITY&level=LEVEL

Parameters:

  • url: base url for the WebHDFS endpoint, supporting http, https, and hdfs schemes
  • cmd: (optional) run the specified command with args and exit without starting the shell
  • -d | --cwd: (optional) initial hdfs directory to switch to on shell invocation
  • -l | --log: (optional) logger destination url as described by supported formats
  • -c | --cfg: (optional) hadoop configuration directory for NameNode HA resolution
  • -t | --timeout: (optional) request timeout in seconds as floating point number
  • -v | --version: (optional) print shell/library version and exit

Environment Variables:

  • HADOOP_CONF_DIR: alternative to and takes precedence over the -c | --cfg command-line parameter
  • WEBHDFS_HISTFILE: (optional) specify the preserved history file, defaulting to ~/.webhdfs_history
  • WEBHDFS_HISTSIZE: (optional) specify the preserved history size, defaulting to 1000; set to 0 to disable

License

MIT

You can’t perform that action at this time.