Find implementation for Hadoop
Java Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



hfind is a find(1) implementation for Hadoop.

Getting started

Download the latest version of the hfind tarball here.

The tarball contains two files, a shell script and a jar.

You first need to configure hfind to find your Hadoop installation, via the HFIND_OPTS variable:

export HFIND_OPTS="-Dhfind.hadoop.ugi='pierre\,pierre' -Dhfind.hadoop.namenode.url=hdfs://"

You can then simply run ./hfind from the exploded tarball directory:

[pierre@mouraf ~/downloads]$ ls
[pierre@mouraf ~/downloads]$ tar zxvf metrics.hfind-1.0.0-SNAPSHOT.tar.gz
[pierre@mouraf ~/downloads]$ export HFIND_OPTS="-Dhfind.hadoop.ugi='pierre\,pierre' -Dhfind.hadoop.namenode.url=hdfs://
[pierre@mouraf ~/downloads]$ ./hfind /user/pierre -type f -name "file*.dat" -o -type d -name a


[pierre@mouraf ~]$ hfind /user/pierre -type f -name "file*.dat" -o -type d -name a

[pierre@mouraf ~]$ hfind /user/pierre -mtime +3 -type d


hfind follows as close as possible the POSIX specification documented here.

As of 2010, some options cannot be implemented (e.g. symbolic links).

-maxdepth/-d follows the BSD implementation:

 -d      Cause find to perform a depth-first traversal, i.e., directories are visited in post-order and all entries in a
         directory will be acted on before the directory itself.  By default, find visits directories in pre-order, i.e.,
         before their contents.  Note, the default is not a breadth-first traversal.
         This option is equivalent to the -depth primary of IEEE Std 1003.1-2001 (``POSIX.1'').

-maxdepth n
         Descend at most n directory levels below the command line arguments.  If any -maxdepth
         primary is specified, it applies to the entire expression even if it would not normally be evaluated.
         -maxdepth 0 limits the whole search to the command line arguments.

See TODO below for work in progress.

Things to remember

A directory size is always reported as zero. Use -empty to find empty directories.

A directory mtime is the date of the latest modification of any file in the directory, going one level only. You can have more recent files in subdirectories:

[pierre@mouraf ~]$ find /tmp/a -ls
45404725        0 drwxr-xr-x    4 pierre   wheel         136 Sep 23 17:10 /tmp/a
45404727        0 drwxr-xr-x    3 pierre   wheel         102 Sep 23 17:11 /tmp/a/b
45404730        0 -rw-r--r--    1 pierre   wheel           0 Sep 23 17:11 /tmp/a/b/hello
45404726        0 -rw-r--r--    1 pierre   wheel           0 Sep 23 17:10 /tmp/a/hello

In the previous example, mtime of /tmp/a is 17:10 although, it contains /tmp/a/b/hello, modified later.


  • Respect parentheses

  • Implement -prune

  • Implement -exec and -ok

  • Implement -perm

  • Implement -depth and -mindepth


mvn install

By default, hfind bundles Hadoop version 0.20.2. One can override it via the hadoop.version parameter, e.g.

mvn clean install
mvn -Dhadoop.version=0.20.2-cdh3u1 clean install
mvn -Dhadoop.version=0.20.2-cdh3u2 clean install
mvn -Dhadoop.version=0.20.2-cdh3u3 clean install
mvn -Dhadoop.version=1.0.0 clean install
mvn -Dhadoop.version=1.0.1 clean install
mvn -Dhadoop.version=1.0.2 clean install

License (see COPYING file for full license)

Copyright 2010-2012 Ning

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.