Hadoop library for large-scale data processing, now an Apache Incubator project
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.


Apache DataFu

Follow @apachedatafu

Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.

It consists of two libraries:

  • Apache DataFu Pig: a collection of user-defined functions for Apache Pig
  • Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce

DataFu is currently undergoing incubation with Apache. A mirror of the official git repository can be found on GitHub at https://github.com/apache/incubator-datafu.

For more information please visit the website:

If you'd like to jump in and get started, check out the corresponding guides for each library:

Blog Posts



Other Resources

An interesting example of using Quantile from DataFu can be found in the Hadoop Real-World Solutions Cookbook.

From Around the Web


Getting Help

Please visit the website: