Skip to content
Hadoop library for large-scale data processing, now an Apache Incubator project
Find file
New pull request
Latest commit 8a7e1de @matthayes matthayes Update README

Apache DataFu

Follow @apachedatafu

Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.

It consists of two libraries:

  • Apache DataFu Pig: a collection of user-defined functions for Apache Pig
  • Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce

DataFu is currently undergoing incubation with Apache. A mirror of the official git repository can be found on GitHub at

For more information please visit the website:

If you'd like to jump in and get started, check out the corresponding guides for each library:

Blog Posts



Other Resources

An interesting example of using Quantile from DataFu can be found in the Hadoop Real-World Solutions Cookbook.

From Around the Web


Getting Help

Please visit the website:

Something went wrong with that request. Please try again.