Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
It consists of two libraries:
- Apache DataFu Pig: a collection of user-defined functions for Apache Pig
- Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce
DataFu is currently undergoing incubation with Apache. A mirror of the official git repository can be found on GitHub at https://github.com/apache/incubator-datafu.
For more information please visit the website:
If you'd like to jump in and get started, check out the corresponding guides for each library:
- Introducing DataFu
- DataFu: The WD-40 of Big Data
- DataFu 1.0
- DataFu's Hourglass: Incremental Data Processing in Hadoop
- A Brief Tour of DataFu
- Building Data Products at LinkedIn with DataFu
- Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)
- DataFu @ ApacheCon 2014
An interesting example of using Quantile from DataFu can be found in the Hadoop Real-World Solutions Cookbook.
- DataFu Enters Incubation Status at Apache
- DataFu: Open Source Apache Pig UDFs by LinkedIn
- LinkedIn Opens DataFu: A Library for Working with Hadoop and Pig
Please visit the website: