Howdah (http://en.wikipedia.org/wiki/Howdah) is a project aimed at providing tools for large scale preprocessing of documents into various output formats. The tentative plan is to use Apache Hadoop and Tika along with other tools (Lucene analysis, regular expressions, etc.) to convert documents.
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
trunk