You can use Cascading Traps with Cascalog to capture tuples whose processing fails. To store those tuples into a sink tap (for example a local file or hfs-textline), use the
:trap keyword with an error sink:
(def errors (lfs-textline "file:///tmp/people.bad_records" :sinkmode :replace)) ;; or (stdout) or (hfs-textline "hdfs:///tmp/...") if running on Hadoop (<- [?name ?age] (people ?name ?age) (:trap errors) (< ?age 40))
It uses for example
fact?- to execute a query and compare its outputs with the expected ones or something like
(facts query => (produces [[3 10] [1 5] [5 11]]) where
(def query (<- ...)). Read Sam Ritchie's blog post Cascalog Testing 2.0 for more details and examples of midje-cascalog 0.4.0.
There are certain features that support live, interactive coding:
(def people [["ben" 21] ["jim" 42]]))
When all the taps in a job are
lfs-textlines or vectors (or stdout), you can run the
-main in your jar directly using
java -jar, instead of submitting it with
hadoop jar. This is sometimes called local mode.
When your jobs are running in this local mode, you can have a lot of information logged with log4j just by putting a standard log4j.xml in the classpath root of your jar. Any exceptions thrown in jobs will be printed to the configured log file with their full stacktrace.