-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-step job support to Haruhi #32
Comments
it's going to be "run flow" |
…ble to compose jobs in spring; todo for tomorrow -- write unit test for the basekbNowFlow and add the tmpDir as a parameter in the context object for SPeL
…ts are being generated for the basekbNowFlow
… in the cluster; next steps are adding more jobs, adding a flow runner for the EMR cluster, and fluentizing the Java/Sprin XML API
The current integration test I am running is
to set up the data, then
and then we clear away the output like so
|
…d of one <list> but practically not the other, because I'd need to write something more complex than <value> to create string beans; this is good enough for now
Right now the job works great locally with a tiny data set. Next steps are to delete the /preprocessed/ and /accepted directories. I'm wondering if we can plug into the 'hadoop fs' implementation to run commands just like we do it on the command line. Ideally this would be part of the bakemono JAR because then this would work remotely. Another necessary feature is to put some contraints on the inputs: there must be a minimum number of arguments, but it would also be good to have validation that files are there if they need to be and not there if they don't. |
Note that EMR allows a job to comprise of multiple steps. Thus, we could write a Haruhi application that runs a series of Bakemono jobs.
This would have a number of beneficial effects. One of them is that we avoid setup and teardown time. Another is that we can stuff intermediate files onto local hard drives w/ HDFS, avoiding network activity for reads. On top of that, we could stuff several small jobs into a single hour slot, or just build a big cluster and get all of our jobs to run hour.
A specific useful application would be the weekly process applied to Freebase.
Note that instead of saying "haruhi run job" we might say "haruhi run batch" or something. (maybe there's a better name); these might be springable but note that most of these are going to want validation logic for the inputs & outputs (to avoid an expensive failed job) and also there will probably be some string interpolation to compute input and output paths and possibly some intelligent selection of reducer counts as well.
The text was updated successfully, but these errors were encountered: