Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-step job support to Haruhi #32

Closed
paulhoule opened this issue Sep 22, 2013 · 3 comments
Closed

Add multi-step job support to Haruhi #32

paulhoule opened this issue Sep 22, 2013 · 3 comments
Milestone

Comments

@paulhoule
Copy link
Owner

Note that EMR allows a job to comprise of multiple steps. Thus, we could write a Haruhi application that runs a series of Bakemono jobs.

This would have a number of beneficial effects. One of them is that we avoid setup and teardown time. Another is that we can stuff intermediate files onto local hard drives w/ HDFS, avoiding network activity for reads. On top of that, we could stuff several small jobs into a single hour slot, or just build a big cluster and get all of our jobs to run hour.

A specific useful application would be the weekly process applied to Freebase.

Note that instead of saying "haruhi run job" we might say "haruhi run batch" or something. (maybe there's a better name); these might be springable but note that most of these are going to want validation logic for the inputs & outputs (to avoid an expensive failed job) and also there will probably be some string interpolation to compute input and output paths and possibly some intelligent selection of reducer counts as well.

@paulhoule
Copy link
Owner Author

it's going to be "run flow"

paulhoule pushed a commit that referenced this issue Oct 9, 2013
paulhoule pushed a commit that referenced this issue Oct 10, 2013
…ble to compose jobs in spring; todo for tomorrow -- write unit test for the basekbNowFlow and add the tmpDir as a parameter in the context object for SPeL
paulhoule pushed a commit that referenced this issue Oct 10, 2013
…ts are being generated for the basekbNowFlow
paulhoule pushed a commit that referenced this issue Oct 10, 2013
… in the cluster; next steps are adding more jobs, adding a flow runner for the EMR cluster, and fluentizing the Java/Sprin XML API
@paulhoule
Copy link
Owner Author

The current integration test I am running is

bzcat ~/0005percent.bz2 | gzip -c - > part-00000.gz
hadoop fs -mkdir /freebase/freebase-rdf-1999-99-99-99-99/
hadoop fs -copyFromLocal part-00000.gz /freebase/freebase-rdf-1999-99-99-99-99/

to set up the data, then

haruhi run flow basekbNowFlow /freebase/ 1999-99-99-99-99

and then we clear away the output like so

hadoop fs -rmr  /preprocessed
hadoop fs -rmr /now

paulhoule pushed a commit that referenced this issue Oct 10, 2013
…d of one <list> but practically not the other, because I'd need to write something more complex than <value> to create string beans; this is good enough for now
paulhoule pushed a commit that referenced this issue Oct 10, 2013
@paulhoule
Copy link
Owner Author

Right now the job works great locally with a tiny data set. Next steps are to delete the /preprocessed/ and /accepted directories. I'm wondering if we can plug into the 'hadoop fs' implementation to run commands just like we do it on the command line. Ideally this would be part of the bakemono JAR because then this would work remotely.

Another necessary feature is to put some contraints on the inputs: there must be a minimum number of arguments, but it would also be good to have validation that files are there if they need to be and not there if they don't.

paulhoule pushed a commit that referenced this issue Oct 12, 2013
paulhoule pushed a commit that referenced this issue Oct 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant