Add multi-step job support to Haruhi #32

paulhoule · 2013-09-22T02:12:10Z

Note that EMR allows a job to comprise of multiple steps. Thus, we could write a Haruhi application that runs a series of Bakemono jobs.

This would have a number of beneficial effects. One of them is that we avoid setup and teardown time. Another is that we can stuff intermediate files onto local hard drives w/ HDFS, avoiding network activity for reads. On top of that, we could stuff several small jobs into a single hour slot, or just build a big cluster and get all of our jobs to run hour.

A specific useful application would be the weekly process applied to Freebase.

Note that instead of saying "haruhi run job" we might say "haruhi run batch" or something. (maybe there's a better name); these might be springable but note that most of these are going to want validation logic for the inputs & outputs (to avoid an expensive failed job) and also there will probably be some string interpolation to compute input and output paths and possibly some intelligent selection of reducer counts as well.

paulhoule · 2013-10-09T18:41:20Z

it's going to be "run flow"

…ble to compose jobs in spring; todo for tomorrow -- write unit test for the basekbNowFlow and add the tmpDir as a parameter in the context object for SPeL

…ts are being generated for the basekbNowFlow

… in the cluster; next steps are adding more jobs, adding a flow runner for the EMR cluster, and fluentizing the Java/Sprin XML API

paulhoule · 2013-10-10T19:37:04Z

The current integration test I am running is

bzcat ~/0005percent.bz2 | gzip -c - > part-00000.gz
hadoop fs -mkdir /freebase/freebase-rdf-1999-99-99-99-99/
hadoop fs -copyFromLocal part-00000.gz /freebase/freebase-rdf-1999-99-99-99-99/

to set up the data, then

haruhi run flow basekbNowFlow /freebase/ 1999-99-99-99-99

and then we clear away the output like so

hadoop fs -rmr  /preprocessed
hadoop fs -rmr /now

…d of one <list> but practically not the other, because I'd need to write something more complex than <value> to create string beans; this is good enough for now

paulhoule · 2013-10-10T23:49:11Z

Right now the job works great locally with a tiny data set. Next steps are to delete the /preprocessed/ and /accepted directories. I'm wondering if we can plug into the 'hadoop fs' implementation to run commands just like we do it on the command line. Ideally this would be part of the bakemono JAR because then this would work remotely.

Another necessary feature is to put some contraints on the inputs: there must be a minimum number of arguments, but it would also be good to have validation that files are there if they need to be and not there if they don't.

…egration test

…icularly deletes) to flows

paulhoule pushed a commit that referenced this issue Oct 9, 2013

fix spelling errror in name #32

e9376b7

paulhoule pushed a commit that referenced this issue Oct 10, 2013

#32 bumped Spring version, now unit tests confirm correct job argumen…

738dc26

…ts are being generated for the basekbNowFlow

paulhoule pushed a commit that referenced this issue Oct 10, 2013

#32 at this point we have the first flow example running with one job…

c2d27d8

… in the cluster; next steps are adding more jobs, adding a flow runner for the EMR cluster, and fluentizing the Java/Sprin XML API

paulhoule pushed a commit that referenced this issue Oct 10, 2013

#32 add sieve3 step to basekbNowFlow

0d96e1a

paulhoule pushed a commit that referenced this issue Oct 12, 2013

#32 -- added runFlow() implementation for AWSCluster which passes int…

010fb62

…egration test

paulhoule pushed a commit that referenced this issue Oct 12, 2013

#32 -- add 'fs' from Hadoop so we can add filesystem operations (part…

388b46c

…icularly deletes) to flows

paulhoule closed this as completed Oct 12, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-step job support to Haruhi #32

Add multi-step job support to Haruhi #32

paulhoule commented Sep 22, 2013

paulhoule commented Oct 9, 2013

paulhoule commented Oct 10, 2013

paulhoule commented Oct 10, 2013

Add multi-step job support to Haruhi #32

Add multi-step job support to Haruhi #32

Comments

paulhoule commented Sep 22, 2013

paulhoule commented Oct 9, 2013

paulhoule commented Oct 10, 2013

paulhoule commented Oct 10, 2013