Azkaban is a batch job scheduler developed at LinkedIn for running Hadoop jobs. Azkaban makes it very simple to construct workflows. A workflow is just a collection of job files, with each job file being a definition of how to run it, what other jobs are prerequisites, what files it reads and writes, etc.
Workflows become difficult to maintain as the complexity and number of jobs grow. azkaban-rb helps solve this problem by providing a simple Ruby DSL to generate Azkaban job files.
Install the gem:
gem install azkaban-rb
Optionally you can also install GraphViz if you plan to generate directed graph visualizations of the data flow within the job.
What does an Azkaban job file look like?
The Azkaban documentation has a lot of great information on the types of jobs and what parameters can be provided.
A simple Azkaban job might look a lot like the one below.
type=pig read.lock=input_file.txt write.lock=output_file.txt pig.script=test.pig param.input=input_file.txt param.output=output_file.txt
The type parameter says that this is a pig job. There are also parameters defining read and write locks on the files the job will be accessing. Azkaban makes sure that other jobs won't access the same files in ways that would violate the locks. It also declares what pig script should be run. Finally it declares some parameters to the pig script itself. Within the pig script $input and $output can be used to reference the files.
Already you can see that we have some duplication. Each of the files are listed twice. This problem gets worse as we add more input files. We'll also add some dependencies to complete the example.
type=pig read.lock=input_file1.txt,input_file2.txt,input_file3.txt,input_file4.txt write.lock=output_file.txt pig.script=test.pig param.input1=input_file1.txt param.input2=input_file2.txt param.input3=input_file3.txt param.input4=input_file4.txt param.output=output_file.txt dependencies=job1,job2,job3
How does azkaban-rb work?
With azkaban-rb the jobs are declared in a Rakefile. One could generate the job file above with the following declaration:
job :test do set "type" => "pig", "read.lock" => "input_file1.txt,input_file2.txt,input_file3.txt,input_file4.txt", "write.lock" => "output_file.txt", "pig.script" => "test.pig", "param.input1" => "input_file1.txt", "param.input2" => "input_file2.txt", "param.input3" => "input_file3.txt", "param.input4" => "input_file4.txt", "param.output" => "output_file.txt", "dependencies" => "job1,job2,job3" end
This example just serves to show that you have complete control over what goes into the job file. The real benefit of azkaban-rb comes from using its helper methods:
pig_job :test => [:job1, :job2, :job3] do uses "test.pig" reads "input_file1.txt", :as => :input1 reads "input_file2.txt", :as => :input2 reads "input_file3.txt", :as => :input3 reads "input_file4.txt", :as => :input4 writes "output_file.txt", :as => :output end
This declaration is a lot more readable. pig_job is just a wrapper around job which implicitly sets the job type to “pig”. The reads and writes methods will add the given files to lists which will get read and write locks, respectively. With the :as option appended to the call each will also define an alias which can be used to reference the file name in the pig script. The other interesting thing going on here is the dependencies have been moved to the top of the declaration. Before we discuss this let's talk about what we're really doing in this declaration.
So what is really happening when you invoke pig_job? azkaban-rb is built on top of rake, a Ruby build tool somewhat similar to make. A Rakefile uses Ruby syntax to declare tasks and their prerequisites. A job declared with azkaban-rb is actually no more than a rake task which generates an Azkaban job file when it executes. The parameters you set within the block are used when creating the job file.
Now back to those dependencies. What we are actually doing is declaring task prerequisites using the rake syntax. Since each job is just a rake task, it means one or more jobs can be defined as prerequisites. There are two benefits to this. First, rake will ensure that all prerequisites of a task are executed, which for us means that all the Azkaban jobs that a job depends on will be generated. Second, the dependencies of the job can be implicitly extracted from the list of prerequisites defined using the rake syntax. This means we don't need to ever set the dependencies.
rake is a natural fit in the declaration of Azkaban workflows:
It's Ruby, so you programmatically have a lot of flexibility in how jobs are generated.
Jobs can be declared in a single Rakefile and namespaced for better organization. Or you can group related jobs into separate Ruby files which are then loaded by the Rakefile.
Using JRuby one can import ant targets, which might be used to build Java MapReduce jobs.
Tasks can be set up in such a way that in a single rake command the Azkaban jobs are generated, Java code is built, and a zip archive of the entire workflow is produced and uploaded to the Azkaban server.
azkaban-rb provides a set of built in methods for declaring Azkaban jobs. Each of these is just a wrapper around the job method which implicitly sets the job type for you.
pig_job :test do uses "test.pig" # set other parameters end
Java process jobs
java_process_job :test do uses "azkaban.example.test.HelloWorld" # class having main method to execute # set other parameters end
command_job :test do uses 'echo "Hello Azkaban!"' # set other parameters end
java_job :test do uses "com.example.MyJob" # class for Hadoop job to execute # set other parameters end
A fully functional example can be found in the source code.
Copyright 2010 LinkedIn, Inc
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.