Skip to content

User Guide

Alex Bain edited this page Jun 30, 2017 · 29 revisions

Table of Contents

Overview

This is the LinkedIn Gradle Plugin for Apache Hadoop User Guide. For the sake of brevity, we will refer to the plugin as simply the "Hadoop Plugin".

The Hadoop Plugin will help you more effectively build, test and deploy Hadoop applications. In particular, the Plugin will help you easily work with Hadoop applications like Apache Pig and build workflows for Hadoop workflow schedulers like Azkaban and Apache Oozie.

The Plugin includes the LinkedIn Gradle DSL for Apache Hadoop (which we shall refer to as simply the "Hadoop DSL"), a language for specifying jobs and workflows for Hadoop workflow schedulers like Azkaban and Apache Oozie. Go directly to the Hadoop DSL Language Reference.

Using the Open-Source Hadoop Plugin

The Hadoop Plugin is now published on plugins.gradle.org. Click on the link for a short snippet to add to your build.gradle file to start using the Hadoop Plugin today!

Using the Hadoop Plugin at LinkedIn

If you are using the Hadoop Plugin internally at LinkedIn, see our comprehensive instructions at go/HadoopPlugin on the LinkedIn Wiki to start using the Plugin.

Hadoop Plugin Tasks

To see all of the Hadoop Plugin tasks, run gradle tasks in the project directory of your Hadoop Plugin project and look at the section titled Hadoop Plugin tasks. You may see something like:

Hadoop Plugin tasks
-------------------
azkabanDevHadoopZip - Creates a Hadoop zip archive for azkabanDev
azkabanUpload - Uploads Hadoop zip archive to Azkaban
azkabanProdHadoopZip - Creates a Hadoop zip archive for azkabanProd
buildAzkabanFlows - Builds the Hadoop DSL for Azkaban. Have your build task depend on this task.
buildHadoopZips - Builds all of the Hadoop zip archives. Tasks that depend on Hadoop zips should depend on this task
buildOozieFlows - Builds the Hadoop DSL for Apache Oozie. Have your build task depend on this task.
buildPigCache - Build the cache directory to run Pig scripts by Gradle tasks. This task will be run automatically for you.
buildScmMetadata - Writes SCM metadata about the project to the project's build directory
checkDependencies - Task to help in controlling and monitoring the dependencies used in the project
CRTHadoopZip - Creates a Hadoop CRT deployment zip archive
disallowLocalDependencies - Task to disallow users from checking in local dependencies
oozieCommand - Runs the oozieCommand specified by -Pcommand=CommandName
oozieUpload - Uploads the Oozie project folder to HDFS
printScmMetadata - Prints SCM metadata about the project to the screen
run_count_by_country.pig - Run the Pig script src/main/pig/count_by_country.pig with no Pig parameters or JVM properties
run_count_by_country_python.pig - Run the Pig script src/main/pig/count_by_country_python.pig with no Pig parameters or JVM properties
run_member_event_count.pig - Run the Pig script src/main/pig/member_event_count.pig with no Pig parameters or JVM properties
run_postal_code.pig - Run the Pig script src/main/pig/postal_code.pig with no Pig parameters or JVM properties
run_verify_recommendations.pig - Run the Pig script src/main/pig/verify_recommendations.pig with no Pig parameters or JVM properties
runPigJob - Runs a Pig job configured in the Hadoop DSL with gradle runPigJob -Pjob=<job name>. Uses the Pig parameters and JVM properties from the DSL.
runSparkJob - Runs a Spark job configured in the Hadoop DSL with gradle runSparkJob -PjobName=<job name> -PzipTaskName=<zip task name>. Uses the Spark parameters and JVM properties from the DSL.
showPigJobs - Lists Pig jobs configured in the Hadoop DSL that can be run with the runPigJob task
showSparkJobs - Lists Spark jobs configured in the Hadoop DSL that can be run with the runSparkJob task
startHadoopZips - Container task on which all the Hadoop zip tasks depend
writeAzkabanPluginJson - Writes a default .azkabanPlugin.json file in the project directory
writeOoziePluginJson - Writes a default .ooziePlugin.json file in the project directory
writeScmPluginJson - Writes a default .scmPlugin.json file in the root project directory

Some of these tasks will help you run and debug Hadoop jobs, some of them are related to the Hadoop DSL, and some of them will help you upload to Azkaban. See the sections below for descriptions of each.

Hadoop DSL Language

The Hadoop Plugin comes with the Hadoop DSL, which makes it easy to specify workflows and jobs for Hadoop workflow schedulers.

(Since version 0.3.9) If for some reason you need to disable the Hadoop DSL Plugin, you can pass -PdisableHadoopDslPlugin on the Gradle command line or add disableHadoopDslPlugin=true to your gradle.properties file.

Hadoop DSL Language Reference

The Hadoop DSL Language Reference is documented on its own page at Hadoop DSL Language Reference.

Hadoop DSL Syntax Completion in IntelliJ IDEA

The Hadoop DSL supports automatic syntax completion in all recent versions of IntelliJ IDEA. See the Hadoop DSL Language Reference to learn how to enable this feature.

Building the Hadoop DSL for Azkaban

Run the buildAzkabanFlows task to compile the Hadoop DSL into Azkaban job files. Generally, one of your build steps should depend on this task (usually the task that builds the Azkaban zip).

// Assume this file contains your Hadoop DSL
apply from: 'src/main/gradle/workflows.gradle'
  
// The build depends on the Hadoop zips, which depends on compiling the Hadoop DSL
startHadoopZips.dependsOn buildAzkabanFlows
build.dependsOn buildHadoopZips
   
// Additionally, you need to configure the hadoopZip block. Read the section on Hadoop
// Zip Artifacts to understand how to add the compiled Hadoop DSL files to your zip.
Hadoop DSL Automatic Builds

(Since version 0.13.1) Hadoop DSL Automatic Builds are a new way to build the Hadoop DSL that makes it very easy to customize the Hadoop DSL for multiple grids. Hadoop DSL Automatic Builds will automatically find your Hadoop DSL definition set files, user profile scripts and workflow scripts.

For each definition set file, the Automatic Build will apply the definition set file, your user profile file and your workflow files. Then it will build the compiled Hadoop DSL output files. At the end, you will have one set of compiled output files for each of your definition set files.

If you declare one definition set file per grid (such as src/main/definitions/devGrid.gradle and src/main/definitions/prodGrid.gradle), you will get compiled output files for each grid without having to use any advanced Hadoop DSL language features like namespace or hadoopClosure. This makes it very easy to build Hadoop DSL output customized for each of your Hadoop grids.

You can easily customize the location of your definition set, user profile and workflow files. You can also enable showSetup = true to get information about the steps the Automatic Build is taking to configure your build. Once you have configured the hadoopDslBuild block in your build.gradle file, call the autoSetup method to process the Automatic Build.

// In your build.gradle file
// No need to manually apply your Hadoop DSL workflow scripts! They will be applied automatically.
 
// Now configure the hadoopDslBuild block and call autoSetup at the end to process it
hadoopDslBuild {
  showSetup = true  // Set showSetup = true to show info about the automatic build process. Set to false by default.
}.autoSetup()   // Call the autoSetup method to process the build
  
// If you want to accept all the defaults, you can write this as a one-liner instead:
hadoopDslBuild.autoSetup()

// The build depends on the Hadoop zips, which depends on compiling the Hadoop DSL
startHadoopZips.dependsOn buildAzkabanFlows
build.dependsOn buildHadoopZips
  
// Now when the buildAzkabanFlows task runs, you will get one set of output files under the Hadoop
// DSL buildPath folder for each definition set file in the definitions folder.
  
// Additionally, you need to configure the hadoopZip block. Read the section on the
// hadoopZip block to understand how to add the compiled Hadoop DSL files to your zip.

// Here is a guide to all the options you can set on the hadoopDslBlock if you want to customize the automatic build process
hadoopDslBuild {
  showSetup = true                      // Set showSetup = true to show info about the automatic build process. Set to false by default.
  
  definitions = "src/main/definitions"  // Automatic builds will rebuild the Hadoop DSL for each file in this directory. Set to "src/main/definitions" by default.
  profiles = "src/main/profiles"        // Automatic builds will apply your user profile script from this directory (if it exists). Set to "src/main/profiles" by default.
  workflows = "src/main/gradle"         // Automatic builds will apply all the .gradle files in this directory. Set to "src/main/gradle" by default.
  
  // The following properties enable you to specify exactly which definition and workflow files to apply (and the order in which to apply them). These
  // properties override the definitions and workflows paths specified above. Use these properties if you want to control exactly what definition and
  // workflow files should be applied and in what order (otherwise, all workflow files will be applied in alphabetic order by file name). See
  // https://docs.gradle.org/2.13/userguide/working_with_files.html for a reference on how to specify Gradle file collections.
  definitionFiles = files(['src/main/otherDefs/defs1.gradle', 'src/main/otherDefs/defs2.gradle']   // Set to null by default
  workflowFiles = files(['src/main/otherFlows/flows1.gradle', 'src/main/otherFlows/defs2.gradle']  // Set to null by default

  // Property to specify what workflow files should be applied first before any other workflow
  // files. Use this to apply helper scripts that should be applied before anything else.
  workflowFilesFirst = files(['src/main/gradle/common.gradle']
  
  // The following properties enable you to customize the user profile to apply. These can also be customized with command line options.
  profileName = 'ackermann'  // Optional - defaults to null (in which case your user name is used). Name of the user profile to apply. Pass -PprofileName=<name> on the command line to override.
  skipProfile = false        // Optional - defaults to false. Specifies whether or not to skip applying the user profile. Pass -PskipProfile=true on the command line to override.
  
}.autoSetup()  // Call the autoSetup method to process the build
Customizing the Automatic Build

In the hadoopDslBuild block you can set a number of properties that enable you to customize the behavior of the Automatic Build process. All of these properties are optional and can be left unspecified if you want to accept the default behavior.

In particular, you can set the definitionFiles and workflowFiles properties to Gradle FileCollection instances if you want to specify exactly which files to apply and their order. These properties will override the definitions and workflows properties that specify the paths from which to read these files. For workflow files, you can specify that the Automatic Build should always apply certain files first by setting the workflowFilesFirst property to a Gradle FileCollection instance.

You can also set the profileName and skipProfile properties to control what Hadoop DSL user profile file is applied (which must exist in the path specified by the profiles property). You can override these settings by using the -PprofileName=<name> and -PskipProfile=true command line options.

External State

In between applying each definition set file found under the definitions path, the state of the Hadoop DSL will be cleared, so that the new definition set file can be applied and the user profile and workflow scripts can be reapplied.

Since the user profile and workflow scripts may each be reapplied several times, any code you have in these files that affects non-Hadoop DSL state may not work correctly, especially if that code assumes that it is processed only once. In particular, any code that assumes Gradle extension properties (such as ext.extensionVariableName) are set only once might not work correctly.

External Gradle Scripts

Hadoop DSL Automatic Builds will automatically apply any Gradle scripts found under the workflows path. If you have helper scripts that you would like to apply manually, simply move them to another path or to a subdirectory of workflows (subdirectories are not processed automatically). Another way to apply helper scripts is to set the workflowFilesFirst property described above.

For common definitions, you can move them into a common definitions file in a subdirectory under the definitions path and then apply from this common file in each of your definition files.

Hadoop Runtime Dependency Configuration

Applying the Hadoop Plugin will create the hadoopRuntime dependency configuration. You should add dependencies to this configuration that your Hadoop code doesn't need at compile time, but needs at runtime when it executes on the grid.

For projects that also apply the Java Plugin, the hadoopRuntime configuration automatically extends the runtime configuration and adds the jar task. By default, everything in the hadoopRuntime configuration will be added to each Hadoop zip artifact you declare in the hadoopZip block.

To see the dependencies that will be added to the hadoopRuntime configuration, run ligradle dependencies --configuration hadoopRuntime.

// In your <rootProject>/<project>/build.gradle:

// Declare Hadoop runtime dependencies using the hadoopRuntime dependency configuration
dependencies {
  hadoopRuntime "org.apache.avro:avro:1.7.7"
  // ...
}

Hadoop Validator

The Hadoop Plugin includes the Hadoop Validator, which provides Gradle tasks that perform local validation of your Hadoop jobs. In particular, the Hadoop Validator includes tasks for syntax checking, schema validation and data validation for Hadoop ecosystem jobs: Hadoop Validator.

Hadoop Zip Artifacts

The Hadoop Plugin includes a number of features for building Hadoop zip artifacts that can be uploaded to your Hadoop workflow scheduler: Hadoop Zip Artifacts.

Azkaban Features

The Hadoop Plugin comes with tasks to compile the Hadoop DSL into job files for Azkaban and to upload zip artifacts to Azkaban: Azkaban Features.

Apache Oozie Features

The Hadoop Plugin comes with tasks to execute Apache Oozie commands and to upload zip artifacts to versioned directories on HDFS: Apache Oozie Features.

Apache Pig Features

The Hadoop Plugin comes with features that should make it much easier for you to quickly run and debug Apache Pig scripts: Apache Pig Features.

Apache Spark Features

The Hadoop Plugin comes with features that should make it much easier for you to quickly run Apache Spark programs: Apache Spark Features.

Dependency Management Features

The Hadoop Plugin comes with features that enable your company's Hadoop development and operations teams to disable poor dependency management practices: Dependency Management Features.

Source Code Metadata Features

The Hadoop Plugin comes with features to record metadata about your source code and to build source code zips for your projects: Source Code Metadata Features.