# Apache Oozie
Apache Oozie is a Hadoop job scheduler that allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.

## The 3 types of jobs
Oozie `Workflow` Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a sequence of actions to be executed.

Oozie `Coordinator` Jobs − These consist of workflow jobs triggered by time and data availability.

Oozie `Bundle` − These can be referred to as a package of multiple coordinator and workflow jobs.

~[](https://www.tutorialspoint.com/apache_oozie/images/sample_workflow.jpg)

## Workflow
A workflow action can be a **Hive action, Pig action, Java action, Shell action**, etc. We can include `fork` and `decisions`.

example workflow:
```xml
<!-- This is a comment -->
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "simple-Workflow">
   <start to = "fork_node" />
   
   <fork name = "fork_node">
      <path start = "Create_External_Table"/>
      <path start = "Create_orc_Table"/>
   </fork>
   
   <action name = "Create_External_Table">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>xyz.com:8088</job-tracker>
         <name-node>hdfs://rootname</name-node>
         <script>hdfs_path_of_script/external.hive</script>
      </hive>
      
      <ok to = "join_node" />
      <error to = "kill_job" />
   </action>
   
   <action name = "Create_orc_Table">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>xyz.com:8088</job-tracker>
         <name-node>hdfs://rootname</name-node>
         <script>hdfs_path_of_script/orc.hive</script>
      </hive>
		
      <ok to = "join_node" />
      <error to = "kill_job" />
   </action>
   
   <join name = "join_node" to = "Insert_into_Table"/>
	
   <action name = "Insert_into_Table">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>xyz.com:8088</job-tracker>
         <name-node>hdfs://rootname</name-node>
         <script>hdfs_path_of_script/Copydata.hive</script>
         <param>database_name</param>
      </hive>
		
      <ok to = "end" />
      <error to = "kill_job" />
   </action>
   
   <kill name = "kill_job">
      <message>Job failed</message>
   </kill>
   
   <end name = "end" />
	
</workflow-app>
```
which produces the following DAG

![](https://www.tutorialspoint.com/apache_oozie/images/workflow.jpg)

### Property File
We can specify a `.properties` config file. Variables like `${nameNode}` can be passed within the workflow definition.

job1.properties
```
nameNode = hdfs://rootname
jobTracker = xyz.com:8088
script_name_external = hdfs_path_of_script/external.hive
script_name_orc=hdfs_path_of_script/orc.hive
script_name_copy=hdfs_path_of_script/Copydata.hive
database = database_name
```

## Running the job
```shell
oozie job \
    --oozie http://host_name:8080/oozie \
    --config edgenode_path/job1.properties \
    -D oozie.wf.application.path hdfs://Namenodepath/pathof_workflow_xml/workflow.xml \
    –run
```

Note − The property file should be on the edge node (not in HDFS), whereas the workflow and hive scripts will be in HDFS.

## Coordinator
Coordinator applications allow users to schedule complex workflows, including workflows that are scheduled regularly. Oozie Coordinator models the workflow execution triggers in the form of time, data or event predicates. The workflow job mentioned inside the Coordinator is started only after the given conditions are satisfied.

Example coordinator:
```xml
<coordinator-app xmlns = "uri:oozie:coordinator:0.2" name =
   "coord_copydata_from_external_orc" frequency = "5 * * * *" start =
   "2016-00-18T01:00Z" end = "2025-12-31T00:00Z" timezone = "America/Los_Angeles">
   
   <controls>
      <timeout>1</timeout>
      <concurrency>1</concurrency>
      <execution>FIFO</execution>
      <throttle>1</throttle>
   </controls>
   
   <action>
      <workflow>
         <app-path>pathof_workflow_xml/workflow.xml</app-path>
      </workflow>
   </action>
	
</coordinator-app>
```

start − It means the start datetime for the job. Starting at this time the actions will be materialized.

end − The end datetime for the job. When actions will stop being materialized.

timezone − The timezone of the coordinator application.

frequency − The frequency, in minutes, to materialize actions.

### Control
timeout − The maximum time, in minutes, that a materialized action will be waiting for the additional conditions to be satisfied before being discarded. A timeout of 0 indicates that at the time of materialization all the other conditions must be satisfied, else the action will be discarded. A timeout of 0 indicates that if all the input events are not satisfied at the time of action materialization, the action should timeout immediately. A timeout of -1 indicates no timeout, the materialized action will wait forever for the other conditions to be satisfied. The default value is -1.

concurrency − The maximum number of actions for this job that can be running at the same time. This value allows to materialize and submit multiple instances of the coordinator app, and allows operations to catchup on delayed processing. The default value is 1.

execution − Specifies the execution order if multiple instances of the coordinator job have satisfied their execution criteria. Valid values are
    - FIFO (oldest first) default.
    - LIFO (newest first).
    - LAST_ONLY (discards all older materializations).
    
## 