Routing

stockiNail edited this page Oct 12, 2015 · 3 revisions

Swarm

As stated in the Overview chapter, "according to Environment tag, a job can be moved from the input queue to the routing queue, waiting for another JEM cluster which will fetch and execute it". In other words, if a job has a different environment from which is submitted then it will be moved to the Routing queue of the submitted environment and if Swarm is active and connected to the job's environment than the job will be automatically moved (submitted and executed) to the correct environment. Let see how this feature is implemented helping us with the following image.

http://www.pepstock.org/resources/routing-01.png

As you can see in the image above, we have 2 JEM environment (Hazelcast cluster) named ENV-1 and ENV-2. Each of these environments has a set of nodes (see Clustering for more details). The ability of JEM to communicate through different environments is made possible by the introduction of a new (Hazelcast) cluster that connect nodes (configured by the user) between different environments. In the picture in fact we have node 5 and node 6 that act either as node of a specific environment and as a bridge between different environments. In the picture this new Hazelcast cluster is named SWARM-ENV because it is configured in the swarm tab of JEM web interface. A node than, can be part of 2 separated cluster one of which is the swarm. This is possible because Hazelcast allow to configured different cluster inside the same JVM. So we can think at swarm as the "cloud of clouds". Now that we understood the main idea of the routing it is time to go a little deeper in the implementation to understand the features that made routing:

  • configurable, deciding which environments (and which nodes) connect one another
  • secure, prevention from an-authorized environment to connect
  • with no data loss, implementing a 2 phase commit algorithm
  • capable to listen for the end of a routed job

Configuration

Configuring the routing environment, means configure an Hazelcast cluster. Instead of using another xml configuration, JEM the BEE provide an easy and intuitive web module to do it. The module, as you may understood, is the swarm module describe in the web interface reference. The information needed to correctly configure it are the following:

  • Group name, is the Hazelcast group name for the cluster
  • Group password, is the password for the member that want to join the swarm Hazelcast cluster.
  • Port is the auto-increment starting port number for each member that will connect to the Hazelcast cluster
  • Ip address, is the list of ip addresses that are allowed to connect to the cluster (note that an Ip correspond to a host so if an host have many JEM nodes installed than all the nodes will become also swarm nodes)
  • Network Interface, is the network interface used in Hazelcast, most of the time you will probably not need it.

So lets make an example referring to the above image. Suppose that the following are the ip addresses of nodes 5 and 6:

  • ENV-1, node 5 1 100.100.100.1
  • ENV-1, node 6 2 100.100.100.2
  • ENV-2, node 5 1 100.100.200.1
  • ENV-2, node 6 2 100.100.200.1

Because we want all these nodes to be part of the SWARM cluster we have to edit swarm configuration of each environment setting the following:

  • Group name = LEARN_SWARM
  • Group password = learn_swarm_password
  • Port number we will leave the default one
  • Ip addresses
    1. 100.100.100.1
    2. 100.100.100.2
    3. 100.100.200.1
    4. 100.100.200.2

At this point we just need to start the swarm via web interface and that's it, we just made possible the routing of jobs between ENV-1 and ENV-2. If in the future we would like to add another environment all we have to do is setting the swarm configuration of this environment in the same way we just did it and we have to remember to add to each environment the new swarm nodes ip. Remember that if swarm is enable all the nodes that are also swarm nodes will automatically join the swarm cluster once started.

Security

Security, in batch execution, is one of the main important feature and JEM the BEE was build with this idea in mind. When was time to implement the routing, what we want to ensure was:

  • only authorized environments could connect one another. This was guaranteed with the used of a configurable password and using a TCP/IP connection, rather than a multicast one, specifying the addresses of the authorized member. This double security constraint seems to address pretty good our goal.
  • even if two environment can connect one another, once a job is routed from an environment to another the user that submit the job, must be granted for batch execution in both environment. This guaranteed that the control of what can or cannot be submitted and executed in a Batch Execution Environment is always under the control of the environment administrator.

Two phase commit

The figure below, help us to illustrate the entire lifecycle of a routed job.

http://www.pepstock.org/resources/routing-02.png

Let follow step by step all the phases a routed job go through:

  1. A job with environment tag "ENV-2" is submitted in the environment "ENV-1" thus inserted in the Pre Input queue of environment "ENV1"
  2. If JCL job is correct and the user is authorized to submit jobs, the job is transferred from Pre Inptut queue to Input queue
  3. The first JEM node available take the job from the Input queue and because the job is referring to an environment different from the one it was submitted, the JEM node moves the job to the Routing queue
  4. If the swarm cluster is not active, nothing happened and the job remains in the Routing queue. Once the swarm cluster is activated and connected to the "ENV-2" environment one of the swarm nodes belonging to "ENV-1" take the job from routing queue and start a 2 phase commit to send it to a swarm node of environment "ENV-2". This is accomplished via Hazelcast distributed task. So the sequence of operations are the following: * swarm node of "ENV-1" set the routingCommit property of the job to false * start via ExecutorService a distributed task that will be executed by a specific swarm node of "ENV-2" chosen with random algorithm (so to balance the workload). * The distributed task in "ENV-2" as last operation, put the job in the Pre Input queue of "ENV-2" and send back a boolean signal to swarm node of "ENV-1". Before insert the Job in the Pre Input queue, swarm node of "ENV-2" generate a new id for "ENV-2" (so to avoid job id collision between the 2 environment) and populate the routing info attribute of this job setting:
    • job Id with the id set in "ENV-1"
    • routing time with the current time
    • submitted time with time when it was submitted in "ENV-1"
    • environment with "ENV-1" * If an exception occurs during this phase or a false signal is sending back this means that the job was not been routed (for an unexpected reason) and it will remain in the Routing queue of ENV-1 with routingCommit set to false so that it will never been routed again. A manual operation (of the administrator) as to be taken to restore the state of this job. * If a positive signal is received than the job is removed from Routing queue because it is a fact that it was successfully submitted in "ENV-2"
  5. The job, now in the correct environment "ENV-2" is moved from Pre-Input to Input queue
  6. The first available JEM node take the job from Input queue put it in the Running queue and execute it
  7. When the job terminates it is moved to the Output queue as usual.
  8. Once in the Output queue, if the job was a routed job, it is send back to the routing environment ("ENV-2") via DistributedTask with a 2 phase commit. So the sequence of operations are the following: * One of the swarm nodes of environment "ENV-2" pick up the routed job from the Output queue and set the property outputCommit to false * Start via ExecutorService a distributed task that will be executed by a specific swarm node of "ENV-1" chosen with random algorithm (so to balance the workload). This distributed task in "ENV-1" will do the following:
    • Notifies the end of the routing job using a Topic (number 9 in the picture)
    • If the job was submitted in waiting mode (that is a client is waiting for the end of its execution) inserts the job in the non persistent Routed queue. As soon as the client received the topic it will erase the entry on the Routed queue that can than be consider with an average of zero jobs load.
    • Sends back a boolean signal to "ENV-2" * If an exception occurs or a negative signal is received (for any unexpected reason) from "ENV-2" than the job remain in the output queue with outputCommit property set to false and a manual operation should be taken by the administrator to restore the status of the job. If a positive signal is received than "ENV-2" set the outputCommit property of the job to true and the life-cycle of the routed job is terminated.
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.