Skip to content

CoRB Stages

Mads Hansen edited this page Jan 2, 2020 · 1 revision

Under the hood, CoRB is batch process tool. Documents are selected, put into a 'batch' and then processed. CoRB has a total of five tasks (stages) that provide hooks for the user to customize CoRB to perform the necessary work. It is not required to configure all five stages to use CoRB but they are available if a more complex job demands them.

The five CoRB stages are:

  1. INIT
  2. URIS
  3. PRE-BATCH
  4. PROCESS
  5. POST-BATCH

Often, it's only necessary to configure the URIS task to select the documents included in the batch and then the PROCESS task to do the actual work.

With the exception of the PROCESS task, all tasks are executed using a single thread. The PROCESS task, which typically involves opening one or more documents, can be run on multiple threads. The practical limit on threads used is governed by the number of CPUs where CoRB is run.

Each stage provides the CoRB developer the opportunity to extend CoRB's functionality by creating a custom task. A custom task is created by implementing either the com.marklogic.developer.Task interface or extending com.marklogic.developer.AbstractTask. A custom task can be used, through configuration, in place of the other ready-made tasks such as PreBatchUpdateFileTask or PostBatchUpdateFileTask. It should be noted that if a task is created for a stage that also has a module, it is the task's responsibility to call the module itself. While CoRB will call the module for a stage if no task is present, CoRB will only call the task if both a module and a task are present. All ready-made tasks will subsequently call the stage's module.

The following sections will provide further details.

INIT

Once CoRB is executed from the command line, INIT is the first task to be called. Presently, there is no clear use case for this task so CoRB doesn't provide an existing implementation. However, it's available for any initialization services necessary and provides an opportunity to run either a custom Java class and/or execute an XQuery script (module). If provided, the Java class is called first and then the module can be called by the java class.

URIS

The primary starting point for a CoRB job is with the URIS task. The URIS task specifies the documents to be used by the Transform task. Document URIs can be pulled from an input file or by an XQuery module. If using an XQuery module, often called a Selector, values must be returned in a specific format: ([optionalArbitraryString],totalCountOfURIs, (sequenceOfURIs)). If an optional arbitrary string is returned from the Selector it can be accessed by XQuery modules in the PRE-BATCH, PROCESS or POST-BATCH tasks by declaring an external variable called URIS_BATCH_REF, which will then be populated when the module is called, for example:

declare variable $URIS_BATCH_REF as xs:string external;

Using URIS_BATCH_REF can be useful for dynamically creating an export file name. For more information on using the arbitrary string as a file name see the next section on PRE-BATCH.

If using a data input file for the Selector, instead of an XQuery module, individual URIS for the documents must be specified on separate lines.

It should be noted that although the typical use case is for either the input file or module to provide URIS it is not required that they be actual URIS. Instead, they can be lines of strings which contain URIS and/or additional information that may be necessary for the transform module to do its work. CoRB will read the entire line of data and send it to the PROCESS task one at a time until all lines have been read. Therefore, it's the responsibility of the PROCESS task's transform module to parse the line intelligently and capture the embedded information. This is particularly useful when needing more than one piece of information to perform a transform on a document or using it to generate a report.

Properties used to configure the URIS task are as follows:

URIS-MODULE=uri/for/selector.xqy
URIS-FILE=name/And/Path/Of/Input/File/For/URIS
COLLECTION-NAME=nameOfCollectionThatCanBeUsedInAQuery

If using an XQuery or JavaScript module, it must either be loaded into the database and available at the URI specified, or the module name can be appended with the string "|ADHOC" and available on the classpath for loading into memory. Adhoc modules are not loaded into the database and will be discarded from memory after use.

PRE-BATCH

After the URIS task but before the PROCESS task is the PRE-BATCH task. The PRE-BATCH task provides the opportunity to run a custom Java task and/or an XQuery module before the PROCESS task begins. Currently, there is one ready-made Java task, PreBatchUpdateFileTask, available for use as a PRE-BATCH task or the developer could build their own.

The typical use case for a PRE-BATCH task is to create the headers for a report. Headers can be created either dynamically or from static content. Static headers can be specified using the EXPORT-FILE-TOP-CONTENT property. To create dynamic headers, a PRE-BATCH-MODULE must be specified and used which will return the values as a comma or pipe-delimited string. Whether creating static or dynamic headers, the EXPORT-FILE-NAME option must also be used to specify the name of the report in which the headers will be created. Lastly, the ready-made Java task PreBatchUpdateFileTask must also be specified as a PRE-BATCH-TASK which will then take either the dynamic headers from PRE-BATCH-MODULE, if one is configured, or to take static headers from EXPORT-FILE-TOP-CONTENT and write them to the file.

As mentioned in the earlier section, it’s also possible to use a dynamically created export file name. To do so, the URIS module must return the file name as the first value in the sequence, for example, (dynamicFileName, count, URIS[1 to n]) and the EXPORT-FILE-NAME property must not be set in the properties. If the EXPORT-FILE-NAME is missing, CoRB will use the first string returned from the URIS module as the file name.

A possible use case for using a PRE-BATCH-MODULE without a task would be if a document needed updating before subsequent tasks were run. For example, perhaps it would be necessary to change a document's status before the complexities of multi-threading were to occur in the subsequent Process task.

Properties that can be used to configure the PRE-BATCH task are as follows:

PRE-BATCH-MODULE=uri/for/prebatch.xqy
PRE-BATCH-TASK=com.marklogic.developer.PreBatchUpdateFileTask
EXPORT-FILE-NAME=nameOfFileWhichWillContainTheReport.txt
EXPORT-FILE-DIR=path/Of/Directory/ExportFileName/Resides/In/If/Not/ProvidedInName
EXPORT-FILE-TOP-CONTENT=name of column1, name of column2, name of column3, etc.

PROCESS

The PROCESS task is the workhorse of CoRB. Here, the list of documents selected from the URIS task is iterated over one at a time passing the document URI or other string to the transform XQuery or JavaScript module. The URI is passed to the transform module as an external variable and therefore must be declared in the transform as:

declare variable $URI as xs:string external; 

The PROCESS task can be multi-threaded. To specify more than one thread, use the property THREAD-COUNT to split the documents amongst the threads.

The transform module is typically used to either transform data or return values for use in a report. If returning multiple values for a report they are typically returned as a comma separated value string, but could also be formatted as XML or JSON. Often when transforming data it's desirable to assign the updated document to a new collection, in order to identify which documents have been modified by the CoRB job.

Properties that can be used to configure the PROCESS task are as follows:

PROCESS-MODULE=/path/to/transform.xqy
THREAD-COUNT=16
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask

POST-BATCH

Once the Transform task has completed the Post-Batch task is called next. Post-batch task can be used to perform a cleanup task, call a query or write footers to a batch report. If a POST-BATCH-MODULE is specified, then the ready-made Java task PostBatchUpdateFileTask will write the module's comma separated return value to the file, EXPORT-FILE-NAME, assuming it has been specified. This would be the approach to take for writing dynamic footers. Alternatively, static footers can be set using the property EXPORT-FILE-BOTTOM-CONTENT. With EXPORT-FILE-BOTTOM-CONTENT, the comma separated values it contains will also be written to EXPORT-FILE-NAME. Finally, if EXPORT_FILE_AS_ZIP is set to true PostBatchUpdateFileTask will zip the contents of EXPORT-FILE-NAME. In all these cases, it is PostBatchUpdateFileTask that takes either the dynamic or static values and writes them to file and perhaps ultimately zips it.

Properties that can be used to configure the POST-BATCH task are as follows:

POST-BATCH-MODULE=/path/to/postBatch.xqy
POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
EXPORT-FILE-BOTTOM-CONTENT=footerName1,footerName2,footerName3
EXPORT_FILE_AS_ZIP=true