Skip to content

Creating Ingest Job Configuration Files

Jordan Matelsky edited this page Aug 21, 2019 · 6 revisions

Overview

The ingest client works to complete an ingest job that is specified by a configuration file and facilitated by the Boss ingest service. The ingest service utilizes a simple JSON schema to encapsulate all the required information that specifies an ingest job. An ingest job is the act of uploading a fixed amount of data to the Boss, and can be part or all of a dataset. In the following sections, we'll walk through the structure and content of the configuration file and areas available for customization.

Currently, the ingest client supports two types of ingests: tile based and volumetric.

Tile Based Ingests

These ingests upload data in 2D images tiles. For these ingests, either boss-v0.1-schema.json or boss-v0.2-schema.json may be used for describing the ingest job.

Volumetric Ingests

These ingests upload 3D "chunks" of data. This ingest type requires boss-v0.2-schema.json for configuring the ingest job. This type of ingest is more efficient than tile based ingests, but you may need to write your own file-type adapter. Currently, CloudVolume and Zarr formats are supported.

Configuration File Schema

We use JSON Schema v4 to specify and validate configuration file structure and content. Supported schema files are located at ingest/schema. Both the ingest client and ingest service will use the schema specified in the configuration file to perform validation.

Boss Configuration File Details

The configuration file is simply a JSON document with 4 objects; schema, client, database, and ingest_job. An example configuration file can be seen below based on the boss-v0.1-schema.json schema file. This example is a tile based ingest using TIFFs. Additional examples that can be used as a starting point are located in ingest/configs

{
  "schema": {
      "name": "boss-v0.1-schema",
      "validator": "BossValidatorV01"
  },
  "client": {
    "backend": {
      "name": "boss",
      "class": "BossBackend",
      "host": "api.theboss.io",
      "protocol": "https"
    },
    "path_processor": {
      "class": "ingest.plugins.multipage_tiff.SingleTimeTiffPathProcessor",
      "params": {
        "z_0": "/usr/local/data/my_file_1.tif"
      }
    },
    "tile_processor": {
      "class": "ingest.plugins.multipage_tiff.SingleTimeTiffTileProcessor",
      "params": {
        "datatype": "uint16",
        "filetype": "tif"
      }
    }
  },
  "database": {
    "collection": "my_col_1",
    "experiment": "my_exp_1",
    "channel": "my_ch_1"
  },
  "ingest_job": {
    "resolution": 0,
    "extent": {
      "x": [0, 4096],
      "y": [0, 4096],
      "z": [0, 1],
      "t": [0, 1000]
    },
    "tile_size": {
      "x": 512,
      "y": 512,
      "z": 1,
      "t": 1
    }
  }
}

schema

The schema object contains details about which schema file and validator class to use to perform the schema validation. For the current production Boss system, the values in the above example should be used.

clients

The client object contains details needed to configure the ingest client. It specifies the "backend" to use along with the custom "plugins" needed to translate tile indices to data for uploading.

  • backend

    The section should be used as shown in the example, unless otherwise directed. It specifies the type of backend in use, interface class, the server url, and http protocol.

  • path_processor

    The path_processor section specifies which path processor class to load. The path processor is responsible for converting the tile indices provided in an upload task message to an absolute file path to the data stored locally. This lets users handle custom ways to organize and access data.

    The value for the item class should be the absolute import for the class. As long as the plugin class is on the python path during execution, the ingest client should be able to dynamically import and load it. More detail regarding plugins can be found in the Creating Custom Plugins page.

    The params item is a dictionary that can contain any custom parameters that should be passed from the configuration file to the path processor class.

  • tile_processor

    The tile_processor section specifies which tile processor class to load. The tile processor is responsible for converting the absolute file path and tile indices to a file handle. This lets users handle custom data types, file formats, and ways of loading data.

    The value for the item class should be the absolute import for the class. As long as the plugin class is on the python path during execution, the ingest client should be able to dynamically import and load it.

    The params item is a dictionary that can contain any custom parameters that should be passed from the configuration file to the tile processor class.

  • chunk_processor (introduced in boss-v0.2-schema)

    The chunk_processor section specifies which chunk processor class to load when doing a volumetric ingest. The chunk processor accepts an absolute file path and tile indices.

    The value for the item class should be the absolute import for the class. As long as the plugin class is on the python path during execution, the ingest client should be able to dynamically import and load it.

    The params item is a dictionary that can contain any custom parameters that should be passed from the configuration file to the chunk processor class.

database

The database object details to what channel data should be written. You must specify the collection, experiment, and channel. In addition, you must have write privileges on the channel. Currently you cannot dynamically create collections, experiments, or channels from within the ingest client and must create them in advance, either programmatically via the API or through the Boss Management Console (not yet deployed).

ingest_job

The ingest_job section is used to specify the resolution, extent, and tile size of the ingest job.

  • resolution - This item stores the level in the resolution hierarchy to which the data should be written. In almost all cases this should be set to 0, which indicates the "base" or "native" resolution of a dataset. Occasionally you may want to ingest to a different level, but typically this would only be for an annotation dataset.

  • extent - This item stores the extent of this ingest job in both space and time. The extent item is used to specify not only how big the chunk of data you desire to ingest is, but also where it is in space and time in your dataset. The values for x, y, z, and t are ranges, starting point inclusive and stopping point exclusive, indicating the chunk of data to ingest. If your dataset does not have a time component, simply set the t item to [0, 1]. The starting point can be used to "offset" into a larger dataset if doing a partial ingest.

It is acceptible to ingest an entire dataset at once if desired, but you can also do partial ingests. More concretely a partial ingest would be if you collected a non-time series dataset that was 10000x10000x1000 voxels in x, y, and z but only wanted to upload 4096x4096x1000 at the moment. This would result in an extent item that looked like this:

"extent": {
    "x": [0, 4096],
    "y": [0, 4096],
    "z": [0, 1000],
    "t": [0, 1]
  }

If you then wanted to ingest the remaining data at a later date, a second ingest configuration file could be created with the extent item looking like this:

"extent": {
    "x": [4096, 10000],
    "y": [4096, 10000],
    "z": [0, 1000],
    "t": [0, 1]
  }

There are a few caveats to what extent values are valid. Because ingest runs completely asyncronous and distributed, we need to be careful not to miss or overlap write operations to the underlying storage system. When doing partial ingest jobs, the x and y spatial extent of intermediate jobs must be divisible by 512 (as demonstrated in the examples above). The last ingest job to cover the xy extent does not have to be divisible by 512. This is because ingest is a write-only process and under the hood data is stored in blocks that are 512x512 in x and y. If intermediate ingest jobs do not completely fill a 512x512 block, they will be "stomped on" by a subsequent ingest job and data previously written to the partial block will be lost.

  • tile_size - this item specifies the size of an individual source image file for ingest. Currently to maximize flexibility, the ingest process only operates on 2D tiles. This means both z and t should typically be 1. At present, individual tiles should be 4096x4096 (16k pixels) or smaller in x and y. If your underlying image file is not a tile but some other representation (e.g. a 3D matrix stored in an HDF5 file), you can handle conversions in the file and tile processor plugins.

  • chunk_size (introduced in boss-v0.2-schema) - this item is relevant for volumetric ingests. It specifies the size of the 3D chunk of data to use for uploading. The dimensions of the chunk must be a multiple of a Boss cuboid which is 512, 512, 16 in x, y, z, respectively.

Plugin Custom Parameters

"Plugins" are used to handle the myriad of file formats and data organization paradigms users of the Boss may have. A plugin is essentially just two custom classes that implement the abstract PathProcessor and either TileProcessor or ChunkProcessor classes. You will notice a parameters argument to both PathProcessor.setup(), TileProcessor.setup(), and ChunkProcessor.setup(). This argument will be set by the ingest client to the dictionaries stored in the params item in the path_processor, tile_processor, chunk_processor sections of the configuration file. Refer to the Creating Custom Plugins page for more details on how to create plugins for your data.