Skip to content

Martian 4.0.0

Compare
Choose a tag to compare
@adam-azarchs adam-azarchs released this 17 Jun 22:15

Martian 4 is a major version update, meaning it may contain breaking
changes as well as new features.

The main features for this release are an overhauled type system with
support for typed maps and structs, and a map call construct, which
replaces and extends the (now removed) sweep functionality.

New types

Typed maps

Stage and pipeline input and output parameters may now be declared with
a type like map<int>. These are dictionaries with string keys and
values of the given type. If a top-level pipeline has an output that is
a map over file types, for example map<csv>, then the resulting output
directory will contain a subdirectory with the name of that parameter,
within which there will be files named like <map key>.csv. In order
to encourage clarity in pipeline definitions, directly nesting typed maps
(e.g. map<map> or map<map<int>>) is not permitted, however one can have
a map of structs (see below) which may contain more typed maps.

Struct types

It is now possible to declare struct data types as well. These look
just like stage or pipeline definitions, except they have no in or
out specifiers on the parameter names (and of course have no calls or
src parameter). Similar to typed maps, structs which contain file
types also get a directory in the top-level output. They can of course
be nested, and there may be typed maps of structs. This, at long last,
allows pipelines to organize the files in their output directories into
subdirectories.

With the addition of structs, the previous behavior where passing

    foo = STAGE_NAME,

was equivalent to

    foo = STAGE_NAME.default

is no longer assumed. Instead of an implicit reference to the "default"
output, the bound reference is now to a struct containing all of the
stage's outputs. In order to support this, any stage or pipeline name
now implicitly defines a struct with the same members as the output
parameters of the stage or pipeline.

Structs are decomposable, so STAGE_NAME.foo.bar.baz is legal if foo is
a struct with a member bar, which is also a struct with a member baz.
In addition, the . operator allows "projecting" through typed maps and arrays.
That is, if we have

struct Bar (
    int baz,
)

struct Foo (
    Bar[] bar,
)

stage STAGE(
    out map<Foo> foo,
    src comp     "stagecode",
)

then STAGE.foo.bar.baz would have a type of map<int[]>. This
becomes especially useful when working with the next feature, map calls.

An input parameter that takes a [map or array of] struct A can accept
a [map or array of] another struct B so long as all of the members of A
are present on B and have the same types. If B has members which A does
not have, they are filtered out when generating the arguments which are
passed to stage code. This allows stages to add additional output
fields without breaking downstream users.

Map calls

It is now possible to call a stage or pipeline once for each value in an
array or typed map. That is,

map call ANALYZE(
    sample = split self.samples,
    params = self.params,
)

In a map call, at least one parameter's value must be preceded by the
split keyword. If more than one parameter is split, all such
parameters must either be arrays with the same length, maps with the
same set of keys, or null. In this example, ANALYZE is called once
for every value in samples. If ANALYZE is a pipeline, and some of
the stages within it don't depend on sample (or on other stages which
do), the work for those stages gets shared between each call.

If samples was an array, then the result of this call is an array with
the same length. If it was a map, the result is a map with the same keys.
This allows for reducing the data, e.g.

call META_ANALYSIS(
    analyses = ANALYZE,
)

Tools

  • The mrc and mrf commands have been merged into a single
    mro command. Symlink aliases for mrc and mrf still work.
    • mro check works just like mrc.
    • mro format works just like mrf.
    • mro graph has options for querying the call graph of a pipeline,
      including outputting the entire graph to json or graphviz dot
      format, querying the source of an input to a call, or tracing the
      stages which depend on the output of a call.
    • mro edit has various refactoring tools for renaming stages, inputs,
      and outputs, as well as finding and eliminating unused outputs or calls.
  • The mrg command accepts a --reverse option which causes it to
    generate an invocation.json file from a given mro file.
  • When a syntax error is encountered in mrc, the expected token
    is now provided.

Runtime changes

  • It is now an error for a call to be disabled based on a null value.
    Previously, null was treated as equivalent to false, which was not
    always the author's intent.
  • Thread reservations may now be in terms of 100ths of a core. This
    is intended for use in stages which for example are mostly blocking
    waiting for external inputs, or perhaps downloading files.
  • Memory reservations may now be non-integral numbers of GB.
    They are tracked at MiB granularity.
  • Pre-populated _outs files no longer contain strings for keys which
    are opposed to be arrays of file types.

Other changes

  • The mro parser is now significantly faster and uses less memory.
  • mrjob and mro can now be compiled and run on darwin OS.
  • The build now relies entirely on go modules, rather than submodules.
  • coffeescript is no longer involved in the build for the web front-end.
  • Remove vendored web dependencies. Rely on npm.
  • The journal files used for coordination between mrp and mrjob
    no longer include the sample ID. Long sample IDs could cause
    the journal file name to exceed the filesystem's file name length
    limits.
  • The repository now includes bazel rule definitions for
    mro_library, mro_test, and mrf_test, among others. See
    the documentation in tools/docs/mro_rules.md.