Skip to content

Proposal 4

Benjamin Allan edited this page Dec 3, 2019 · 8 revisions

See also issue #116 Lightweight job monitoring support (ljms) and simple user sampler https://github.com/ovis-hpc/ovis/issues/116

Background

Some of the alternative methods to obtain application information are:

  • Use caliper (LLNL) and pipe the data blobs via LDMS {if/when this combination is available}
  • Use kokkos sampler from LDMS to push json data sets periodically.
  • The shm sampler (aka MPI sampler) to poll shared memory binary data files written by the application.
  • For progress detection, tail or filter a run-time user-specified log file.
  • For configuration detection, try to automatically detect input files and copy them elsewhere.
  • App directly to network database (sql, dsos, etc).

In many cases, even application developers are not in a position to enforce creation or location of log and configuration files. Many simulation control languages have include statements, making auto-discovery of configuration input impossible.

Requirements and constraints

See #116 for initial list. Add extra here.

Possible solutions

  • Baseline:
    • Canonical data location in /dev/shm/jobmon/$JOBID.config, $JOBID.progress, $JOBID.env
      • May base directory may be overridden by admin or user supplied environment variable (or argument to scripted utilities).
    • New Samplers
      • TOML jobmon file sampler
      • String-file-blob jobmon sampler with optional at-store decoding.
    • C class library API and supporting wrappers for developers/users to construct or parse data files.
      • Application defines event name/counter pairs for progress. Structured naming ala TOML.
      • App defines scope/name/value tuples for configuration parameter capture.
      • Library captures data types and includes them in text format somehow.
    • Store that is smart enough to just roll with schema changes.

The merits and demerits of the alternatives, preferably based on examples and (where needed) prototype implementations.

requirement caliper kokkos shm detect file progress detect config files net database baseline
free of ldmsd connect no no yes yes yes yes yes
human readable no yes no yes yes no yes
cheap/no low parse yes no yes no no yes yes
bounded by API no yes yes maybe maybe yes yes
free of net FS yes yes yes maybe no no yes

When appropriate, the agreed solution, implementation team, review/test team, and expected release time/version.

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally