Skip to content
Roie Danino edited this page Dec 13, 2022 · 6 revisions

Basic usage

  • Use UCX with statistics support compiled-in (./configure --enable-stats ... or ./contrib/configure-prof)
  • Pass env vars: UCX_STATS_DEST=stdout or something like UCX_STATS_DEST=file:/tmp/ucx_%h_%p.stat
  • Run the application and statistics will be generated on exit

Configuration options

#
# Destination to send statistics to. If the value is empty, statistics are
# not reported. Possible values are:
#   udp:<host>[:<port>]   - send over UDP to the given host:port.
#   stdout                - print to standard output.
#   stderr                - print to standard error.
#   file:<filename>[:bin] - save to a file (%h: host, %p: pid, %c: cpu, %t: time, %u: user, %e: exe)
#
# Syntax: string
#
UCX_STATS_DEST=

#
# Trigger to dump statistics:
#   exit              - dump just before program exits.
#   signal:<signo>    - dump when process is signaled.
#   timer:<interval>  - dump in specified intervals (in seconds).
#
# Syntax: string
#
UCX_STATS_TRIGGER=exit

#
# Used for filter counters summary.
# Comma-separated list of glob patterns specifying counters.
# Statistics summary will contain only the matching counters.
# The order is not meaningful.
# Each expression in the list may contain any of the following wildcard:
#   *     - matches any number of any characters including none.
#   ?     - matches any single character.
#   [abc] - matches one character given in the bracket.
#   [a-z] - matches one character from the range given in the bracket.
#
# Syntax: comma-separated list of: string
#
UCX_STATS_FILTER=*

#
# Statistics format parameter:
#   full    - each counter will be displayed in a separate line 
#   agg     - like full but there will also be an aggregation between similar counters
#   summary - all counters will be printed in the same line.
#
# Syntax: [full|agg|summary]
#
UCX_STATS_FORMAT=full

Details

Throughout the code there are counting points. The counters are divided into classes. The classes are arranged in a hierarchy. An example of classes and their relation maybe:

 ucp_worker->uct_iface->uct_ep->rc_fc

For example the group uct_ep contains the counters:am, put, get, atomic, bytes_short, bytes_bcopy, bytes_zcopy, no_res, flush, flush_wait.

The counters may be printed in two ways: full report and summary. In full report mode all classes and their counters will be printed. The user may specify the subset of the counters to be printed, either as a list of counters or as a list of regular expressions (globing). The result will be a single line. For example if the user specified the following

list:=*copy*,*eager*

then the result will look like:

[elrond1:13966] ucp_worker{rx_eager_msg:10000 rx_eager_chunk_exp:1670000 rx_eager_chunk_unexp:0} ucp_ep{tx_eager:10000 tx_eager_sync:0} uct_ep{bytes_bcopy:10253440130 uct_ep.bytes_zcopy:0}

Each counter will be an accumulation of all instances within its class. For example: uct_ep.bytes_bcopy has 2 instances in:

ucp_worker-0x6aeb90:

    uct_iface-mlx5_0:1-0x6b4760:

         uct_ep-0x7289d0:

              bytes_bcopy: 10253440000

    uct_iface-mlx5_0:1-0x716020:

         uct_ep-0x732a30:

              bytes_bcopy: 130

The list of counters or regular expressions is defined in the UCX_STATS_FILTER environment variable. If UCX_STATS_FILTER=* then full report will be provided. Otherwise a summary.

Clone this wiki locally