Skip to content

oar-team/colmet

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Colmet - Collecting metrics about jobs running in a distributed environnement

Introduction:

Colmet is a monitoring tool to collect metrics about jobs running in a distributed environnement, especially for gathering metrics on clusters and grids. It provides currently several backends :

  • Input backends:
    • taskstats: fetch task metrics from the linux kernel
    • rapl: intel processors realtime consumption metrics
    • perfhw: perf_event counters
    • jobproc: get infos from /proc
    • ipmipower: get power metrics from ipmi
    • temperature: get temperatures from /sys/class/thermal
    • infiniband: get infiniband/omnipath network metrics
    • lustre: get lustre FS stats
  • Output backends:
    • elasticsearch: store the metrics on elasticsearch indexes
    • hdf5: store the metrics on the filesystem
    • stdout: display the metrics on the terminal

It uses zeromq to transport the metrics across the network.

It is currently bound to the OAR RJMS.

A Grafana sample dashboard is provided for the elasticsearch backend. Here are some snapshots:

Installation:

Requirements

  • a Linux kernel that supports

    • Taskstats
    • intel_rapl (for RAPL backend)
    • perf_event (for perfhw backend)
    • ipmi_devintf (for ipmi backend)
  • Python Version 2.7 or newer

    • python-zmq 2.2.0 or newer
    • python-tables 3.3.0 or newer
    • python-pyinotify 0.9.3-2 or newer
    • python-requests
  • For the Elasticsearch output backend (recommended for sites with > 50 nodes)

    • An Elasticsearch server
    • A Grafana server (for visu)
  • For the RAPL input backend:

  • For the infiniband backend:

    • perfquery command line tool
  • for the ipmipower backend:

    • ipmi-oem command line tool (freeipmi) or other configurable command

Installation

You can install, upgrade, uninstall colmet with these commands::

$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet

Or from git (last development version)::

$ pip install [--user] git+https://github.com/oar-team/colmet.git

Or if you already pulled the sources::

$ pip install [--user] path/to/sources

Usage:

for the nodes :

sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556

for the collector :

# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
# Collector with an Elasticsearch backend:
  colmet-collector -vvv \
    --zeromq-bind-uri tcp://192.168.0.1:5556 \
    --buffer-size 5000 \
    --sample-period 3 \
    --elastic-host http://192.168.0.2:9200 \
    --elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log

You will see the number of counters retrieved in the debug log.

For more information, please refer to the help of theses scripts (--help)

Notes about backends

Some input backends may need external libraries that need to be previously compiled and installed:

# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/

Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:

export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so

colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
   --cpuset_rootpath /dev/cpuset/oar \
   --enable-infiniband --omnipath \
   --enable-lustre \
   --enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
   --enable-RAPL \
   --enable-jobproc \
   --enable-ipmipower >> /var/log/colmet.log 2>&1

RAPL - Running Average Power Limit (Intel)

RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.

Usage : start colmet-node with option --enable-RAPL

A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.

If a given counter is not supported by harware the metric name will be "counter_not_supported_by_hardware" and 0 values will appear in the collected data; -1 values in the collected data means there is no counter mapped to the column.

Perfhw

This provides metrics collected using interface perf_event_open.

Usage : start colmet-node with option --enable-perfhw

Optionnaly choose the metrics you want (max 5 metrics) using options --perfhw-list followed by space-separated list of the metrics/

Example : --enable-perfhw --perfhw-list instructions cpu_cycles cache_misses

A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric.

Available metrics (refers to perf_event_open documentation for signification) :

cpu_cycles 
instructions 
cache_references 
cache_misses 
branch_instructions
branch_misses
bus_cycles 
ref_cpu_cycles 
cache_l1d 
cache_ll
cache_dtlb 
cache_itlb 
cache_bpu 
cache_node 
cache_op_read 
cache_op_prefetch 
cache_result_access 
cpu_clock 
task_clock 
page_faults 
context_switches 
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults 
emulation_faults
dummy
bpf_output

Temperature

This backend gets temperatures from /sys/class/thermal/thermal_zone*/temp

Usage : start colmet-node with option --enable-temperature

A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between counter_1, counter_2, etc from collected data and the actual name of the metric.

About

Colmet - Collecting metrics about jobs running in a distributed environnement

Resources

License

Stars

Watchers

Forks

Packages

No packages published