Skip to content

Commit

Permalink
[Metric] Fix prometheus metric backend (#3124)
Browse files Browse the repository at this point in the history
  • Loading branch information
zhongchun committed Jun 9, 2022
1 parent cd22c4c commit ff0e925
Show file tree
Hide file tree
Showing 7 changed files with 215 additions and 34 deletions.
91 changes: 89 additions & 2 deletions docs/source/development/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,95 @@ Mars metrics support three different backends:
* ``prometheus`` is an open-source systems monitoring and alerting toolkit.
* ``ray`` is a metric backend which just runs on ray engine.

We can choose a metric backend by configuring ``metrics.backend`` in
``mars/deploy/oscar/base_config.yml`` or its descendant files.
Console
````````````````

The default metric backend is ``console``. It just logs the value when log level
is ``debug``.

Prometheus
````````````````

Firstly, we should download Prometheus. For details, please refer to
`Prometheus Getting Started
<https://prometheus.io/docs/prometheus/latest/getting_started/>`_.

Secondly, we can new a Mars session by configuring Prometheus backend as follows:

.. code-block:: python
In [1]: import mars
In [2]: session = mars.new_session(
...: n_worker=1,
...: n_cpu=2,
...: web=True,
...: config={"metrics.backend": "prometheus"}
...: )
Finished startup prometheus http server and port is 15768
Finished startup prometheus http server and port is 44303
Finished startup prometheus http server and port is 63391
Finished startup prometheus http server and port is 13722
Web service started at http://0.0.0.0:15518
Thirdly, we should config Prometheus, more configurations please refer to
`Prometheus Configuration
<https://prometheus.io/docs/prometheus/latest/configuration/configuration/>`_.

.. code-block:: yaml
scrape_configs:
- job_name: 'mars'
scrape_interval: 5s
static_configs:
- targets: ['localhost:15768', 'localhost:44303', 'localhost:63391', 'localhost:13722']
Then start Prometheus:

.. code-block:: shell
$ prometheus --config.file=promconfig.yaml
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:296 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:332 msg="Starting Prometheus" version="(version=2.13.1, branch=non-git, revision=non-git)"
level=info ts=2022-06-07T13:05:01.484Z caller=main.go:333 build_context="(go=go1.13.1, user=brew@Mojave.local, date=20191018-01:13:04)"
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:334 host_details=(darwin)
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:335 fd_limits="(soft=256, hard=unlimited)"
level=info ts=2022-06-07T13:05:01.485Z caller=main.go:336 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-06-07T13:05:01.487Z caller=main.go:657 msg="Starting TSDB ..."
level=info ts=2022-06-07T13:05:01.488Z caller=web.go:450 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-06-07T13:05:01.494Z caller=head.go:514 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2022-06-07T13:05:01.495Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=1
level=info ts=2022-06-07T13:05:01.495Z caller=head.go:562 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=1
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:672 fs_type=1a
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:673 msg="TSDB started"
level=info ts=2022-06-07T13:05:01.497Z caller=main.go:743 msg="Loading configuration file" filename=promconfig_mars.yaml
level=info ts=2022-06-07T13:05:01.501Z caller=main.go:771 msg="Completed loading of configuration file" filename=promconfig_mars.yaml
level=info ts=2022-06-07T13:05:01.501Z caller=main.go:626 msg="Server is ready to receive web requests."
Fourthly, run a Mars task:

.. code-block:: python
In [3]: import numpy as np
In [4]: import mars.dataframe as md
In [5]: df1 = md.DataFrame(np.random.randint(0, 3, size=(10, 4)),
...: columns=list('ABCD'), chunk_size=5)
...: df2 = md.DataFrame(np.random.randint(0, 3, size=(10, 4)),
...: columns=list('ABCD'), chunk_size=5)
...:
...: r = md.merge(df1, df2, on='A').execute()
Finally, we can check metrics in Prometheus web http://localhost:9090.

Ray
````````````````

We could config ``metrics.backend`` when creating a Ray cluster or new a session.

Metrics Naming Convention
------------------
Expand Down
96 changes: 75 additions & 21 deletions docs/source/locale/zh_CN/LC_MESSAGES/development/metrics.po
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: mars 0.9.0rc2+18.g21929ced5\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2022-04-24 12:19+0800\n"
"POT-Creation-Date: 2022-06-08 14:41+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
Expand Down Expand Up @@ -53,8 +53,8 @@ msgstr "``Meter`` 是一组事件发生的速率。 我们可以将其用作 qps

#: ../../source/development/metrics.rst:16
msgid ""
"``Histogram`` is a type of statistics which records the average value of"
" a window data."
"``Histogram`` is a type of statistics which records the average value of "
"a window data."
msgstr "``Histogram`` 是一种统计类型,它记录窗口数据的平均值。"

#: ../../source/development/metrics.rst:18
Expand All @@ -66,8 +66,9 @@ msgid ""
"**Note**: If ``tag_keys`` is declared, ``tags`` must be specified when "
"invoking ``record`` method and tags' keys must be consistent with "
"``tag_keys``."
msgstr "**注意**:如果声明了 ``tag_keys``,调用 ``record`` 方法时必须指定 ``tags`` "
"参数,并且 ``tags`` 的 keys 必须跟 ``tag_keys`` 保持一致。"
msgstr ""
"**注意**:如果声明了 ``tag_keys``,调用 ``record`` 方法时必须指定 ``tags`` 参数,并且 ``tags`` 的"
" keys 必须跟 ``tag_keys`` 保持一致。"

#: ../../source/development/metrics.rst:54
msgid "Three different Backends"
Expand All @@ -89,40 +90,93 @@ msgstr "``prometheus`` 一个开源系统监控和报警工具包。"
msgid "``ray`` is a metric backend which just runs on ray engine."
msgstr "``ray`` 是一种运行在 ray 引擎上的 metric 后端。"

#: ../../source/development/metrics.rst:62
#: ../../source/development/metrics.rst:63
msgid "Console"
msgstr ""

#: ../../source/development/metrics.rst:65
msgid ""
"The default metric backend is ``console``. It just logs the value when "
"log level is ``debug``."
msgstr "默认的 metric 后端是 ``console``. 它只是在日志级别为 ``debug`` 时打印出 metric 的值。"

#: ../../source/development/metrics.rst:69
msgid "Prometheus"
msgstr ""

#: ../../source/development/metrics.rst:71
msgid ""
"Firstly, we should download Prometheus. For details, please refer to "
"`Prometheus Getting Started "
"<https://prometheus.io/docs/prometheus/latest/getting_started/>`_."
msgstr ""
"首先,我们需要下载 Prometheus。具体的可以参考 `Prometheus Getting Started "
"<https://prometheus.io/docs/prometheus/latest/getting_started/>`_."

#: ../../source/development/metrics.rst:75
msgid ""
"We can choose a metric backend by configuring ``metrics.backend`` in "
"``mars/deploy/oscar/base_config.yml`` or its descendant files."
msgstr "我们可以通过配置 ``mars/deploy/oscar/base_config.yml`` 或它的继承文件中的 "
"``metrics.backend`` 来选择一种 metric 后端。"
"Secondly, we can new a Mars session by configuring Prometheus backend as "
"follows:"
msgstr "其次,我们可以如下配置 Prometheus 后端来启动一个 Mars session:"

#: ../../source/development/metrics.rst:66
#: ../../source/development/metrics.rst:93
msgid ""
"Thirdly, we should config Prometheus, more configurations please refer to"
" `Prometheus Configuration "
"<https://prometheus.io/docs/prometheus/latest/configuration/configuration/>`_."
msgstr ""
"第三,我们要配置 Prometheus,更多的配置可以参考 `Prometheus Configuration "
"<https://prometheus.io/docs/prometheus/latest/configuration/configuration/>`_."

#: ../../source/development/metrics.rst:108
msgid "Then start Prometheus:"
msgstr "接着,启动 Prometheus:"

#: ../../source/development/metrics.rst:130
msgid "Fourthly, run a Mars task:"
msgstr "第四,执行一个 Mars task:"

#: ../../source/development/metrics.rst:145
msgid "Finally, we can check metrics in Prometheus web http://localhost:9090."
msgstr "最后,我们可以在 Prometheus 的网页端 http://localhost:9090 查看 metrics。"

#: ../../source/development/metrics.rst:148
msgid "Ray"
msgstr ""

#: ../../source/development/metrics.rst:150
msgid ""
"We could config ``metrics.backend`` when creating a Ray cluster or new a "
"session."
msgstr "我们可以在创建 Ray cluster 时或新建 session 时配置 ``metrics.backend``。"

#: ../../source/development/metrics.rst:153
msgid "Metrics Naming Convention"
msgstr "Metrics 命名约定"

#: ../../source/development/metrics.rst:68
#: ../../source/development/metrics.rst:155
msgid "We propose a naming convention for metrics as follows:"
msgstr "我们提出一种如下的 metrics 命名约定:"

#: ../../source/development/metrics.rst:70
#: ../../source/development/metrics.rst:157
msgid "``namespace.[component].metric_name[_units]``"
msgstr ""

#: ../../source/development/metrics.rst:72
#: ../../source/development/metrics.rst:159
msgid "``namespace`` could be ``mars``."
msgstr "``namespace`` 可以是 ``mars``。"

#: ../../source/development/metrics.rst:73
msgid "``component`` could be `supervisor`, `worker` or `band` etc, and can be "
#: ../../source/development/metrics.rst:160
msgid ""
"``component`` could be `supervisor`, `worker` or `band` etc, and can be "
"omitted."
msgstr "``component`` 可以是 `supervisor`,`worker` 或 `band` 等等,也可以省略这个参数。"

#: ../../source/development/metrics.rst:74
#: ../../source/development/metrics.rst:161
msgid ""
"``units`` is the metric unit which may be seconds when recording time, or"
" ``_count`` when metric type is ``Counter``, ``_number`` when metric type"
" is ``Gauge`` if there is no suitable unit."
msgstr "``units`` 是 metric 的单位,当记录的是时间时,可以用 seconds,当没有合适的单位"
"时,``Counter`` 类型的 metric 可以用 ``_count``,``Gauge`` 类型的 metric 可以用 "
"``_number``。"

msgstr ""
"``units`` 是 metric 的单位,当记录的是时间时,可以用 seconds,当没有合适的单位时,``Counter`` 类型的 "
"metric 可以用 ``_count``,``Gauge`` 类型的 metric 可以用 ``_number``。"
2 changes: 1 addition & 1 deletion mars/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@
# limitations under the License.

from .api import Metrics
from .api import init_metrics
from .api import init_metrics, shutdown_metrics
from .api import record_time_cost_percentile, Percentile
19 changes: 18 additions & 1 deletion mars/metrics/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

logger = logging.getLogger(__name__)

_init = False
_metric_backend = "console"
_backends_cls = {
"console": console_metric,
Expand All @@ -35,6 +36,10 @@


def init_metrics(backend="console", config: Dict[str, Any] = None):
global _init
if _init is True:
return

backend = backend or "console"
if backend not in _backends_cls:
raise NotImplementedError(f"Do not support metric backend {backend}")
Expand All @@ -43,17 +48,29 @@ def init_metrics(backend="console", config: Dict[str, Any] = None):
if _metric_backend == "prometheus":
try:
from prometheus_client import start_http_server
from ..utils import get_next_port

port = config.get("port", 0) if config else 0
port = port or get_next_port()
start_http_server(port)
logger.info("Finished startup prometheus http server and port is %d", port)
logger.warning(
"Finished startup prometheus http server and port is %d", port
)
except ImportError:
logger.warning(
"Failed to start prometheus http server because there is no prometheus_client"
)
_init = True
logger.info("Finished initialize the metrics with backend %s", _metric_backend)


def shutdown_metrics():
global _metric_backend
_metric_backend = "console"
global _init
_init = False


class Metrics:
"""
A factory to generate different types of metrics.
Expand Down
25 changes: 20 additions & 5 deletions mars/metrics/backends/prometheus/prometheus_metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import socket

from typing import Optional, Dict

from ....utils import lazy_import
Expand All @@ -28,16 +31,28 @@

class PrometheusMetricMixin(AbstractMetric):
def _init(self):
self._metric = (
pc.Gauge(self._name, self._description, self._tag_keys) if pc else None
# Prometheus metric name must match the regex `[a-zA-Z_:][a-zA-Z0-9_:]*`
# `.` is a common character in metrics, so here replace it with `:`
self._name = self._name.replace(".", ":")
self._tag_keys = self._tag_keys + (
"host",
"pid",
)
self._tags = {"host": socket.gethostname(), "pid": os.getpid()}
try:
self._metric = (
pc.Gauge(self._name, self._description, self._tag_keys) if pc else None
)
except ValueError: # pragma: no cover
self._metric = None

def _record(self, value=1, tags: Optional[Dict[str, str]] = None):
if self._metric:
if tags:
self._metric.labels(**tags).set(value)
if tags is not None:
tags.update(self._tags)
else:
self._metric.set(value)
tags = self._tags
self._metric.labels(**tags).set(value)


class Counter(PrometheusMetricMixin, AbstractCounter):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ def test_counter(start_prometheus_http_server):
c = Counter("test_counter", "A test counter", ("service", "tenant"))
assert c.name == "test_counter"
assert c.description == "A test counter"
assert c.tag_keys == ("service", "tenant")
assert set(["host", "pid"]).issubset(set(c.tag_keys))
assert set(["service", "tenant"]).issubset(set(c.tag_keys))
assert c.type == "counter"
c.record(1, {"service": "mars", "tenant": "test"})
verify_metric("test_counter", 1.0)
Expand All @@ -66,7 +67,7 @@ def test_gauge(start_prometheus_http_server):
g = Gauge("test_gauge", "A test gauge")
assert g.name == "test_gauge"
assert g.description == "A test gauge"
assert g.tag_keys == ()
assert set(["host", "pid"]).issubset(set(g.tag_keys))
assert g.type == "gauge"
g.record(0.1)
verify_metric("test_gauge", 0.1)
Expand All @@ -78,7 +79,7 @@ def test_meter(start_prometheus_http_server):
m = Meter("test_meter")
assert m.name == "test_meter"
assert m.description == ""
assert m.tag_keys == ()
assert set(["host", "pid"]).issubset(set(m.tag_keys))
assert m.type == "meter"
num = 3
while num > 0:
Expand All @@ -92,7 +93,7 @@ def test_histogram(start_prometheus_http_server):
h = Histogram("test_histogram")
assert h.name == "test_histogram"
assert h.description == ""
assert h.tag_keys == ()
assert set(["host", "pid"]).issubset(set(h.tag_keys))
assert h.type == "histogram"
num = 3
while num > 0:
Expand Down

0 comments on commit ff0e925

Please sign in to comment.