docs/Metrics.md

# Introduction

The option **--prom** enables more or less coarse grained data collection for
life monitoring the the running turnserver instances - per default one per
CPU thread or strand or whatever is configured via the option **relay-threads**.
The label pair **tid="inctanceNumber"** allows one to determine, to which
turnserver instance the related metric is associated.

Just monitoring at turnserver instance level is pretty save wrt. to the number
of generated metrics and the thread, which generates and delivers the related
report formatted as plain text in the
[Prometheus/OpenMetrics exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/)
via **http://your-turnserver.do.main/metrics** . Collecting the data itself
is pretty light weight (just atomic add, sub, or set). The related metrics are described in the next section.

# Metrics

## version
The TURN server software name and version.

## bind\_requests
The number of valid STUN binding requests (RFC 5780) a turnserver instance
received so far.  If a binding request contains one or more unrecognized
attributes, or other errors (see below), it gets counted via bind\_errors,
not here. If `no-stun-backward-compatibility` is set, old STUN binding requests
are ignored and thus not counted. If `no-stun` is set, all STUN binding
requests are ignored and thus not counted, excluded from reports.

## bind\_responses
The number of STUN binding responses to valid STUN binding requests (RFC 5780)
the turnserver instance sent so far. It obviously gets sent only if the related
request had no errors and a proper answer could be constructed. The response
gets counted as soon as it has been put into the server's "sent-queue", i.e.
takes not into account, whether the delivery was successful. Excluded from
reports if `no-stun` is set.


## bind\_errors
The number of errors related to STUN binding occurred so far. The label pair
**code="number"** can be used to distinguish the errors. The error codes used
so far are:
  - 420
    - unknown attribute
    - Unknown attribute: TURN server was configured without RFC 5780 support
  - 400
    - wrong response format
    - Wrong request: applicable only to UDP protocol
    - Wrong request format: you cannot use PADDING and RESPONSE\_PORT together
    - Wrong request: padding applicable only to UDP and DTLS protocols

NOTE: If no binding errors occured so far, excluded from reports.

## rx\_msgs, rx\_bytes
Messages and bytes received from turn clients or peers. The label pair
**peer="value"** is used to specify, whether the metric refers to
the turn client (`0`) or the the peer (`1`).

## tx\_msgs, tx\_bytes
Messages and bytes sent to turn clients or peers. The label pair
**peer="value"** is used to specify, whether the metric refers to
to the turn client (`0`) or the the peer (`1`).

## allocations
The number of created allocations (label `created="1"`) and the number of
active, i.e. not yet deleted allocations (label `created="0"`).  Wrt. to the
turnserver this is usually what matters and since there is a 1:1 relation to
the owning session, sessions are not yet tracked in a separate metric. But
note, that a session may exists without a [not yet created] allocation.

## lifetime
The lifetime of current allocations in seconds. This metric gets automatically
enabled, if monitoring at the session level is enabled (at turnserver level it
does not make any sense). Because per contract there is max. one allocation
between a client and the turnserver, and the allocation is always related to
the session between both, the related allocation can be identified by the
**tid="inctanceNumber"** and **"sid=sessionID"**. When an allocation gets
deleted (usually when the session gets closed), the lifetime is set to `-1`
to get a marker for the allocation/session end.

## session\_state
The current state of a session. This metric gets automatically enabled, if
monitoring at the session level is enabled (at turnserver level it does not
make any sense). Sessions can be
distinguished by the labels **tid="inctanceNumber"** and **"sid=sessionID"**.
The metric value denotes the following states:
  - 0 .. session is closed.
  - 1 .. allocation got deleted. Because this usually happen on a session
    shutdown, and the time between an allocation got deleted and the session
    closed is very short, you probably will not see it very often.
  - 2 .. session shutdown initiated. Because the time between the shutdown init
    and the session end is very short, you probably will not see it very often.
  - 3 .. session is open, however no allocation request has been seen so far.
  - 4 .  an allocation got created, however, no refresh for it has been seen
    so far. The **lifetime** metric can be used to check the allocation's
    lifetime in seconds.
  - 5 .. an allocation exist and has been refreshed by the client at least once.
    On each refresh the related **lifetime** metric gets an update.

## process\_\*
For now 17 metrics about the coturn process itself. Most relevant are probably
the total number of threads (threads\_total), the physical RAM in use
(resident\_memory\_bytes), and the number of open file descriptors (open\_fds)
vs. the maximum number of  open file descriptors (max\_fds). If they are too
close, you may adjust this limit by adding e.g. `LimitNOFILE=64535` to the
`[Service]` section of the systemd service or using `ulimit -S -n 64535` in
your coturn service startup script. Last but not least: if there are too many
major page faults (major\_pagefaults) there might be a memory related problem
on the box running the app.

## scrape\_duration\_seconds
The metric labeled with `collector="default"` reports how long it took to collect
and format the turnserver related metrics (i.e. all except `process` and
`scrape_duration_seconds`). If this takes longer than 1 second, you probably
have either a "too many metrics" or "not enough CPU resources" problem.

The metric labeled `collector="process"` reports how long it took to collect
and format the coturn process related data exposed via the `process_*` metrics.

Last but not least the `collector="libprom"` tells you, how long it took to
collect the reports from all other collectors (default and process) and to
create the final report exposed via HTTP.

## Summary
So monitoring at turnserver level (the default) produces max.
`1 + (1 + 1 + 2 + 2*2 + 2*2 + 2) * relay-threads + 17 + 3 = 14 * relay-threads + 21`
metrics.
This means for a box with 2x Xeon(R) CPU E5-2690 v4 per default max.
`56 * 14 + 21 = 805` metrics.
This is not a big deal and easy to handle by the coturn application as
well as metrics agents like VictoriaMetric's vmagent or Prometheus, 
time series databases like VictoriaMetric's vmdb or Prometheus Mimir, and
visualizers like Grafana or Netdata. Should be also sufficient to monitor
the health of your coturn applications.


# Metrics at session level
Logging at session level may help:
  - in development to debug traffic issues
  - to understand your traffic patterns
  - to discover bottlenecks
  - support better troubleshooting

Using the option **--prom-uniq-sid** or **--prom-sid** enables monitoring
at the session label. The difference between both options is, that `--prom-sid`
recycles used session IDs by putting the session IDs of closed session back
into a pool and tries to reuse them for new sessions after a certain time has
passed (see option **--prom-sid-retain**).

If any of the sid options is enabled, the **session\_state** and **lifetime**
metrics get enabled automatically and all metrics get a **"sid=sessionID"**
label pair. This means the max. number of metrics is now for `--prom-sid`:
`21 + (14 + 2) * relay-threads * sum(countPerPool(unique(sessionIDs)))/relay-threads`
or assuming that coturn distributes requests/sessions evenly over the running
relay-threads as rule of thumb:
`21 + 16 * relay-threads * averageSessionIdPoolSize` or
`21 + 16 * relay-threads * max(concurrentSessionsAtATime)`
and for `--prom-uniq-sid`:
`21 + 16 * relay-threads * all_sessions_handled_so_far_by_the_app`.

So it is pretty obvious, that these options possibly open the door for
**DoS attacks**!  Therefore one should think at least twice and understand
the impact on produced metrics as described below, before using it. Having a
pretty good control over the clients (authentication, session frequency,
kindness of your clients) may help as well ;-).

If enabled, it is certainly a good idea to setup an alert manager, which
monitors the scrapetime or size of the metrics reports and alerts you, if it
takes too long or gets too big, which indicates that there are probably too
many metrics and may get you into trouble. In case you monitor the coturn log,
watch out for entries containing `grew rsid pool`: `maxId` says, how many
unique session IDs the relay-thread with id `tid` has seen so far. So the
current number of metrics for this thread is about `16 * maxSid`.

BTW: Usually a separate single thread per request generates and delivers the
metric report (using libmicrohttp), not the turnserver threads themselves.
However, it needs some CPU cycles to get the work done and thus may have an
impact on other tasks running on the box, if CPU resources are not sufficient.

Anyway, as you deduce from the calculations above, one should definitely avoid
using the `--prom-uniq-sid` option unless you really know, what you are doing.
If you really need it, enable it only for the time needed and make sure, that
scraping agents use a separate time series DB or storage place, where the
metrics data can be easily discarded, when the work is done and have
not really an impact on your other services in production.

## Tuning
As already said, to mitigate a possible "too many metrics" problem, the option
**--prom-sid** got introduced, which basically uses a dynamic pool of session
IDs per turnserver instance, where the IDs of closed sessions get put back and
reused by new sessions. So if you have per average only 100 sessions or
allocations concurrently running, it would result at least into
`56 * 16 * ceil(100/40)) + 21 = 2709` metrics.
Still easy to handle by coturn and not really a problem for report collectors,
time series DBs and visualizers.
**BUT** if one decides to challenge your server,
the numbers may change dramatically. So there might be a desire to adjust the
time a session ID has to stay in the ID pool before it can be reused.  The
option **--prom-sid-retain=num** allows you to adjust this parameter measured
in seconds. For now it defaults to 60 - a more or less conservative value,
depending on the environment. To get a feeling, how it works out for you,
the following test may reveal data for more or less proper settings:

## Testing
- setup a coturn test instance
- ensure that `lt-cred-mech` is enabled and no other mechanism
- ensure that a realm is configured, e.g. `realm = test.do.main`
- set `relay-threads = 5`
- add a user to the coturn db you wanna use for your tests, e.g. if you have
  `userdb=/data/coturn/db` set, use:
  `turnadmin -a -b /data/coturn/db -u foo -p bar -r test.do.main`
- make sure, your firewall allows the related traffic on the server as well
  as the client zone/container.
- run the test on the client machine, e.g.:
  `time turnutils_uclient -p 3478 -DgXv -y -c -u foo -w bar -m 40 server_hostname`.
  If you have changed the `listening-port`, adjust the `-p` option accordingly.
  Only simulate as many clients as CPU threads/strands are available on the
  client machine (option `-m`).
- when finished:
    - have a look at the log file (defaults to /var/log/coturn.log)
    - get the metrics report, e.g. `curl -o /tmp/report.om http://localhost:9641/metrics`
    - to find out the number of metrics and size of the report, one may run:
      `grep ^coturn_ /tmp/report.om |wc -lc`. The first number in the output
      is the number of lines and thus number of metrics in the report, the
      2nd number is the size of the report without any `# TYPE` or `# HELP`
      comments.
    - to find out, how many bytes you can drop from transmission of each report
      using the option **--prom-compact**, one may run:
      `grep -v ^coturn_ /tmp/report.om |wc -lc`.

- you may also try the same test using an invalid password or running a test
  peer e.g. on machineC (`turnutils_peer -v -p 16384`) and add the options
  `-e machineC -r 16384`to the `turnutils_uclient` command.


# Misc options

## **--prom-compact**
Should be used on all production systems to avoid the inclusion of the `HELP`
and `TYPE` comments for each metric in each report.
Usually it is sufficient to get and record them once, when a demo is required
or Grafana charts get setup. There is no need to send them again and again in
each report - would just waste bandwidth and demands more resources to parse
and process the report. At least on production systems it is recommended to
enable this option.

## **--prom-realm**
Label the `session_state` metric with the session realm. If one uses different
realms for user authentication it could make sense to add a realm label, so one
may select the interested time series by the realm. If you only use one and the
same realm it is usually just overhead and should be skipped. Because rarely
used, it is not taken into account in the calculations of number of metrics
above.

## **--prom-usernames**
Label the `session_state` metric with client usernames. One should not do this,
unless really needed: Modern apps often generate usernames on the fly (e.g.
hash(currentTimeAsString + `:` + username)) and thus all the names are
different or even unique for a long time. This in turn means, that a new state
metric gets created for each new session. This stresses the server resources
the same way as `--prom-uniq-sid` does, because a sample for each metric gets
kept in memory and needs to be formatted and transferred on each /metric
request. When the report gets collected and stored in a time series DB, this
would in turn create a new time series for each new session
(metricName + label:value pairs form the key for a time series) and would let
your DB "explode", get it performance and resource wise probably into trouble,
because it has to dig into such a huge set of time series (high cardinality)
and replace "active/cached" ones (high churn rate). So do NOT use it, unless
you known what you are doing!


# More Information
More information related to labels and cardinality can be found e.g. via:
- https://docs.victoriametrics.com/keyConcepts.html
- https://docs.victoriametrics.com/FAQ.html#what-is-high-cardinality
- https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate