Showing admins infos on a specific admin port #23

athoune · 2021-01-04T19:13:24Z

A private http server for debugging purpose, performance analysis, or monitoring.

It listens on localhost, and should not be routed on Internet.

It exposes :

stats/groups as JSON
pprof
Prometheus endpoint for standard monitoring

jech · 2021-01-05T14:26:20Z

I'm not sure what this pull request does. Why should connections from localhost be trusted? Why is binding to localhost better than BTTP basic over TLS? And why do you want to enable profiling in production?

athoune · 2021-01-05T20:32:03Z

On a server, with restricted access, admin website on localhost is not so ugly. OK, I'll add password.

How can I do load testing with webrtc, without real user and real flappy internet connection ? OK, lets add a flag.

I'm focusing on stats and metrics. I can't host a service without metrics.

admin password
pprof flag

athoune · 2021-01-11T13:32:16Z

The POC is done.

Do you think that proposed features are useful and this PR is welcome ?

What metrics should be exposed in a first iteration ?

jech · 2021-01-11T13:35:41Z

I'm sorry, but I still don't understand what problems this PR is aimed at solving.

athoune · 2021-01-12T14:00:11Z

Managing galene server

For hosting an Internet service, you need observability.
Observability for incident analysis and correction, for performance tuning and capacity planning.

Logging

Logging exposes events (with timestamp) and error.
Logging is mandatory, but you need to expose states and metrics, too.

Application states

Rooms with clients and their channels are specific to Galene, and very useful to know the context (connection performance, codec used…) and real usage of th service.

Metrics

Metrics are emitter for time series databases. It can be used for graphing (with the all mighty Grafana) and monitoring and raising allerts. Prometheus is one of the time series databases, but its export format is the de facto standard, readable by all time series.

Prometheus export is a HTTP endpoint, displaying metrics with tag and values. They are two formats, a plain text with lots of comments and a compact binary format. The golang library handles the different kind of counter, the HTTP handler, and the two export formats.

PProf

pprof is the golang tool for understanding runtime behavior.
It can be used in production with real load.

Security

Metrics, states and logging are for private eyes. You don't know how its scale and if this information can be used for evil purpose.

Using a distinct port from main application port, and listening by default localhost is a first level of protection. Authenticating access with a password (using galene tool for hashing password) is a second level. Optional admin (you don't have to always run admin endpoint) is a third level.
For now, admin endpoint is readonly, you can see, but can't change settings or states.

This Pull Request

Logging is out of the scope of this PR.

admin.GroupsHAndler and admin.OneGroupHandler expose stats. GroupStats.

pprof are exposed with its standard handlers from net/http/pprof.

Prometheus endpoint came from github.com/prometheus/client_golang/prometheus/promhttp and code is instrumented for populating metrics.

Authentication is done with handmade admin.BasicAuthMiddleware middleware, using standard galene json format for storing bkbdf hashed password.

Settings are done with standard flag.

jech · 2021-01-12T17:55:11Z

I have read your pull request, and I understand what it does. I don't understand why it is useful. * We all agree that better logging is a good idea. Help with improving logging is welcome, but there's nothing about logging in your pull request. * Group stats: this needs to be carefully balanced with the users' right to privacy. There are good reasons why Galène doesn't export connected usernames or IP addresses. * I understand what Prometheus is. I do not understand what particular data your patch exports in Prometheus format. * Galène is already exposing pprof data (through a set of command-line options). We already have a good understanding of where the CPU time goes, we have a good understanding of where memory allocation happens. Please explain why you believe that additionally exposing CPU time statistics over HTTP will help improving Galène. I am sorry, Mathieu, but I will not apply your patches unless you can explain what is their purpose. Please do not paraphrase the code, but explain *why* this is necessary.

athoune · 2021-01-13T08:44:30Z

ok, I'm focusing on prometheus exports.

athoune · 2021-01-13T09:32:15Z

pprof http endpoint is removed.

athoune · 2021-01-13T10:17:02Z

I suggest some counters :

the main model use Group/Client/Track : gauge
webrtc use differents codecs, count group by codecs
opened websockets
cache usage
rtp usage (and congestion)

athoune · 2021-01-13T11:03:13Z

For now, I have that :

# HELP galene_cache_get Galene cache successful get call
# TYPE galene_cache_get counter
galene_cache_get 607
# HELP galene_cache_get_size Galene cache successful get size
# TYPE galene_cache_get_size counter
galene_cache_get_size 629990
# HELP galene_cache_store Galene cache store call
# TYPE galene_cache_store counter
galene_cache_store 54512
# HELP galene_cache_store_size Galene cache store size
# TYPE galene_cache_store_size counter
galene_cache_store_size 3.7907401e+07
# HELP galene_clients The number of connected clients
# TYPE galene_clients gauge
galene_clients 2
# HELP galene_groups Number of groups
# TYPE galene_groups gauge
galene_groups 8
# HELP galene_rtp_dead_write galene rtp  dead writer
# TYPE galene_rtp_dead_write counter
galene_rtp_dead_write 0
# HELP galene_rtp_read galene rtp read packets
# TYPE galene_rtp_read counter
galene_rtp_read 3.7907401e+07
# HELP galene_rtp_track galene rtp track
# TYPE galene_rtp_track counter
galene_rtp_track{codec="audio/opus",label="audio"} 3
galene_rtp_track{codec="video/VP8",label="video"} 3
# HELP galene_rtp_write galene rtp write packets
# TYPE galene_rtp_write counter
galene_rtp_write 1.8701508e+07
# HELP galene_websockets The number of opened websockets
# TYPE galene_websockets gauge
galene_websockets 2

jech · 2021-01-14T01:20:20Z

I have to say that I still don't understand what kind of insights you expect to get from this kind of statistics. Also, I'm not convinced with the approach of maintaining a running count of items, it's error-prone and difficult to maintain. Wouldn't it be simpler to count items of interest on demand, as is done in stats.go?

Please be aware that every call to Inc involves two atomic operations, which will cause cache-line bouncing and will prevent scaling beyond 4 cores or so.

athoune · 2021-01-16T11:51:03Z

This metrics are suggestion, high level one are mandatory (number of websocket, opened rooms, opened channel, dropped packets and everything linked to quality), low level, like cache usage are clearly optionnal.

stats API are useful when you are part of the discussion, inside the chat room.

Metrics, from the outside, are needed for monitoring the service (the quality is too low : maybe the network is involved, the CPU is saturated?), what happens when user stop using H.264 for A1? How many users, how many rooms can I handle with this server? What server is needed for handling a massive conference? Is there jsut one user flooding my service with its 4k video upstream? What can I do when the server is saturated, just restart the service? How di I know the quality service and if users are happy?

This reverts commit de55c28.

jech · 2021-01-16T14:38:07Z

I'm sorry, I still don't understand. In what way are any of the metrics that you implemented related to service quality?

Here are a few examples of useful metrics:

the number of packets dropped in the up direction;
- the fraction that was nacked early;
- the fraction that was nacked late;
- the fraction that was never nacked;
the number of packets nacked by receivers;
- the fraction of those that could be satisfied locally;
- the fraction of those that resulted in a late nack;
the number of times that a writer thread was late;
- the fraction that resulted in queueing;
- the fraction that resulted in a packet drop.

I could be mistaken, but I don't see any code in your submission that relates to any of the above.

athoune · 2021-01-20T20:14:43Z

There is different kind of metrics.

Quality

I'm OK to implement your suggestions

Usage and capacity planning

CPU and memory are monitored from outside, through cgroups.

Network usage can be monitored.

Some basic usage number, like user/room/channel for guessing how many CPU/RAM I need for more users (aka capacity planning)

What do other SFUs ?

BBB : https://bigbluebutton-exporter.greenstatic.dev/exporter-user-guide/#metrics
Jitsi : https://github.com/jitsi/jitsi-videobridge/blob/master/doc/statistics.md
Janus has log, pcap dump, but not real metrics

Galene metrics

Which additional metrics seem useful for Galene ?

athoune mentioned this pull request Jan 5, 2021

Admin tools #24

Closed

Mathieu Lecarme added 17 commits January 16, 2021 12:53

Groups stats.

8258e46

pprof.

47465eb

JSON is snakecase.

3f51bc5

Group API.

cce7fa4

Prometheus metrics, first metric

c6a9782

Admin has authorization

2342c86

Fix: close config file.

a667551

Expose pprof endpoints.

c42a50e

Cache password. Test basic auth.

3b0cdc6

Fix: oups, missing tests.

f0540a2

Add counters.

f60347f

Count RTP.

113115f

Count cache.

528a9a5

Register all the things.

2fe7e47

Unplugging pprof http endpoint.

376080f

Count tracks.

2df9990

Revert "Register all the things."

bcbc3c1

This reverts commit de55c28.

Fix: promauto doesn't need to register.

b7995dc

jech force-pushed the master branch from 1d59570 to aaaaae5 Compare February 14, 2021 19:14

jech closed this Feb 18, 2021

jech mentioned this pull request Mar 23, 2021

Monitoring (Prometheus or JSON) #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Showing admins infos on a specific admin port #23

Showing admins infos on a specific admin port #23

athoune commented Jan 4, 2021 •

edited

Loading

jech commented Jan 5, 2021

athoune commented Jan 5, 2021 •

edited

Loading

athoune commented Jan 11, 2021

jech commented Jan 11, 2021

athoune commented Jan 12, 2021

jech commented Jan 12, 2021 via email

athoune commented Jan 13, 2021

athoune commented Jan 13, 2021

athoune commented Jan 13, 2021

athoune commented Jan 13, 2021

jech commented Jan 14, 2021 •

edited

Loading

athoune commented Jan 16, 2021

jech commented Jan 16, 2021 •

edited

Loading

athoune commented Jan 20, 2021

Showing admins infos on a specific admin port #23

Showing admins infos on a specific admin port #23

Conversation

athoune commented Jan 4, 2021 • edited Loading

jech commented Jan 5, 2021

athoune commented Jan 5, 2021 • edited Loading

athoune commented Jan 11, 2021

jech commented Jan 11, 2021

athoune commented Jan 12, 2021

Managing galene server

Logging

Application states

Metrics

PProf

Security

This Pull Request

jech commented Jan 12, 2021 via email

athoune commented Jan 13, 2021

athoune commented Jan 13, 2021

athoune commented Jan 13, 2021

athoune commented Jan 13, 2021

jech commented Jan 14, 2021 • edited Loading

athoune commented Jan 16, 2021

jech commented Jan 16, 2021 • edited Loading

athoune commented Jan 20, 2021

Quality

Usage and capacity planning

What do other SFUs ?

Galene metrics

athoune commented Jan 4, 2021 •

edited

Loading

athoune commented Jan 5, 2021 •

edited

Loading

jech commented Jan 14, 2021 •

edited

Loading

jech commented Jan 16, 2021 •

edited

Loading