Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing admins infos on a specific admin port #23

Closed
wants to merge 18 commits into from
Closed

Showing admins infos on a specific admin port #23

wants to merge 18 commits into from

Conversation

athoune
Copy link

@athoune athoune commented Jan 4, 2021

A private http server for debugging purpose, performance analysis, or monitoring.

It listens on localhost, and should not be routed on Internet.

It exposes :

@athoune athoune mentioned this pull request Jan 5, 2021
@jech
Copy link
Owner

jech commented Jan 5, 2021

I'm not sure what this pull request does. Why should connections from localhost be trusted? Why is binding to localhost better than BTTP basic over TLS? And why do you want to enable profiling in production?

@athoune
Copy link
Author

athoune commented Jan 5, 2021

On a server, with restricted access, admin website on localhost is not so ugly. OK, I'll add password.

How can I do load testing with webrtc, without real user and real flappy internet connection ? OK, lets add a flag.

I'm focusing on stats and metrics. I can't host a service without metrics.

  • admin password
  • pprof flag

@athoune
Copy link
Author

athoune commented Jan 11, 2021

The POC is done.

Do you think that proposed features are useful and this PR is welcome ?

What metrics should be exposed in a first iteration ?

@jech
Copy link
Owner

jech commented Jan 11, 2021

I'm sorry, but I still don't understand what problems this PR is aimed at solving.

@athoune
Copy link
Author

athoune commented Jan 12, 2021

Managing galene server

For hosting an Internet service, you need observability.
Observability for incident analysis and correction, for performance tuning and capacity planning.

Logging

Logging exposes events (with timestamp) and error.
Logging is mandatory, but you need to expose states and metrics, too.

Application states

Rooms with clients and their channels are specific to Galene, and very useful to know the context (connection performance, codec used…) and real usage of th service.

Metrics

Metrics are emitter for time series databases. It can be used for graphing (with the all mighty Grafana) and monitoring and raising allerts. Prometheus is one of the time series databases, but its export format is the de facto standard, readable by all time series.

Prometheus export is a HTTP endpoint, displaying metrics with tag and values. They are two formats, a plain text with lots of comments and a compact binary format. The golang library handles the different kind of counter, the HTTP handler, and the two export formats.

PProf

pprof is the golang tool for understanding runtime behavior.
It can be used in production with real load.

Security

Metrics, states and logging are for private eyes. You don't know how its scale and if this information can be used for evil purpose.

Using a distinct port from main application port, and listening by default localhost is a first level of protection. Authenticating access with a password (using galene tool for hashing password) is a second level. Optional admin (you don't have to always run admin endpoint) is a third level.
For now, admin endpoint is readonly, you can see, but can't change settings or states.

This Pull Request

Logging is out of the scope of this PR.

admin.GroupsHAndler and admin.OneGroupHandler expose stats. GroupStats.

pprof are exposed with its standard handlers from net/http/pprof.

Prometheus endpoint came from github.com/prometheus/client_golang/prometheus/promhttp and code is instrumented for populating metrics.

Authentication is done with handmade admin.BasicAuthMiddleware middleware, using standard galene json format for storing bkbdf hashed password.

Settings are done with standard flag.

@jech
Copy link
Owner

jech commented Jan 12, 2021 via email

@athoune
Copy link
Author

athoune commented Jan 13, 2021

ok, I'm focusing on prometheus exports.

@athoune
Copy link
Author

athoune commented Jan 13, 2021

pprof http endpoint is removed.

@athoune
Copy link
Author

athoune commented Jan 13, 2021

I suggest some counters :

  • the main model use Group/Client/Track : gauge
  • webrtc use differents codecs, count group by codecs
  • opened websockets
  • cache usage
  • rtp usage (and congestion)

@athoune
Copy link
Author

athoune commented Jan 13, 2021

For now, I have that :

# HELP galene_cache_get Galene cache successful get call
# TYPE galene_cache_get counter
galene_cache_get 607
# HELP galene_cache_get_size Galene cache successful get size
# TYPE galene_cache_get_size counter
galene_cache_get_size 629990
# HELP galene_cache_store Galene cache store call
# TYPE galene_cache_store counter
galene_cache_store 54512
# HELP galene_cache_store_size Galene cache store size
# TYPE galene_cache_store_size counter
galene_cache_store_size 3.7907401e+07
# HELP galene_clients The number of connected clients
# TYPE galene_clients gauge
galene_clients 2
# HELP galene_groups Number of groups
# TYPE galene_groups gauge
galene_groups 8
# HELP galene_rtp_dead_write galene rtp  dead writer
# TYPE galene_rtp_dead_write counter
galene_rtp_dead_write 0
# HELP galene_rtp_read galene rtp read packets
# TYPE galene_rtp_read counter
galene_rtp_read 3.7907401e+07
# HELP galene_rtp_track galene rtp track
# TYPE galene_rtp_track counter
galene_rtp_track{codec="audio/opus",label="audio"} 3
galene_rtp_track{codec="video/VP8",label="video"} 3
# HELP galene_rtp_write galene rtp write packets
# TYPE galene_rtp_write counter
galene_rtp_write 1.8701508e+07
# HELP galene_websockets The number of opened websockets
# TYPE galene_websockets gauge
galene_websockets 2

@jech
Copy link
Owner

jech commented Jan 14, 2021

I have to say that I still don't understand what kind of insights you expect to get from this kind of statistics. Also, I'm not convinced with the approach of maintaining a running count of items, it's error-prone and difficult to maintain. Wouldn't it be simpler to count items of interest on demand, as is done in stats.go?

Please be aware that every call to Inc involves two atomic operations, which will cause cache-line bouncing and will prevent scaling beyond 4 cores or so.

@athoune
Copy link
Author

athoune commented Jan 16, 2021

This metrics are suggestion, high level one are mandatory (number of websocket, opened rooms, opened channel, dropped packets and everything linked to quality), low level, like cache usage are clearly optionnal.

stats API are useful when you are part of the discussion, inside the chat room.

Metrics, from the outside, are needed for monitoring the service (the quality is too low : maybe the network is involved, the CPU is saturated?), what happens when user stop using H.264 for A1? How many users, how many rooms can I handle with this server? What server is needed for handling a massive conference? Is there jsut one user flooding my service with its 4k video upstream? What can I do when the server is saturated, just restart the service? How di I know the quality service and if users are happy?

@jech
Copy link
Owner

jech commented Jan 16, 2021

I'm sorry, I still don't understand. In what way are any of the metrics that you implemented related to service quality?

Here are a few examples of useful metrics:

  • the number of packets dropped in the up direction;
    • the fraction that was nacked early;
    • the fraction that was nacked late;
    • the fraction that was never nacked;
  • the number of packets nacked by receivers;
    • the fraction of those that could be satisfied locally;
    • the fraction of those that resulted in a late nack;
  • the number of times that a writer thread was late;
    • the fraction that resulted in queueing;
    • the fraction that resulted in a packet drop.

I could be mistaken, but I don't see any code in your submission that relates to any of the above.

@athoune
Copy link
Author

athoune commented Jan 20, 2021

There is different kind of metrics.

Quality

I'm OK to implement your suggestions

Usage and capacity planning

CPU and memory are monitored from outside, through cgroups.

Network usage can be monitored.

Some basic usage number, like user/room/channel for guessing how many CPU/RAM I need for more users (aka capacity planning)

What do other SFUs ?

Galene metrics

Which additional metrics seem useful for Galene ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants