New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric: Add total http time #398
Comments
I would make sense to use histograms: haproxy_backend_http_request_time_second_sum buckets being configurable globally:
|
@roidelapluie Will you please add HTTP methods for |
No, please no. You seem to have to idea how much memory and CPU the current metrics already take, adding more is even worse. A long time ago I intended to add some histograms and more buckets in general in the stats. Nowadays we see huge configs and people want even more performance, there is simply ZERO excuse for asking a load balancer to compute the myriad of stats each user wants while all of these can be trivially extracted from the logs using existing standard tools. Stats were made for monitoring purposes. I think adding native support for prometheus was a very bad idea because it resulted in fooling people into thinking it's up to the low-level process itself to produce and keep such stats. I'm now convinced it was a huge mistake and that we instead ought to have created a sidecar process to produce such specific stats based on a copy of the logs. With such a architecture, it would become easy to have such an agent run on a different CPU socket or even a different machine, and use all the resources it needs without destroying performance nor adding latency. |
To collect data from logs and expose them to Prometheus, you can use mtail or fluentd. Keep in mind that the prometheus exporter is stateless. It only exposes existing internal stats. So each addition comes with a cost, for everyone, not only for Prometheus users. As far as possible, try to use existing info. You can already add so many information in logs on each request. With the right tool, these info can be easily extracted and collected. |
Well we do get those data from logs. The point of this message is that the current metrics (response time of the last X connections) are not helpful and could be replaced by total response time. |
Probably. Maybe we should simply remove a number of useless metrics that
have been accumulating over time and been used as partial indicators
instead of direct ones, and see if some of them could be replaced with
better ones.
I mean, haproxy has adopted stats extremely early because it was critical
to fix application and infrastructure issues where it was deployed. Many
of them are still HTTP-1.0 centric. There's still some ambiguity between
request and connection sometimes. For example the ctime metric is the average
connect time per *request*. So when you have server-side keep-alive with
connection reuse, it gets smoothed. At the very least it should be per
connection and the connection-to-request ratio should be reported. These
are just examples of course.
Similarly, halog has learned very early to compute the average response time
on valid responses only, because 504 and friends were significantly degrading
the mean value. We probably ought to do something similar natively with the
stats metrics.
Reorganizing metrics between those indicating user experience and those
indicating infrastructure stability would be nice as well. The backend
connect time these days has little effect on user experience thanks to
connection reuse, caching etc but is a huge indicator of network health.
Conversely the application's response time is an indicator of user
experience and not of the network's health. And the total time which
cumulates the two is an indicator of nothing.
That's something to discuss in order to attack it for 2.3 now. However if
we can fix some documentation or slightly readjust a few very confusing
metrics before 2.2 gets released, it might still be doable.
|
@wtarreau Promethus exporter is an option that came with build options. when I (user) build HAProxy with this option it means that I want a full and reasonable metrics from HAProxy and also I accept the performance reduction by enabling this option. If it is concern to me and even I don't want it I will switch to another solutions like fluentd or.. and won't build HAProxy with prometheus. |
I have nothing against your use of prometheus with haproxy, what I mean is that just having a new way to export metrics doesn't automatically mean that new metrics will born out of nowhere. The main problem with your request for "full and reasonable metrics" is that everyone has a different definition of this, and this is a never-ending story. Strangely before we had the prometheus exporter, everyone used to praise haproxy for its extremely comprehensive metrics which was far more than reasonable. Now it seems that suddenly a small set of users would like their LB to compute and store all the stuff that normally ought to be done on a stats aggregation system, possibly from time-series DB. And what next ? The same people will probably expect stats to be kept across reloads ? Then synced among cluster nodes ? Please don't try to confuse your load balancer with a database. |
@clwluvw, You should not see the Prometheus exporter other than as a tool to expose internal stats. As said, it is stateless. So when you ask for new metrics in Prometheus, it means in reality adding new counters in the HAProxy core. For everyone. Without a complete refactoring of the stats mechanism to make it modular, we must be careful when a new counter should be added (In fact, we should always be careful for every addition). I don't know if we will ever have time to refactor the stats (and the will to do it). But, As Willy said, some existing counters are quite old and not really accurate nowadays. So a dusting is possible but we should think to all users and external tools relying on existing stats (and as always, we need time). It is a sensitive subject. At short term, it is possible to add a filter to collect specific stats. Filters are stream-centric, so some stats won't be collected this way. But it is probably a good compromise. However, because it already exists some solutions to get more metrics from the logs, I doubt such filter will be quickly inserted in our todo-list. |
Over the past years, especially in the cloud native / kubernetes world, a consensus on what
Again different time and different era.\ The bottom line is that if haproxy want to stay relevant in this new type of infrastructure it needs to address the current set of challenges that
All those statements are incorrect.\
Just one more note on this. While this is probably true ( i honestly have no idea ) , i personally run multiple Envoy proxies with tens of clusters ( backends ) exposing way more metrics than haproxy does including histograms for each of them and so on and the memory / cpu overhead of it is negligible.
Again wrong perspective, nobody wants those metrics to be stored on the load balancer, and actually some could argue that some of the metrics currently exposed ( like averages over 1024 requests ) are exactly what you are pointing out .. not the job of a load balancer. |
Yes please. I'm sure this is easier to do in haproxy than the current Inside prometheus for example, it'd be easy to then do |
Keeping the max variant (max queue time, max total time) could be a good idea, as it can help troubleshooting. But I agree that the 1024-last-connections average counters are (totally) useless... Thank you very much 👍 |
Hi @UrBnW I hadn't seen @danielbeardsley's explanation of his use case for the total time above, which matches yours, and seen like this I understand the mathematical reasoning and agree with you. Also it's way more reasonable than computing historgrams on the fly! There may of course be a problem of units there because it's within reach to wrap a 64-bit counter, though not trivial. It could sustain 3 years at 100k req/s 24x7 with a 2s response time. That's obviously not everyone's use case. On the other hand other counters already wrap and all properly designed monitoring systems already deal with that cleanly so that should not be an issue. I'll see where we can stuff this total time in the server and backend struct (and maybe frontend, I'll see). I'll check with @capflam next week how this could be done in the prometheus exporter, if anything needs to be done there at all. Thanks for the explanation! |
Prometheus corrects for monotonicity so that’s not a problem in this case 👍 |
One additional thought, if I may, while we are in stat counters. In We could think about relying on We can then have quite "huge" Not sure how we could handle this, but I suspect Perhaps then we need a new Of course feel free :) |
Oh I remember about that one, that I already wanted to address in 2.4 then 2.6, then... Every time we forget about it. The reason is totally stupid: originally there was no way to wait for anything, so the qtime was what was left once you deduced the request time, connect time and response time from the total time. Nowadays it's totally wrong and I would like to introduce a processing time (think about Lua scripts) and maybe even later a comptuation time that would be part of this. The processing time is impacted by multiple factors, computation, DNS resolution, waiting for more data from the client etc. So maybe later it could be refined, but in the mean time at least that would allow us to fix this qtime once for all. And to be clear, I consider the current behavior a bug. |
Glad to hear from you @wtarreau, and thank you for your detailed answer, really appreciated 👍 So as per my understanding, if processing time is also deduced from the total time (in addition to the req/conn/resp times), then I just wonder whether or not it will also solve this
Currently, with such a configuration, While here in stats, here's an improvement proposal. Thank you again 👍 |
Yes the goal is that your config above doesn't affect qtime. Also regarding your stats proposal, please do not deviate from the original topic or it becomes impossible to follow issues. It's already amazingly complicated to context-switch between tens at a time, but if multiple topics are opened in each it's impossible. And to make a long story short, no, we should really not allow the clear operation to be done by URL like this because it means anyone could do it. You can do it over the CLI however ("clear counters" I think). We could possibly think about this only for authenticated users, in "stats admin" mode but few people use it, and it'd remain extremely low on the priority list. |
What should haproxy do differently? Which functionality do you think we should add?
Expose the total http response time to prometheus metrics
haproxy_backend_http_request_time_second_total
haproxy_frontend_http_request_time_second_total
haproxy_backend_http_responses_time_second_total
haproxy_frontend_http_responses_time_second_total
What are you trying to do?
I try to calculate the average response time of http queries.
haproxy_backend_response_time_average_seconds
is not helpful as the last 1024 connections could happen in the last second or the last 5 minutes.The text was updated successfully, but these errors were encountered: