No response body from `/metrics` with multiple servers #151

ahmgeek · 2019-08-20T15:40:44Z

Code first

require "prometheus/middleware/collector"
require "prometheus/middleware/exporter"

module Sinatra
  class Api < Base   
    # Used for Prometheus
    use Rack::Deflater
    use Prometheus::Middleware::Collector
    use Prometheus::Middleware::Exporter

    configure do
      set :raise_errors, true
    end
  end
end

Technology

We use JRuby-9.2.4.1 and Trinidad server

Setup

We have 2 servers running the application with the same ruby version along with same Trinidad server. On front of them HAProxy as a load balancer.

The issue

When you call the /metrics page, HAProxy forward the request to the available server, if both are available then selection is a bit random.
The issue is, on one server you get the response and you can see the metrics with OpenMetrics format, on the other, you get 200 status code back but no body provided.
Just blank page.

curling the servers, one is working pretty fine and response body has the metrics, other works pretty fine response body is empty.
Logs are also fine.
Didn't really go deep in the application to see what's wrong (maybe threading issue with the datastores)
But wanted to ask you guys first, maybe you had this issue before or have some hints

The text was updated successfully, but these errors were encountered:

dmagliola · 2019-08-23T10:49:07Z

Hi @ahmgeek
Could you confirm which version of the Prometheus Client you are using?
If you're on 0.9, i'd recommend first upgrading to v0.10.0.pre.alpha.2 to begin with.

We haven't seen this issue before, however, and while JRuby is supported, we're not really JRuby in production, so we don't have much experience with it unfortunately

ahmgeek · 2019-08-23T11:01:14Z

hey @dmagliola, we are already using v0.10.0.pre.alpha.2 I will dig deep a bit then and will report back the findings.

ahmgeek · 2019-08-27T15:21:00Z

Hey @dmagliola,
I am still trailing this but some questions from my side:

client_ruby/lib/prometheus/client/registry.rb

Line 26 in b3bbe23

raise AlreadyRegisteredError, "#{name} has already been registered"

Is this is meant to be rescueed somewhere?
Here's the log

   {
    "service": "billing",
    "host": "prd-billing-server",
    "buildtag": "unknown",
    "environment": "production",
    "thread_id": 2056,
    "timestamp": "2019-08-27T12:47:29.7843Z",
    "severity": "ERROR",
    "message": "http_server_requests_total has already been registered",
    "route": "/check",
    "progname": "null",
    "audit": "false",
    "event_id": 0,
    "error_class": "Prometheus::Client::Registry::AlreadyRegisteredError",
    "error_backtrace": "/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:26:in `block in register'\norg/jruby/ext/thread/Mutex.java:165:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:24:in `register'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:41:in `counter'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:47:in `init_request_metrics'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:36:in `initialize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:86:in `block in use'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `block in to_app'\norg/jruby/RubyArray.java:1792:in `each'\norg/jruby/RubyEnumerable.java:1062:in `inject'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `to_app'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1473:in `new'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1462:in `prototype'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `block in call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1787:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:66:in `block in call'\norg/jruby/RubyArray.java:1792:in `each'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:50:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/error_handler-0.5.0/lib/error_handler/middleware.rb:18:in `call'"
  },
  {
    "service": "billing",
    "host": "prd-billing-server",
    "buildtag": "unknown",
    "environment": "production",
    "thread_id": 2050,
    "timestamp": "2019-08-27T12:47:29.7817Z",
    "severity": "ERROR",
    "message": "http_server_requests_total has already been registered",
    "route": "/check",
    "progname": "null",
    "audit": "false",
    "event_id": 0,
    "error_class": "Prometheus::Client::Registry::AlreadyRegisteredError",
    "error_backtrace": "/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:26:in `block in register'\norg/jruby/ext/thread/Mutex.java:165:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:24:in `register'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:41:in `counter'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:47:in `init_request_metrics'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:36:in `initialize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:86:in `block in use'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `block in to_app'\norg/jruby/RubyArray.java:1792:in `each'\norg/jruby/RubyEnumerable.java:1062:in `inject'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `to_app'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1473:in `new'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1462:in `prototype'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `block in call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1787:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:66:in `block in call'\norg/jruby/RubyArray.java:1792:in `each'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:50:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/error_handler-0.5.0/lib/error_handler/middleware.rb:18:in `call'"
  }

I see it rescued here

client_ruby/spec/prometheus/client/registry_spec.rb

Line 44 in b3bbe23

rescue Prometheus::Client::Registry::AlreadyRegisteredError

So not sure if it should be rescued somewhere else, although I don't think it should be rescued but for some reasons, when it raises the error couple of times, the /metrics end point still always return 200 success status code and no body.

ahmgeek · 2019-08-27T15:23:28Z

@dmagliola,

client_ruby/lib/prometheus/client.rb

Lines 10 to 16 in b3bbe23

    
           def self.registry 
        
             @registry ||= Registry.new 
        
           end 
        
           def self.config 
        
             @config ||= Config.new 
        
           end

Is this is thread safe? notice Client is module not class, and instance variable here are not really instance variables, it's a shared variables between the objects. Also memoization is not thread safe I think, not sure if they will cause issues (as this depends on the usage) but just worth mentioning.

dmagliola · 2019-08-28T09:47:19Z

Is AlreadyRegisteredError meant to be rescued

Not really sure how to answer that. It's not meant to be rescued within the client itself. You'll get this error if you try to register the same metric twice. Adding metrics is thread-safe, however if multiple threads add the same metric you'll get this error.

If you're going to have multiple threads declaring the metrics (adding them to the registry), you may need to either synchronize checking whether they are there already and adding if not, or, you can rescue the exception and get the already registered metric from the registry, I guess?

It's a bit hard to answer what's best without knowing how your code is doing this.

Is Prometheus::Client.registry thread-safe?

No, you are right. That is very much not thread safe.
We should probably split the reading and setting of @registry into separate instructions, and have a Mutex around writing to it.

Could you try temporarily adding a mutex, in your code, around the code that access this method, to see if that solves the problem?
We should fix this anyway, but i'd still like to get to the bottom of what problem you're having

ahmgeek · 2019-08-28T09:54:19Z

@dmagliola Thanks for the response, I am pretty doing nothing within my code, just using the collector and the exporter as they are:

require "prometheus/middleware/collector"
require "prometheus/middleware/exporter"

module Sinatra
  class Api < Base   
    # Used for Prometheus
    use Rack::Deflater
    use Prometheus::Middleware::Collector
    use Prometheus::Middleware::Exporter

    configure do
      set :raise_errors, true
    end
  end
end

However I have the issues because jruby 😄

I figured that there's some race conditions or something because the errors seems like: couple of threads are trying to register the same metric then it fails to continue for some reasons from inside the collector, I am trying to re-produce this locally with tests then send a PR to fix it.

Will let you know soonish the outcome.

ahmgeek · 2019-09-09T09:21:58Z

Turned out there's a raised issue, but swallowed by rack's URLMap, I sent a PR to fix this issue and wrote a detailed information about the issues and the fixes.

jplflyer · 2021-02-22T23:10:33Z

I was having a related issue. We're running inside an AWS Fargate container, with Puma. Our initialization code was some I found somewhere, and it looked like this:

require 'prometheus/client'
require 'prometheus/client/data_stores/direct_file_store'

Rails.logger.info "aa_prometheus.rb startup script"

# First we determine the path. Under a non-container environment,
# we would clear out files, but in our environment, this causes
# things to actually break (as we have multiple puma threads racing
# through this.
prom_dir=Rails.root.join('tmp', 'prometheus')
FileUtils.mkdir_p prom_dir
#Dir["#{prom_dir}/*.bin"].each do |file_path|
#  File.unlink(file_path)
#end

# Then just point the data store.
Prometheus::Client.config.data_store = Prometheus::Client::DataStores::DirectFileStore.new(dir: prom_dir)

Notice the 3 lines I commented out for deleting the old files. I believe this was the cause of the problem. I understand removing old files if you're in a persistent scenario with a single-threaded startup. But I watched where it was producing files then suddenly then all disappeared and /metrics stopped doing anything useful. I believe it was Puma finally deciding to start another thread, but I'm not sure how to prove it.

No idea why I felt I was supposed to delete old files, but as I shouldn't be bouncing willy-nilly without bouncing the entire container, I don't feel it's necessary.

Sinjo · 2022-02-01T22:13:42Z

I'm going to close this on the basis that it's pretty old and we have fixed a bunch of threading issues in the meantime.

I'm slightly uneasy at the idea that we may have something outstanding that only gets surfaced by the parallelism of JRuby, but even if that's the case then this lead has long since gone cold, and we need a fresh report that we can chase up.

ahmgeek changed the title ~~No response (/metrics) with multiple servers~~ No response body from /metrics with multiple servers Aug 20, 2019

ahmgeek mentioned this issue Sep 2, 2019

Fixing the Registry threading/design issues #153

Closed

Sinjo closed this as completed Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No response body from `/metrics` with multiple servers #151

No response body from `/metrics` with multiple servers #151

ahmgeek commented Aug 20, 2019 •

edited

Loading

dmagliola commented Aug 23, 2019

ahmgeek commented Aug 23, 2019

ahmgeek commented Aug 27, 2019

ahmgeek commented Aug 27, 2019 •

edited

Loading

dmagliola commented Aug 28, 2019

ahmgeek commented Aug 28, 2019 •

edited

Loading

ahmgeek commented Sep 9, 2019 •

edited

Loading

jplflyer commented Feb 22, 2021

Sinjo commented Feb 1, 2022

No response body from /metrics with multiple servers #151

No response body from /metrics with multiple servers #151

Comments

ahmgeek commented Aug 20, 2019 • edited Loading

Code first

Technology

Setup

The issue

dmagliola commented Aug 23, 2019

ahmgeek commented Aug 23, 2019

ahmgeek commented Aug 27, 2019

ahmgeek commented Aug 27, 2019 • edited Loading

dmagliola commented Aug 28, 2019

ahmgeek commented Aug 28, 2019 • edited Loading

ahmgeek commented Sep 9, 2019 • edited Loading

jplflyer commented Feb 22, 2021

Sinjo commented Feb 1, 2022

No response body from `/metrics` with multiple servers #151

No response body from `/metrics` with multiple servers #151

ahmgeek commented Aug 20, 2019 •

edited

Loading

ahmgeek commented Aug 27, 2019 •

edited

Loading

ahmgeek commented Aug 28, 2019 •

edited

Loading

ahmgeek commented Sep 9, 2019 •

edited

Loading