Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No response body from /metrics with multiple servers #151

Closed
ahmgeek opened this issue Aug 20, 2019 · 9 comments
Closed

No response body from /metrics with multiple servers #151

ahmgeek opened this issue Aug 20, 2019 · 9 comments

Comments

@ahmgeek
Copy link

ahmgeek commented Aug 20, 2019

Code first

require "prometheus/middleware/collector"
require "prometheus/middleware/exporter"

module Sinatra
  class Api < Base   
    # Used for Prometheus
    use Rack::Deflater
    use Prometheus::Middleware::Collector
    use Prometheus::Middleware::Exporter

    configure do
      set :raise_errors, true
    end
  end
end

Technology

We use JRuby-9.2.4.1 and Trinidad server

Setup

We have 2 servers running the application with the same ruby version along with same Trinidad server. On front of them HAProxy as a load balancer.

The issue

When you call the /metrics page, HAProxy forward the request to the available server, if both are available then selection is a bit random.
The issue is, on one server you get the response and you can see the metrics with OpenMetrics format, on the other, you get 200 status code back but no body provided.
Just blank page.

curling the servers, one is working pretty fine and response body has the metrics, other works pretty fine response body is empty.
Logs are also fine.
Didn't really go deep in the application to see what's wrong (maybe threading issue with the datastores)
But wanted to ask you guys first, maybe you had this issue before or have some hints

@ahmgeek ahmgeek changed the title No response (/metrics) with multiple servers No response body from /metrics with multiple servers Aug 20, 2019
@dmagliola
Copy link
Collaborator

Hi @ahmgeek
Could you confirm which version of the Prometheus Client you are using?
If you're on 0.9, i'd recommend first upgrading to v0.10.0.pre.alpha.2 to begin with.

We haven't seen this issue before, however, and while JRuby is supported, we're not really JRuby in production, so we don't have much experience with it unfortunately

@ahmgeek
Copy link
Author

ahmgeek commented Aug 23, 2019

hey @dmagliola, we are already using v0.10.0.pre.alpha.2 I will dig deep a bit then and will report back the findings.

@ahmgeek
Copy link
Author

ahmgeek commented Aug 27, 2019

Hey @dmagliola,
I am still trailing this but some questions from my side:

raise AlreadyRegisteredError, "#{name} has already been registered"

Is this is meant to be rescueed somewhere?
Here's the log

   {
    "service": "billing",
    "host": "prd-billing-server",
    "buildtag": "unknown",
    "environment": "production",
    "thread_id": 2056,
    "timestamp": "2019-08-27T12:47:29.7843Z",
    "severity": "ERROR",
    "message": "http_server_requests_total has already been registered",
    "route": "/check",
    "progname": "null",
    "audit": "false",
    "event_id": 0,
    "error_class": "Prometheus::Client::Registry::AlreadyRegisteredError",
    "error_backtrace": "/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:26:in `block in register'\norg/jruby/ext/thread/Mutex.java:165:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:24:in `register'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:41:in `counter'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:47:in `init_request_metrics'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:36:in `initialize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:86:in `block in use'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `block in to_app'\norg/jruby/RubyArray.java:1792:in `each'\norg/jruby/RubyEnumerable.java:1062:in `inject'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `to_app'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1473:in `new'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1462:in `prototype'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `block in call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1787:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:66:in `block in call'\norg/jruby/RubyArray.java:1792:in `each'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:50:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/error_handler-0.5.0/lib/error_handler/middleware.rb:18:in `call'"
  },
  {
    "service": "billing",
    "host": "prd-billing-server",
    "buildtag": "unknown",
    "environment": "production",
    "thread_id": 2050,
    "timestamp": "2019-08-27T12:47:29.7817Z",
    "severity": "ERROR",
    "message": "http_server_requests_total has already been registered",
    "route": "/check",
    "progname": "null",
    "audit": "false",
    "event_id": 0,
    "error_class": "Prometheus::Client::Registry::AlreadyRegisteredError",
    "error_backtrace": "/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:26:in `block in register'\norg/jruby/ext/thread/Mutex.java:165:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:24:in `register'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/client/registry.rb:41:in `counter'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:47:in `init_request_metrics'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/prometheus-client-0.10.0.pre.alpha.2/lib/prometheus/middleware/collector.rb:36:in `initialize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:86:in `block in use'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `block in to_app'\norg/jruby/RubyArray.java:1792:in `each'\norg/jruby/RubyEnumerable.java:1062:in `inject'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/builder.rb:147:in `to_app'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1473:in `new'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1462:in `prototype'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `block in call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1787:in `synchronize'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/sinatra-1.4.8/lib/sinatra/base.rb:1487:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:66:in `block in call'\norg/jruby/RubyArray.java:1792:in `each'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/rack-1.6.11/lib/rack/urlmap.rb:50:in `call'\n/home/billing/apps/billing/shared/bundle/jruby/2.5.0/gems/error_handler-0.5.0/lib/error_handler/middleware.rb:18:in `call'"
  }

I see it rescued here

rescue Prometheus::Client::Registry::AlreadyRegisteredError

So not sure if it should be rescued somewhere else, although I don't think it should be rescued but for some reasons, when it raises the error couple of times, the /metrics end point still always return 200 success status code and no body.

@ahmgeek
Copy link
Author

ahmgeek commented Aug 27, 2019

@dmagliola,

def self.registry
@registry ||= Registry.new
end
def self.config
@config ||= Config.new
end

Is this is thread safe? notice Client is module not class, and instance variable here are not really instance variables, it's a shared variables between the objects. Also memoization is not thread safe I think, not sure if they will cause issues (as this depends on the usage) but just worth mentioning.

@dmagliola
Copy link
Collaborator

Is AlreadyRegisteredError meant to be rescued

Not really sure how to answer that. It's not meant to be rescued within the client itself. You'll get this error if you try to register the same metric twice. Adding metrics is thread-safe, however if multiple threads add the same metric you'll get this error.

If you're going to have multiple threads declaring the metrics (adding them to the registry), you may need to either synchronize checking whether they are there already and adding if not, or, you can rescue the exception and get the already registered metric from the registry, I guess?

It's a bit hard to answer what's best without knowing how your code is doing this.

Is Prometheus::Client.registry thread-safe?

No, you are right. That is very much not thread safe.
We should probably split the reading and setting of @registry into separate instructions, and have a Mutex around writing to it.

Could you try temporarily adding a mutex, in your code, around the code that access this method, to see if that solves the problem?
We should fix this anyway, but i'd still like to get to the bottom of what problem you're having

@ahmgeek
Copy link
Author

ahmgeek commented Aug 28, 2019

@dmagliola Thanks for the response, I am pretty doing nothing within my code, just using the collector and the exporter as they are:

require "prometheus/middleware/collector"
require "prometheus/middleware/exporter"

module Sinatra
  class Api < Base   
    # Used for Prometheus
    use Rack::Deflater
    use Prometheus::Middleware::Collector
    use Prometheus::Middleware::Exporter

    configure do
      set :raise_errors, true
    end
  end
end

However I have the issues because jruby 😄

I figured that there's some race conditions or something because the errors seems like: couple of threads are trying to register the same metric then it fails to continue for some reasons from inside the collector, I am trying to re-produce this locally with tests then send a PR to fix it.

Will let you know soonish the outcome.

@ahmgeek
Copy link
Author

ahmgeek commented Sep 9, 2019

Turned out there's a raised issue, but swallowed by rack's URLMap, I sent a PR to fix this issue and wrote a detailed information about the issues and the fixes.

@jplflyer
Copy link

I was having a related issue. We're running inside an AWS Fargate container, with Puma. Our initialization code was some I found somewhere, and it looked like this:

require 'prometheus/client'
require 'prometheus/client/data_stores/direct_file_store'

Rails.logger.info "aa_prometheus.rb startup script"

# First we determine the path. Under a non-container environment,
# we would clear out files, but in our environment, this causes
# things to actually break (as we have multiple puma threads racing
# through this.
prom_dir=Rails.root.join('tmp', 'prometheus')
FileUtils.mkdir_p prom_dir
#Dir["#{prom_dir}/*.bin"].each do |file_path|
#  File.unlink(file_path)
#end

# Then just point the data store.
Prometheus::Client.config.data_store = Prometheus::Client::DataStores::DirectFileStore.new(dir: prom_dir)

Notice the 3 lines I commented out for deleting the old files. I believe this was the cause of the problem. I understand removing old files if you're in a persistent scenario with a single-threaded startup. But I watched where it was producing files then suddenly then all disappeared and /metrics stopped doing anything useful. I believe it was Puma finally deciding to start another thread, but I'm not sure how to prove it.

No idea why I felt I was supposed to delete old files, but as I shouldn't be bouncing willy-nilly without bouncing the entire container, I don't feel it's necessary.

@Sinjo
Copy link
Member

Sinjo commented Feb 1, 2022

I'm going to close this on the basis that it's pretty old and we have fixed a bunch of threading issues in the meantime.

I'm slightly uneasy at the idea that we may have something outstanding that only gets surfaced by the parallelism of JRuby, but even if that's the case then this lead has long since gone cold, and we need a fresh report that we can chase up.

@Sinjo Sinjo closed this as completed Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants