Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the Registry threading/design issues #153

Closed
wants to merge 12 commits into from

Conversation

ahmgeek
Copy link

@ahmgeek ahmgeek commented Sep 2, 2019

What does it fix #151

There are two issues with the current implementation:

  1. In the collector middleware
  2. A design issue with the registry

I will go over them one by one

The collector

init_request_metrics
init_exception_metrics

These init_* methods get called every single time a request happens

def init_request_metrics
@requests = @registry.counter(

As you can see, every request we try to register the default http_* metrics.
If you think this is not an issue with MRI, it is actually, it raises AlreadyRegisteredError with MRI
This is the spec d4f4e32 that produces the bug.

The fix

Check before adding the metric to the registry, if the metric is already there, use it, otherwise push the metric to the registry.
Here fb4504e I introduce the registry.get(metric)

I think Daniel mentioned something regards symbols/strings. I would vote for strings, I can try and fix this too but it will be a breaking change if I understand correctly? maybe we can have a coercing method under the hood and warn about the deprecation or something.

The registry

Turned out, in a multi-threaded world, two threads could race to register the metric, and bypass the @registry.get(metric) check. So one would register the metric and the other would raise AlreadyRegisteredError.
The way I see it if multiple threads try to register the same metric (with locking introduced), if the metric is registered, there should be no errors raised, just return the metric and use it to perform the preferred action.
Hence here 80adc30 I changed from rubt's mutex, to something more efficient for the use case.

The specs for thread safety was relying on the raise to make it thread-safe, for me, this design is more of a workaround. If the registry is thread-safe, it means only one metric will be written to the registry at all times.

The registry again

I re-designed in the last commit the registry to be used as singleton object. This way we don't need the rather bad not thread safe memoization.

Points to discuss

@dmagliola Please let me know how is the new design works, maybe I understand things wrongly, or my implementation has flaws. Also if you can put me in the loop while you discuss this with your team that would be great.

@coveralls
Copy link

coveralls commented Sep 2, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling 37089ca on ahmgeek:test/broken-integration into b3bbe23 on prometheus:master.

@ahmgeek
Copy link
Author

ahmgeek commented Sep 3, 2019

@dmagliola please take a look and merge it if you think it's fine this fixes the issue within the collector.

Also please another release :(

Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
@ahmgeek ahmgeek changed the title Add failing tests for integrating in rack Fixing the Registry threading/design issues Sep 5, 2019
@dmagliola
Copy link
Collaborator

Hi @ahmgeek, thank you for your PR!
I'm currently on vacation. I tried to look into this, but it's quite involved, and some of these changes require discussions with other members of the team (like returning a metric instead of erroring if registering it twice).

This PR also highlights a few other small problems, like the confusion between strings and symbols in the Registry, which we should also fix.
I'll be looking into this near the end of September.

Sorry for the delay, and thanks for the good work!

@ahmgeek
Copy link
Author

ahmgeek commented Sep 6, 2019

@dmagliola Oh man, am really sorry to disturb you during the vacation, please forget about it. Have fun and enjoy the vacation.
We can discuss it later. 😘

Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
Signed-off-by: Ahmad Tolba <tolpa1@gmail.com>
@ahmgeek
Copy link
Author

ahmgeek commented Sep 25, 2019

Unfortunately, on local development machine randomly the /metrics endpoint returns the initialized metrics in a freezed way i.e never increments the counters for instance.

In production it's always freezed, there's no way to increment them.
I will close this PR now, I think we will go for an internal implementation within my company (targeting jRuby mainly)

Thanks ❤️

@ahmgeek ahmgeek closed this Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No response body from /metrics with multiple servers
3 participants