Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note about at_exit into README #207

Closed
wants to merge 1 commit into from

Conversation

lzap
Copy link

@lzap lzap commented Oct 23, 2020

I think we found a way to mitigate problem with many metric temporary files accumulating over time. It's definitely better with this solution. Although this block could be probably part of the library, I'd rather have this explicit in our initializer.

@lzap
Copy link
Author

lzap commented Oct 23, 2020

One thing that comes to my mind is if this does not break something - by deleting there could be information loss. In that case proper "squashing" needs to be implemented and that should be probably done automatically by the library at random, or actually during exit in a similar manner.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling e1622c6 on lzap:readme-at_exit into 4cdd0ab on prometheus:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling e1622c6 on lzap:readme-at_exit into 4cdd0ab on prometheus:master.

@dmagliola
Copy link
Collaborator

Hi @lzap,
Thanks for this PR.

So, if your entire app is exiting / rebooting, it's a good idea to delete all the files in the /tmp/prometheus directory. However, unless i'm mistaken, at_exit could run when any of the regular Unicorn worker processes exits.
If that is the case, and a worker process is exiting but the rest of the app is not rebooting (and I understand that this worker recycling is quite a common occurrence during the app's lifecycle), then deleting these files will make counters go down, which would be a massive issue.

Could you confirm if you think i'm wrong that worker processes may exit and execute this block without the rest of the app exiting?
If so, unfortunately we can't do this.

Proper squashing as you say would help with this, and it would be great to implement that. We're not looking at that at the moment, but we do accept contributions :-D
But it's not a trivial thing to add.

@dmagliola
Copy link
Collaborator

Another option is what I started doing in this PR
That approach doesn't work, or at least it's a lot less straightforward than it'd seem, and I left the PR there as a sort of warning sign, because it does look like the obvious thing to do.

We could have an option (be it a separate store, or a parameter in the store if you can find a way to make that work) for "one file per process", instead of "one file per metric per process". This would add locking if a process has multiple threads incrementing counters at the same time, but that's unlikely to be an issue in a multi-process environment, I believe.

It may also have other performance implications, so it should probably not be the default.[

@lzap
Copy link
Author

lzap commented Nov 23, 2020

Apologies for the delay.

If that is the case, and a worker process is exiting but the rest of the app is not rebooting (and I understand that this worker recycling is quite a common occurrence during the app's lifecycle), then deleting these files will make counters go down, which would be a massive issue.

Hmmm, I haven't actually realized this. Counters must be monotonic indeed.

Could you confirm if you think i'm wrong that worker processes may exit and execute this block without the rest of the app exiting?

Yeah, this is right. Many app servers do this.

I guess some proper squashing needs to be added and I can imagine this won't be an easy task. Sorry for the noise.

@lzap lzap closed this Nov 23, 2020
@dmagliola
Copy link
Collaborator

Squashing would indeed help a lot.
I would like to add it, but it is also indeed not easy. There are lots of concerns, from the practical of how you even do it (the code in all_values that currently merges all files to report metrics would probably be a starting point), to small things like what you named the "squashed file", to much more complex ones like "what if 2 processes start to squash at the same time", and things like you either need to lock all the files you've read so no one can increment them, and then empty them and somehow the process that owns them needs to find out, or only squash files for dead processes which is not going to help as much, unless you have persistent file systems across server restarts.

In any case, I think we'd probably get a lot more bang for the buck if we changed to "one file per process" instead of "one per process per metric". That alone will drastically reduce the number of files.

I took a stab at it here, and failed to make the same code be configurable for both settings without major refactoring.

If you think you'd have time to look into it, i'd love to brainstorm/help a bit if you want, and figure out how we can do that, that's probably the most valuable improvement we can make to the gem.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants