Add note about at_exit into README #207

lzap · 2020-10-23T07:52:37Z

I think we found a way to mitigate problem with many metric temporary files accumulating over time. It's definitely better with this solution. Although this block could be probably part of the library, I'd rather have this explicit in our initializer.

lzap · 2020-10-23T07:54:50Z

One thing that comes to my mind is if this does not break something - by deleting there could be information loss. In that case proper "squashing" needs to be implemented and that should be probably done automatically by the library at random, or actually during exit in a similar manner.

coveralls · 2020-10-23T08:06:52Z

Coverage remained the same at 100.0% when pulling e1622c6 on lzap:readme-at_exit into 4cdd0ab on prometheus:master.

coveralls · 2020-10-23T08:06:53Z

Coverage remained the same at 100.0% when pulling e1622c6 on lzap:readme-at_exit into 4cdd0ab on prometheus:master.

dmagliola · 2020-10-26T18:23:10Z

Hi @lzap,
Thanks for this PR.

So, if your entire app is exiting / rebooting, it's a good idea to delete all the files in the /tmp/prometheus directory. However, unless i'm mistaken, at_exit could run when any of the regular Unicorn worker processes exits.
If that is the case, and a worker process is exiting but the rest of the app is not rebooting (and I understand that this worker recycling is quite a common occurrence during the app's lifecycle), then deleting these files will make counters go down, which would be a massive issue.

Could you confirm if you think i'm wrong that worker processes may exit and execute this block without the rest of the app exiting?
If so, unfortunately we can't do this.

Proper squashing as you say would help with this, and it would be great to implement that. We're not looking at that at the moment, but we do accept contributions :-D
But it's not a trivial thing to add.

dmagliola · 2020-10-26T18:26:03Z

Another option is what I started doing in this PR
That approach doesn't work, or at least it's a lot less straightforward than it'd seem, and I left the PR there as a sort of warning sign, because it does look like the obvious thing to do.

We could have an option (be it a separate store, or a parameter in the store if you can find a way to make that work) for "one file per process", instead of "one file per metric per process". This would add locking if a process has multiple threads incrementing counters at the same time, but that's unlikely to be an issue in a multi-process environment, I believe.

It may also have other performance implications, so it should probably not be the default.[

lzap · 2020-11-23T14:52:55Z

Apologies for the delay.

If that is the case, and a worker process is exiting but the rest of the app is not rebooting (and I understand that this worker recycling is quite a common occurrence during the app's lifecycle), then deleting these files will make counters go down, which would be a massive issue.

Hmmm, I haven't actually realized this. Counters must be monotonic indeed.

Could you confirm if you think i'm wrong that worker processes may exit and execute this block without the rest of the app exiting?

Yeah, this is right. Many app servers do this.

I guess some proper squashing needs to be added and I can imagine this won't be an easy task. Sorry for the noise.

dmagliola · 2020-11-23T20:17:18Z

Squashing would indeed help a lot.
I would like to add it, but it is also indeed not easy. There are lots of concerns, from the practical of how you even do it (the code in all_values that currently merges all files to report metrics would probably be a starting point), to small things like what you named the "squashed file", to much more complex ones like "what if 2 processes start to squash at the same time", and things like you either need to lock all the files you've read so no one can increment them, and then empty them and somehow the process that owns them needs to find out, or only squash files for dead processes which is not going to help as much, unless you have persistent file systems across server restarts.

In any case, I think we'd probably get a lot more bang for the buck if we changed to "one file per process" instead of "one per process per metric". That alone will drastically reduce the number of files.

I took a stab at it here, and failed to make the same code be configurable for both settings without major refactoring.

If you think you'd have time to look into it, i'd love to brainstorm/help a bit if you want, and figure out how we can do that, that's probably the most valuable improvement we can make to the gem.

Thanks!

Add note about at_exit into README

e1622c6

lzap closed this Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add note about at_exit into README #207

Add note about at_exit into README #207

lzap commented Oct 23, 2020

lzap commented Oct 23, 2020 •

edited

Loading

coveralls commented Oct 23, 2020

coveralls commented Oct 23, 2020

dmagliola commented Oct 26, 2020

dmagliola commented Oct 26, 2020

lzap commented Nov 23, 2020

dmagliola commented Nov 23, 2020

Add note about at_exit into README #207

Add note about at_exit into README #207

Conversation

lzap commented Oct 23, 2020

lzap commented Oct 23, 2020 • edited Loading

coveralls commented Oct 23, 2020

coveralls commented Oct 23, 2020

dmagliola commented Oct 26, 2020

dmagliola commented Oct 26, 2020

lzap commented Nov 23, 2020

dmagliola commented Nov 23, 2020

lzap commented Oct 23, 2020 •

edited

Loading