Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PushMeterRegistry metrics lost on close if scheduled publish is in progress #3872

Closed
pirgeo opened this issue May 31, 2023 · 12 comments · Fixed by #4287
Closed

PushMeterRegistry metrics lost on close if scheduled publish is in progress #3872

pirgeo opened this issue May 31, 2023 · 12 comments · Fixed by #4287
Assignees
Labels
bug A general bug module: micrometer-core An issue that is related to our core module
Milestone

Comments

@pirgeo
Copy link
Contributor

pirgeo commented May 31, 2023

As pointed out in #3832 (comment):

the JVM simply does not wait for Micrometer to finish publishing. This is intentional (Micrometer should not disturb your app)

Therefore, in some cases, the last export is not finished because the JVM shuts down before the export is completed. This can happen for exports triggered by a timer (periodic exports), as well as for exports triggered by a shutdown signal (e.g. hitting the actuator shutdown endpoint in a Spring Boot app).

I think adding a configuration option to wait for the exporter thread to finish its export would be helpful. This could be combined with a timeout to define the maximum time to wait until the app shuts down even if the data is not exported, in order to not block the shutdown indefinitely in case of network issues. By setting this timeout to 0, Micrometer would behave in the same way as it does today.

Rationale
From my research on #3832, I realized that it is possible that apps shut down without exporting data collected between the previous export and the shutdown signal. For certain metrics, this might be relevant information. I believe it should be the choice of the person running the application whether to have the application run for a few extra cycles in order to ensure data completeness. Additionally, for apps running only for a short period of time, there is a chance that no data is exported if the first export would only happen after the app is shut down. By ensuring the export is complete before shutting down, even short-lived apps will always be able to export their data.

@shakuzen shakuzen added this to the 1.11.x milestone Jun 9, 2023
@shakuzen shakuzen added the module: micrometer-core An issue that is related to our core module label Jun 9, 2023
@shakuzen
Copy link
Member

shakuzen commented Jun 9, 2023

After looking into #3863, I think the issue you describe only happens when the registry is closed while a scheduled publish is in progress, which should be a somewhat rare case. Otherwise, we block on the thread that calls close until the publish finishes.

@shakuzen shakuzen added waiting for team An issue we need members of the team to review waiting for feedback We need additional information before we can continue and removed waiting-for-triage waiting for team An issue we need members of the team to review labels Jun 9, 2023
@shakuzen
Copy link
Member

shakuzen commented Jun 9, 2023

Can you confirm if my assumption in my previous comment is true for what you're seeing?

@pirgeo
Copy link
Contributor Author

pirgeo commented Jun 12, 2023

This is in line with what you wrote above. I originally saw the "Duplicate call to publish" message, meaning the scheduled publish was in progress while I tried to shut the app down. I have not had problems with unfinished exports when the shutdown-publish is responsible.
It seems to be that very rare case where the publishing is already in progress (because it's scheduled) and then the app gets a shutdown signal and exits before the publishing is finished.

@shakuzen shakuzen changed the title Configurable timeout to ensure export can finish before app shuts down. Metrics lost on close if scheduled publish is in progress Jun 14, 2023
@shakuzen shakuzen added bug A general bug and removed waiting for feedback We need additional information before we can continue labels Jun 14, 2023
@pirgeo
Copy link
Contributor Author

pirgeo commented Jul 12, 2023

Edit: A bit more background:

  • It seems inconsistent to me to wait for the export to finish when the shutdown signal is received, but not when the export is happening on the periodic export.
  • The fact that the periodic export can block the shutdown export and therefore lead to a loss of data (the last export) is problematic. The last export can often have crucial metrics.

We would really like to be able to ensure the last export (independent of whether it is triggered by periodic export or shutdown signal) completes before the app shuts down. I understand that blocking the app from shutting down is against Micrometers' philosophy, but I could see this as a config option (i.e., an opt-in like: I am willing to wait for the export to finish before my app shuts down).

Any thoughts on this?

@pirgeo
Copy link
Contributor Author

pirgeo commented Jul 28, 2023

@shakuzen I am thinking about how this could best be tackled since it potentially can have quite the impact (e.g. losing data about how many background queues were shut down before the app shut down because the publishing does not complete). I was able to come up with two ways of dealing with this:

  1. Make it possible to turn off the randomly delayed publishing. However, with the changes as they are now (with the atomic variable checking if publishing is already in progress), the lost data might not actually have anything to do with the random delay. If the publishing is in progress at the same time as the app shuts down, this problem can still appear.
  2. Make it possible to set a timeout that should be waited for before shutting down, even in scheduled exports (not only on shutdown).

Do you think adding a (customizable?) timeout is a viable way forward? This could be set to off by default to keep the current behavior, but being able to rely on the last export completing would be really helpful.

@pirgeo
Copy link
Contributor Author

pirgeo commented Sep 6, 2023

Since this is pretty serious for some of our customers, I have investigated further. Some of them are pretty firm on not upgrading to newer Micrometer versions until this is addressed, so I would be really happy if we could get this moving before 1.12. I have created a reproducer here: d0154b1

It shows that when close() is called on a StepMeterRegistry, the last export will be lost if there is already a scheduled export in progress. I would really like to be able to give our customers an out (e.g. setting a max timeout that should be waited for when close() is called and hits the duplicate call to publish error).

Are there any objections to adding such a setting? Otherwise, I will open a PR to add such a feature.

CC @shakuzen @jonatan-ivanov
also CC @lenin-jaganathan, I think you were involved in initially implementing the publishSafely() method.

@pirgeo
Copy link
Contributor Author

pirgeo commented Oct 2, 2023

We have moved the reproducer from a single commit to an actual branch and opened a PR (#4062). The PR contains two test cases, one that shows the current behavior (and how the default behavior does not change when merging this PR; the test is called scheduledPublishInterruptedByCloseWillDropData_whenShutdownTimeoutIsZero) and one that shows how the updated code guards against the data loss when the configuration is adapted (scheduledPublishInterruptedByCloseWillNotDropData_whenShutdownTimeoutIsBigEnough).

@shakuzen shakuzen self-assigned this Oct 24, 2023
@shakuzen
Copy link
Member

Just leaving a comment to say I'll be looking at this again. Thanks for the investigation and test cases and pull request so far. Much appreciated.

@pirgeo
Copy link
Contributor Author

pirgeo commented Oct 24, 2023

Thank you, please let me know if you need any more information!

@shakuzen shakuzen linked a pull request Oct 30, 2023 that will close this issue
@shakuzen
Copy link
Member

Are there any objections to adding such a setting?

Sorry for the late response to this. I thought I left a comment regarding this somewhere else, but I haven't been able to find it (there are a lot of comments in a lot of different threads related to this issue, so I may have missed it or I never made it). Given that we do a final publish in a blocking manner on close when there isn't a publish in progress, my preference would be to avoid introducing a new configuration option, and instead, align the behavior when a publish is in progress - that is, block close on the in-progress publish finishing. I have taken a stab at that in #4287. Let me know if you have any feedback on that, and I'll review it with @jonatan-ivanov as well.

@pirgeo
Copy link
Contributor Author

pirgeo commented Oct 30, 2023

Thanks for taking a look! I actually prefer this to be the default, but wasn't sure if there was appetite for it by others as well. That's why I made it optional in my proposal. The only feedback I have for #4287 is that this will now block indefinitely if there is a long-running request. I don't assume this to be a problem, since HTTP requests usually time out in a reasonable timeframe compared to an application shutdown, but I wanted to point it out.

@shakuzen
Copy link
Member

Thanks for taking a look and sharing your thoughts. I appreciate it.
It indeed can block indefinitely, but that's already true for the case when close is called while a publish isn't in progress. We do a publish in that case and close doesn't return until it does. The thinking here is that it is up to the platform, framework, or user to decide what behavior is appropriate. Closing a registry is typically something that is done at application shutdown in a shutdown hook (separate thread from application work). And typically the thing waiting for an application to shutdown (e.g. a platform like kubernetes) will not wait indefinitely. It will usually send a SIGKILL signal after some time, which will force the JVM to shutdown even if the shutdown hook hasn't completed.

We can of course consider feedback if users have some issue related to this behavior, but I don't remember anything reported against the behavior of blocking for publish on close that we've had since I believe Micrometer 1.1.

@shakuzen shakuzen modified the milestones: 1.11.x, 1.10.13 Nov 2, 2023
@shakuzen shakuzen modified the milestones: 1.10.13, 1.11.6 Nov 13, 2023
@shakuzen shakuzen changed the title Metrics lost on close if scheduled publish is in progress PushMeterRegistry metrics lost on close if scheduled publish is in progress Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug module: micrometer-core An issue that is related to our core module
Projects
None yet
2 participants