Bug 1626086: Wait for tasks to complete before shutting down#798
Bug 1626086: Wait for tasks to complete before shutting down#798mdboom merged 13 commits intomozilla:masterfrom
Conversation
wlach
left a comment
There was a problem hiding this comment.
Looks pretty good to me. Should we add a note about this behaviour to the SDK documentation?
Yeah. On some level, this is like a bugfix on an implementation detail that's documented here: https://mozilla.github.io/glean/book/user/adding-glean-to-your-project.html#parallelism But it's just tricky/weird enough that Python developers probably want to know this is happening. Let me add a little bit in the Python-specific documentation. |
|
|
||
| Glean installs an [`atexit` handler](https://docs.python.org/3/library/atexit.html) so the Glean thread can attempt to cleanly shut down when your application exits. | ||
| This handler will wait up to 1 second for any pending work to complete. | ||
| If that times out, some Glean work may be lost. |
There was a problem hiding this comment.
@mdboom , on other platforms I/O and networking happen on a thread that's separate from the "dispatcher" queue for API calls.
If we did something similar here, thus being consistent with Kotlin/Swift, we'd mitigate data loss: API/ping serialization are very likely to finish < 1s, networking operations are more likely to take more time on weird network conditions.
Maybe this can be a follow-up?
There was a problem hiding this comment.
Yep -- I have this bug for the follow-up: https://bugzilla.mozilla.org/show_bug.cgi?id=1626403
In the case of Python, I think we'll use a separate process (rather than a thread) so that networking can continue even if the main process finishes, but it's the same concept. My only hesitation on the follow-up is whether to do it before or after the Rust networking refactor lands.
There was a problem hiding this comment.
Will the work actually be lost? Or would Glean be smart enough to resend it the next time the command / application is invoked?
There was a problem hiding this comment.
Oh, yeah, of course. If networking is off, say, the ping file won't be deleted and a reattempt will be made next time. So pretty low risk.
There was a problem hiding this comment.
I'd suggest mentioning that, since the current wording is a little scary.
There was a problem hiding this comment.
In general, right now, I think there's a minor chance of loosing a small fraction of data: since I/O and networking are happening on the same dispatcher, if any work is queued after this lengthy operations and the timeout is hit, the work queued after is lost, I think.
If B is a lengthy task on which we hit the 1s timeout, then the following can happen:
Queued tasks: A-B (timeout) - C (lost)- D (lost) - E (lost)
|
|
||
| Glean installs an [`atexit` handler](https://docs.python.org/3/library/atexit.html) so the Glean thread can attempt to cleanly shut down when your application exits. | ||
| This handler will wait up to 1 second for any pending work to complete. | ||
| If that times out, some Glean work may be lost. |
There was a problem hiding this comment.
In general, right now, I think there's a minor chance of loosing a small fraction of data: since I/O and networking are happening on the same dispatcher, if any work is queued after this lengthy operations and the timeout is hit, the work queued after is lost, I think.
If B is a lengthy task on which we hit the 1s timeout, then the following can happen:
Queued tasks: A-B (timeout) - C (lost)- D (lost) - E (lost)
e65860a to
bedd50c
Compare
|
I went ahead and made uploading happen in a separate process. I think the pending Rust refactor on that won't impact this too much after all. That way, I don't think we need the scary warning, since it's become a really unlikely corner case. |
4db5aa7 to
15d704a
Compare
wlach
left a comment
There was a problem hiding this comment.
As I mentioned here, I'm worried that multiprocessing might have issues with the setup I'm using for mozregression on Windows. See this very long thread on stackoverflow, for example: https://stackoverflow.com/questions/24944558/pyinstaller-built-windows-exe-fails-with-multiprocessing
Can you make it optional?
Gah. That's horrifying, but not surprising when you think it all through. It's even failing our Windows tests here (which aren't doing any PyInstaller magic). Assuming I can figure out what's going on with the Windows unit tests, making it optional through a If not, it's easy enough for me to take this last commit out and merge this with the multithreading, without the multiprocessing. That's at least an 80% solution. Once the refactor to move a lot of the upload processing to Rust is complete, it will be pretty easy to write a pure-Rust ping uploader, which might avoid some of these issues (assuming we can run a binary executable inside of a PyInstaller archive). |
|
Fixed the basic Windows failures, and added a configuration option to make the multiprocessing optional. |
9c27b3a to
adc3d6e
Compare
wlach
left a comment
There was a problem hiding this comment.
I have some concerns about mixing the test code with the implementation, but aside from that this is looking pretty good to me.
| """ | ||
| Processes a single ping file. | ||
| """ | ||
| def _do_process_async(cls) -> "multiprocessing.Process": |
There was a problem hiding this comment.
Is this the right name for the function? It seems like it's more about doing things in a seperate process than doing things async.
There was a problem hiding this comment.
Maybe _process_pings_multiprocessing?
| return _process(cls.storage_directory(), Glean._configuration) | ||
|
|
||
| @classmethod | ||
| def _test_process_sync(cls) -> bool: |
There was a problem hiding this comment.
This feels like something that should live in a test or test fixture, rather than in the implementation. In this case I'd personally just create a small helper function inside the test file itself.
There was a problem hiding this comment.
This needs to be here because it needs to run when the global _testing_mode flag is set to True. _testing_mode is required even for Glean's users writing their own unit tests involving Glean, so it has to be in code shipped in the library.
There was a problem hiding this comment.
Ah, I see. Seems a little odd to me, but maybe there's no other way.
Co-Authored-By: William Lachance <wrlach@gmail.com>
Co-Authored-By: William Lachance <wrlach@gmail.com>
No description provided.