Skip to content

Bug 1626086: Wait for tasks to complete before shutting down#798

Merged
mdboom merged 13 commits intomozilla:masterfrom
mdboom:python-shutdown-cleanly
Apr 6, 2020
Merged

Bug 1626086: Wait for tasks to complete before shutting down#798
mdboom merged 13 commits intomozilla:masterfrom
mdboom:python-shutdown-cleanly

Conversation

@mdboom
Copy link
Contributor

@mdboom mdboom commented Mar 31, 2020

No description provided.

@auto-assign auto-assign bot requested a review from badboy March 31, 2020 19:42
@mdboom mdboom requested a review from wlach March 31, 2020 19:42
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. Should we add a note about this behaviour to the SDK documentation?

@mdboom
Copy link
Contributor Author

mdboom commented Mar 31, 2020

Looks pretty good to me. Should we add a note about this behaviour to the SDK documentation?

Yeah. On some level, this is like a bugfix on an implementation detail that's documented here: https://mozilla.github.io/glean/book/user/adding-glean-to-your-project.html#parallelism

But it's just tricky/weird enough that Python developers probably want to know this is happening. Let me add a little bit in the Python-specific documentation.


Glean installs an [`atexit` handler](https://docs.python.org/3/library/atexit.html) so the Glean thread can attempt to cleanly shut down when your application exits.
This handler will wait up to 1 second for any pending work to complete.
If that times out, some Glean work may be lost.
Copy link
Contributor

@Dexterp37 Dexterp37 Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdboom , on other platforms I/O and networking happen on a thread that's separate from the "dispatcher" queue for API calls.

If we did something similar here, thus being consistent with Kotlin/Swift, we'd mitigate data loss: API/ping serialization are very likely to finish < 1s, networking operations are more likely to take more time on weird network conditions.

Maybe this can be a follow-up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep -- I have this bug for the follow-up: https://bugzilla.mozilla.org/show_bug.cgi?id=1626403

In the case of Python, I think we'll use a separate process (rather than a thread) so that networking can continue even if the main process finishes, but it's the same concept. My only hesitation on the follow-up is whether to do it before or after the Rust networking refactor lands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the work actually be lost? Or would Glean be smart enough to resend it the next time the command / application is invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, of course. If networking is off, say, the ping file won't be deleted and a reattempt will be made next time. So pretty low risk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest mentioning that, since the current wording is a little scary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, right now, I think there's a minor chance of loosing a small fraction of data: since I/O and networking are happening on the same dispatcher, if any work is queued after this lengthy operations and the timeout is hit, the work queued after is lost, I think.

If B is a lengthy task on which we hit the 1s timeout, then the following can happen:

Queued tasks: A-B (timeout) - C (lost)- D (lost) - E (lost)

@mdboom mdboom requested review from Dexterp37 and wlach April 1, 2020 12:49

Glean installs an [`atexit` handler](https://docs.python.org/3/library/atexit.html) so the Glean thread can attempt to cleanly shut down when your application exits.
This handler will wait up to 1 second for any pending work to complete.
If that times out, some Glean work may be lost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, right now, I think there's a minor chance of loosing a small fraction of data: since I/O and networking are happening on the same dispatcher, if any work is queued after this lengthy operations and the timeout is hit, the work queued after is lost, I think.

If B is a lengthy task on which we hit the 1s timeout, then the following can happen:

Queued tasks: A-B (timeout) - C (lost)- D (lost) - E (lost)

@mdboom mdboom force-pushed the python-shutdown-cleanly branch from e65860a to bedd50c Compare April 2, 2020 16:28
@mdboom
Copy link
Contributor Author

mdboom commented Apr 2, 2020

I went ahead and made uploading happen in a separate process. I think the pending Rust refactor on that won't impact this too much after all.

That way, I don't think we need the scary warning, since it's become a really unlikely corner case.

@mdboom mdboom requested review from Dexterp37 and wlach April 2, 2020 16:29
@mdboom mdboom force-pushed the python-shutdown-cleanly branch 2 times, most recently from 4db5aa7 to 15d704a Compare April 2, 2020 16:30
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned here, I'm worried that multiprocessing might have issues with the setup I'm using for mozregression on Windows. See this very long thread on stackoverflow, for example: https://stackoverflow.com/questions/24944558/pyinstaller-built-windows-exe-fails-with-multiprocessing

Can you make it optional?

@mdboom
Copy link
Contributor Author

mdboom commented Apr 2, 2020

As I mentioned here, I'm worried that multiprocessing might have issues with the setup I'm using for mozregression on Windows. See this very long thread on stackoverflow, for example: https://stackoverflow.com/questions/24944558/pyinstaller-built-windows-exe-fails-with-multiprocessing

Can you make it optional?

Gah. That's horrifying, but not surprising when you think it all through. It's even failing our Windows tests here (which aren't doing any PyInstaller magic).

Assuming I can figure out what's going on with the Windows unit tests, making it optional through a allow_multiprocessing config flag or something could work.

If not, it's easy enough for me to take this last commit out and merge this with the multithreading, without the multiprocessing. That's at least an 80% solution.

Once the refactor to move a lot of the upload processing to Rust is complete, it will be pretty easy to write a pure-Rust ping uploader, which might avoid some of these issues (assuming we can run a binary executable inside of a PyInstaller archive).

@mdboom mdboom requested a review from wlach April 2, 2020 19:54
@mdboom
Copy link
Contributor Author

mdboom commented Apr 2, 2020

Fixed the basic Windows failures, and added a configuration option to make the multiprocessing optional.

@mdboom mdboom force-pushed the python-shutdown-cleanly branch from 9c27b3a to adc3d6e Compare April 3, 2020 11:57
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about mixing the test code with the implementation, but aside from that this is looking pretty good to me.

"""
Processes a single ping file.
"""
def _do_process_async(cls) -> "multiprocessing.Process":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right name for the function? It seems like it's more about doing things in a seperate process than doing things async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe _process_pings_multiprocessing?

return _process(cls.storage_directory(), Glean._configuration)

@classmethod
def _test_process_sync(cls) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something that should live in a test or test fixture, rather than in the implementation. In this case I'd personally just create a small helper function inside the test file itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be here because it needs to run when the global _testing_mode flag is set to True. _testing_mode is required even for Glean's users writing their own unit tests involving Glean, so it has to be in code shipped in the library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Seems a little odd to me, but maybe there's no other way.

Co-Authored-By: William Lachance <wrlach@gmail.com>
@mdboom mdboom merged commit fd8f771 into mozilla:master Apr 6, 2020
@mdboom mdboom deleted the python-shutdown-cleanly branch April 14, 2020 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants