Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maint: Fix sporadic QtDims garbage collection failures by converting some stray references to weakrefs #5471

Merged
merged 2 commits into from
Jan 18, 2023

Conversation

Carreau
Copy link
Contributor

@Carreau Carreau commented Jan 12, 2023

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:

  1. changes to make this crash more reliably, by removing unused
    reference to the thread, and using weakrefs to make sure we free up
    things faster. Freeing things faster mean we crash more reliably.
  2. actually fix the cleanup to be done after thread termination and not
    worker termination. Due to 1) we are more confident this fixes
    things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

  • worker stop
    • delete last thread reference
      (if GC, crash)
  • Thread stops.

Now we have

  • worker stop
  • Thread stops.
    • delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

@github-actions github-actions bot added the qt Relates to qt label Jan 12, 2023
@Carreau Carreau marked this pull request as ready for review January 12, 2023 13:15
@Carreau
Copy link
Contributor Author

Carreau commented Jan 12, 2023

I'm not totally sure this is correct, like does the thread always stop once the worker stops ?
I think that having to deal with both the thread and worker is likely confusing, and we may want to have to deal with a single object (worker+thread), and make sure that calling stop on it, just stops everything.

@codecov
Copy link

codecov bot commented Jan 12, 2023

Codecov Report

Merging #5471 (ad8af6a) into main (3d7ed91) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5471      +/-   ##
==========================================
- Coverage   89.31%   89.30%   -0.02%     
==========================================
  Files         600      600              
  Lines       50964    51031      +67     
==========================================
+ Hits        45518    45571      +53     
- Misses       5446     5460      +14     
Impacted Files Coverage Δ
napari/_qt/widgets/qt_dims_slider.py 94.21% <100.00%> (+0.01%) ⬆️
napari/components/experimental/chunk/_pool.py 85.71% <0.00%> (-7.94%) ⬇️
napari/_qt/widgets/qt_viewer_status_bar.py 81.57% <0.00%> (-1.32%) ⬇️
napari/utils/_testsupport.py 62.66% <0.00%> (-1.18%) ⬇️
napari/conftest.py 87.74% <0.00%> (-1.14%) ⬇️
napari/components/_layer_slicer.py 96.59% <0.00%> (-0.91%) ⬇️
napari/components/_tests/test_layer_slicer.py 97.98% <0.00%> (-0.35%) ⬇️
napari/layers/points/_slice.py 100.00% <0.00%> (ø)
napari/layers/shapes/_tests/test_shapes.py 100.00% <0.00%> (ø)
napari/layers/points/points.py 98.83% <0.00%> (+<0.01%) ⬆️
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jni
Copy link
Member

jni commented Jan 12, 2023

I don't fully understand the issues very well here, but (based on the conversation in #5443, nice work! I'll let @Czaki review as he is our local thread guru. 😅 Thank you both! 🙏

@jni
Copy link
Member

jni commented Jan 12, 2023

Oh, btw, before merge, could we rename the PR to something more concrete, e.g., "Fix sporadic QtDims garbage collection failures by converting some stray references to weakrefs"? And try to capture some of the broader narrative in the commit message — @andy-sweet pointed out that longer commit messages (e.g. copying the whole description into the commit message) make it much nicer to investigate issues later, sometimes saving a few round trips between the terminal and the browser.

@andy-sweet
Copy link
Member

andy-sweet commented Jan 12, 2023

I'm not totally sure this is correct, like does the thread always stop once the worker stops ?

See superqt's new_worker_qthread, which is what AnimationWorker uses to create the worker and thread. Based on the superqt implementation, the thread object will emit its finished signal after the worker emits its finished signal, and will do so the main event loop (because of the behavior of Qt::AutoConnection).

So the answer to your question is yes, but it might take a little time depending on how busy the main event loop is at the time.

def __init__(self, dims, axis, reverse=False, fps=10, mode=LoopMode.LOOP):
def __init__(
self, qt_dims, axis, reverse=False, fps=10, mode=LoopMode.LOOP
):
super().__init__()
Copy link
Member

@andy-sweet andy-sweet Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this doesn't have qt_dims as a parent. And I wonder if this might be an alternative fix to holding a weak reference to QtDims.

If I understand the QObject lifetime model correctly, I think that if the child object is still alive, then its parent must also be (i.e. the parent effectively owns the child and is controls its lifetime/destruction). So it would not be possible for _on_click to be executed with an invalid reference to QtDims.

More generally, I wonder how many of our object lifetime issues are caused by relying on Python references to Qt objects, rather than relying on the QObject hierarchy. Though that discussion is far beyond the scope of this PR :)

Copy link
Member

@andy-sweet andy-sweet Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess when we add the QtPlayButton to the layout, then reparenting will occur. So this doesn't seem like a viable alternative fix. Though I still can't explain why we need the weak reference (i.e. why would self.qt_dims be something we can't dereference in self._on_click, since it should be the grandparent of this widget).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's magic for me as well, but usually I found that weakreferences, even if not strictly necessary seem to help to have at least consistant crashes, or make it clearer when doing refactor who is the owner, and simplify reference graph when using objgraph.

Something I haven't done there is also use wekref's callbacks to disconnect things but I don't think it's strictly necessary here.

I tend to prefer to know reliably I'm lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read your PR description properly, but after reading this message and thinking a bit more, the weak reference is justifiable (regardless of any bugs it does or doesn't cause) because it breaks a cyclic reference between Python objects (specifically, QtDims and QtPlayButton with QtDimSliderWidget in the middle of the cycle). Might be good to add that as an explanatory implementation comment to the code to help the next lost soul.

In general, my guess is there's only so much pyqt/pyside can do when it comes to keeping Python and QObject lifetimes consistent/safe. Would be great if anyone knows of a resource that explains how to use those more safely (other than reading through the code itself).

Anyway, I'm pretty close to approving this. I just need to think through the dropped reference to QtDimSliderWidget.thread.

@Carreau
Copy link
Contributor Author

Carreau commented Jan 13, 2023

"Fix sporadic QtDims garbage collection failures by converting some stray references to weakrefs"?

I'm not actually sure it fixes things, and I'm not totally sure I understand what is happening. But Yes.

And try to capture some of the broader narrative in the commit message — @andy-sweet pointed out that longer commit messages (e.g. copying the whole description into the commit message) make it much nicer to investigate issues later, sometimes saving a few round trips between the terminal and the browser.

I've started to skip doing long messages for napari because of the squash/rebase, and relied on the PR description. So it's up to you, but the "squash" and "investigate issues later" IMHO contradict each others.

See superqt's new_worker_qthread, which is what AnimationWorker uses to create the worker and thread. Based on the superqt implementation, the thread object will emit its finished signal after the worker emits its finished signal, and will do so the main event loop (because of the behavior of Qt::AutoConnection).

So the answer to your question is yes, but it might take a little time depending on how busy the main event loop is at the time.

Yes, that was my understanding (and how I came to this fix), but me thread two makeing are and condition race understand very well not.

…some stray references to weakrefs

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference
@Carreau
Copy link
Contributor Author

Carreau commented Jan 13, 2023

I remove the unrelated typing changes and rephrased the commit.

@jni
Copy link
Member

jni commented Jan 13, 2023

I've started to skip doing long messages for napari because of the squash/rebase, and relied on the PR description. So it's up to you, but the "squash" and "investigate issues later" IMHO contradict each others.

That's what I mean: the PR title is what by default gets put into the (final, squashed) commit message, so it should be properly descriptive.

The description does not get copied by default into the squashed commit but we should copy it manually on merging the PR. Sorry that I wasn't more explicit here: when I said, "capture some of the broader narrative into the commit message", I meant "the PR squash-rebase commit message", and by "we" I meant, whoever pushes the merge button. Ironically @Carreau it was one of your commits that gave rise to this discussion — it had a wonderful long description, and it could be seen in the final squashed history because it was the only commit in that PR. In that case, GitHub copies the full commit message into the squashed commit message.

@Carreau
Copy link
Contributor Author

Carreau commented Jan 13, 2023

Yeah, imho this is a GH ui issue, I think that each GH PR should have a shared editable area that will reflect the final merge/squash message. Maybe in the diff view ? So that this information is also part of the Review Process.

@andy-sweet
Copy link
Member

andy-sweet commented Jan 13, 2023

The description does not get copied by default into the squashed commit but we should copy it manually on merging the PR.

Looks like this might be configurable, so that the default squashed merge commit message is the PR title + description? https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/configuring-commit-squashing-for-pull-requests

That way the shared UI in GitHub is the PR title and description, and we don't have to rely on the merger to remember to copy the text from there (which I've been doing occasionally). They may still have to remind the author to keep the title/description up-to-date though, as here.

Unfortunately, I don't have access to napari/napari settings - I'm assuming you do @jni?

@jni
Copy link
Member

jni commented Jan 17, 2023

That way the shared UI in GitHub is the PR title and description, and we don't have to rely on the merger to remember to copy the text from there (which I've been doing occasionally). They may still have to remind the author to keep the title/description up-to-date though, as here.

Life changing discovery @andy-sweet!!! ❤️‍🔥 I have changed it now to default to copying title and description. 🚀 Not quite as good as @Carreau's suggestion of having the commit message also being subject to review, but much much better than what we've had until now.

@Carreau title here is still lackluster... ;)

@Carreau Carreau changed the title Maint: Naming, weakref, and typing. Maint: Fix sporadic QtDims garbage collection failures by converting some stray references to weakrefs Jan 17, 2023
@Carreau
Copy link
Contributor Author

Carreau commented Jan 17, 2023

I've update the title and first message.

Copy link
Member

@andy-sweet andy-sweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping QtDimSliderWidget.thread seems fine as the only caller of that stores that, so I'm approving this. Thanks for the investigation and fix @Carreau!

With respect to #5433, this only fixes the aborting tests, right? And not the hanging ones?

@andy-sweet
Copy link
Member

I'll merge after 24 hours unless there are any other objections.

Copy link
Collaborator

@Czaki Czaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style remarks. I think that I understand how it works, but I'm not 100% sure

napari/_qt/widgets/qt_dims_slider.py Outdated Show resolved Hide resolved
napari/_qt/widgets/qt_dims_slider.py Outdated Show resolved Hide resolved
Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
@Carreau
Copy link
Contributor Author

Carreau commented Jan 18, 2023

I think that I understand how it works, but I'm not 100% sure

We are in the same boat.

With respect to #5433, this only fixes the aborting tests, right? And not the hanging ones?

It's one problem with flaky test you can never be sure.

One thing I can suggest is to set the

job.<id>.<steps?>.timeout-minutes for each GH action item to at least get a quick failure.
Say current job time + 10% ? Possibly only the tox step to be conservative with what can hang ?

@Czaki
Copy link
Collaborator

Czaki commented Jan 18, 2023

Say current job time + 10% ? Possibly only the tox step to be conservative with what can hang ?

My observation is that sometimes, because of network/pypi problems 10 minutes buffer is required, but we still could use 40 minutes cap instead of six hours.

@Carreau
Copy link
Contributor Author

Carreau commented Jan 18, 2023

Opened task to do so as #5488

@andy-sweet
Copy link
Member

With respect to #5433, this only fixes the aborting tests, right? And not the hanging ones?

It's one problem with flaky test you can never be sure.

Sure. But this PR offers one explanation of the aborting test (by attempting to dereference a C++ object that has been deleted) and proposes a fix for it. By contrast, I don't see how it could explain a hanging test - would be happy if it does though, especially with an explanation!

@andy-sweet andy-sweet merged commit 69b834c into napari:main Jan 18, 2023
@Czaki Czaki mentioned this pull request Jun 7, 2023
@Czaki Czaki added this to the 0.4.18 milestone Jun 16, 2023
@Czaki Czaki added maintenance PR with maintance changes, triaged-0.4.18 To mark PR that is triaged in 0.4.18 release process labels Jun 16, 2023
Czaki added a commit that referenced this pull request Jun 16, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
Czaki added a commit that referenced this pull request Jun 17, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
Czaki added a commit that referenced this pull request Jun 18, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
Czaki added a commit that referenced this pull request Jun 19, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
Czaki added a commit that referenced this pull request Jun 21, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
Czaki added a commit that referenced this pull request Jun 21, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
Czaki added a commit that referenced this pull request Jun 21, 2023
…some stray references to weakrefs (#5471)

As far as I can tell this is due to issue with garbage collection issue
between the worker and thread. Mostly between the garbage collection and
the worker cleanup if the last reference to the thread is set to None
and collected while the thread is still active it causes a crash.

This has two side:
1) changes to make this crash more reliably, by removing unused
   reference to the thread, and using weakrefs to make sure we free up
   things faster. Freeing things faster mean we crash more reliably.
2) actually fix the cleanup to be done after thread termination and not
   worker termination. Due to 1) we are more confident this fixes
   things.

In particular, the animation thread, has a worker that refers to the
dims via a callback/slot.

We also remove the local reference to thread, to make sure there is not
a standing, unused reference to thread, thus self._animation_thread will
be the last standing reference.

That is to say we currently have:

 - worker stop
    - delete last thread reference
 (if GC, crash)
 - Thread stops.

Now we have
 - worker stop
 - Thread stops.
    - delete last thread (and worker) reference

This should fix some issues with thread/garbage collection/cycle (#5443)

Co-authored-by: Grzegorz Bokota <bokota+github@gmail.com>
@Carreau Carreau deleted the cnp branch October 2, 2023 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance PR with maintance changes, qt Relates to qt triaged-0.4.18 To mark PR that is triaged in 0.4.18 release process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants