Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH BUG] Pop Shell adds override-redirect (OR) windows to the GNOME Shell Overview, leading to crashes #1251

Closed
Arcitec opened this issue Nov 24, 2021 · 15 comments · Fixed by #1255

Comments

@Arcitec
Copy link

Arcitec commented Nov 24, 2021

Related issues: #1233 and #1250

Partially related merge requests: #1249

The previous issue (1233) is too cluttered and doesn't describe the actual reason for the crashes.

This is the result of a few weeks of debugging this with the GNOME developers. They managed to figure out that Pop Shell is inadvertently adding override-redirect ("OR") windows to the GNOME Shell overview. Those are windows such as popup menus, and are not supposed to be in the overview. GNOME therefore doesn't track those windows and doesn't have any associated app object for them. Therefore crashing the Activities Overview with app is null when it attempts to retrieve app info about the OR windows that Pop Shell has added.

GNOME has created some patches for GNOME 41 and decided to work around it by making GNOME Shell always track all windows, including OR windows. Just to prevent crashes if extensions ever add such windows to the tracking. The result is that, if you run with GNOME Shell's latest patches (not merged yet, they're for GNOME 41 and will be backported), then you won't crash with Pop Shell anymore, but will see small 1x1 pixel windows floating in the overview instead.

It was also discovered that this only happens when the Pop Shell "Show Minimize to Tray Windows" option is ENABLED.

The theory is that when Pop Shell is adding "skip-taskbar" windows, it needs to also add a check to NOT add the window if it's an "OR window".

Here is the GNOME Shell patch where they added more robust protection against this issue (link leads to a comment which demonstrates the tiny "OR" window that Pop Shell has added to the overview):

Here is a crash reproducer. It needs to be run after a fresh login to the session, on GNOME 41, with Pop Shell enabled (without the changes from #1249 which hide the issue but doesn't solve the core problem). It creates an OR window which Pop Shell will add to the Overview, which will then lead to a crash if you run unpatched GNOME Shell, or a floating 1x1 pixel window if you run the patched GNOME 41:

Here is a very long discussion where this and other crashes were being investigated, but I have extracted all important information so you don't have to read it:

Here are the most relevant, technical quotes describing how Pop Shell is inadvertently adding OR windows, and how it's also showing duplicate dialogs that are already shown in their parent windows:

Sebastian Keller at GNOME: "I couldn't find a Pop Shell report about it including windows it probably does not want to include, such as the 1x1 pixel override redirect skip taskbar windows that get placed outside the screen by gtk after the first time one uses drag&drop. Not trying to show those will likely fix the most common source of this issue and avoid having weird seemingly random white pixels in the overview (if using the gnome-shell patch to keep the overview working with those windows). They are also showing dialogs that are already shown on their parent windows (try closing gedit with an unsaved file and open the overview)."

Sebastian Keller commenting about Pop Shell pull request #1249 : "That looks like it would fix the freeze by effectively disabling the option on >40 like is already done for 40. Unfortunately it does not yet address the problem of this option (on versions where it's still available) including windows it probably should not include. Not trying to show those windows in the overview would not only improve the experience on <40, but also allow enabling the option again on >=40 without a shell side workaround."

Florian Mullner at GNOME: "OR == override-redirect.
It's a term from the ICCCM and basically means that the window manager should leave the window alone (not add decorations, not assign a position, not allow resizing/repositioning it etc.).
The most common case are popup menus."

Florian Mullner: "I suspect that Pop Shell's inclusion of OR windows isn't intentional. The pop-shell extension allows skip-taskbar windows without excluding OR windows (which we don't handle explicitly, because they are always marked as skip-taskbar by mutter)."

Suggested fixes:

  1. Analyze window properties and skip everything marked as override-redirect. Then you can re-enable the "show skip-taskbar windows" option on GNOME 40+ without crashing.
  2. Try not to add things such as "child dialogs" to the overview. Pop Shell is currently adding the child dialogs independently to the overview, meaning when you open Text Editor, write something, try to close it so that it asks if you want to save, and then you open the Overview, you'll see a floating "Save/Abort" window separately from the Text Editor window. I don't know if that's an OR Window too, so perhaps that would also be fixed if issue 1 is fixed.
@Arcitec
Copy link
Author

Arcitec commented Nov 24, 2021

Images of both window tracking issues.

Issue 1: OR windows added to overview (the small window to the left is created by the break-my-pop-shell.py crash reproducer, which creates a small OR window):

1

Issue 2: Child dialogs added separately to overview, leading to a duplicate view of certain prompts:

2

@mmstick
Copy link
Member

mmstick commented Nov 24, 2021

Which child dialogs are you referring to? I'm not aware of creating or adding any windows to the overview anywhere. Pop Shell does not create any windows of its own. It's simply getting them from GNOME Shell.

@mmstick
Copy link
Member

mmstick commented Nov 24, 2021

Also, any help with knowing how a window is "override-redirect" would be great.

@Arcitec
Copy link
Author

Arcitec commented Nov 24, 2021

Hi Michael, thanks for taking a look at this!

Which child dialogs are you referring to? I'm not aware of creating or adding any windows to the overview anywhere. Pop Shell does not create any windows of its own. It's simply getting them from GNOME Shell.

The second screenshot above shows how the "Save/Exit" dialog from Text Editor is visible as a second window in the overview.

This was mentioned by GNOME devs as being caused by Pop Shell. It might be an OR window (I don't know), and if so it would only bug out on GNOME 41 without the Pop Shell version check fix (without #1249). If it is an OR window, you would also need to be using the patched GNOME Shell to see the issue without getting crashes.

Also, any help with knowing how a window is "override-redirect" would be great.

I agree. I don't know, so I asked the experts at GNOME here, hopefully they'll have a good suggestion since they know the code base inside and out:

https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/4751#note_1319153

@Arcitec
Copy link
Author

Arcitec commented Nov 25, 2021

@mmstick Alright, I asked them for methods that work on both GNOME 3.38 and 40+. Sebastian Keller has given a fantastically detailed answer about how to solve both issues (OR windows, and child windows such as save dialogs):

Detecting OR Windows:

  • MetaWindow::is_override_redirect()
  • When this is implemented, you may be able to remove some of the manual "Exclude window named xyz" code blocks from Pop Shell, since their OR windows would be excluded automatically. I think you can also remove the whole "if GNOME Shell >= 40, don't show skip-taskbar windows" stuff that fix: Freeze in activites view on GNOME 41 #1249 's code block is a workaround for.

Detecting child dialogs that shouldn't be separately added to overview:

  • "For the attached dialogs they might want to use MetaWindow::is_attached_dialog()..."
  • "...or something like MetaWindow::get_window_type() and check if that is Meta.WindowType.NORMAL to exclude stuff like docks, dialogs or menus in general, but maybe they still want to include non-OR dialogs that don't have a window that they are transient for." (Meaning: This 2nd method would get rid of everything that isn't a normal window. Including dialogs that aren't attached to any other windows. Which may be either desired or undesired, it depends on Pop Shell's goals.)
  • He went on to explain a third method: "The question which of the non-OR windows they want to include or exclude depends on how exactly they want this to work. They could also have a look at meta_window_recalc_features in the mutter source code to see how that decides which windows are automatically marked as skip_taskbar in mutter and thus hidden in the default shell overview."
  • Solving this would probably allow you to remove a lot of the manually programmed tiling exceptions (such as Desktop NG and Gjs being manually excluded from the overview).

I did not realize how much different the two issues were. The OR windows, which should never be added, are easy to exclude with that single API above. But the child dialogs have multiple solutions depending on what you want to achieve. Should I split that issue off into a separate ticket?

PS: Another GNOME developer mentioned that https://discourse.gnome.org/ is usually the best place to ask about interacting with the GNOME Shell programmatically. The relevant category for extensions is https://discourse.gnome.org/c/desktop/6 with a tag of "extensions", and I've seen that they give great help there, if there's ever any other difficult questions about obscure parts of the API. They also have a Matrix chat room at https://gnome.element.io/#/room/#extensions:gnome.org for developers.

@mmstick
Copy link
Member

mmstick commented Nov 25, 2021

Yeah I don't know what OR windows they are referring to. Pop Shell doesn't have any. So I guess it's just adding these types of windows to the ignore list.

@Arcitec
Copy link
Author

Arcitec commented Nov 25, 2021

@mmstick Yeah the OR windows don't come from inside Pop Shell itself. Instead, the problem is that anytime any running application creates an override-redirect window with a skip-taskbar flag, Pop Shell is adding it to the overview. That's what your "Show Minimize to Tray Window" option is doing (it's adding skip-taskbar windows).

The problem is that a good portion of those skip-taskbar windows are OR windows. One way to create one is with the crash reproducer script I linked to in the initial post. Another way is just to drag and drop a file anywhere which creates an 1x1 OR window offscreen.

Since such windows don't make sense to add to the overview, GNOME Shell wasn't tracking them and wasn't creating ShellApp (app) objects for them.

So the crash flow is: Pop Shell inadvertently adds the OR windows to the overview. The user goes to the overview. GNOME tries to retrieve the ShellApp window for the OR window that Pop Shell brought into the Overview window list. It gets null back from its own API since OR windows are not interesting to track and nothing existed for that window/app. Then, the Overview/Activities crashes with app is null.

As mentioned, they are patching this on the GNOME Shell side to always track all windows no matter what, just to avoid these kinds of crashes.

But Pop Shell needs to be revised a bit to not even add OR Windows at all, to avoid seeing those weird windows (the first embedded image above shows one example of an OR Window which becomes visible after GNOME Shell has been patched to not crash from the issue).

Pop Shell's fix would be to run MetaWindow::is_override_redirect() and never add those windows to the Overview.

Anyway, there's two issues here. OR windows need to be excluded. And child windows need to be excluded. I didn't realize that they would need separate fixes. Should I split the issues into two tickets? :)

@mmstick
Copy link
Member

mmstick commented Nov 25, 2021

If by child windows you're referring to dialogs / transient windows, these have been ignored since day one.

@Arcitec
Copy link
Author

Arcitec commented Nov 25, 2021

@mmstick Yeah I am referring to those. Here's a Fedora 35 (GNOME 41.1) live environment with zero extensions enabled:

without_pop

Then I built Pop Shell with the latest git MINUS the "4x" version check patch (merge request #1249) since that patch just covers up the actual issues by disabling skip-tb windows on GNOME 40+ (if the core problems are fixed, that whole "if 40+, disable the 'show skip-tb windows' feature" code block can safely be removed and the feature would work again). I then ran Pop Shell with default settings:

with_pop

As is seen in the image, Pop Shell is adding the Text Editor's child dialog to the overview. This is something that needs code review, and the GNOME devs provided a few different ways to identify these child dialogs and fully remove them from Pop Shell. I've now also confirmed that the "save dialog" is not an OR window, by the way.

But since this child dialog issue is a separate problem, and isn't related to the OR windows, I'm wondering if I should make a separate ticket for the child dialogs? Since the bugs turned out to be unrelated to each other, it would be best to avoid mixing these two issues? If so, I'll edit all posts above to remove the references to the problematic child windows and move that information to a new ticket.

@jmmaranan
Copy link
Contributor

jmmaranan commented Nov 25, 2021

Hi @Bananaman - if you look into the version 4x check on extension.ts, added a comment there why it is being skipped to be overriden when 4.x. https://github.com/pop-os/shell/blob/master/src/extension.ts#L2644-L2647. Gnome 4x, on WindowPreview.js_init() line 124, does not have a null check. This null check on upstream for 4x only (short term) can be done. Or as they said they are going to MR on 4x and hopefully backported (long term).

Since most users might not use the feature (some apps have minimize to tray which still shows on desktop but not on Overview, but would like that to be handled by tiling and so that's why this is included) and added a toggle to be sure there's a fallback.

Thanks for the research, so the underlying term is an OR (override redirect) for a Meta.Window with a skip_taskbar set to true :) but not in all cases it is an OR as one of the upstream devs explained. So pop-shell need to add the Meta.Window.is-override-redirect check and prevent gnome-4x users from seeing a 1x1 icon.

@jmmaranan
Copy link
Contributor

One way to test the dialogs does not come from the same issue is if disabling the Show Skip Taskbar option and see if it goes away. It looks like upstream also explained to check for attached dialogs. There's a whole lot to test on users using Android dev environment or Jetbrains apps.

@sbstnk
Copy link

sbstnk commented Nov 25, 2021

Gnome 4x, on WindowPreview.js_init() line 124, does not have a null check [...]

Adding random null checks without understanding why something could be null is not a good idea, so instead we looked into what would cause null to be returned there. And the reason we found are windows being added to the overview that the app/window tracker does not track. It should not need a null check if only tracked windows get put in the overview and the types of windows currently not tracked probably don't make much sense to be included in the overview (although that will be possible with the MR).

Currently without that MR this specifically means no OR windows and no windows that are not of these types, otherwise you will get null from the app tracker. Excluding those windows would allow this pop-shell option to work on >=40. In pop-shell with that option these windows are included because every window regardless of it being OR and its window type is getting included as long as it has the skip taskbar hint (and its WM_CLASS is not one of the two excluded ones), which does not seem intentional and much more than what this option is supposed to do given its name/description. So fixing this is entirely possible from the pop-shell side without waiting for a workaround in gnome-shell.

But that's only to fix the stuck overview. There are other types of windows that are tracked, have the skip taskbar hint, but don't make much sense to include in the overview as separate windows, like attached dialogs, which already get shown on the window they are attached to.

So pop-shell need to add the Meta.Window.is-override-redirect check and prevent gnome-4x users from seeing a 1x1 icon.

Seeing the 1x1 pixel gtk DND IPC windows will only happen once/if the MR is merged. Currently it will result in null being returned by the app tracker, because those are OR windows.

@jmmaranan
Copy link
Contributor

@sbstnk - thanks for replying and taking the time to explain the details.

Adding random null checks without understanding why something could be null is not a good idea, so instead we looked into what would cause null to be returned there. And the reason we found are windows being added to the overview that the app/window tracker does not track. It should not need a null check if only tracked windows get put in the overview and the types of windows currently

My bad and I agree with you, @Bananaman and Florian, I was actually trying to override it to add the null check like the getCaption call before from WindowPreview.init but given how massive the code there is today, it does not make sense to override them from pop-shell. Since it worked on 3.3x before the API changed in 4x and not much usage on 4x (so I assumed during that time) - the reason it was skipped for the time being. With all the technical information given, I think the if-check can now be removed.

I am working on a PR. Thank you as well for the break-my-shell script that was helpful.

@jmmaranan
Copy link
Contributor

jmmaranan commented Nov 26, 2021

@sbstnk - I would assume upstream still needed the MR for users other than pop-shell ext (some random user experiencing overview freezes even without pop-shell). When the 1 x 1 appears and given if System 76 still uses gnome 4x or users updated to 4x with that patch; will check back how it looks like and maybe an additional fix need to be done.

@Arcitec
Copy link
Author

Arcitec commented Nov 26, 2021

Hey @jmmaranan, thanks a lot for trying to solve these issues! I'm looking at your #1255 patches now and will comment there. :)

I would assume upstream still needed the MR for users other than pop-shell ext (some random user experiencing overview freezes even without pop-shell).

Yeah, there is a rare situation (nobody was able to find the exact condition) where vanilla GNOME without extensions can freeze itself because something other than NORMAL, DIALOG, UTILITY or MODAL_DIALOG was added to the overview. Removing the check for specific window types and tracking every window (except "OR") was the first patch of the GNOME Shell merge request.

Later, we found that extensions can also cause crashes, and narrowed it down to override-redirect (OR) windows and Pop Shell. So the choice was made to also track OR windows to protect against extension crashes. This is the 1x1 pixel windows appearing in overview with patched GNOME Shell and unpatched Pop Shell. The OR windows are most commonly used for popup menus and when people drag-and-drop files (which creates OR windows offscreen).

So, to sum it up, upstream will most likely merge the "track all windows except OR windows" to fix the vanilla GNOME crash. But may not merge the "track OR windows" patch since it's a bit dirty (I didn't understand the details of their discussion on the merge request, but it's something about it using "actors" when some other system would have been better).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants