Skip to content

8255248: NullPointerException in JFXPanel due to race condition in HostContainer#1178

Closed
prsadhuk wants to merge 9 commits intoopenjdk:masterfrom
prsadhuk:JDK-8255248
Closed

8255248: NullPointerException in JFXPanel due to race condition in HostContainer#1178
prsadhuk wants to merge 9 commits intoopenjdk:masterfrom
prsadhuk:JDK-8255248

Conversation

@prsadhuk
Copy link
Collaborator

@prsadhuk prsadhuk commented Jul 18, 2023

Due to transient datatype of scenePeer, it can become null which can result in NPE in scenarios where scene is continuously been reset and set, which warrants a null check, as is done in other places for the same variable.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8255248: NullPointerException in JFXPanel due to race condition in HostContainer (Bug - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jfx.git pull/1178/head:pull/1178
$ git checkout pull/1178

Update a local copy of the PR:
$ git checkout pull/1178
$ git pull https://git.openjdk.org/jfx.git pull/1178/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 1178

View PR using the GUI difftool:
$ git pr show -t 1178

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jfx/pull/1178.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 18, 2023

👋 Welcome back psadhukhan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Ready for review label Jul 18, 2023
@mlbridge
Copy link

mlbridge bot commented Jul 18, 2023

Webrevs

@aghaisas
Copy link
Collaborator

This avoids the NPE.

Overall, I think that we should choose between below options-

  • Option 1 : A null check should be added to every usage of scenePeer and stagePeer as close to the usage as possible.
  • Option 2: The usage of scenePeer and stagePeer should be synchronized.

I am not well versed with this area of the code and hence cannot comment on ramifications of Option 2.

Also, can we add a test? (either automated or manual) - I see that there is a sample in JBS that demonstrates this issue.

@mstr2
Copy link
Collaborator

mstr2 commented Jul 18, 2023

The proposed patch does not appear to fix the issue. Here's how the code currently looks like (after the patch is applied):

786    @Override
787    protected void paintComponent(Graphics g) {
788        if (scenePeer == null) {
789            return;
790        }
791        if (pixelsIm == null) {
792            createResizePixelBuffer(scaleFactorX, scaleFactorY);
793            if (pixelsIm == null) {
794                return;
795            }
796        }
797        DataBufferInt dataBuf = (DataBufferInt)pixelsIm.getRaster().getDataBuffer();
798        int[] pixelsData = dataBuf.getData();
799        IntBuffer buf = IntBuffer.wrap(pixelsData);
800        if (scenePeer != null && !scenePeer.getPixels(buf, pWidth, pHeight)) {
801            // In this case we just render what we have so far in the buffer.
802        }

Note that in L788, we return early if scenePeer is null. The fact that scenePeer can sometimes be null in L800 means that the field is changed on a different thread. Since that's the case, no amount of null checking will ever be good enough: scenePeer can just as easily be null after the check succeeds in L800.

Similiar issues can be observed in other places, for example later in the same file:

    private class HostContainer implements HostInterface {
        @Override
        public void setEmbeddedScene(EmbeddedSceneInterface embeddedScene) {
            ...
            scenePeer = embeddedScene;
            ....
            invokeOnClientEDT(() -> {
                ....
                if (scenePeer != null) {
                    scenePeer.setDragStartListener(dnd.getDragStartListener());
                }
            });
        }
    }

Note that scenePeer is assigned on the thread that calls setEmbeddedScene, and then its value is read from the EDT.

Since it appears that JFXPanel is inherently multi-threaded, every usage of an unprotected shared field is potentially faulty. At the very least, every field that can be written to or read from multiple threads must be protected by a lock. Note that is is not enough to just protect all writes; all reads have to be protected as well.

In addition to that, we need to account for concurrent code execution. What happens if thread A runs code on scenePeer, while thread B has already created a new scene peer? Is this safe? If not, then we need to prevent such concurrent execution.

If it is indeed safe to concurrently run code on different instances of a scene peer, a possible approach could be to read the value of the shared field into a local variable, and perform the null checking on the local variable:

    EmbeddedSceneInterface scenePeer = null;
    synchronized (this) {
        scenePeer = this.scenePeer;
    }

    if (scenePeer != null) {
        ...
    }

@kevinrushforth
Copy link
Member

/reviewers 2

@openjdk
Copy link

openjdk bot commented Jul 18, 2023

@kevinrushforth
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@mstr2
Copy link
Collaborator

mstr2 commented Jul 21, 2023

After looking at the class for a bit longer and seeing how its state is read and mutated on different threads, it might be a better approach to enforce serialized code execution in all cases. This might be done as follows:

  1. Remove all volatile modifiers from the fields of the class. (Also, make disableCount an int field.)
  2. Mark all non-private methods which are entry points into the class with the synchronized modifier. Private methods should not be marked with synchronized.
  3. Wrap code in all other entry points (lambdas, anonymous implementations) in a synchronized block that synchronizes on the JFXPanel instance.

If done correctly, no method of the class will ever run concurrently with another method of the class, and the memory effects of synchronized will ensure that we can always see the latest field values. What do you think about this approach?

@andy-goryachev-oracle
Copy link
Contributor

If done correctly,

I think it makes sense, especially since there is at least one place where the whole method might need to be synchronized (JXFPanel:1090)

The key word is if done correctly. One thing to look for is the possibility of introducing a deadlock and regression, so we may need to analyze the callers and perform more extensive testing.

lStagePeer = stagePeer;
}
lStagePeer = embeddedStage;
if (lStagePeer == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is something wrong with this code:
why we need a locked assignment on line 1044 that immediately gets overwritten on line 1046?

}
lScenePeer.setPixelScaleFactors((float) scaleFactorX, (float) scaleFactorY);
synchronized(LOCK) {
scenePeer = lScenePeer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the whole method should be synchronized?
the logic around scenePeer is asking to be atomic, isn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change as-is doesn't work correctly; if embeddedScene is null, scenePeer is now unchanged.

}
});
SwingUtilities.invokeAndWait(JFXPanelNPETest::createUI);
for (int i = 0; i < 300; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 300 sufficient?

for (int i = 0; i < 300; i++) {
SwingUtilities.invokeLater(contentPane::repaint);
Platform.runLater(() -> contentPane.setScene(null));
Thread.sleep(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend replacing a fixed number with a random value.

(Typically, when a random is added to the test, it would make sense to print its seed to stdout, for the sake of reproducing a possible failure later; but in this case it might be unnecessary because we are dealing with multiple threads and that adds its own degree of randomness, completely eliminating reproducibility. Or, perhaps, we still need to print the seed to maximize the probability of reproducing the failed scenario.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite downvote, I still think randomness in the unit test is a good thing.

For more background, please take a look at the second answer in
https://stackoverflow.com/questions/32458/random-data-in-unit-tests

@kevinrushforth
Copy link
Member

The key word is if done correctly. One thing to look for is the possibility of introducing a deadlock and regression, so we may need to analyze the callers and perform more extensive testing.

That is exactly my concern. It would need a lot more very careful analysis before making this synchronous.

@mstr2
Copy link
Collaborator

mstr2 commented Jul 21, 2023

That is exactly my concern. It would need a lot more very careful analysis before making this synchronous.

It seems to me that this whole class is asking for more careful analysis, considering that there seem to be several concurrency bugs in there. I don't quite like the idea of a system test "proving" that there's no race condition. That should be done through rigorous analysis instead. The first (and incorrect) proposed fix shows that the system test would have failed in demonstrating that the defects are no longer there.

Copy link
Collaborator

@hjohn hjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely clear to me which thread is calling which methods in this class, and it would help to clearly mark which part of the class can be called from the FX thread and which parts from the AWT thread.

I also question how some of the other fields are used. Many are marked volatile which is insufficient if you don't want to observe partial changes of x/y coordinates for example.

I agree with @mstr2 that a more careful analysis couldn't hurt. There seem to be only 2 threads involved, the AWT and FX Thread. The fields they are sharing should be grouped and marked with a comment that they're shared. Any access to these fields should then be only while holding a lock.

}
lScenePeer.setPixelScaleFactors((float) scaleFactorX, (float) scaleFactorY);
synchronized(LOCK) {
scenePeer = lScenePeer;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change as-is doesn't work correctly; if embeddedScene is null, scenePeer is now unchanged.

@prsadhuk
Copy link
Collaborator Author

The key word is if done correctly. One thing to look for is the possibility of introducing a deadlock and regression, so we may need to analyze the callers and perform more extensive testing.

That is exactly my concern. It would need a lot more very careful analysis before making this synchronous.

I guess if we have to make all public methods in JFXPanel synchronous, then the same might be called for SwingNode too, ..

@openjdk openjdk bot removed the rfr Ready for review label Jul 25, 2023
@openjdk openjdk bot added the rfr Ready for review label Jul 25, 2023
@kevinrushforth kevinrushforth self-requested a review July 29, 2023 14:48
@openjdk
Copy link

openjdk bot commented Aug 21, 2023

@prsadhuk this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8255248
git fetch https://git.openjdk.org/jfx.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Aug 21, 2023
@openjdk openjdk bot removed merge-conflict Pull request has merge conflict with target branch rfr Ready for review labels Aug 30, 2023
@openjdk openjdk bot added the rfr Ready for review label Aug 30, 2023
Copy link
Contributor

@andy-goryachev-oracle andy-goryachev-oracle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good - the new LOCK'ing logic appears I have some minor comments

synchronized(LOCK) {
return pHeight;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we could add setCompSize(int, int) to set both pHeight and pWidth at the same time, in order to avoid multiple LOCK'ed blocks down below (L638, L1068)?

synchronized(LOCK) {
return scaleFactorX;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same idea, possibly add setScaleFactor(double, double)?

int lHeight = getCompHeight();
double lScaleFactorX = getScaleFactorX();
double lScaleFactorY = getScaleFactorY();
if (lScenePeer == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: move check on L885 right after L880, no need to lock 4 times if lScenePeer is null

scenePeer.setDragStartListener(dnd.getDragStartListener());
}
SwingDnD lDnD = getDnD();
lDnD = new SwingDnD(JFXPanel.this, embeddedScene);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a need for getting getDnd() on L1146 and immediately overwriting it on L1147?
should it be
SwingDnD lDnD = new SwingDnD(JFXPanel.this, embeddedScene);
?

import org.junit.AfterClass;
import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.Test;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using junit5?

for (int i = 0; i < 300; i++) {
SwingUtilities.invokeLater(contentPane::repaint);
Platform.runLater(() -> contentPane.setScene(null));
Thread.sleep(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite downvote, I still think randomness in the unit test is a good thing.

For more background, please take a look at the second answer in
https://stackoverflow.com/questions/32458/random-data-in-unit-tests

Copy link
Contributor

@andy-goryachev-oracle andy-goryachev-oracle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good - the new LOCK'ing logic seems well designed.
added some minor suggestions - I hope my comments are still there after I accidentally clicked on 'Cancel Review'...


private final Object LOCK = new Object();

// Accessed on EDT only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment valid anymore?

perhaps we should mention the fact that access to these fields should be LOCK'ed instead?

Copy link
Collaborator

@hjohn hjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes as they are now are IMHO entirely inadequate. Simply surrounding all accesses to shared fields with a narrowly scoped synchronized block is not a magic solution. You will want the fields to be coherent for a longer period than that of a single field access. For example:

    int lWidth = getCompWidth();
    int lHeight = getCompHeight();

The new methods getCompWidth and getCompHeight are synchronized, but the overall code here is not. This means that if the control resizes from 1000 x 1 to 1 x 1000 you may end up with a lWidth and lWeight of 1000 x 1000 or 1 x 1.

IMHO the steps to resolve this to properly synchronized code is:

  1. Find all external entry points (all non-private methods, and any private methods used by a listener callback)
  2. For the entry points found above, mark which are called by FX and which by AWT
  3. For the AWT methods note down all fields it accesses, including any in any methods it is calling
  4. For the FX methods note down all fields it accesses, including any in any methods it is calling
  5. Find the fields that are accessed in both the AWT and FX lists
  6. Mark any methods that access those fields as synchronized -- this may be too course grained if these methods are large, do I/O, call back into other systems -- in that case you may need to move some code around to do all the code that needs synchronization first or last and use a synchronized block.

edit: further clarifications

@andy-goryachev-oracle
Copy link
Contributor

You do have a valid point, @hjohn .
Perhaps an easier approach would be to separate fields into awt and fx ones (perhaps renaming the fields to have awt- and fx- prefix?), and use invokeLater/runLater for inter-thread communications?

This way no locking/synchronization is required, we have no shared fields, and everything happens in the right thread (at the expense of some memory allocation).

What do you guys think?

@mstr2
Copy link
Collaborator

mstr2 commented Aug 31, 2023

I agree with @hjohn and will add one more comment: even if all local effects are taken into account, the code in this class might be racing against entirely unrelated code in the JavaFX framework. These races can't be reasonably addressed with local synchronization.

Coordinating multiple threads with potentially unknown side effects requires a bigger analysis effort, and the changes here do not inspire confidence that this has happened. I see no reason to replace the existing defective implementation with another potentially defective implementation.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 29, 2023

@prsadhuk This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 27, 2023

@prsadhuk This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfr Ready for review

Development

Successfully merging this pull request may close these issues.

7 participants