Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8289208: Test DrawRotatedStringUsingRotatedFont.java occasionally crashes on MacOS #9362

Closed
wants to merge 2 commits into from

Conversation

mkartashev
Copy link
Member

@mkartashev mkartashev commented Jul 4, 2022

Java2D's Disposer maintains a record of objects to dispose of with the help of a collection that isn't thread safe. When DisposerRecords objects are being added to it at the same time as others are being disposed on the Toolkit thread, chaos ensues.

This commit replaces the collection with a thread-safe one, more consistently guards against exceptions in individual disposers, and adds exception's stacktraces printing in order to facilitate said exceptions' debugging, which are otherwise hard to pinpoint without code modification.

Originally, the bug was caught on MacOS running an existing test (DrawRotatedStringUsingRotatedFont) that would occasionally crash the VM (probably due to double-free detected by libc that abort()'s in this case). It may take many re-tries to reproduce and this wasn't observed on Linux. The new test (test/jdk/sun/java2d/Disposer/TestDisposerRace.java) displays the problem in a more reliable fashion and fails both on MacOS and Linux without this fix.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8289208: Test DrawRotatedStringUsingRotatedFont.java occasionally crashes on MacOS

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9362/head:pull/9362
$ git checkout pull/9362

Update a local copy of the PR:
$ git checkout pull/9362
$ git pull https://git.openjdk.org/jdk pull/9362/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9362

View PR using the GUI difftool:
$ git pr show -t 9362

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9362.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 4, 2022

👋 Welcome back mkartashev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 4, 2022
@openjdk
Copy link

openjdk bot commented Jul 4, 2022

@mkartashev The following label will be automatically applied to this pull request:

  • client

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the client client-libs-dev@openjdk.org label Jul 4, 2022
@mlbridge
Copy link

mlbridge bot commented Jul 4, 2022

Webrevs

@prrace
Copy link
Contributor

prrace commented Jul 8, 2022

I must be missing something here, since you say you saw a crash and this fixes it.

Principally you seem to be making access to the deferredRecords list thread safe.
But first I'm not sure how where the MT problem is.
The code that disposes on the rendering thread is invoked from a dispose() method
that was on the Disposer thread. It then waits for that to finish.
At that time the Disposer thread is blocked so not doing anything and the render
thread is free to add elements to deferredRecords.
When it returns the deferredRecords are cleared on the Disposer thread and the
rendering thread can't be adding anything to it.

Second if it is just about MT access to deferredRecords and pulling elements off
that queue then why aren't we getting a Java exception about concurrent modification
instead of a crash.

And if we do somehow have two threads that end up trying to free the same data

  • ie both executing the dispose() method of an instance, since they are all
    (or should) be synchronized AND are supposed to check if they've already been
    freed .. why isn't that working ?

So what am I missing ?

BTW when we run this test since it is a headless test we would never have accelerated
glyphs .. and the same is true of your new tests .. so I imagine in our test harness
and that of others too they'll all always pass.

@mkartashev
Copy link
Member Author

The code that disposes on the rendering thread is invoked from a dispose() method
that was on the Disposer thread. It then waits for that to finish.
At that time the Disposer thread is blocked so not doing anything and the render
thread is free to add elements to deferredRecords.

Correct. Now let's suppose that the rendering thread - unblocked at the time - is adding more records such that ArrayList needs to grow. Now ArrayList starts relocating its underlying storage array at the time as dispose() returns and we're ready for the next iteration of

for (int i=0;i<deferredRecords.size(); i++) {

loop. The next call to ArrayList.get() here

DisposerRecord rec = deferredRecords.get(i);

will be executed on the disposer thread simultaneously with the array relocation of deferredRecords. So some of the get() calls will return null as they read from initialized, but yet-to-be-relocated chunks, and some will read what essentially is garbage.

Second if it is just about MT access to deferredRecords and pulling elements off
that queue then why aren't we getting a Java exception about concurrent modification
instead of a crash.

The race was between ArrayList.get() called from clearDeferredRecords() on a dedicated disposer thread and ArrayList.add() called from pollRemove() on a different thread. AFAIK there are no safeguards against concurrent modification between those two methods (I check the implementation).

And if we do somehow have two threads that end up trying to free the same data ie both executing the dispose() method of an instance

I didn't imply that dispose() was called twice on the same instance. I said "probably due to double-free" (or "bad" free in the bug description) simply because these are the checks that are usually implemented in the heap alloc/dealloc routines, so "double-free" is likely with the second free called from one of the dispose() methods; I have no idea who called the first free() and when.

BTW when we run this test since it is a headless test we would never have accelerated
glyphs .. and the same is true of your new tests .. so I imagine in our test harness
and that of others too they'll all always pass.

Are you implying that test/jdk/java/awt/Graphics2D/DrawString/DisposerTest.java is useless in the headless mode? Then I can mark it as headful. I can also simply drop it.

Be that as it may, test/jdk/sun/java2d/Disposer/TestDisposerRace.java doesn't depend on any acceleration and is as removed from any 2D code as it can be. And it also shows the problem on Linux, making the one mentioned above somewhat superfluous.

@mrserb
Copy link
Member

mrserb commented Jul 13, 2022

will be executed on the disposer thread simultaneously with the array relocation of deferredRecords. So some of the get() calls will return null as they read from initialized, but yet-to-be-relocated chunks, and some will read what essentially is garbage.

Don't we have a similar issue in the usage of records and queue? Is it possible that the target object in the add method will be deallocated before ref will be added to the records? In that case, the next code in run method will fail:

    Reference<?> obj = queue.remove();
    obj.clear();
    DisposerRecord rec = records.remove(obj);
    rec.dispose();

Do we need Reference.reachabilityFence at the end of the add or some kind of synchronization?

Note that pollRemove tries to solve that:

    DisposerRecord rec = records.remove(obj);
    if (rec == null) { // shouldn't happen, but just in case.
         continue;
    }

But for sure synchronization should solve that in some better way.

@mkartashev
Copy link
Member Author

Don't we have a similar issue in the usage of records and queue? Is it possible that the target object in the add method will be deallocated before ref will be added to the records?

Yes, I believe it can be deallocated before the reference to it is added to records.

In that case, the next code in run method will fail:

    Reference<?> obj = queue.remove();
    obj.clear();
    DisposerRecord rec = records.remove(obj);
    rec.dispose();

Of this I am not so sure. When target object is deallocated in the middle of add(), then ref.get() (Reference.get()) returns null, but the above code doesn't do that. Neither does the add() method; records are indexed by the Reference object, not - by now deallocated - referent of that reference:

java.lang.ref.Reference<Object> ref;
...
records.put(ref, rec);

@prrace
Copy link
Contributor

prrace commented Jul 22, 2022

Correct. Now let's suppose that the rendering thread - unblocked at the time - is adding more records such that
ArrayList needs to grow. Now ArrayList starts relocating its underlying storage array at the time as dispose() returns

First, the only way that list grows is here :
share/classes/sun/font/StrikeCache.java
static void disposeStrike(final FontStrikeDisposer disposer) {
..

rq.flushAndInvokeNow(new Runnable() {
public void run() {
doDispose(disposer);
Disposer.pollRemove();
}
});
..
}

where Disposer.pollRemove() is adding to the ArrayList deferredRecords

This disposeStrike method() is called from the dispose() method in
share/classes/sun/font/FontStrikeDisposer.java

So the ArrayList must have FINISHED "relocating its underlying storage array" BEFORE dispose() returns.

So all the following text you have implying concurrent adds / removes etc etc doesn't make sense to me.

What I can imagine is that the storage change may not yet be visible to the Disposer thread,
or becomes visible as it is in the process of clearing the deferred records.

So the change to use the ConcurrentLinkedDeque would - if it does what I suppose - fix that.
it would ensure all the changes made on the render thread are immediately visible to the dispose thread.
So let's take that change since it can only make things safer

However I don't think you've explained how it leads to the crash.
Your comment " "double-free" is likely with the second free called from one of the dispose() methods"
and the surrounding text again doesn't add up to me. I read you as saying there are two dispose() methods
but there shouldn't be two operating on the same data and the Strike one is synchronized and ensures
it only frees once ..

My guess - and the stack trace in the bug perhaps bears this out since it implicates BufImgSurfaceData

  • nothing to do with fonts .. is that this is a non-synchronized dispose() method
    allocated from here
    0x0000000107afc4e0(0x00007ff636c574a0)
    java.lang.Exception
    at java.desktop/sun.java2d.DefaultDisposerRecord.(DefaultDisposerRecord.java:55)
    at java.desktop/sun.java2d.Disposer.addRecord(Disposer.java:107)
    at java.desktop/sun.awt.image.BufImgSurfaceData.initRaster(Native Method)

And DrawRotatedStringUsingRotatedFont.java does cycle through a lot of BufferedImage instances too ..

Are you implying that test/jdk/java/awt/Graphics2D/DrawString/DisposerTest.java is useless in the headless mode?
Then I can mark it as headful.

It is useless to test this in headless mode since there is no RenderQueue .. and so this whole scenario never happens
I think it sufficient to make the EXISTING test headful .. the harness will report a failure automatically if there's a crash

Both new tests don't seem needed and the 2nd one doesn't seem to be testing the issue at all ..
as I wrote above we do not run these concurrently and your test does nothing to block the disposer thread.

@mkartashev
Copy link
Member Author

(as a side note: many thanks for such an in-depth review).

What I can imagine is that the storage change may not yet be visible to the Disposer thread, or becomes visible as it is in the process of clearing the deferred records.

In the specific macOS case I'm sure that's the right explanation for the crash. But Disposer is a public class that is being widely used throughout JDK and as far as I can see nothing in its contract prevents it from being used in a way I described and illustrated with the second test. Perhaps I was unnecessarily generic in my description.

It is useless to test this in headless mode since there is no RenderQueue .. and so this whole scenario never happens I think it sufficient to make the EXISTING test headful .. the harness will report a failure automatically if there's a crash

Let me drop that test and make the existing one headful.

Both new tests don't seem needed and the 2nd one doesn't seem to be testing the issue at all .. as I wrote above we do not run these concurrently and your test does nothing to block the disposer thread.

On the issue of the second test, I still believe it is useful:

  • it demonstrates a legitimate problem with Disposer, albeit admittedly a different one than reported in the bug,
  • it shows the problem more clearly as the problematic usage pattern of Disposer is contained in the test itself and not buried in JDK code,
  • it crashes more reliably,
  • it shows that the problem (at least potentially) exists on Linux also,
  • it does not have to be marked headful.

@prrace
Copy link
Contributor

prrace commented Aug 1, 2022

On the issue of the second test, I still believe it is useful:

  • it demonstrates a legitimate problem with Disposer, albeit admittedly a different one than reported in the bug,
  • it shows the problem more clearly as the problematic usage pattern of Disposer is contained in the test itself and not buried in JDK code,
  • it crashes more reliably,
  • it shows that the problem (at least potentially) exists on Linux also,
  • it does not have to be marked headful.

I think it might help us ensure that such a problem will not actually creep in, but I think it
is testing a scenario that is not the cause of the crash etc. However I suppose in the end
all our passing tests are testing things that aren't a problem right now ! .. so OK ..

@openjdk
Copy link

openjdk bot commented Aug 1, 2022

@mkartashev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8289208: Test DrawRotatedStringUsingRotatedFont.java occasionally crashes on MacOS

Reviewed-by: prr, avu

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 590 new commits pushed to the master branch:

  • 0c40128: 7194212: NPE in Flow.visitIdent
  • 6eb7c3a: 8290732: JNI DestroyJavaVM can start shutdown when one non-daemon thread remains
  • 083e014: 8292233: Increase symtab hash table size
  • 45e5b31: 8292244: Remove unnecessary include directories
  • 9bfffa0: 8291945: Add OSInfo API for static OS information
  • bd58553: 8290833: Remove ConstantPoolCache::walk_entries_for_initialization()
  • 755ecf6: 8292153: x86: Represent Registers as values
  • dedc05c: 8291640: java/beans/XMLDecoder/8028054/Task.java should use the 3-arg Class.forName
  • 3d20a8b: 8291959: FileFontStrike#initNative does not properly initialize IG Table on Windows
  • a28ab7b: 8288568: Reduce runtime of java.security microbenchmarks
  • ... and 580 more: https://git.openjdk.org/jdk/compare/7905788e969727c81eea4397f0d9b918cdb5286a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@prrace, @avu) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added ready Pull request is ready to be integrated sponsor Pull request is ready to be sponsored labels Aug 1, 2022
@openjdk
Copy link

openjdk bot commented Aug 2, 2022

@mkartashev
Your change (at version eb393c6) is now ready to be sponsored by a Committer.

@prrace
Copy link
Contributor

prrace commented Aug 4, 2022

You really should wait for the 2nd review before integrating

@mkartashev
Copy link
Member Author

@mrserb Are you OK with the current state of the PR?

@prrace
Copy link
Contributor

prrace commented Aug 11, 2022

maybe @avu can be the 2nd reviewer ?

@avu
Copy link
Contributor

avu commented Aug 12, 2022

Yes, I've looked through the code. It looks good to me.

@avu
Copy link
Contributor

avu commented Aug 12, 2022

/approve

@openjdk
Copy link

openjdk bot commented Aug 12, 2022

@avu Unknown command approve - for a list of valid commands use /help.

@mkartashev
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Aug 12, 2022

@mkartashev
Your change (at version eb393c6) is now ready to be sponsored by a Committer.

@avu
Copy link
Contributor

avu commented Aug 12, 2022

/sponsor

@openjdk
Copy link

openjdk bot commented Aug 12, 2022

Going to push as commit 00decca.
Since your change was applied there have been 592 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 12, 2022
@openjdk openjdk bot closed this Aug 12, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Aug 12, 2022
@openjdk
Copy link

openjdk bot commented Aug 12, 2022

@avu @mkartashev Pushed as commit 00decca.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dcubed-ojdk
Copy link
Member

This fix appears to be causing test failures in Tier3:

JDK-8292304
sun/java2d/Disposer/TestDisposerRace.java fails after JDK-8289208

@prrace
Copy link
Contributor

prrace commented Aug 12, 2022

This is the exact new test added by the fix so that shouldn't happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client client-libs-dev@openjdk.org integrated Pull request has been integrated
5 participants