Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRaC Critical Section API #5

Closed
wants to merge 5 commits into from

Conversation

alexeybakhtin
Copy link
Contributor

@alexeybakhtin alexeybakhtin commented Nov 25, 2021

This is a proposal for the CRaC Critical Section interface and its usage in the SHA1PRNG SecureRandom implementation.
The changes in the SecureRandom implementation allow invalidating and reseeding SHA1PRNG secure random during checkpoint/restore. SHA1PRNG can be invalidated and reseeded in case of being created with a default embedded seed generator. Also, SHA1PRNG is used as an additional seed generator to the SUN NativePRNG implementation, so it is desirable to have reseeded SHA1PRNG after restore.


Progress

  • Change must not contain extraneous whitespace

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/crac pull/5/head:pull/5
$ git checkout pull/5

Update a local copy of the PR:
$ git checkout pull/5
$ git pull https://git.openjdk.java.net/crac pull/5/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5

View PR using the GUI difftool:
$ git pr show -t 5

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/crac/pull/5.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 25, 2021

👋 Welcome back abakhtin! A progress list of the required criteria for merging this PR into crac will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Nov 25, 2021

@alexeybakhtin This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

CRaC Critical Section API

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the crac branch:

  • 2dc0384: jcheck: Add census section

Please see this link for an up-to-date comparison between the source branch of this pull request and the crac branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@AntonKozlov) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 25, 2021
Copy link
Member

@AntonKozlov AntonKozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It may be useful to split changes for CheckpointLock and in SecureRandom to make discussion more focused.

throw exception;
if (0 < exception.getSuppressed().length) {
throw exception;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that the enterCriticalSectionLock will be released here? Then it looks possible for a thread waiting to enter checkpoint-critical section will allowed to enter and be checkpointed in the middle of the critical section, that we'd want to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I've moved enterCriticalSectionLock to the Core class to lock checkpoint and restore stages.
afterRestore() methods are not protected by enterCriticalSectionLock, so it is possible to use methods with critical sections inside afterRestore()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A small nit: now it seems like checkpointRestore0 and unpacking the bundle may throw CheckpointRestore, but they don't throw. I assume this is exactly move afterRestore out from the enterCriticalSectionLock synchronized section. But would it be wrong to hold the enterCriticalSectionLock and attempt to lock it again?

Actually, we have no synchronization of the complete checkpoint/restore procedure. We may impose that all beforeCheckpoint's, checkpointRestore0, and afterRestore's complete before another series of them from another checkpoint/restore attempt. So afterRestore will need no CheckpointLock (no checkpointRestore will be possible).

But latter will not help if we have some technical issues with recursive attempt to lock.

@alexeybakhtin
Copy link
Contributor Author

Hi @AntonKozlov ,
Thank you for your review.
As suggested, I've excluded SecureRandom part from the review and focused on the CRaC Critical Section API only.
AbstractContextImpl and SecureRandom classes are reverted to the original state.
JTREG Test is added to demonstrate and verify new API.

@alexeybakhtin alexeybakhtin changed the title Invalidate and reseed SHA1PRND during checkpoint/restore CRaC Critical Section API Nov 26, 2021
}

@Override
@SuppressWarnings("try")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this initially. What warning does the "try" generate? I.e. will every user will need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenJDK build generates warning : "warning: [try] auto-closeable resource lock is never referenced in body of corresponding try statement"
I never see this warning compiling tests or apps. It safe to remove from test

@@ -0,0 +1,103 @@
// Copyright 2019-2021 Azul Systems, Inc. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/jdk/jdk/crac/checkpointLock seems more appropriate (Selector is a set of regression tests for nio.Selector Resource implementation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Collaborator

@DanHeidinga DanHeidinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand the use of this api a little better. Do you have some examples of where and how you expect it to be used?

The CheckpointLock seems to be a way to block checkpoints from being taken during operations that are "unsafe" for checkpoints to occur during or would leave the application in an unstable state.

The idea of blocking checkpoints at certain points makes sense but I'm wondering how common do you expect this to be used? Will library developers need to deploy this throughout their libraries? Or does it target a smaller set of use cases?

How does this work in a non-CRIU runtime? We don't want to introduce an API that library authors need to use for checkpoints but costs them performance / complexity when running on a vanilla JVM. Maybe to gain some experience with the problem but it'll be hard to get buy-in for a solution that costs the common case for a corner case.

@@ -50,6 +52,9 @@

private static boolean traceStartupTime;

private static final List<CheckpointLock> lockQ = new LinkedList<CheckpointLock>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinkedList is rarely the class we actually want to use. Better to use ArrayList

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. After comparing ArrayList and LinkedList implementations, I can see - ArrayList has better performance even for general LinkedList use cases.

static void removeLock(CheckpointLock lock) {
synchronized (lockQ) {
lockQ.remove(lock);
lockQ.notify();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will wake up the checkpointRestore1 wait loop on every lock removal. If there are a lot of CheckpointLock's this code will spend a lot of time waking up the other loop only for it to check and go back to sleep.

Would it make more sense to only notify the lockQ when the queue is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thank you.
Even if notify does not really wakeup another thread, it would be better to notify on empty queue only

/**
* A {@code CheckpointLock} allows creating critical sections that are not interrupted during checkpoint/restore.
*/
public class CheckpointLock implements AutoCloseable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to call this a CheckpointBlocker rather than a Lock as it seems the purpose is to prevent checkpoints from being able to proceed.

@alexeybakhtin
Copy link
Contributor Author

@DanHeidinga
The main idea of CheckpointLock is to introduce Critical Sections that can not be interrupted by checkpoint/restore operation. It could be used by JDK and Application code. One of the use cases was introduced in the first commit to this review as an example:

try (CheckpointLock lock = new CheckpointLock() ) {

Sha1PRNG Secure random should be reseeded to generate a new set of random bytes after restore. Also, the state should be cleared to prevent the calculation of the previously used randoms. As soon as the state is cleared, an implementation should prevent checkpoint in the middle of the methods that use a secure random state. Similar actions could/should be applied for other secure sensitive resources.
It could be necessary for application and libraries to use critical sections if they modify resources in the beforeCheckpoint/afterRestore methods.
The overhead of the CheckpointLock is very limited - just creation of the small object, add to and remove from the static ArrayList - so, I do not expect performance issues for the non-CRIU runtime or normal behavior after restore.

@DanHeidinga
Copy link
Collaborator

@DanHeidinga The main idea of CheckpointLock is to introduce Critical Sections that can not be interrupted by checkpoint/restore operation. It could be used by JDK and Application code.

I understand why this is an appealing idea - there are operations that ideally should be treated atomically with respect to the checkpoint/restore mechanism. Conceptually very similar to Hotspot's GCLocker blocking further access to jni critical while a GC is pending.

And while this kind of approach works for ie: jni critical, it works in part because the spec requires that critical regions are well bounded and the operations allowed in a critical section are heavily restricted.

Those properties don't hold for CheckpointLocks. What properties need to be specified to ensure these critical sections are short lived?

One of the use cases was introduced in the first commit to this review as an example:

try (CheckpointLock lock = new CheckpointLock() ) {

Sha1PRNG Secure random should be reseeded to generate a new set of random bytes after restore. Also, the state should be cleared to prevent the calculation of the previously used randoms. As soon as the state is cleared, an implementation should prevent checkpoint in the middle of the methods that use a secure random state. Similar actions could/should be applied for other secure sensitive resources.

Thanks for the example. Looking at the SecureRandom code, the afterRestore hook calls the engineSetSeed method which takes a CheckpointLock to prevent additional checkpoints happening during that operation.

There's a larger problem here though, that isn't resolved by stopping the future checkpoints from happening during the engineSetSeed and looking at that problem will hopefully lead to better understanding of thread behaviour around checkpoints.

The intention of updating the SecureRandom seed is to prevent attackers from being able "guess" future random numbers by restoring, observing the randoms, and then restoring a second time - basically to avoid replay attacks.

Using CheckpointLock prevents a second checkpoint from happening while the seeds are being reset, but I'm not sure what that actually gains us here. Presumably, the second checkpoint will run through the {before,after}Checkpoint hooks and reset the state on checkpoint and reseed on the second restore. What additional advantage has been gained by locking around that operation?

And to the larger threading question, if we have N threads operating prior to the checkpoint using the SecureRandom, by taking the checkpoint, we leave those threads with potentially inconsistent results. The checkpoint operation doesn't quiesce the threads so they are free to observe the state being modified by the beforeCheckpoint hook and on restore, to race with afterCheckpoint hook potentially seeing the old or new seed.

If I've missed something that forces the running threads to quiesce, then please correct me.

It could be necessary for application and libraries to use critical sections if they modify resources in the beforeCheckpoint/afterRestore methods.

I don't think CheckpointLock provides a strong enough guarantee for these operations as they race with other active threads.

The overhead of the CheckpointLock is very limited - just creation of the small object, add to and remove from the static ArrayList - so, I do not expect performance issues for the non-CRIU runtime or normal behavior after restore.

Look at this like the use of the AccessController::doPrivileged operations which are hard to place correctly, and for most users, are pure overhead as they don't run with the SecurityManager enabled. I'm hesitant about introducing APIs that have the same characteristics as apis that are in the process of being removed [1] unless we have good understanding of why the new apis are different.

[1] https://github.com/openjdk/jdk/blob/f7237793ffa3a5a804fea49f165c8b9f1935cfac/src/java.base/share/classes/java/security/AccessController.java#L281

@alexeybakhtin
Copy link
Contributor Author

Hi @DanHeidinga,

Thank you a lot for your review. I see a lot of open issues. Let me answer to some of them. The others are homework for me.

Using CheckpointLock prevents a second checkpoint from happening while the seeds are being reset, but I'm not sure what that actually gains us here. Presumably, the second checkpoint will run through the {before,after}Checkpoint hooks and reset the state on checkpoint and reseed on the second restore. What additional advantage has been gained by locking around that operation?

I think we should restrict nested checkpoints. So, application should not be allowed to request second checkpoint from the beforeCheckpoint or afterRestore methods. Proposed implementation partially implement it with checkpointRestoreLock

synchronized (checkpointRestoreLock) {

but it does not work if called directly (not from threads) from beforeCheckpoint / afterRestore methods.
Also, I think CheckpointException should be thrown on the attempt to a nested checkpoint.

And to the larger threading question, if we have N threads operating prior to the checkpoint using the SecureRandom, by taking the checkpoint, we leave those threads with potentially inconsistent results. The checkpoint operation doesn't quiesce the threads so they are free to observe the state being modified by the beforeCheckpoint hook and on restore, to race with afterCheckpoint hook potentially seeing the old or new seed.

If I've missed something that forces the running threads to quiesce, then please correct me.

I think this exact scenario is addressed here. The checkpoint operation holds the lock on the checkpointRestoreLock object. Running threads will be stopped till the end of restore trying to walk into the critical section

synchronized(checkpointRestoreLock) {

It is the application responsibility to correctly surround all code sections that access resources modified in the beforeCheckpoint method.
Threads, suspended on the enter of critical section will be released and continue execution after Restore is complete and state is modified in beforeCheckpoint/afterRestore methods. These threads will enter a critical section with the new state of the resource

Look at this like the use of the AccessController::doPrivileged operations which are hard to place correctly, and for most users, are pure overhead as they don't run with the SecurityManager enabled. I'm hesitant about introducing APIs that have the same characteristics as apis that are in the process of being removed [1] unless we have good understanding of why the new apis are different.

Yes, I see your point. I think I can modify the code to do nothing in case of application started without checkpoint properties CRaCChecpointTo and CRaCRestoreFrom.

@AntonKozlov
Copy link
Member

The idea of blocking checkpoints at certain points makes sense but I'm wondering how common do you expect this to be used? Will library developers need to deploy this throughout their libraries? Or does it target a smaller set of use cases?

Libraries may find this useful, but this is not mandatory, of course. A similar mechanism can be implemented by an application.
But since the JDK is will likely have this implemented, and some libraries may need this, it's worth exposing this as a public API. The main driver is the need to provide a control for something that should not end up in the image, like secrets that are fine to flow in the memory of the running process but are too critical to be stored. This relates to any implementation (e.g. not based on CRIU) that stores the image of the JVM process.

And while this kind of approach works for ie: jni critical, it works in part because the spec requires that critical regions are well bounded and the operations allowed in a critical section are heavily restricted.
Those properties don't hold for CheckpointLocks. What properties need to be specified to ensure these critical sections are short lived?

In contrast with JNI critical (which affects negatively the whole JVM), a non-minimal critical section has a potentially negative effect on the checkpoint itself only. Making the section shorter is beneficial, but not mandatory. As with custom beforeCheckpoint's, critical sections are part of the application's checkpoint procedure. JDK itself should not impose restrictions on this -- the user should have the control and take responsibility.

Look at this like the use of the AccessController::doPrivileged operations which are hard to place correctly, and for most users, are pure overhead as they don't run with the SecurityManager enabled. I'm hesitant about introducing APIs that have the same characteristics as apis that are in the process of being removed [1] unless we have good understanding of why the new apis are different.

This is indeed an interesting analogy. Actually, the API similar to the doPrivileged looks even better :) (instead of try-with-resources that generates the warning).

From the critique of the SecurityManager from JEP-411 [1], the difficult programming model and poor performance concerns should be evaluated. But both are associated with the need of traversing the stack for making the decision.
For CheckpointLock we need to check the global state that may be implemented efficiently, especially for no-checkpoint-possible mode.

[1] https://openjdk.java.net/jeps/411

@alexeybakhtin alexeybakhtin mentioned this pull request Dec 7, 2021
1 task
@DanHeidinga
Copy link
Collaborator

I think this exact scenario is addressed here. The checkpoint operation holds the lock on the checkpointRestoreLock object. Running threads will be stopped till the end of restore trying to walk into the critical section

synchronized(checkpointRestoreLock) {

It is the application responsibility to correctly surround all code sections that access resources modified in the beforeCheckpoint method.
Threads, suspended on the enter of critical section will be released and continue execution after Restore is complete and state is modified in beforeCheckpoint/afterRestore methods. These threads will enter a critical section with the new state of the resource

I reread the SecureRandom code and I agree the state modified in the beforeCheckpoint method is covered by CheckpointLocks - two of them in fact. And that makes this code subtle on first read as it appears to violate the constraints the locks should be providing (it doesn't, it's just a non-obvious use of double-checked locking though without the use of volatile... so still maybe broken but I'm not clear how a checkpoint affects happens-before).

It's also non-obvious how the two different new CheckpointLock() calls interact with each other and the class's existing synchronized methods. A developer needs to think about two kinds of synchronization now:

  1. The existing concept of synchronization to protect against concurrent access
  2. The new checkpoint consistency to guard against checkpoint-related data modification.

All of a class invariants now have to be guarded against changes from the {before/after}Checkpoint methods in addition to multi-threaded concerns.

This makes CheckpointLock one of those powerful, subtle apis that tend to be footguns and the source of security defects when deployed across the ecosystem.

The overall thing - and in general, not just in this PR - that needs to be clarified is the threading and locking behaviour around a checkpoint. What guarantees does CRaC provide to application threads about the state of the system during their beforeCheckpoint hooks and their afterRestore hooks?

Are application / framework / library authors responsible to ensure their threads are quiesced?

If we intend to have CRaC optimize the checkpoint by e.g. aggressively GCing before a checkpoint, then it would be good to ensure that app threads aren't allocating garbage at the same time. Acting in a single threaded manner here is a great conceptual model but introduces other issues e.g. deadlocks - at any rate, we need to start to specify what we're going to guarantee and what the apps have to build themselves.

But since the JDK is will likely have this implemented, and some libraries may need this, it's worth exposing this as a public API. The main driver is the need to provide a control for something that should not end up in the image, like secrets that are fine to flow in the memory of the running process but are too critical to be stored. This relates to any implementation (e.g. not based on CRIU) that stores the image of the JVM process.

Preventing secrets from being visible in the image is the right goal - we completely agree there. I'm concerned about how easy it will be to get this wrong with the proposed api. Programmers have a hard time understanding "happens-before" and the Java memory model and we've had that model since Java 5. It's partly why so many programs are either over or under synchronized.

And common synchronization patterns, like privatizing copies to operate on, will violate the intent of the CheckpointLock api rendering user's existing intuition wrong around how to protect data.

Here's an example of privatizing two variables that then leak into the image due to not holding the CheckpointLock long enough. A checkpoint taken at (A) will leak the secret into the image. User error? Yes but we'll eventually make the same kind of mistakes in the JDK as well.

class SecretHolder {
   Object secret_part1;
   Object secret_part2; // use two vars to demonstrate
   ....
   void useSecret {
      Object localSecret_1;
      Object localSecret_2;
       try (new CheckpointLock()) {
           localSecret_1 = secret_part1;
           localSecret_2 = secret_part2;
        }
        // (A) use localSecret_1 & _2;
    }

    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        secret_part1 = null;
        secret_part2 = null;
    }
}

The point I'm trying to make is that we need to uplevel the discussion to look at the guarantees we want CRaC to provide and then can tune the apis and features we create to provide those guarantees.

@DanHeidinga
Copy link
Collaborator

Lock ordering between synchronized methods and the jdk.crac.Core.checkpointRestoreLock can also result in deadlocks.

There's a non-obvious deadlock in this example (assuming I have the ordering right) between the callers of the CheckpointDeadLock.operate() method and the act of taking a checkpoint:

class CheckpointDeadLock {
   Object state;
   
  synchronized void clearState() {
    state = null;
  }

  synchronized void setState(Object o) {
    try (new CheckpointLock()) {
       // lock order: {this, jdk.crac.Core.checkpointRestoreLock }
       state = o;
    }
  }
   
   void operate {
      while(true) {
        setState(new Object());
      }
    }

    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
       // { jdk.crac.Core.checkpointRestoreLock, this } when calling synchronized 'clearState'
        clearState();
    }
}

@AntonKozlov
Copy link
Member

It's also non-obvious how the two different new CheckpointLock() calls interact with each other and the class's existing synchronized methods. A developer needs to think about two kinds of synchronization now:

  1. The existing concept of synchronization to protect against concurrent access
  2. The new checkpoint consistency to guard against checkpoint-related data modification.

I suggest looking at the checkpoint-critical section (or similar) as the Read part of the ReadWrite [1] lock, while checkpoint/restore is performed with the Write lock acquired. Checkpoint-critical sections can intersect, they are effectively no-op in absence of checkpoint/restore. Checkpoint/restores never intersect with each other and checkpoint-critical sections.

Similarly, with the Lock [2], I think we can specify checkpoint-lock independent of any native monitor. However, this implementation needs to have a minimal effect on the checkpoint-critical sections and Resource implementations, and the effect should be documented.

[1] https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html
[2] https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Lock.html

All of a class invariants now have to be guarded against changes from the {before/after}Checkpoint methods in addition to multi-threaded concerns.

They should have such guards before this PR as well -- the async checkpoint request may arrive any time and the implementor should take care about proper sync of between the main thread(s) and the thread doing checkpoint/restore (and calling before/afterRestore's, the thread is not specified).

So the new API should only sync arbitrary java code with checkpoint/restore; but not arbitrary java code with before/afterRestore -- for this, there is a better way with fine-grained user synchronization.

Thanks for the deadlock example. That and the occasional dependency on the lock implementation pushes to the thinking: should not we narrow checkpointRestoreLock to checkpointRestore1. So beforeCheckpoint will run in parallel with checkpoint-critical sections.

The overall thing - and in general, not just in this PR - that needs to be clarified is the threading and locking behaviour around a checkpoint. What guarantees does CRaC provide to application threads about the state of the system during their beforeCheckpoint hooks and their afterRestore hooks?

Are application / framework / library authors responsible to ensure their threads are quiesced?

What kind of guarantees would we need? So far we don't assume any special state of the system and threads during beforeCheckpoint/afterRestore, as they are arbitrary java code.

In general, it's app / lib responsibility to sync user resources with the checkpoint.
I doubt jdk would implement this flexibly and efficiently, so the best is to provide the control for users.

If we intend to have CRaC optimize the checkpoint by e.g. aggressively GCing before a checkpoint, then it would be good to ensure that app threads aren't allocating garbage at the same time. Acting in a single threaded manner here is a great conceptual model but introduces other issues e.g. deadlocks - at any rate, we need to start to specify what we're going to guarantee and what the apps have to build themselves.

For this, checkpoint implemented in several phases:

  1. beforeCheckpoint's are called
  2. in the safepoint, JVM checks preconditions and do necessary preparation, including GC
  3. still in the safepoint, the image is stored

I'm concerned about how easy it will be to get this wrong with the proposed api.

And common synchronization patterns, like privatizing copies to operate on, will violate the intent of the CheckpointLock api rendering user's existing intuition wrong around how to protect data.

Intuition plays against us. Process memory is usually seen as vault protected by OS means. However, it may leak on the file system as a core dump. The same is with CRaC, although the image is a part of the expected workflow. With the new API, the image still could not be completely safe from secrets unless a complete code review is performed. But at least it is feasible.

The point I'm trying to make is that we need to uplevel the discussion to look at the guarantees we want CRaC to provide and then can tune the apis and features we create to provide those guarantees.

Do you mean something in particular? I think we have a pile of concrete concerns about CRaC like the security, we don't have good mechanisms to deal with. Adding new features and trying to use them is the way to notice similarities and to come with something better (the API is not stable yet and may be extended and reduced). However, knowing the flaws in advance are preferable. Just not sure what are they.

@alexeybakhtin
Copy link
Contributor Author

Thanks for the deadlock example. That and the occasional dependency on the lock implementation pushes to the thinking: should not we narrow checkpointRestoreLock to checkpointRestore1. So beforeCheckpoint will run in parallel with checkpoint-critical sections.

I think we can not do it this way. It is exactly what we are going to prevent. beforeCheckpoint() prepares resources for checkpoint and the critical section prevents resources from concurrent access/modification during checkpoint/restore.
In case of a concurrent run of beforeCheckpoint() and critical sections, we'll have a situation when beforeCheckpoint() prepares resources for checkpoint and then critical sections modify or restore it. So, I think the lock around the whole checkpoint/ restore is the right solution.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Dec 14, 2021
@alexeybakhtin
Copy link
Contributor Author

Hi @AntonKozlov @DanHeidinga
Thank you for your review and comments.
I've proposed the new API method Core.criticalSection() instead of CheckpointLock.
CheckpointLock causes javac warning and the new proposed API method does not have this issue.
Also, this version optimizes criticalSection implementation in case CRaCCheckpointTo option is not specified in cmdline.
jmh benchmark is added to assure - critical section does not cause performance degradation for normal and after restore JDK runs

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 14, 2021
@alexeybakhtin
Copy link
Contributor Author

Hi @DanHeidinga, all
After additional discussions, we decided to postpone this API. As Dan pointed out, additional checkpoint locks introduce excessive complexity of the code and could lead to deadlocks that are hard to find and avoid.
In the absence of a general implementation of checkpoint locks, it is the application's responsibility to protect its own resources from concurrent access/modification during checkpont/restore.
I will continue to implement protection of security-related data in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready Pull request is ready to be integrated
3 participants