Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[litmus] Add support for speedcheck parameter for -mode presi #869

Merged
merged 1 commit into from
Jul 10, 2024

Conversation

relokin
Copy link
Member

@relokin relokin commented Jun 3, 2024

This change adds support for the speedcheck parameter for -mode presi. This was already supported in -mode std. The user can provide the parameter "+sc" which will force the exit to as soon as the post-condition is observed.

@maranget
Copy link
Member

maranget commented Jun 12, 2024

Hi @relokin, using one global stop_now flag may lead to deadlock. Commit b72a2f3 is an attempt to avoid such deadlocks by stopping all instances before stopping the experiment.

@relokin
Copy link
Member Author

relokin commented Jun 25, 2024

Hey @maranget, indeed I missed this. I tried to avoid other deadlocks, but I missed this.

I had a look at your patch and I think it's a much better way to achieve what I was trying to. Do you want to open a new PR with your own patch or should I cherry pick it in this pull request?

@maranget
Copy link
Member

Hi @relokin, cherry picking looks like the most adequate technique. Your opinion?

@relokin
Copy link
Member Author

relokin commented Jun 26, 2024

Just to check with you that my understanding is correct.

The high level desire was to exit litmus7 as soon as there is an execution where the post-condition is satisfied. With this PR, if a specific execution satisfies the post-condition, then other instances will continue for a little longer.

For example, let's say we execute 2 instances of MP using 4 cores. The 1st instance observes an execution which satisfies the post-condition and exits immediately. The 2nd instance will continue executing until the end of the for loop (executing in total size iterations), or until it encounters itself an execution which satisfies the post-condition).

However, at the end of the execution, we know that at least for one instance its last execution satisfied the post-condition. And in the case of presi, we know that at least for at least a set of cores (2 in the case of MP), the last time they executed the test, it satisfied the post-condition.

Does that make sense?

@maranget
Copy link
Member

maranget commented Jun 26, 2024

The high level desire was to exit litmus7 as soon as there is an execution where the post-condition is satisfied. With this PR, if a specific execution satisfies the post-condition, then other instances will continue for a little longer.

Hi @relokin, we agree on the high-level desire. If I am not mistaken, this is what the synchronisation code of commit 0b45377 does. That is all instance will exit as soon as possible if one instance discovers that the post-condition is satisfied.

Every test thread executes nruns times (function choose) a sequence of size tests (function choose_params).

As soon as one of the instance discovers that the post-condition is satisfied, it sets the global flag stop_now to true. Moreover all the threads of any instance synchronise as follows: thread number zero copies the global flag into an instance level flag and all instance thread synchronise with an instance level synchronisation barrier before they read the instance level flag. If this flag is set, all threads exit the loop by returning from the choose_params function. As a consequence, all thread of all instances will exit their loops as soon as possible and return inside the loop of size nruns in the function choose. There they all synchronise on a global synchronisation barrier before reading the global flag and all exit if they see it set,

I am not sure the scheme above is dead-lock free. It looks important that all the threads of a given instance act consistently. Hence the idea of them synchronising before reading the instance level flag.

@maranget
Copy link
Member

maranget commented Jun 26, 2024

A simpler scheme that would not stop threads as soon as the previous one, but that would spare the instance level synchronisation, would be as follows: the choose_param function (loop on size) simply records the occurrence of the stop condition locally, returning 1 if the stop condition occurred for some of the loop iteration and 0 otherwise.

In choose, if the returned value is one, set the global stop_now flag to one. Then synchronise on a global sync barrier, before reading the global flag and exiting when set.

@relokin
Copy link
Member Author

relokin commented Jul 9, 2024

A simpler scheme that would not stop threads as soon as the previous one, but that would spare the instance level synchronisation, would be as follows: the choose_param function (loop on size) simply records the occurrence of the stop condition locally, returning 1 if the stop condition occurred for some of the loop iteration and 0 otherwise.

Thanks @maranget! The motivation for this change (and I am aware that this might not be the same for speedcheck in -mode std) was to make it easier to identify the execution that satisfied the post-condition. So it's quite important that we exit as soon as possible. So I would rather not change it, unless ofc, there is something wrong with the current approach.

@maranget
Copy link
Member

maranget commented Jul 10, 2024

Hi @relokin. I guess we agree to merge this PR as it is, or do you see additional improvements?

@maranget
Copy link
Member

If we agree on merging, would you please rebase on master?

This change adds support for the speedcheck parameter for -mode
presi. This was already supported in -mode std. The user can provide
the parameter "+sc" which will force the exit to as soon as the
post-condition is observed.

So as to avoid deadlocks, introduce a 2 level procedure
to stop the experiment.

Any instance that reaches the stop condition sets the global
`stop_now` flag. This global flag is copied into an
instance specific `stop_now` flag after each test
by the thread number zero of this instance.

Then, the instance threads synchronise on an instance level
barrier before reading the instance flag and interrupting
the test loop, if the flag is set.

After returning from the test loop, _all_ threads synchronise on
a global barrier, before reading the global flag and
interrupting the experiment if the flag is set.
@relokin
Copy link
Member Author

relokin commented Jul 10, 2024

I am happy for this to be merged. Thanks @maranget!

@maranget maranget merged commit 010efa8 into herd:master Jul 10, 2024
3 checks passed
@maranget
Copy link
Member

Merged, thanks @relokin

@relokin relokin deleted the speedcheck-presi branch July 10, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants