-
-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a concurrency bug of JournalStorage set_trial_state_values
.
#4033
Fix a concurrency bug of JournalStorage set_trial_state_values
.
#4033
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4033 +/- ##
==========================================
- Coverage 90.19% 90.08% -0.12%
==========================================
Files 160 160
Lines 12580 12597 +17
==========================================
+ Hits 11347 11348 +1
- Misses 1233 1249 +16
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@wattlebirdaz I added a test case. Could you review this PR if you have time? |
FYI: I found an additional bug when added a test case for multi-process optimization. I confirmed that this test case will be failed even if checking out to the master branch.
cc: @wattlebirdaz |
@c-bata I haven't reviewed this yet but I will just leave a comment to make sure that we understand the bug correctly. I'll review it in a few days. So the bug is in this function, especially L266. optuna/optuna/storages/_journal/storage.py Lines 246 to 270 in 7da2f2d
The desired behavior of this function is that If there are two threads working on the same trial and trying to make the state from RUNNING to COMPLETE, one of them should return False. optuna/optuna/storages/_journal/storage.py Lines 264 to 270 in 7da2f2d
In JournalStorage, a worker thread cannot tell whether a RUNNING state that the worker sees is a state registered by this thread or another thread. This is because owners of the trials are maintained by the process, not thread. Therefore, both of the threads will return True mainly because of the condition in L266. The reason RDB(such as Postgres) works in this scenario is that in RDB, a worker can see the state before the worker registers the new state. In this scenario, RDB does not have to keep the owner list of the trials. Thus, whenever a worker sees a RUNNING state, it is sure that the state is not registered by that worker. See L685, L686 optuna/optuna/storages/_rdb/storage.py Lines 672 to 697 in 7da2f2d
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Great work!
I added some comments on the code. Please feel free to ask me anything about it.
I'll investigate this in a few days. |
This pull request has not seen any recent activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wattlebirdaz Thank you for your review and your thorough investigation!
I fixed a bug when running in multiple processes at eec3075 and made this PR ready for review 🙏
set_trial_state_values
.
@HideakiImamura Could you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Looks almost good to me.
In the code there are multiple worker-ish concepts (such as worker, thread, process). For clarity, we could avoid the use of names that contain "thread" and only use (worker, process) names. (e.g. _thread_id_to_owned_trial_id
. to _worker_id_to_owned_trial_id
). Or we could at least leave a comment that indicates that a worker is a thread. What do you think?
|
||
|
||
def test_pop_waiting_trial_multiprocess_safe() -> None: | ||
with tempfile.NamedTemporaryFile() as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask one thing? Is there any reason why we can't execute multi-process test for other dbs such as sqlite3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STORAGE_MODES
variable, which is used with pytest.mark.parametrize()
, contains sqlite3
, redis
, and inmemory
except for journal
.
sqlite3
: To pop waiting trials concurrently, RDBStorage usesSELECT ... FOR UPDATE
syntax, which is unsupported in SQLite3. Furthermore, SQLite3 does not support a high level of concurrency. It sometimes raises "database is locked" errors and it makes tests fragile.redis
: We usefakeredis
for testingRedisStorage
. It doesn't support multiple processes.
So here I put this test case instead of using pytest.mark.parametrize("storage_mode", STORAGE_MODES)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Almost, LGTM.
As @wattlebirdaz pointed, I think It would be better to unify either "worker" or "thread", to avoid confusion.
@HideakiImamura Thanks for your review!
Sorry, which comment are you referring to? I guess you mentioned about https://github.com/optuna/optuna/pull/4033/files#r994612122 but we seemed to reach a consensus on it. |
@c-bata this one: #4033 (review) :) |
Thanks! That sounds make sense. I'll update the PR>. |
PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! LGTM
Motivation
Refs #4002 (comment)
Description of the changes
Make
set_trial_state_values
thread safe.