Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a concurrency bug of JournalStorage set_trial_state_values. #4033

Merged
merged 12 commits into from
Oct 24, 2022

Conversation

c-bata
Copy link
Member

@c-bata c-bata commented Sep 30, 2022

Motivation

Refs #4002 (comment)

Description of the changes

Make set_trial_state_values thread safe.

import optuna


optuna.logging.set_verbosity(optuna.logging.DEBUG)


def objective(trial):
    x = trial.suggest_float("x", -100, 100)
    return x ** 2


if __name__ == '__main__':
    try_these_parameters_first = [{"x": -15}, {"x": 0.2}, {"x": 30}, {"x": 20}]
    file_storage = optuna.storages.JournalFileStorage("optuna-journal.log")
    storage = optuna.storages.JournalStorage(file_storage)
    study = optuna.create_study(storage=storage)
    study.enqueue_trial(try_these_parameters_first[0])
    study.enqueue_trial(try_these_parameters_first[1])
    study.enqueue_trial(try_these_parameters_first[2])
    study.enqueue_trial(try_these_parameters_first[3])
    study.optimize(objective, n_trials=20, n_jobs=5)
$ python example.py
/Users/c-bata/go/src/github.com/optuna/optuna/example.py:15: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
[I 2022-09-30 11:27:49,663] A new study created in Journal with name: no-name-d61fe716-0401-4662-998d-0399d194c424
[D 2022-09-30 11:27:49,669] Trial 0 popped from the trial queue.
[D 2022-09-30 11:27:49,669] Trial 1 popped from the trial queue.
[D 2022-09-30 11:27:49,671] Trial 2 popped from the trial queue.
[D 2022-09-30 11:27:49,672] Trial 3 popped from the trial queue.
[I 2022-09-30 11:27:49,678] Trial 2 finished with value: 900.0 and parameters: {'x': 30}. Best is trial 0 with value: 225.0.
[I 2022-09-30 11:27:49,680] Trial 0 finished with value: 225.0 and parameters: {'x': -15}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,682] Trial 1 finished with value: 0.04000000000000001 and parameters: {'x': 0.2}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,684] Trial 3 finished with value: 400.0 and parameters: {'x': 20}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,792] Trial 4 finished with value: 6600.0333114589475 and parameters: {'x': 81.24058906395834}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,798] Trial 5 finished with value: 5529.278900689064 and parameters: {'x': -74.35912116673424}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,799] Trial 7 finished with value: 893.070613535204 and parameters: {'x': 29.884287067541095}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,802] Trial 6 finished with value: 6957.099989732264 and parameters: {'x': -83.40923204137695}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,803] Trial 9 finished with value: 6718.015050720809 and parameters: {'x': 81.96349828259412}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,811] Trial 8 finished with value: 970.8555796564026 and parameters: {'x': 31.15855548090127}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,812] Trial 11 finished with value: 1681.4636828254615 and parameters: {'x': 41.00565427871456}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,813] Trial 12 finished with value: 6.700583914136698 and parameters: {'x': 2.5885486115073633}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,813] Trial 13 finished with value: 774.7264078163153 and parameters: {'x': 27.833907519719816}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:49,819] Trial 10 finished with value: 263.83437431410476 and parameters: {'x': -16.24297923147428}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:50,384] Trial 15 finished with value: 784.3986584673387 and parameters: {'x': -28.007117996454735}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:50,387] Trial 14 finished with value: 833.8785841982007 and parameters: {'x': -28.876955937186327}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:50,387] Trial 17 finished with value: 1937.9235357536486 and parameters: {'x': -44.02185293412408}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:50,388] Trial 18 finished with value: 2151.3886284339164 and parameters: {'x': -46.38306402593425}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:50,388] Trial 16 finished with value: 2332.4531020099207 and parameters: {'x': -48.2954770347071}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-30 11:27:50,393] Trial 19 finished with value: 52.91460788648185 and parameters: {'x': 7.274242770658803}. Best is trial 1 with value: 0.04000000000000001.

@github-actions github-actions bot added the optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. label Sep 30, 2022
@c-bata c-bata added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Sep 30, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2022

Codecov Report

Merging #4033 (3e19989) into master (e8abeec) will decrease coverage by 0.11%.
The diff coverage is 96.66%.

@@            Coverage Diff             @@
##           master    #4033      +/-   ##
==========================================
- Coverage   90.19%   90.08%   -0.12%     
==========================================
  Files         160      160              
  Lines       12580    12597      +17     
==========================================
+ Hits        11347    11348       +1     
- Misses       1233     1249      +16     
Impacted Files Coverage Δ
optuna/storages/_journal/storage.py 96.82% <96.66%> (+0.32%) ⬆️
optuna/storages/_rdb/alembic/versions/v3.0.0.c.py 53.93% <0.00%> (-10.03%) ⬇️
optuna/storages/_rdb/alembic/versions/v3.0.0.d.py 51.54% <0.00%> (-6.97%) ⬇️
optuna/storages/_cached_storage.py 99.59% <0.00%> (-0.41%) ⬇️
optuna/samplers/nsgaii/_crossover.py 91.17% <0.00%> (ø)
optuna/integration/skopt.py 98.19% <0.00%> (+0.05%) ⬆️
optuna/storages/_rdb/storage.py 94.04% <0.00%> (+0.18%) ⬆️
optuna/study/study.py 94.92% <0.00%> (+0.39%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@c-bata c-bata marked this pull request as ready for review September 30, 2022 05:26
@c-bata
Copy link
Member Author

c-bata commented Sep 30, 2022

@wattlebirdaz I added a test case. Could you review this PR if you have time?

@c-bata c-bata marked this pull request as draft September 30, 2022 06:05
@c-bata
Copy link
Member Author

c-bata commented Sep 30, 2022

FYI: I found an additional bug when added a test case for multi-process optimization. I confirmed that this test case will be failed even if checking out to the master branch.

https://github.com/optuna/optuna/compare/master...c-bata:optuna:journal-storage-multiprocess-test?expand=1

$ pytest tests/storages_tests/test_journal.py -k set_trial_state
==================================================================================================================================== test session starts ====================================================================================================================================
platform darwin -- Python 3.9.11, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/c-bata/go/src/github.com/optuna/optuna, configfile: pyproject.toml
collected 3 items / 2 deselected / 1 selected

tests/storages_tests/test_journal.py F                                                                                                                                                                                                                                                [100%]

========================================================================================================================================= FAILURES ==========================================================================================================================================
_______________________________________________________________________________________________________________________ test_set_trial_state_values_multiprocess_safe _______________________________________________________________________________________________________________________

    def test_set_trial_state_values_multiprocess_safe() -> None:
        with tempfile.NamedTemporaryFile() as file:
            file_storage = optuna.storages.JournalFileStorage(file.name)
            storage = optuna.storages.JournalStorage(file_storage)
            study = optuna.create_study(storage=storage)
            for i in range(10):
                study.enqueue_trial({"i": i})

            trial_id_set = set()
            with ProcessPoolExecutor(10) as pool:
                futures = []
                for i in range(10):
                    future = pool.submit(pop_waiting_trial, file.name, study.study_name)
                    futures.append(future)

                for future in as_completed(futures):
                    trial_id_set.add(future.result())
>           assert len(trial_id_set) == 10
E           assert 9 == 10
E            +  where 9 = len({0, 1, 2, 3, 4, 5, ...})

tests/storages_tests/test_journal.py:95: AssertionError
----------------------------------------------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------------------------------------------------
[I 2022-09-30 15:03:34,976] A new study created in Journal with name: no-name-24b93df7-b4ca-4fd6-8073-b9e9df768249
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
/Users/c-bata/go/src/github.com/optuna/optuna/tests/storages_tests/test_journal.py:73: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
================================================================================================================================== short test summary info ==================================================================================================================================
FAILED tests/storages_tests/test_journal.py::test_set_trial_state_values_multiprocess_safe - assert 9 == 10
============================================================================================================================== 1 failed, 2 deselected in 0.96s ==============================================================================================================================

cc: @wattlebirdaz

@rotaki
Copy link
Collaborator

rotaki commented Oct 5, 2022

@c-bata I haven't reviewed this yet but I will just leave a comment to make sure that we understand the bug correctly. I'll review it in a few days.

So the bug is in this function, especially L266.

def set_trial_state_values(
self, trial_id: int, state: TrialState, values: Optional[Sequence[float]] = None
) -> bool:
log: Dict[str, Any] = {
"trial_id": trial_id,
"state": state,
"values": values,
}
if state == TrialState.RUNNING:
log["datetime_start"] = datetime.datetime.now().isoformat(timespec="microseconds")
elif state.is_finished():
log["datetime_complete"] = datetime.datetime.now().isoformat(timespec="microseconds")
with self._thread_lock:
self._write_log(JournalOperation.SET_TRIAL_STATE_VALUES, log)
self._sync_with_backend()
if (
state == TrialState.RUNNING
and trial_id not in self._replay_result._trial_ids_owned_by_this_process
):
return False
else:
return True

The desired behavior of this function is that If there are two threads working on the same trial and trying to make the state from RUNNING to COMPLETE, one of them should return False.

if (
state == TrialState.RUNNING
and trial_id not in self._replay_result._trial_ids_owned_by_this_process
):
return False
else:
return True

In JournalStorage, a worker thread cannot tell whether a RUNNING state that the worker sees is a state registered by this thread or another thread. This is because owners of the trials are maintained by the process, not thread. Therefore, both of the threads will return True mainly because of the condition in L266.

The reason RDB(such as Postgres) works in this scenario is that in RDB, a worker can see the state before the worker registers the new state. In this scenario, RDB does not have to keep the owner list of the trials. Thus, whenever a worker sees a RUNNING state, it is sure that the state is not registered by that worker.

See L685, L686

def set_trial_state_values(
self, trial_id: int, state: TrialState, values: Optional[Sequence[float]] = None
) -> bool:
try:
with _create_scoped_session(self.scoped_session) as session:
trial = models.TrialModel.find_or_raise_by_id(trial_id, session, for_update=True)
self.check_trial_is_updatable(trial_id, trial.state)
if values is not None:
for objective, v in enumerate(values):
self._set_trial_value_without_commit(session, trial_id, objective, v)
if state == TrialState.RUNNING and trial.state != TrialState.WAITING:
return False
trial.state = state
if state == TrialState.RUNNING:
trial.datetime_start = datetime.now()
if state.is_finished():
trial.datetime_complete = datetime.now()
except sqlalchemy_exc.IntegrityError:
return False
return True

Copy link
Collaborator

@rotaki rotaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Great work!
I added some comments on the code. Please feel free to ask me anything about it.

optuna/storages/_journal/storage.py Show resolved Hide resolved
optuna/storages/_journal/storage.py Show resolved Hide resolved
optuna/storages/_journal/storage.py Outdated Show resolved Hide resolved
tests/storages_tests/test_storages.py Outdated Show resolved Hide resolved
tests/storages_tests/test_storages.py Outdated Show resolved Hide resolved
@rotaki
Copy link
Collaborator

rotaki commented Oct 5, 2022

#4033 (comment)

I'll investigate this in a few days.

@github-actions
Copy link
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Oct 12, 2022
@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Oct 13, 2022
@c-bata c-bata marked this pull request as ready for review October 14, 2022 00:30
Copy link
Member Author

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wattlebirdaz Thank you for your review and your thorough investigation!
I fixed a bug when running in multiple processes at eec3075 and made this PR ready for review 🙏

optuna/storages/_journal/storage.py Show resolved Hide resolved
optuna/storages/_journal/storage.py Outdated Show resolved Hide resolved
tests/storages_tests/test_storages.py Outdated Show resolved Hide resolved
optuna/storages/_journal/storage.py Show resolved Hide resolved
@c-bata c-bata changed the title Fix JournalStorage to pop waiting trials concurrently Fix a concurrency bug of JournalStorage set_trial_state_values. Oct 14, 2022
@c-bata
Copy link
Member Author

c-bata commented Oct 14, 2022

@HideakiImamura Could you review this PR?

Copy link
Collaborator

@rotaki rotaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Looks almost good to me.

In the code there are multiple worker-ish concepts (such as worker, thread, process). For clarity, we could avoid the use of names that contain "thread" and only use (worker, process) names. (e.g. _thread_id_to_owned_trial_id. to _worker_id_to_owned_trial_id). Or we could at least leave a comment that indicates that a worker is a thread. What do you think?

tests/storages_tests/test_journal.py Outdated Show resolved Hide resolved


def test_pop_waiting_trial_multiprocess_safe() -> None:
with tempfile.NamedTemporaryFile() as file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I ask one thing? Is there any reason why we can't execute multi-process test for other dbs such as sqlite3?

Copy link
Member Author

@c-bata c-bata Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORAGE_MODES variable, which is used with pytest.mark.parametrize(), contains sqlite3, redis, and inmemory except for journal.

  • sqlite3: To pop waiting trials concurrently, RDBStorage uses SELECT ... FOR UPDATE syntax, which is unsupported in SQLite3. Furthermore, SQLite3 does not support a high level of concurrency. It sometimes raises "database is locked" errors and it makes tests fragile.
  • redis: We use fakeredis for testing RedisStorage. It doesn't support multiple processes.

So here I put this test case instead of using pytest.mark.parametrize("storage_mode", STORAGE_MODES).

optuna/storages/_journal/storage.py Outdated Show resolved Hide resolved
optuna/storages/_journal/storage.py Show resolved Hide resolved
Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Almost, LGTM.

As @wattlebirdaz pointed, I think It would be better to unify either "worker" or "thread", to avoid confusion.

@c-bata
Copy link
Member Author

c-bata commented Oct 20, 2022

@HideakiImamura Thanks for your review!

As @wattlebirdaz pointed, I think It would be better to unify either "worker" or "thread", to avoid confusion.

Sorry, which comment are you referring to? I guess you mentioned about https://github.com/optuna/optuna/pull/4033/files#r994612122 but we seemed to reach a consensus on it.

@rotaki
Copy link
Collaborator

rotaki commented Oct 20, 2022

@c-bata this one: #4033 (review) :)

@c-bata
Copy link
Member Author

c-bata commented Oct 21, 2022

Thanks! That sounds make sense. I'll update the PR>.

@c-bata
Copy link
Member Author

c-bata commented Oct 21, 2022

PTAL.

Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@rotaki rotaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! LGTM

@HideakiImamura HideakiImamura merged commit 182c751 into optuna:master Oct 24, 2022
@HideakiImamura HideakiImamura added this to the v3.1.0 milestone Oct 24, 2022
@c-bata c-bata deleted the fix-journal-pop-waiting-trial branch October 24, 2022 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants