Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enqueue, storage & parallelization #4002

Closed
cuismabc opened this issue Sep 14, 2022 · 5 comments · Fixed by #4034
Closed

enqueue, storage & parallelization #4002

cuismabc opened this issue Sep 14, 2022 · 5 comments · Fixed by #4034
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.

Comments

@cuismabc
Copy link

cuismabc commented Sep 14, 2022

Expected behavior

run the attached code, you will see the problem. this issue does not always happen, but often.

Environment

  • Optuna version: 3.9
  • Python version:3.0
  • OS: windows X64
  • (Optional) Other libraries and their versions:

Error messages, stack traces, or logs

[I 2022-09-14 14:22:16,229] A new study created in RDB with name: C:/Users/cuism/Desktop/ex_DOEs/study
A new study created in RDB with name: C:/Users/cuism/Desktop/ex_DOEs/study
trial index istrial index is  00

[I 2022-09-14 14:22:17,167] Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 0 with value: 450.0.
Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 0 with value: 450.0.
[I 2022-09-14 14:22:17,182] Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 0 with value: 450.0.
Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 0 with value: 450.0.
trial index is 1
[I 2022-09-14 14:22:17,604] Trial 1 finished with value: 0.04000000000000001 and parameters: {'x': 0.2, 'y': 0}. Best is trial 1 with value: 0.04000000000000001.
Trial 1 finished with value: 0.04000000000000001 and parameters: {'x': 0.2, 'y': 0}. Best is trial 1 with value: 0.04000000000000001.
Traceback (most recent call last):
  File "C:\Users\cuism\Desktop\ex_DOEs\HF2_code_DOE_enqueue.py", line 42, in <module>
    study.optimize(objective, n_trials=10, n_jobs=2)
  File "C:\Python\Python39\lib\site-packages\optuna\study\study.py", line 419, in optimize
    _optimize(
  File "C:\Python\Python39\lib\site-packages\optuna\study\_optimize.py", line 103, in _optimize
    f.result()
  File "C:\Python\Python39\lib\concurrent\futures\_base.py", line 438, in result
    return self.__get_result()
  File "C:\Python\Python39\lib\concurrent\futures\_base.py", line 390, in __get_result
    raise self._exception
  File "C:\Python\Python39\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\Python\Python39\lib\site-packages\optuna\study\_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "C:\Python\Python39\lib\site-packages\optuna\study\_optimize.py", line 208, in _run_trial
    frozen_trial = _tell_with_warning(
  File "C:\Python\Python39\lib\site-packages\optuna\study\_tell.py", line 197, in _tell_with_warning
    study._storage.set_trial_state_values(trial_id, state, values)
  File "C:\Python\Python39\lib\site-packages\optuna\storages\_cached_storage.py", line 261, in set_trial_state_values
    self._check_trial_is_updatable(cached_trial)
  File "C:\Python\Python39\lib\site-packages\optuna\storages\_cached_storage.py", line 386, in _check_trial_is_updatable
    raise RuntimeError(
RuntimeError: Trial#0 has already finished and can not be updated.

Steps to reproduce

  1. run the attached codes several times, the issue will come
import optuna
import time
import random
import logging
import sys
import os

def objective(trial):
    print("trial index is", trial.number)
   
    x = trial.suggest_float("x", -100, 100)
    y = trial.suggest_float("y", -15, 30)
    obj=x ** 2 + y ** 2
    #time.sleep(random.uniform(1,5))
    return obj
    

file_path = "C:/Users/cuism/Desktop/ex_DOEs/study.db"
if os.path.isfile(file_path):
  os.remove(file_path)
  print("File has been deleted")
else:
  print("File does not exist")
  
optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
study_name = "C:/Users/cuism/Desktop/ex_DOEs/study"  # Unique identifier of the study.
storage_name = "sqlite:///{}.db".format(study_name)  


try_these_parameters_first=[{"x": -15, "y":-15},{"x": 0.2, "y":0},{"x": 30, "y":30},{"x": 20, "y":20}]

sampler = optuna.integration.BoTorchSampler( n_startup_trials=5)
study = optuna.create_study(sampler=sampler,study_name=study_name, storage=storage_name)

study.enqueue_trial(try_these_parameters_first[0])
study.enqueue_trial(try_these_parameters_first[1])
study.enqueue_trial(try_these_parameters_first[2])
study.enqueue_trial(try_these_parameters_first[3])


study.optimize(objective, n_trials=10, n_jobs=2)

Additional context (optional)

No response

@cuismabc cuismabc added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Sep 14, 2022
@nzw0301 nzw0301 changed the title enqueue , storage & parallelization enqueue, storage & parallelization Sep 18, 2022
@not522
Copy link
Member

not522 commented Sep 22, 2022

Thank you for your bug report. We think this bug occurs when SQLite is used for parallel optimization.

# "FOR UPDATE" clause is used for row-level locking.
# Please note that SQLite3 doesn't support this clause.
if for_update:
query = query.with_for_update()

If you need a database for parallel optimization, please try other RDB storage. (Or, please wait for the v3.1.0 release, which includes new style storage #3854)

@c-bata
Copy link
Member

c-bata commented Sep 28, 2022

Thank you for reporting the problem. I'll update the documentation to let Optuna users know this feature does not work with SQLite3.

@wattlebirdaz I tried using JournalFileStorage for this usecase. However, I got the following error. If you have time, could you please help us fix this bug?

https://gist.github.com/c-bata/e68678f8ee248f82d378b0cd3bee8d83#file-example-journal-py

$ python example-journal.py
/Users/c-bata/go/src/github.com/optuna/optuna/example.py:17: ExperimentalWarning: BoTorchSampler is experimental (supported from v2.4.0). The interface can change in the future.
  sampler = optuna.integration.BoTorchSampler(n_startup_trials=5)
/Users/c-bata/go/src/github.com/optuna/optuna/example.py:19: ExperimentalWarning: JournalStorage is experimental (supported from v3.1.0). The interface can change in the future.
  storage = optuna.storages.JournalStorage(file_storage)
[I 2022-09-28 15:53:53,753] A new study created in Journal with name: no-name-1ac00f93-3de5-400f-8d0a-3c8602fe771e
[D 2022-09-28 15:53:53,757] Trial 0 popped from the trial queue.
[D 2022-09-28 15:53:53,758] Trial 0 popped from the trial queue.
[D 2022-09-28 15:53:53,758] Trial 0 popped from the trial queue.
[D 2022-09-28 15:53:53,758] Trial 0 popped from the trial queue.
[D 2022-09-28 15:53:53,759] Trial 1 popped from the trial queue.
[I 2022-09-28 15:53:53,766] Trial 1 finished with value: 0.04000000000000001 and parameters: {'x': 0.2, 'y': 0}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-28 15:53:53,767] Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-28 15:53:53,768] Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 1 with value: 0.04000000000000001.
[D 2022-09-28 15:53:53,769] Trial 2 popped from the trial queue.
[I 2022-09-28 15:53:53,769] Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-28 15:53:53,770] Trial 0 finished with value: 450.0 and parameters: {'x': -15, 'y': -15}. Best is trial 1 with value: 0.04000000000000001.
[I 2022-09-28 15:53:53,773] Trial 2 finished with value: 1800.0 and parameters: {'x': 30, 'y': 30}. Best is trial 1 with value: 0.04000000000000001.
Traceback (most recent call last):
  File "/Users/c-bata/go/src/github.com/optuna/optuna/example.py", line 25, in <module>
    study.optimize(objective, n_trials=10, n_jobs=5)
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/study/study.py", line 419, in optimize
    _optimize(
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/study/_optimize.py", line 103, in _optimize
    f.result()
  File "/Users/c-bata/.pyenv_x64/versions/3.9.11/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/Users/c-bata/.pyenv_x64/versions/3.9.11/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/Users/c-bata/.pyenv_x64/versions/3.9.11/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/study/_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/study/_optimize.py", line 208, in _run_trial
    frozen_trial = _tell_with_warning(
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/study/_tell.py", line 173, in _tell_with_warning
    study._storage.set_trial_state_values(frozen_trial._trial_id, state, values)
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/storages/_journal/storage.py", line 258, in set_trial_state_values
    self._sync_with_backend()
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/storages/_journal/storage.py", line 104, in _sync_with_backend
    self._replay_result.apply_logs(logs)
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/storages/_journal/storage.py", line 351, in apply_logs
    self._apply_set_trial_state_values(log)
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/storages/_journal/storage.py", line 548, in _apply_set_trial_state_values
    if not self._trial_exists_and_updatable(trial_id, log):
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/storages/_journal/storage.py", line 605, in _trial_exists_and_updatable
    self._raise_if_log_issued_by_pid(
  File "/Users/c-bata/go/src/github.com/optuna/optuna/optuna/storages/_journal/storage.py", line 389, in _raise_if_log_issued_by_pid
    raise err
RuntimeError: Trial#0 has already finished and can not be updated.

@rotaki
Copy link
Collaborator

rotaki commented Sep 29, 2022

Study._pop_waiting_trial_id() should be thread-safe because it calls storage.get_all_trials and storage.set_trial_state_values sequentially. If more than two threads interleave when executing Study._pop_waiting_trial_id(), one of them might see a trial with a COMPLETED state when executing self._storage.set_trial_state_values(trial._trial_id, state=TrialState.RUNNING). I am still investigating why this bug is not caused in RDB. (I might be wrong but I assume RDBs hold the lock in a larger granularity, which prevents from causing this probelm.)

def _pop_waiting_trial_id(self) -> Optional[int]:
for trial in self._storage.get_all_trials(
self._study_id, deepcopy=False, states=(TrialState.WAITING,)
):
if not self._storage.set_trial_state_values(trial._trial_id, state=TrialState.RUNNING):
continue
_logger.debug("Trial {} popped from the trial queue.".format(trial.number))
return trial._trial_id
return None

In the mean time, a quick way to fix this for both sqlite3 and JournalStorage should be the following:

Define self._enqueue_trial_lock = threading.Lock() in Study class and

    def _pop_waiting_trial_id(self) -> Optional[int]:
        with self._enqueue_trial_lock:
            for trial in self._storage.get_all_trials(
                self._study_id, deepcopy=False, states=(TrialState.WAITING,)
            ):
                if not self._storage.set_trial_state_values(trial._trial_id, state=TrialState.RUNNING):
                    continue
                print("Thread:{} Successed in popping: {}".format(threading.get_ident(), trial._trial_id))

                _logger.debug("Trial {} popped from the trial queue.".format(trial.number))
                return trial._trial_id

            return None

@c-bata
Copy link
Member

c-bata commented Sep 30, 2022

@wattlebirdaz Thank you for your investigation.
I've opened a pull request to make the JournalStorage.set_trial_state_values() method thread-safe. I'll make it ready for review after added a test case to cover this scenario.

@c-bata
Copy link
Member

c-bata commented Sep 30, 2022

@cuismabc Thank you for reporting this problem! I opened a pull request to update the documentation.
#4034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants