Fix race condition for trial number computation. #1490

hvy · 2020-07-07T09:01:57Z

Motivation

Description of the changes

Allows count_past_trials to circumvent repeatable read isolation level. See the comments in the code for details.

Note

Verified the changes with MySQL, by inspection, by checking for unique trial numbers after running a distributed optimization. The bug is otherwise reproducible, simply running a lot of trials with.

SELECT COUNT(DISTINCT(number)) FROM trials;

import optuna
from optuna.samplers import TPESampler

def objective(trial, n_params):
    return sum(trial.suggest_float(f"x{i}", 0.0, 1.0) for i in range(n_params))

if __name__ == "__main__":
    n_params = 10
    n_trials = 200

    database = "duplicatenumber1488"
    storage = "mysql://root@localhost/duplicatenumber1488"
    study = optuna.create_study(sampler=TPESampler(), study_name=database, storage=storage, load_if_exists=True)

    study.optimize(lambda trial: objective(trial, n_params), n_trials=n_trials)

About isolation levels in sqlalchemy for different dialect
https://docs.sqlalchemy.org/en/13/core/connections.html#sqlalchemy.engine.Connection.execution_options.params.isolation_level.

c-bata · 2020-07-07T09:32:39Z

"Read Uncommitted" is lower isolation level than "Repeatable Read". It looks that this PR depends on "Dirty Reads", right?

https://en.wikipedia.org/wiki/Isolation_(database_systems)

Like this table said, "Dirty reads" may occure when using "Read Uncommitted". So it looks that this fixes is not safe.

hvy · 2020-07-07T14:00:26Z

Thanks @c-bata for you quick comment.

I changed the logic to do row level locking on all trials for the given study, and retrying on failures, rather than relaxing the isolation level. Verified the logic with MySQL and PostgreSQL locally. Note that this has the downside of significantly making creation of trials slower. However, this cannot be helped given our "constraint" on the number that it must be unique, etc.

hvy · 2020-07-07T14:27:11Z

Mini bench my MySQL.

Code

import argparse
import math
import time

import optuna
from optuna.samplers import TPESampler
import sqlalchemy


class Profile:
    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.end = time.time()

    def get(self):
        return self.end - self.start


def build_objective_fun(n_param):
    def objective(trial):
        return sum(
            [
                math.sin(trial.suggest_uniform("param-{}".format(i), 0, math.pi * 2))
                for i in range(n_param)
            ]
        )

    return objective


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("mysql_user", type=str)
    parser.add_argument("mysql_host", type=str)
    args = parser.parse_args()

    storage_str = "mysql+pymysql://{}@{}/".format(args.mysql_user, args.mysql_host)

    optuna.logging.set_verbosity(optuna.logging.CRITICAL)

    print(f"| #params | #trials | time(sec) |")
    print(f"| ------- | ------- | --------- |")

    for n_param in [1, 2, 4, 8, 16, 32]:
        for n_trial in [1, 10, 100, 1000]:
            engine = sqlalchemy.create_engine(storage_str)
            conn = engine.connect()
            conn.execute("commit")
            database_str = "profile_storage_t{}_p{}".format(n_trial, n_param)
            try:
                conn.execute("drop database {}".format(database_str))
            except Exception:
                pass
            conn.execute("create database {}".format(database_str))
            conn.close()

            storage = optuna.storages.get_storage(storage_str + database_str)
            study = optuna.create_study(storage=storage, sampler=TPESampler())

            with Profile() as prof:
                study.optimize(
                    build_objective_fun(n_param), n_trials=n_trial, gc_after_trial=False,
                )

            print(f"| {n_param} | {n_trial} | {prof.get():.2f} |")

#params	#trials	PR(sec)	`master`(sec)
1	1	0.02	0.01
1	10	0.12	0.11
1	100	1.60	1.18
1	1000	35.97	16.84
2	1	0.02	0.02
2	10	0.13	0.11
2	100	1.53	1.25
2	1000	38.03	20.17
4	1	0.02	0.02
4	10	0.14	0.12
4	100	1.83	1.53
4	1000	42.60	26.23
8	1	0.04	0.03
8	10	0.16	0.15
8	100	2.36	2.04
8	1000	52.53	36.11
16	1	0.08	0.06
16	10	0.20	0.20
16	100	3.21	2.98
16	1000	71.93	54.78
32	1	0.11	0.11
32	10	0.30	0.29
32	100	4.95	4.64
32	1000	110.76	93.52

ytsmiling · 2020-07-08T02:17:58Z

optuna/storages/_rdb/storage.py

+            # Lock all trials belonging to this study. This might lead to a deadlock
+            # (`OperationalError`) in which case we will retry.
+            session.query(models.TrialModel).filter(
+                models.TrialModel.study_id == study_id
+            ).with_for_update().all()


How about locking a entry in the studies table instead of locking the whole trials (though I haven't checked that it improves the latency)?

Ah, that might scale better with the number of trials. Let me just verify it.

Looks good.

#params #trials time(sec)

1 1 0.02

1 10 0.11

1 100 1.26

1 1000 17.41

2 1 0.02

2 10 0.12

2 100 1.32

2 1000 20.22

4 1 0.02

4 10 0.13

4 100 1.61

4 1000 25.82

8 1 0.04

8 10 0.16

8 100 2.07

8 1000 35.07

16 1 0.06

16 10 0.20

16 100 3.02

16 1000 55.23

32 1 0.12

32 10 0.30

32 100 4.76

32 1000 90.28

optuna/storages/_rdb/storage.py

c-bata

LGTM!

Note that this has the downside of significantly making creation of trials slower. However, this cannot be helped given our "constraint" on the number that it must be unique, etc.

Totally agree.

hvy · 2020-07-08T07:05:01Z

Thanks @c-bata as always. Regarding the performance hit, it's actually negligible now with @ytsmiling's suggestion of restricting the lock to a single row (at most), instead of all trials.

c-bata

@hvy Now I noticed that we should modify the implementation of count_past_trials.

optuna/optuna/storages/_rdb/models.py

Lines 251 to 257 in d7bdd63

    
           def count_past_trials(self, session): 
        
               # type: (orm.Session) -> int 
        
               trial_count = session.query(func.count(TrialModel.trial_id)).filter( 
        
                   TrialModel.study_id == self.study_id, TrialModel.trial_id < self.trial_id 
        
               ) 
        
               return trial_count.scalar()

We need to remove TrialModel.trial_id < self.trial_id from here.

c-bata

Sorry, It's my misunderstanding. Now we lock study before inserting trial. So current logic has no problem.

ytsmiling

Thank you for addressing this issue. LGTM.

ytsmiling · 2020-07-09T07:56:28Z

It's highly unlikely, but if trial_id is not assigned sequentially, the count_past_trial method can fail (duplicated number can be assigned). I'm okay about leaving the method as is, but if you'd like to change the implementation, I'll re-review this PR.

hvy · 2020-07-10T05:09:38Z

optuna/storages/_rdb/storage.py

@@ -464,9 +464,46 @@ def _create_new_trial(

        session = self.scoped_session()

-        # Ensure that that study exists.
-        models.StudyModel.find_or_raise_by_id(study_id, session)
+        try:


Memo: Try n (maybe 3) times and propagate the OperationalError in case they all fail. It should be easier to debug.

hvy · 2020-07-10T07:34:52Z

Changed the logic to propagate sqlalchemy errors after 3 retries to reduce the risk of silencing unexpected errors and to aid debugging in case of what would previously have resulted in maxim recursion depth error.

hvy · 2020-07-10T07:34:56Z

PTAL.

optuna/storages/_rdb/storage.py

toshihikoyanase

LGTM!

ytsmiling

LGTM!

HideakiImamura

Thanks! LGTM!

Fix race condition for trial number computation

77f0493

hvy added bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. labels Jul 7, 2020

hvy marked this pull request as ready for review July 7, 2020 09:20

c-bata self-requested a review July 7, 2020 09:49

Prefer row level locking over dirty reads

d3060a2

hvy force-pushed the fix-trial-number-race-condition branch from 2ebd008 to d3060a2 Compare July 7, 2020 13:33

hvy mentioned this pull request Jul 8, 2020

Duplicate trial numbers on distributed optimization #1488

Closed

ytsmiling reviewed Jul 8, 2020

View reviewed changes

hvy added 3 commits July 8, 2020 14:37

Simplify database locking for trial creation

7a53f3e

Raise appropriate KeyError for missing studies

acab9e2

Fix typo

3f40029

c-bata reviewed Jul 8, 2020

View reviewed changes

optuna/storages/_rdb/storage.py Outdated Show resolved Hide resolved

Fix for update/repeatable read location

d7bdd63

c-bata approved these changes Jul 8, 2020

View reviewed changes

c-bata added this to the v2.0.0 milestone Jul 8, 2020

c-bata mentioned this pull request Jul 8, 2020

Duplicate trial numbers are assigned on parallel execution. c-bata/goptuna#130

Closed

c-bata requested changes Jul 8, 2020

View reviewed changes

c-bata approved these changes Jul 8, 2020

View reviewed changes

ytsmiling approved these changes Jul 9, 2020

View reviewed changes

ytsmiling mentioned this pull request Jul 9, 2020

Deadlock can occur when using MySQL backend. #1499

Closed

hvy commented Jul 10, 2020

View reviewed changes

Propage sqlalchemy error instead of waiting for max num recursions

c7fcf7e

hvy mentioned this pull request Jul 10, 2020

Fix _CachedStorage and RDBStorage distribution compatibility check race condition. #1506

Merged

toshihikoyanase reviewed Jul 10, 2020

View reviewed changes

optuna/storages/_rdb/storage.py Show resolved Hide resolved

toshihikoyanase approved these changes Jul 13, 2020

View reviewed changes

ytsmiling approved these changes Jul 13, 2020

View reviewed changes

HideakiImamura approved these changes Jul 13, 2020

View reviewed changes

HideakiImamura merged commit 4be9da5 into optuna:master Jul 13, 2020

hvy deleted the fix-trial-number-race-condition branch July 13, 2020 01:33

This was referenced Jul 18, 2020

Distributed Parallel error: Trial has already finished and can not be updated. #1531

Closed

Same parameters for multiple trials under distributed parallel processing #1547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition for trial number computation. #1490

Fix race condition for trial number computation. #1490

hvy commented Jul 7, 2020 •

edited

c-bata commented Jul 7, 2020 •

edited

hvy commented Jul 7, 2020

hvy commented Jul 7, 2020 •

edited

ytsmiling Jul 8, 2020

hvy Jul 8, 2020 •

edited

hvy Jul 8, 2020

c-bata left a comment •

edited

hvy commented Jul 8, 2020

c-bata left a comment •

edited

c-bata left a comment

ytsmiling left a comment

ytsmiling commented Jul 9, 2020

hvy Jul 10, 2020

hvy commented Jul 10, 2020

hvy commented Jul 10, 2020

toshihikoyanase left a comment

ytsmiling left a comment

HideakiImamura left a comment

#params	#trials	time(sec)
1	1	0.02
1	10	0.11
1	100	1.26
1	1000	17.41
2	1	0.02
2	10	0.12
2	100	1.32
2	1000	20.22
4	1	0.02
4	10	0.13
4	100	1.61
4	1000	25.82
8	1	0.04
8	10	0.16
8	100	2.07
8	1000	35.07
16	1	0.06
16	10	0.20
16	100	3.02
16	1000	55.23
32	1	0.12
32	10	0.30
32	100	4.76
32	1000	90.28

	def count_past_trials(self, session):
	# type: (orm.Session) -> int

	trial_count = session.query(func.count(TrialModel.trial_id)).filter(
	TrialModel.study_id == self.study_id, TrialModel.trial_id < self.trial_id
	)
	return trial_count.scalar()

Fix race condition for trial number computation. #1490

Fix race condition for trial number computation. #1490

Conversation

hvy commented Jul 7, 2020 • edited

Motivation

Description of the changes

Note

c-bata commented Jul 7, 2020 • edited

hvy commented Jul 7, 2020

hvy commented Jul 7, 2020 • edited

ytsmiling Jul 8, 2020

Choose a reason for hiding this comment

hvy Jul 8, 2020 • edited

Choose a reason for hiding this comment

hvy Jul 8, 2020

Choose a reason for hiding this comment

c-bata left a comment • edited

Choose a reason for hiding this comment

hvy commented Jul 8, 2020

c-bata left a comment • edited

Choose a reason for hiding this comment

c-bata left a comment

Choose a reason for hiding this comment

ytsmiling left a comment

Choose a reason for hiding this comment

ytsmiling commented Jul 9, 2020

hvy Jul 10, 2020

Choose a reason for hiding this comment

hvy commented Jul 10, 2020

hvy commented Jul 10, 2020

toshihikoyanase left a comment

Choose a reason for hiding this comment

ytsmiling left a comment

Choose a reason for hiding this comment

HideakiImamura left a comment

Choose a reason for hiding this comment

hvy commented Jul 7, 2020 •

edited

c-bata commented Jul 7, 2020 •

edited

hvy commented Jul 7, 2020 •

edited

hvy Jul 8, 2020 •

edited

c-bata left a comment •

edited

c-bata left a comment •

edited