Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix frequent deadlock caused by conditional locks. #1514

Merged
merged 4 commits into from Jul 14, 2020

Conversation

hvy
Copy link
Member

@hvy hvy commented Jul 14, 2020

Motivation

Alternative to #1509. See also #1509 (comment).

This fix does not contain any retries but as far as I could verify locally, deadlocks no longer occur.

Description of the changes

Only uses a single lock, over all trial-related tables.

@hvy hvy added bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions. labels Jul 14, 2020
@hvy hvy requested a review from ytsmiling July 14, 2020 02:39
Copy link
Member

@ytsmiling ytsmiling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this PR passes tests in #1497.
I wonder how we can reduce communication costs by filtering TrialParamModel and others, but I believe this should be addressed in a separate PR.
Changes are minimal and it also works quite well. Overall, this PR LGTM.

@hvy
Copy link
Member Author

hvy commented Jul 14, 2020

Thanks for you quick review. You're right that it's a somewhat naive implementation in terms of performance. I can add a TODO comment about that.

.options(orm.selectinload(models.TrialModel.params))
.options(orm.selectinload(models.TrialModel.values))
.options(orm.selectinload(models.TrialModel.user_attributes))
.options(orm.selectinload(models.TrialModel.system_attributes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dumped the SQL queries related to this line. The queries are the same whether we add selectinload or not as follows:

SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trial
s.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete 
FROM trials 
WHERE trials.trial_id = %(trial_id_1)s FOR UPDATE

As can be seen in the example in the reference, the change is can be seen in the following SELECT queries.

# without selectinload
SELECT trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes.`key` AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json 
FROM trial_user_attributes 
WHERE trial_user_attributes.trial_id = %(trial_id_1)s

# with selectionload
SELECT trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.`key` AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json 
FROM trial_user_attributes 
WHERE trial_user_attributes.trial_id IN (%(primary_keys_1)s) ORDER BY trial_user_attributes.trial_id```

So, I guess the selectinload option does not affect the record lock, but I may misunderstand the lock mechanism. What do you think of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the following code to dump SQL queries: https://gist.github.com/toshihikoyanase/f839e9ab9b2cbba23366b635f12b79da

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you're right. We probably need to do a proper "JOIN" (e.g. orm.joinedload) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I verified that this properly locks backref tables. Sorry but could you check it again?

Note that we cannot do an INNER JOIN (innerjoin=True) and must do an LEFT OUTER JOIN since backref tables don't necessarily have rows corresponding to the trial ID when we enter this section. Performance-wise, this change doesn't seem to have any noticeable impact, but I've only tested on a very trivial case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems good to me, but I sometimes encountered the deadlock when I ran the tests in #1497 as follows:

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: INSERT INTO trial_user_attributes (trial_id, `key`, value_json) VALUES (%(trial_id)s, %(key)s, %(value_json)s)]
[parameters: {'trial_id': 23, 'key': 'x', 'value_json': '7.048336196015672'}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

During handling of the above exception, another exception occurred:

optuna.exceptions.StorageInternalError: An exception is raised during the commit. This typically happens due to invalid data in the commit, e.g. exceeding max length. (The actual exception is as follows: OperationalError("(pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')"))

Do you think we need retry like #1509?


FYI: The query with orm.joinedload is converted into the following SQL query:

SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trial
s.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_comp
lete, trial_user_attributes_1.trial_user_attribute_id AS trial_user_attributes_1_trial_user_attribute_id, trial_user_attributes_1.trial_id AS trial_use
r_attributes_1_trial_id, trial_user_attributes_1.`key` AS trial_user_attributes_1_key, trial_user_attributes_1.value_json AS trial_user_attributes_1_va
lue_json, trial_system_attributes_1.trial_system_attribute_id AS trial_system_attributes_1_trial_system_attribute_id, trial_system_attributes_1.trial_i
d AS trial_system_attributes_1_trial_id, trial_system_attributes_1.`key` AS trial_system_attributes_1_key, trial_system_attributes_1.value_json AS tria
l_system_attributes_1_value_json, trial_params_1.param_id AS trial_params_1_param_id, trial_params_1.trial_id AS trial_params_1_trial_id, trial_params_
1.param_name AS trial_params_1_param_name, trial_params_1.param_value AS trial_params_1_param_value, trial_params_1.distribution_json AS trial_params_1
_distribution_json, trial_values_1.trial_value_id AS trial_values_1_trial_value_id, trial_values_1.trial_id AS trial_values_1_trial_id, trial_values_1.
step AS trial_values_1_step, trial_values_1.value AS trial_values_1_value 
FROM trials LEFT OUTER JOIN trial_user_attributes AS trial_user_attributes_1 ON trials.trial_id = trial_user_attributes_1.trial_id LEFT OUTER JOIN trial_system_attributes AS trial_system_attributes_1 ON trials.trial_id = trial_system_attributes_1.trial_id LEFT OUTER JOIN trial_params AS trial_params_1 ON trials.trial_id = trial_params_1.trial_id LEFT OUTER JOIN trial_values AS trial_values_1 ON trials.trial_id = trial_values_1.trial_id 
WHERE trials.trial_id = %(trial_id_1)s FOR UPDATE

Copy link
Member Author

@hvy hvy Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional information, this time the query seems to align with the expectation. It's still not entirely clear to me why the deadlocks occur however. As for this issue, I changed the logic to not lock at all as they actually shouldn't be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your update. I read the comment #1514 (comment), and I think your latest change sounds good to me.

Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I confirmed this PR passes the test in #1497 on my local MySQL server.

This is not a review comment but just a question: Why the "eager loading" resolves deadlock? According to the reference, we can use the eager loading to reduce the number of queries. I'm not sure why this mechanism resolves deadlock.

@hvy
Copy link
Member Author

hvy commented Jul 14, 2020

That the number of queries are reduced is actually not that important here. "Eager loading", or importantly JOINs, will make sure that locks propagate to joined tables (not just TrialModel but also TrialParamModel, etc.). This should prevent race conditions.

Details might vary between dialects but these references might shed some light.

HideakiImamura
HideakiImamura previously approved these changes Jul 14, 2020
Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the kindful reply and confirmation. I understood that this eager loading results in a clear locking order and resolve the deadlock.

I saw the reference of eager loading for SQLAlchemy and confirmed that we should use joinedload here.

LGTM!

@ytsmiling ytsmiling self-requested a review July 14, 2020 06:23
@HideakiImamura HideakiImamura dismissed their stale review July 14, 2020 06:27

Sorry for the confusion. I re-tried the test example in #1497, then I found the deadlock. I dismissed my approval.

@ytsmiling
Copy link
Member

ytsmiling commented Jul 14, 2020

Let me clarify the cause of the original bugs and changes introduced by this PR.
I initially thought that the problem was caused by #1496. However, after this PR, discussions, and investigation of logs provided in #1514 (comment), I'm rather skeptical about why the deadlock was caused and why eager-loading resolves the problem. I think the biggest diff in this PR is that it removes with_for_update from several queries. This removes oddly nested FOR UPDATE queries and it's plausible that it mitigates the frequent dead-lock.
When you investigate the log when we use selectinload, you can find that tables are locked individually.
After we switched to joinedload, we can see that the query locks multiple tables at the same time. Though I'm not 100% sure, but it's not that surprising that it can cause a dead-lock depending on DB implementations.

There'll be several methods to address deadlocks, including retry-clause, but I'm currently wondering a more fundamental question: why do we need locks for the first place. Currently, optuna's spec requires that RUNNING trials are updated from only a single worker, and I think we don't need any locks. After you remove with_for_update from the first query, you'll won't get the frequent deadlock in #1514 (comment).

@hvy
Copy link
Member Author

hvy commented Jul 14, 2020

The locks were introduced here 9ba2cf6, but yes I agree that those shouldn't be needed given the storage specification that a trial shouldn't be updated from a different worker than its creator. I removed the additional locks in the latest commit.

@hvy hvy force-pushed the fix-conditional-lock-deadlock branch from 87e913f to 04c43e2 Compare July 14, 2020 08:02
Copy link
Member

@ytsmiling ytsmiling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick actions and detailed investigations by both @hvy and other reviewers.
I confirmed that this PR passes #1497, and this PR LGTM.

The eager-join can be an orthogonal important improvement and I'm looking forward to another PR when it improves performance.

Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I confirmed this PR passes the test in #1497 on my local MySQL server. I agree with the idea no to acquire any lock here. LGTM.

Copy link
Member

@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm glad to make the code simple though the fruitful discussion.
Thank you!

@toshihikoyanase toshihikoyanase merged commit 97a776b into optuna:master Jul 14, 2020
@hvy hvy deleted the fix-conditional-lock-deadlock branch July 14, 2020 08:18
@toshihikoyanase toshihikoyanase added this to the v2.0.0 milestone Jul 14, 2020
@hvy
Copy link
Member Author

hvy commented Jul 14, 2020

A rather fruitful detour for sure. Thank you all for the fast actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. optuna.storages Related to the `optuna.storages` submodule. This is automatically labeled by github-actions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants