New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix frequent deadlock caused by conditional locks. #1514
Fix frequent deadlock caused by conditional locks. #1514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this PR passes tests in #1497.
I wonder how we can reduce communication costs by filtering TrialParamModel
and others, but I believe this should be addressed in a separate PR.
Changes are minimal and it also works quite well. Overall, this PR LGTM.
Thanks for you quick review. You're right that it's a somewhat naive implementation in terms of performance. I can add a TODO comment about that. |
optuna/storages/_rdb/storage.py
Outdated
.options(orm.selectinload(models.TrialModel.params)) | ||
.options(orm.selectinload(models.TrialModel.values)) | ||
.options(orm.selectinload(models.TrialModel.user_attributes)) | ||
.options(orm.selectinload(models.TrialModel.system_attributes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dumped the SQL queries related to this line. The queries are the same whether we add selectinload
or not as follows:
SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trial
s.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_complete
FROM trials
WHERE trials.trial_id = %(trial_id_1)s FOR UPDATE
As can be seen in the example in the reference, the change is can be seen in the following SELECT queries.
# without selectinload
SELECT trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes.`key` AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json
FROM trial_user_attributes
WHERE trial_user_attributes.trial_id = %(trial_id_1)s
# with selectionload
SELECT trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.`key` AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json
FROM trial_user_attributes
WHERE trial_user_attributes.trial_id IN (%(primary_keys_1)s) ORDER BY trial_user_attributes.trial_id```
So, I guess the selectinload
option does not affect the record lock, but I may misunderstand the lock mechanism. What do you think of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the following code to dump SQL queries: https://gist.github.com/toshihikoyanase/f839e9ab9b2cbba23366b635f12b79da
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, you're right. We probably need to do a proper "JOIN" (e.g. orm.joinedload
) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I verified that this properly locks backref tables. Sorry but could you check it again?
Note that we cannot do an INNER JOIN
(innerjoin=True
) and must do an LEFT OUTER JOIN
since backref tables don't necessarily have rows corresponding to the trial ID when we enter this section. Performance-wise, this change doesn't seem to have any noticeable impact, but I've only tested on a very trivial case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems good to me, but I sometimes encountered the deadlock when I ran the tests in #1497 as follows:
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: INSERT INTO trial_user_attributes (trial_id, `key`, value_json) VALUES (%(trial_id)s, %(key)s, %(value_json)s)]
[parameters: {'trial_id': 23, 'key': 'x', 'value_json': '7.048336196015672'}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
During handling of the above exception, another exception occurred:
optuna.exceptions.StorageInternalError: An exception is raised during the commit. This typically happens due to invalid data in the commit, e.g. exceeding max length. (The actual exception is as follows: OperationalError("(pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')"))
Do you think we need retry like #1509?
FYI: The query with orm.joinedload
is converted into the following SQL query:
SELECT trials.trial_id AS trials_trial_id, trials.number AS trials_number, trials.study_id AS trials_study_id, trial
s.state AS trials_state, trials.value AS trials_value, trials.datetime_start AS trials_datetime_start, trials.datetime_complete AS trials_datetime_comp
lete, trial_user_attributes_1.trial_user_attribute_id AS trial_user_attributes_1_trial_user_attribute_id, trial_user_attributes_1.trial_id AS trial_use
r_attributes_1_trial_id, trial_user_attributes_1.`key` AS trial_user_attributes_1_key, trial_user_attributes_1.value_json AS trial_user_attributes_1_va
lue_json, trial_system_attributes_1.trial_system_attribute_id AS trial_system_attributes_1_trial_system_attribute_id, trial_system_attributes_1.trial_i
d AS trial_system_attributes_1_trial_id, trial_system_attributes_1.`key` AS trial_system_attributes_1_key, trial_system_attributes_1.value_json AS tria
l_system_attributes_1_value_json, trial_params_1.param_id AS trial_params_1_param_id, trial_params_1.trial_id AS trial_params_1_trial_id, trial_params_
1.param_name AS trial_params_1_param_name, trial_params_1.param_value AS trial_params_1_param_value, trial_params_1.distribution_json AS trial_params_1
_distribution_json, trial_values_1.trial_value_id AS trial_values_1_trial_value_id, trial_values_1.trial_id AS trial_values_1_trial_id, trial_values_1.
step AS trial_values_1_step, trial_values_1.value AS trial_values_1_value
FROM trials LEFT OUTER JOIN trial_user_attributes AS trial_user_attributes_1 ON trials.trial_id = trial_user_attributes_1.trial_id LEFT OUTER JOIN trial_system_attributes AS trial_system_attributes_1 ON trials.trial_id = trial_system_attributes_1.trial_id LEFT OUTER JOIN trial_params AS trial_params_1 ON trials.trial_id = trial_params_1.trial_id LEFT OUTER JOIN trial_values AS trial_values_1 ON trials.trial_id = trial_values_1.trial_id
WHERE trials.trial_id = %(trial_id_1)s FOR UPDATE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional information, this time the query seems to align with the expectation. It's still not entirely clear to me why the deadlocks occur however. As for this issue, I changed the logic to not lock at all as they actually shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your update. I read the comment #1514 (comment), and I think your latest change sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I confirmed this PR passes the test in #1497 on my local MySQL server.
This is not a review comment but just a question: Why the "eager loading" resolves deadlock? According to the reference, we can use the eager loading to reduce the number of queries. I'm not sure why this mechanism resolves deadlock.
That the number of queries are reduced is actually not that important here. "Eager loading", or importantly Details might vary between dialects but these references might shed some light. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the kindful reply and confirmation. I understood that this eager loading results in a clear locking order and resolve the deadlock.
I saw the reference of eager loading for SQLAlchemy and confirmed that we should use joinedload
here.
LGTM!
Sorry for the confusion. I re-tried the test example in #1497, then I found the deadlock. I dismissed my approval.
Let me clarify the cause of the original bugs and changes introduced by this PR. There'll be several methods to address deadlocks, including retry-clause, but I'm currently wondering a more fundamental question: why do we need locks for the first place. Currently, optuna's spec requires that RUNNING trials are updated from only a single worker, and I think we don't need any locks. After you remove |
The locks were introduced here 9ba2cf6, but yes I agree that those shouldn't be needed given the storage specification that a trial shouldn't be updated from a different worker than its creator. I removed the |
87e913f
to
04c43e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. I confirmed this PR passes the test in #1497 on my local MySQL server. I agree with the idea no to acquire any lock here. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm glad to make the code simple though the fruitful discussion.
Thank you!
A rather fruitful detour for sure. Thank you all for the fast actions. |
Motivation
Alternative to #1509. See also #1509 (comment).
This fix does not contain any retries but as far as I could verify locally, deadlocks no longer occur.
Description of the changes
Only uses a single lock, over all trial-related tables.