New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 #7240
Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 #7240
Conversation
@barrywhart Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/9262493758. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details. |
Documentation preview for b2c13d8 will be available here when this CircleCI job completes successfully. More info
|
a57de8d
to
d1ec0ab
Compare
I think it should be good now. I recreated the branch and force pushed with this commit message:
|
Hi @barrywhart the test suites for database operations are here: https://github.com/mlflow/mlflow/blob/master/tests/db/ |
Thanks, I'll work on that! It may be some time next week. It turns out that "test what happens when the database is down" can surface lots of work to do in several different areas. 🤣 The larger context is, we're hosting MLflow ourselves in Google Cloud, and Google's routine, automated maintenance will occasionally take the database down for 30-60 seconds. Major database upgrades or making database schema changes may also require outages. Ideally, we want MLflow client applications to wait at least a few minutes before failing, in order to avoid unnecessary alerts and failures. |
Sounds good @barrywhart! I'll review and kick off CI once that test is added :) In the meantime, let me know if you have any questions or get stuck. |
@BenWilson2: I added a test for the new error handling behavior. I couldn't find any tests that used a database other than SQLite, so the test uses SQLite. Even though SQLite is a local database, thus not prone to OperationalError, I think this is still a reasonable test, since all databases use the same, new |
tests/db/test_tracking_operations.py
Outdated
mlflow.log_param("p", "param") | ||
mlflow.log_metric("m", 1.0) | ||
mlflow.set_tag("t", "tag") | ||
mlflow.pyfunc.log_model( | ||
artifact_path="model", python_model=Model(), registered_model_name="model" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which line throws an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two exceptions are thrown. This line:
mlflow.log_param("p", "param")
and this line:
mlflow.pyfunc.log_model(
artifact_path="model", python_model=Model(), registered_model_name="model"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.pytest.org/en/7.1.x/reference/reference.html#pytest.raises might help.
If mlflow.log_param("p", "param")
throws an exception, the lines after that will not be executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated!
tests/db/test_tracking_operations.py
Outdated
|
||
monkeypatch.setattr(sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite, "dbapi", dbapi) | ||
|
||
with pytest.raises(mlflow.MlflowException): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(mlflow.MlflowException): | |
with pytest.raises(mlflow.MlflowException, match=...): |
Can we add match
? pytest.raises(mlflow.MlflowException)
matches any MlflowException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
8fd2788
to
89fa8d4
Compare
@harupy, @BenWilson2: This is ready for review hopefully final?). |
tests/db/test_tracking_operations.py
Outdated
class ConnectionWrapper: | ||
"""Wraps a sqlite3.Connection object.""" | ||
|
||
def __init__(self, conn): | ||
self.conn = conn | ||
|
||
def __getattr__(self, name): | ||
return getattr(self.conn, name) | ||
|
||
def cursor(self): | ||
"""Return a wrapped SQLite cursor.""" | ||
return CursorWrapper(self.conn.cursor()) | ||
|
||
class CursorWrapper: | ||
"""Wraps a sqlite3.Cursor object.""" | ||
|
||
def __init__(self, cursor): | ||
self.cursor = cursor | ||
|
||
def __getattr__(self, name): | ||
return getattr(self.cursor, name) | ||
|
||
def execute(self, *args, **kwargs): | ||
"""Wraps execute(), simulating sporadic OperationalErrors.""" | ||
nonlocal execute_counter | ||
if execute_counter is not None and "pragma" not in args[0].lower(): | ||
execute_counter += 1 | ||
if execute_counter % 3 == 0: | ||
# Simulate a database error | ||
raise sqlite3.OperationalError("test") | ||
return self.cursor.execute(*args, **kwargs) | ||
|
||
def connect(*args, **kwargs): | ||
"""Wraps sqlite3.dbapi.connect(), returning a wrapped connection.""" | ||
global connect_counter | ||
conn = old_connect(*args, **kwargs) | ||
return ConnectionWrapper(conn) | ||
|
||
def dbapi(*args, **kwargs): | ||
"""Wraps SQLiteDialect_pysqlite.dbapi(), returning patched dbapi.""" | ||
nonlocal api_module, old_connect | ||
if api_module is None: | ||
# Only patch the first time dbapi() is called, to avoid recursion. | ||
api_module = old_dbapi(*args, **kwargs) | ||
old_connect = api_module.connect | ||
monkeypatch.setattr(api_module, "connect", connect) | ||
return api_module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these wrappers? Can we use unittest.mock.patch
?
with mock.patch("...", side_effect=sqlite3.OperationalError("test")):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not able to patch these functions because Python's SQLite module is mostly implemented in C. Those functions are not patchable -- you'll get a runtime error if you try.
I fixed a couple of issues with the tests. However, I saw some errors in "(mleap / models / 0.21.0)" that I think are not related to this PR:
|
e5d0e20
to
fec725f
Compare
I recreated the branch to address a DCO issue, and I fixed a couple of pylint issues. I think everything should be working now, although as mentioned above, I had seen some errors in ""(mleap / models / 0.21.0)" that seemed unrelated to this PR. |
The MLeap failures have nothing to do with this. Anything in the |
I see why the database test was failing -- it relies on mocking/wrapping parts of SQLite to force a |
@harupy, @BenWilson2: I think this is ready for review. Thanks! |
@harupy: Can I get a review, please? It's a pretty simple change to the production code, and the tests are passing now. I'd love to get this change released so MLflow can handle SQL database downtime (whether planned or unplanned). 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @barrywhart !
Head branch was pushed to by a user without write access
@dbczumar: Ben's change to the code required an update to the test -- I just fixed that. Can you please approve the tests to run? Then this should be ready to merge. 🙏 |
Head branch was pushed to by a user without write access
…t 400 Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Co-authored-by: Ben Wilson <39283302+BenWilson2@users.noreply.github.com> Signed-off-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com> Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
a5366cb
to
a576a8b
Compare
@dbczumar: Ok, the failed test should work now. 🤞 |
Hmm, the new error is very concerning -- on MySQL, it's throwing an OperationalError for a constraint violation. In order for this change to make sense, OperationalError needs to imply the error can be retried. The DBAPI docs seem to imply this is the case, but if databases don't respect it, it's not workable, at least without some way to determine retryable vs not. I tried to find a list of MySQL error codes to see if there's a clear pattern which errors are retryable, but I can't find a list. |
A little more research into Postgres and MySQL. Postgres has a pretty careful mapping of error codes to exception types. This is what I expected from all databases: https://github.com/psycopg/psycopg2/blob/master/psycopg/error_type.c#L62L139. The PyMySQL driver does this also: https://github.com/PyMySQL/PyMySQL/blob/main/pymysql/err.py. MySQLClient has similar code, but not very sophisticated. Near the bottom of this section, it appears that almost any error code not handled earlier defaults to So it seems this change may be useful depending on the quality of the database client package. What do you think? One option is to add this behavior but make it optional. For the database & driver combinations where it works, this change is really valuable for gracefully handling brief database outages, but for bad ones, it would lead to inappropriate retries, e.g. on NULL integrity constraint violations (the failing test). It's also interesting that all the tests were passing a few weeks ago. Perhaps the tests are using a different MySQL driver package now, either an entirely different package or a different version. |
Thanks for the detailed analysis! I'm okay with mapping all instances of this error to 503s :). Happy to merge once we adjust the test case. |
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Head branch was pushed to by a user without write access
@dbczumar: Great, thanks! I updated the test case and added a brief code comment about the inconsistent error reporting behavior. |
Thanks @barrywhart ! |
How do we keep seeing a distinct new failure on each run? It's like the database clients are changing every day! I'll look at this failure tonight or tomorrow. Any concerns if I look at pinning some dependencies? |
No worries! Thanks for taking a look. I'd rather we make the test a bit more forgiving (e.g. check the version of SQLALchemy and use that to determine which sqlite dbapi to reference) as opposed to pin test dependencies. It's important to ensure we're running our tests against the latest library versions. |
One strange thing is, in the MySQL tests, it had installed both PyMySQL and MySQLClient (I'm not talking about package versions; these are two completely different db client packages). It seems like the test should only install one, or if it needs to test both, it should do multiple test runs, one with each client package. Do you know anything about this? It's possible this is happening with other databases as well. |
…03_service_temporarily_unavailable
I think I figured out the issue, but I'm also surprised. The latest stable release of SQLAlchemy is 1.4.44. There are release candidates available for 2.0. Apparently the GitHub build installs a release candidate, but when I build and run locally using the same Docker commands:
it works for me, presumably because I'm getting the stable version of SQLAlchemy. Anyway, fix incoming... |
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Head branch was pushed to by a user without write access
@dbczumar: Ok, I fixed the SQLite issue -- the updated test should now work with SQLAlchemy 1.x (current stable) or 2.x (currently in "release candidate" status). |
@dbczumar: If you have a moment, could you approve the build to run? Feeling 🤞🍀 about the build passing this time. 😬 |
Awesome work, @barrywhart ! Thanks so much! Merging! |
@dbczumar: Thank you! Looking forward to updating once this is released. 🙏 |
Related Issues/PRs
Fixes #7238What changes are proposed in this pull request?
Created additional
except
block inmlflow/store/db/utils.py
make_managed_session()
function. Catches SQLAlchemyOperationalError
(e.g. database down) and reports HTTP error 503 (TEMPORARILY_UNAVAILABLE
). This triggers the standard MLflow client "exponential backoff" retry behavior, which avoids causing clients to unnecessarily fail (assuming the database issue is fixed soon).How is this patch tested?
I started this test script and left it running:
I started the MLflow server as follows:
As the test script was running, I manually ran
service postgresql stop
andservice postgresql start
several times to confirm that:Previously, the test script failed immediately and exited with the stack backtrace provided in the linked issue (#7238).
Does this PR change the documentation?
Release Notes
Is this a user-facing change?
If the SQL database is down, MLflow REST API reports a retryable HTTP error 503 (
TEMPORARILY_UNAVAILABLE
), which will be automatically retried by the MLflow client library. Previously, this was reported as a non-retryable HTTP error (400).What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes