Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 #7240

barrywhart · 2022-11-03T00:29:37Z

Related Issues/PRs

Fixes #7238

What changes are proposed in this pull request?

Created additional except block in mlflow/store/db/utils.py make_managed_session() function. Catches SQLAlchemy OperationalError (e.g. database down) and reports HTTP error 503 (TEMPORARILY_UNAVAILABLE). This triggers the standard MLflow client "exponential backoff" retry behavior, which avoids causing clients to unnecessarily fail (assuming the database issue is fixed soon).

How is this patch tested?

Existing unit/integration tests
New unit/integration tests
Manual tests (describe details, including test results, below)

I started this test script and left it running:

import os
import time
import uuid

import mlflow

mlflow.set_experiment("foo")
num_iterations = 180
for i in range(num_iterations):
    print(f"Creating run {i} of {num_iterations}")
    with mlflow.start_run():
        time.sleep(1)

I started the MLflow server as follows:

PYTHONUNBUFFERED=1 mlflow server --backend-store-uri postgresql://mlflow:password@127.0.0.1/postgres \
    --default-artifact-root /tmp/ --host=0.0.0.0 --port=5007 --workers 1 \
     --gunicorn-opts '--timeout 0 --workers 1 --threads 1 --error-logfile - --log-file - --log-level debug'

As the test script was running, I manually ran service postgresql stop and service postgresql start several times to confirm that:

The script did not fail.
The script "paused" while the database was down, then resumed once the database was up again.

Previously, the test script failed immediately and exited with the stack backtrace provided in the linked issue (#7238).

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

If the SQL database is down, MLflow REST API reports a retryable HTTP error 503 (TEMPORARILY_UNAVAILABLE), which will be automatically retried by the MLflow client library. Previously, this was reported as a non-retryable HTTP error (400).

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2022-11-03T00:42:11Z

@barrywhart Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/9262493758. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

mlflow-automation · 2022-11-03T00:42:13Z

Documentation preview for b2c13d8 will be available here when this CircleCI job completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/3602581130.

barrywhart · 2022-11-03T01:34:02Z

I think it should be good now. I recreated the branch and force pushed with this commit message:

    Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400
    
    Signed-off-by: Barry Hart <barry.hart@zoro.com>

BenWilson2 · 2022-11-03T14:46:04Z

Hi @barrywhart the test suites for database operations are here: https://github.com/mlflow/mlflow/blob/master/tests/db/
and the best place to put a service 'pause' validation check is likely here: https://github.com/mlflow/mlflow/blob/master/tests/db/test_tracking_operations.py
If you can get an async thread that submits a subprocess.Popen command to pause the db services (there's no need to validate this behavior for sqlite), sleep for a specified time, and then resume the processes to validate that logging of tracking functionality works, I think that'll cover the test coverage for this addition.

barrywhart · 2022-11-03T16:47:05Z

Thanks, I'll work on that! It may be some time next week. It turns out that "test what happens when the database is down" can surface lots of work to do in several different areas. 🤣

The larger context is, we're hosting MLflow ourselves in Google Cloud, and Google's routine, automated maintenance will occasionally take the database down for 30-60 seconds. Major database upgrades or making database schema changes may also require outages. Ideally, we want MLflow client applications to wait at least a few minutes before failing, in order to avoid unnecessary alerts and failures.

BenWilson2 · 2022-11-03T18:46:52Z

Sounds good @barrywhart! I'll review and kick off CI once that test is added :) In the meantime, let me know if you have any questions or get stuck.

barrywhart · 2022-11-04T14:23:18Z

@BenWilson2: I added a test for the new error handling behavior. I couldn't find any tests that used a database other than SQLite, so the test uses SQLite. Even though SQLite is a local database, thus not prone to OperationalError, I think this is still a reasonable test, since all databases use the same, new except block.

harupy · 2022-11-05T01:15:33Z

tests/db/test_tracking_operations.py

+            mlflow.log_param("p", "param")
+            mlflow.log_metric("m", 1.0)
+            mlflow.set_tag("t", "tag")
+            mlflow.pyfunc.log_model(
+                artifact_path="model", python_model=Model(), registered_model_name="model"
+            )


Which line throws an exception?

Two exceptions are thrown. This line:

mlflow.log_param("p", "param")

and this line:

mlflow.pyfunc.log_model( artifact_path="model", python_model=Model(), registered_model_name="model" )

https://docs.pytest.org/en/7.1.x/reference/reference.html#pytest.raises might help.

If mlflow.log_param("p", "param") throws an exception, the lines after that will not be executed.

Ok, updated!

harupy · 2022-11-05T01:16:06Z

tests/db/test_tracking_operations.py

+
+    monkeypatch.setattr(sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite, "dbapi", dbapi)
+
+    with pytest.raises(mlflow.MlflowException):


Suggested change

with pytest.raises(mlflow.MlflowException):

with pytest.raises(mlflow.MlflowException, match=...):

Can we add match? pytest.raises(mlflow.MlflowException) matches any MlflowException.

barrywhart · 2022-11-08T02:07:28Z

@harupy, @BenWilson2: This is ready for review hopefully final?).

harupy · 2022-11-08T02:19:11Z

tests/db/test_tracking_operations.py

+    class ConnectionWrapper:
+        """Wraps a sqlite3.Connection object."""
+
+        def __init__(self, conn):
+            self.conn = conn
+
+        def __getattr__(self, name):
+            return getattr(self.conn, name)
+
+        def cursor(self):
+            """Return a wrapped SQLite cursor."""
+            return CursorWrapper(self.conn.cursor())
+
+    class CursorWrapper:
+        """Wraps a sqlite3.Cursor object."""
+
+        def __init__(self, cursor):
+            self.cursor = cursor
+
+        def __getattr__(self, name):
+            return getattr(self.cursor, name)
+
+        def execute(self, *args, **kwargs):
+            """Wraps execute(), simulating sporadic OperationalErrors."""
+            nonlocal execute_counter
+            if execute_counter is not None and "pragma" not in args[0].lower():
+                execute_counter += 1
+                if execute_counter % 3 == 0:
+                    # Simulate a database error
+                    raise sqlite3.OperationalError("test")
+            return self.cursor.execute(*args, **kwargs)
+
+    def connect(*args, **kwargs):
+        """Wraps sqlite3.dbapi.connect(), returning a wrapped connection."""
+        global connect_counter
+        conn = old_connect(*args, **kwargs)
+        return ConnectionWrapper(conn)
+
+    def dbapi(*args, **kwargs):
+        """Wraps SQLiteDialect_pysqlite.dbapi(), returning patched dbapi."""
+        nonlocal api_module, old_connect
+        if api_module is None:
+            # Only patch the first time dbapi() is called, to avoid recursion.
+            api_module = old_dbapi(*args, **kwargs)
+            old_connect = api_module.connect
+            monkeypatch.setattr(api_module, "connect", connect)
+        return api_module


Do we need these wrappers? Can we use unittest.mock.patch?

with mock.patch("...", side_effect=sqlite3.OperationalError("test")): ...

I was not able to patch these functions because Python's SQLite module is mostly implemented in C. Those functions are not patchable -- you'll get a runtime error if you try.

barrywhart · 2022-11-08T20:29:13Z

I fixed a couple of issues with the tests.

However, I saw some errors in "(mleap / models / 0.21.0)" that I think are not related to this PR:

FAILED tests/mleap/test_mleap_model_export.py::test_model_deployment - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_mleap_module_model_save_with_relative_path_and_valid_sample_input_produces_mleap_flavor - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_mleap_module_model_save_with_absolute_path_and_valid_sample_input_produces_mleap_flavor - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_mleap_model_log - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_spark_module_model_save_with_relative_path_and_valid_sample_input_produces_mleap_flavor - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).

barrywhart · 2022-11-08T22:23:50Z

I recreated the branch to address a DCO issue, and I fixed a couple of pylint issues.

I think everything should be working now, although as mentioned above, I had seen some errors in ""(mleap / models / 0.21.0)" that seemed unrelated to this PR.

BenWilson2 · 2022-11-08T23:42:32Z

I recreated the branch to address a DCO issue, and I fixed a couple of pylint issues.

I think everything should be working now, although as mentioned above, I had seen some errors in ""(mleap / models / 0.21.0)" that seemed unrelated to this PR.

The MLeap failures have nothing to do with this. Anything in the database or python tests sections will be, though.

barrywhart · 2022-11-09T01:46:30Z

I see why the database test was failing -- it relies on mocking/wrapping parts of SQLite to force a sqlalchemy.OperationalError. I updated the test so it is skipped when running on other databases. Since the code under test is generic (i.e. it catches a SQLAlchemy exception, not a database specific one), testing with a single database back end should be good enough.

barrywhart · 2022-11-10T17:56:54Z

@harupy, @BenWilson2: I think this is ready for review. Thanks!

barrywhart · 2022-11-29T17:11:20Z

@harupy: Can I get a review, please? It's a pretty simple change to the production code, and the tests are passing now.

I'd love to get this change released so MLflow can handle SQL database downtime (whether planned or unplanned).

🙏

mlflow/store/db/utils.py

dbczumar

LGTM! Thanks @barrywhart !

barrywhart · 2022-11-30T20:50:47Z

@dbczumar: Ben's change to the code required an update to the test -- I just fixed that. Can you please approve the tests to run? Then this should be ready to merge. 🙏

…t 400 Signed-off-by: Barry Hart <barry.hart@zoro.com>

Signed-off-by: Barry Hart <barry.hart@zoro.com>

Co-authored-by: Ben Wilson <39283302+BenWilson2@users.noreply.github.com> Signed-off-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com> Signed-off-by: Barry Hart <barry.hart@zoro.com>

Signed-off-by: Barry Hart <barry.hart@zoro.com>

barrywhart · 2022-12-01T00:34:10Z

@dbczumar: Ok, the failed test should work now. 🤞

barrywhart · 2022-12-01T02:31:23Z

Hmm, the new error is very concerning -- on MySQL, it's throwing an OperationalError for a constraint violation. In order for this change to make sense, OperationalError needs to imply the error can be retried. The DBAPI docs seem to imply this is the case, but if databases don't respect it, it's not workable, at least without some way to determine retryable vs not.

I tried to find a list of MySQL error codes to see if there's a clear pattern which errors are retryable, but I can't find a list.

barrywhart · 2022-12-01T02:54:14Z

A little more research into Postgres and MySQL.

Postgres has a pretty careful mapping of error codes to exception types. This is what I expected from all databases: https://github.com/psycopg/psycopg2/blob/master/psycopg/error_type.c#L62L139.

The PyMySQL driver does this also: https://github.com/PyMySQL/PyMySQL/blob/main/pymysql/err.py.

MySQLClient has similar code, but not very sophisticated. Near the bottom of this section, it appears that almost any error code not handled earlier defaults to OperationalError: https://github.com/PyMySQL/mysqlclient/blob/main/MySQLdb/_mysql.c#L161L204

So it seems this change may be useful depending on the quality of the database client package.

What do you think? One option is to add this behavior but make it optional. For the database & driver combinations where it works, this change is really valuable for gracefully handling brief database outages, but for bad ones, it would lead to inappropriate retries, e.g. on NULL integrity constraint violations (the failing test).

It's also interesting that all the tests were passing a few weeks ago. Perhaps the tests are using a different MySQL driver package now, either an entirely different package or a different version.

dbczumar · 2022-12-01T05:47:13Z

A little more research into Postgres and MySQL.

Postgres has a pretty careful mapping of error codes to exception types. This is what I expected from all databases: https://github.com/psycopg/psycopg2/blob/master/psycopg/error_type.c#L62L139.

The PyMySQL driver does this also: https://github.com/PyMySQL/PyMySQL/blob/main/pymysql/err.py.

MySQLClient has similar code, but not very sophisticated. Near the bottom of this section, it appears that almost any error code not handled earlier defaults to OperationalError: https://github.com/PyMySQL/mysqlclient/blob/main/MySQLdb/_mysql.c#L161L204

So it seems this change may be useful depending on the quality of the database client package.

What do you think? One option is to add this behavior but make it optional. For the database & driver combinations where it works, this change is really valuable for gracefully handling brief database outages, but for bad ones, it would lead to inappropriate retries, e.g. on NULL integrity constraint violations (the failing test).

It's also interesting that all the tests were passing a few weeks ago. Perhaps the tests are using a different MySQL driver package now, either an entirely different package or a different version.

Thanks for the detailed analysis! I'm okay with mapping all instances of this error to 503s :). Happy to merge once we adjust the test case.

Signed-off-by: Barry Hart <barry.hart@zoro.com>

barrywhart · 2022-12-01T12:29:45Z

@dbczumar: Great, thanks! I updated the test case and added a brief code comment about the inconsistent error reporting behavior.

dbczumar · 2022-12-01T19:11:42Z

Thanks @barrywhart !

barrywhart · 2022-12-01T19:56:45Z

How do we keep seeing a distinct new failure on each run? It's like the database clients are changing every day!

I'll look at this failure tonight or tomorrow.

Any concerns if I look at pinning some dependencies?

dbczumar · 2022-12-01T19:59:30Z

How do we keep seeing a distinct new failure on each run? It's like the database clients are changing every day!

I'll look at this failure tonight or tomorrow.

Any concerns if I look at pinning some dependencies?

No worries! Thanks for taking a look. I'd rather we make the test a bit more forgiving (e.g. check the version of SQLALchemy and use that to determine which sqlite dbapi to reference) as opposed to pin test dependencies. It's important to ensure we're running our tests against the latest library versions.

barrywhart · 2022-12-01T20:08:25Z

One strange thing is, in the MySQL tests, it had installed both PyMySQL and MySQLClient (I'm not talking about package versions; these are two completely different db client packages). It seems like the test should only install one, or if it needs to test both, it should do multiple test runs, one with each client package.

Do you know anything about this?

It's possible this is happening with other databases as well.

…03_service_temporarily_unavailable

barrywhart · 2022-12-02T14:21:50Z

I think I figured out the issue, but I'm also surprised.

The latest stable release of SQLAlchemy is 1.4.44. There are release candidates available for 2.0. Apparently the GitHub build installs a release candidate, but when I build and run locally using the same Docker commands:

./tests/db/compose.sh build --build-arg DEPENDENCIES="$(python setup.py -q dependencies)"
./tests/db/compose.sh run --rm --no-TTY mlflow-sqlite pytest tests/store/tracking/test_sqlalchemy_store.py tests/store/model_registry/test_sqlalchemy_store.py tests/db

it works for me, presumably because I'm getting the stable version of SQLAlchemy.

Anyway, fix incoming...

Signed-off-by: Barry Hart <barry.hart@zoro.com>

barrywhart · 2022-12-02T14:41:34Z

@dbczumar: Ok, I fixed the SQLite issue -- the updated test should now work with SQLAlchemy 1.x (current stable) or 2.x (currently in "release candidate" status).

barrywhart · 2022-12-02T19:53:04Z

@dbczumar: If you have a moment, could you approve the build to run? Feeling 🤞🍀 about the build passing this time. 😬

dbczumar · 2022-12-02T22:57:15Z

Awesome work, @barrywhart ! Thanks so much! Merging!

barrywhart · 2022-12-03T20:30:37Z

@dbczumar: Thank you! Looking forward to updating once this is released. 🙏

barrywhart changed the title ~~Report SQLAlchemy OperationalError as a retryable HTTP error (503) no…~~ Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 Nov 3, 2022

barrywhart mentioned this pull request Nov 3, 2022

[BUG] MLflow REST API incorrectly reports "database down" as BAD_REQUEST (HTTP 400) error #7238

Closed

22 tasks

github-actions bot added area/server-infra MLflow Tracking server backend area/sqlalchemy Use of SQL alchemy in tracking service or model registry rn/bug-fix Mention under Bug Fixes in Changelogs. labels Nov 3, 2022

barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from a57de8d to d1ec0ab Compare November 3, 2022 01:33

harupy reviewed Nov 5, 2022

View reviewed changes

barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from 8fd2788 to 89fa8d4 Compare November 5, 2022 02:49

harupy reviewed Nov 8, 2022

View reviewed changes

barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from e5d0e20 to fec725f Compare November 8, 2022 22:22

BenWilson2 reviewed Nov 30, 2022

View reviewed changes

mlflow/store/db/utils.py Outdated Show resolved Hide resolved

dbczumar approved these changes Nov 30, 2022

View reviewed changes

dbczumar enabled auto-merge (squash) November 30, 2022 18:59

auto-merge was automatically disabled November 30, 2022 20:49
Head branch was pushed to by a user without write access

auto-merge was automatically disabled December 1, 2022 00:31
Head branch was pushed to by a user without write access

Barry Hart and others added 6 commits November 30, 2022 19:32

Report SQLAlchemy OperationalError as a retryable HTTP error (503) no…

7030840

…t 400 Signed-off-by: Barry Hart <barry.hart@zoro.com>

Skip test if not running on SQLite

ece0d8e

Signed-off-by: Barry Hart <barry.hart@zoro.com>

Update skip reason

977051b

Signed-off-by: Barry Hart <barry.hart@zoro.com>

Update mlflow/store/db/utils.py

c2ca989

Co-authored-by: Ben Wilson <39283302+BenWilson2@users.noreply.github.com> Signed-off-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com> Signed-off-by: Barry Hart <barry.hart@zoro.com>

Check for logging.exception() rather than logging.warning()

0a73fd3

Signed-off-by: Barry Hart <barry.hart@zoro.com>

Fix broken test by forcing use of a unique database

a576a8b

Signed-off-by: Barry Hart <barry.hart@zoro.com>

barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from a5366cb to a576a8b Compare December 1, 2022 00:32

dbczumar enabled auto-merge (squash) December 1, 2022 00:37

Adjust test_log_null_param test case to work on MySQL

ac351c0

Signed-off-by: Barry Hart <barry.hart@zoro.com>

auto-merge was automatically disabled December 1, 2022 12:28
Head branch was pushed to by a user without write access

dbczumar enabled auto-merge (squash) December 1, 2022 19:11

Merge branch 'master' into bhart-issue_report_database_down_as_http_5…

4a886dd

…03_service_temporarily_unavailable

SQLAlchemy "dbapi()" function renamed in v2.0 to "imkport_dbapi()"

b2c13d8

Signed-off-by: Barry Hart <barry.hart@zoro.com>

auto-merge was automatically disabled December 2, 2022 14:35
Head branch was pushed to by a user without write access

dbczumar merged commit da4fe0f into mlflow:master Dec 2, 2022


		monkeypatch.setattr(sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite, "dbapi", dbapi)

		with pytest.raises(mlflow.MlflowException):

Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 #7240

Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 #7240

Conversation

barrywhart commented Nov 3, 2022 • edited

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented Nov 3, 2022

mlflow-automation commented Nov 3, 2022 • edited

barrywhart commented Nov 3, 2022

BenWilson2 commented Nov 3, 2022

barrywhart commented Nov 3, 2022

BenWilson2 commented Nov 3, 2022

barrywhart commented Nov 4, 2022

harupy Nov 5, 2022

Choose a reason for hiding this comment

barrywhart Nov 5, 2022 • edited

Choose a reason for hiding this comment

harupy Nov 7, 2022

Choose a reason for hiding this comment

barrywhart Nov 7, 2022

Choose a reason for hiding this comment

harupy Nov 5, 2022

Choose a reason for hiding this comment

barrywhart Nov 5, 2022

Choose a reason for hiding this comment

barrywhart commented Nov 8, 2022

harupy Nov 8, 2022 • edited

Choose a reason for hiding this comment

barrywhart Nov 8, 2022

Choose a reason for hiding this comment

barrywhart commented Nov 8, 2022

barrywhart commented Nov 8, 2022

BenWilson2 commented Nov 8, 2022

barrywhart commented Nov 9, 2022

barrywhart commented Nov 10, 2022

barrywhart commented Nov 29, 2022

dbczumar left a comment

Choose a reason for hiding this comment

barrywhart commented Nov 30, 2022

barrywhart commented Dec 1, 2022

barrywhart commented Dec 1, 2022

barrywhart commented Dec 1, 2022

dbczumar commented Dec 1, 2022

barrywhart commented Dec 1, 2022

dbczumar commented Dec 1, 2022

barrywhart commented Dec 1, 2022

dbczumar commented Dec 1, 2022

barrywhart commented Dec 1, 2022

barrywhart commented Dec 2, 2022

barrywhart commented Dec 2, 2022

barrywhart commented Dec 2, 2022

dbczumar commented Dec 2, 2022

barrywhart commented Dec 3, 2022

barrywhart commented Nov 3, 2022 •

edited

mlflow-automation commented Nov 3, 2022 •

edited

barrywhart Nov 5, 2022 •

edited

harupy Nov 8, 2022 •

edited