Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 #7240

Conversation

barrywhart
Copy link
Contributor

@barrywhart barrywhart commented Nov 3, 2022

Related Issues/PRs

Fixes #7238

What changes are proposed in this pull request?

Created additional except block in mlflow/store/db/utils.py make_managed_session() function. Catches SQLAlchemy OperationalError (e.g. database down) and reports HTTP error 503 (TEMPORARILY_UNAVAILABLE). This triggers the standard MLflow client "exponential backoff" retry behavior, which avoids causing clients to unnecessarily fail (assuming the database issue is fixed soon).

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

I started this test script and left it running:

import os
import time
import uuid

import mlflow

mlflow.set_experiment("foo")
num_iterations = 180
for i in range(num_iterations):
    print(f"Creating run {i} of {num_iterations}")
    with mlflow.start_run():
        time.sleep(1)

I started the MLflow server as follows:

PYTHONUNBUFFERED=1 mlflow server --backend-store-uri postgresql://mlflow:password@127.0.0.1/postgres \
    --default-artifact-root /tmp/ --host=0.0.0.0 --port=5007 --workers 1 \
     --gunicorn-opts '--timeout 0 --workers 1 --threads 1 --error-logfile - --log-file - --log-level debug'

As the test script was running, I manually ran service postgresql stop and service postgresql start several times to confirm that:

  • The script did not fail.
  • The script "paused" while the database was down, then resumed once the database was up again.

Previously, the test script failed immediately and exited with the stack backtrace provided in the linked issue (#7238).

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

If the SQL database is down, MLflow REST API reports a retryable HTTP error 503 (TEMPORARILY_UNAVAILABLE), which will be automatically retried by the MLflow client library. Previously, this was reported as a non-retryable HTTP error (400).

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@barrywhart barrywhart changed the title Report SQLAlchemy OperationalError as a retryable HTTP error (503) no… Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400 Nov 3, 2022
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

@barrywhart Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/9262493758. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

@github-actions github-actions bot added area/server-infra MLflow Tracking server backend area/sqlalchemy Use of SQL alchemy in tracking service or model registry rn/bug-fix Mention under Bug Fixes in Changelogs. labels Nov 3, 2022
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Nov 3, 2022

Documentation preview for b2c13d8 will be available here when this CircleCI job completes successfully.

More info

@barrywhart barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from a57de8d to d1ec0ab Compare November 3, 2022 01:33
@barrywhart
Copy link
Contributor Author

I think it should be good now. I recreated the branch and force pushed with this commit message:

    Report SQLAlchemy OperationalError as a retryable HTTP error (503) not 400
    
    Signed-off-by: Barry Hart <barry.hart@zoro.com>

@BenWilson2
Copy link
Member

Hi @barrywhart the test suites for database operations are here: https://github.com/mlflow/mlflow/blob/master/tests/db/
and the best place to put a service 'pause' validation check is likely here: https://github.com/mlflow/mlflow/blob/master/tests/db/test_tracking_operations.py
If you can get an async thread that submits a subprocess.Popen command to pause the db services (there's no need to validate this behavior for sqlite), sleep for a specified time, and then resume the processes to validate that logging of tracking functionality works, I think that'll cover the test coverage for this addition.

@barrywhart
Copy link
Contributor Author

Thanks, I'll work on that! It may be some time next week. It turns out that "test what happens when the database is down" can surface lots of work to do in several different areas. 🤣

The larger context is, we're hosting MLflow ourselves in Google Cloud, and Google's routine, automated maintenance will occasionally take the database down for 30-60 seconds. Major database upgrades or making database schema changes may also require outages. Ideally, we want MLflow client applications to wait at least a few minutes before failing, in order to avoid unnecessary alerts and failures.

@BenWilson2
Copy link
Member

Sounds good @barrywhart! I'll review and kick off CI once that test is added :) In the meantime, let me know if you have any questions or get stuck.

@barrywhart
Copy link
Contributor Author

@BenWilson2: I added a test for the new error handling behavior. I couldn't find any tests that used a database other than SQLite, so the test uses SQLite. Even though SQLite is a local database, thus not prone to OperationalError, I think this is still a reasonable test, since all databases use the same, new except block.

Comment on lines 131 to 144
mlflow.log_param("p", "param")
mlflow.log_metric("m", 1.0)
mlflow.set_tag("t", "tag")
mlflow.pyfunc.log_model(
artifact_path="model", python_model=Model(), registered_model_name="model"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which line throws an exception?

Copy link
Contributor Author

@barrywhart barrywhart Nov 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two exceptions are thrown. This line:

mlflow.log_param("p", "param")

and this line:

mlflow.pyfunc.log_model(
    artifact_path="model", python_model=Model(), registered_model_name="model"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.pytest.org/en/7.1.x/reference/reference.html#pytest.raises might help.

image

If mlflow.log_param("p", "param") throws an exception, the lines after that will not be executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated!


monkeypatch.setattr(sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite, "dbapi", dbapi)

with pytest.raises(mlflow.MlflowException):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(mlflow.MlflowException):
with pytest.raises(mlflow.MlflowException, match=...):

Can we add match? pytest.raises(mlflow.MlflowException) matches any MlflowException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@barrywhart barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from 8fd2788 to 89fa8d4 Compare November 5, 2022 02:49
@barrywhart
Copy link
Contributor Author

@harupy, @BenWilson2: This is ready for review hopefully final?).

Comment on lines 73 to 127
class ConnectionWrapper:
"""Wraps a sqlite3.Connection object."""

def __init__(self, conn):
self.conn = conn

def __getattr__(self, name):
return getattr(self.conn, name)

def cursor(self):
"""Return a wrapped SQLite cursor."""
return CursorWrapper(self.conn.cursor())

class CursorWrapper:
"""Wraps a sqlite3.Cursor object."""

def __init__(self, cursor):
self.cursor = cursor

def __getattr__(self, name):
return getattr(self.cursor, name)

def execute(self, *args, **kwargs):
"""Wraps execute(), simulating sporadic OperationalErrors."""
nonlocal execute_counter
if execute_counter is not None and "pragma" not in args[0].lower():
execute_counter += 1
if execute_counter % 3 == 0:
# Simulate a database error
raise sqlite3.OperationalError("test")
return self.cursor.execute(*args, **kwargs)

def connect(*args, **kwargs):
"""Wraps sqlite3.dbapi.connect(), returning a wrapped connection."""
global connect_counter
conn = old_connect(*args, **kwargs)
return ConnectionWrapper(conn)

def dbapi(*args, **kwargs):
"""Wraps SQLiteDialect_pysqlite.dbapi(), returning patched dbapi."""
nonlocal api_module, old_connect
if api_module is None:
# Only patch the first time dbapi() is called, to avoid recursion.
api_module = old_dbapi(*args, **kwargs)
old_connect = api_module.connect
monkeypatch.setattr(api_module, "connect", connect)
return api_module
Copy link
Member

@harupy harupy Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these wrappers? Can we use unittest.mock.patch?

with mock.patch("...", side_effect=sqlite3.OperationalError("test")):
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to patch these functions because Python's SQLite module is mostly implemented in C. Those functions are not patchable -- you'll get a runtime error if you try.

@barrywhart
Copy link
Contributor Author

I fixed a couple of issues with the tests.

However, I saw some errors in "(mleap / models / 0.21.0)" that I think are not related to this PR:

FAILED tests/mleap/test_mleap_model_export.py::test_model_deployment - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_mleap_module_model_save_with_relative_path_and_valid_sample_input_produces_mleap_flavor - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_mleap_module_model_save_with_absolute_path_and_valid_sample_input_produces_mleap_flavor - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_mleap_model_log - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).
FAILED tests/mleap/test_mleap_model_export.py::test_spark_module_model_save_with_relative_path_and_valid_sample_input_produces_mleap_flavor - mlflow.mleap.MLeapSerializationException: MLeap encountered an error while serializing the model. Ensure that the model is compatible with MLeap (i.e does not contain any custom transformers).

@barrywhart barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from e5d0e20 to fec725f Compare November 8, 2022 22:22
@barrywhart
Copy link
Contributor Author

I recreated the branch to address a DCO issue, and I fixed a couple of pylint issues.

I think everything should be working now, although as mentioned above, I had seen some errors in ""(mleap / models / 0.21.0)" that seemed unrelated to this PR.

@BenWilson2
Copy link
Member

I recreated the branch to address a DCO issue, and I fixed a couple of pylint issues.

I think everything should be working now, although as mentioned above, I had seen some errors in ""(mleap / models / 0.21.0)" that seemed unrelated to this PR.

The MLeap failures have nothing to do with this. Anything in the database or python tests sections will be, though.

@barrywhart
Copy link
Contributor Author

I see why the database test was failing -- it relies on mocking/wrapping parts of SQLite to force a sqlalchemy.OperationalError. I updated the test so it is skipped when running on other databases. Since the code under test is generic (i.e. it catches a SQLAlchemy exception, not a database specific one), testing with a single database back end should be good enough.

@barrywhart
Copy link
Contributor Author

@harupy, @BenWilson2: I think this is ready for review. Thanks!

@barrywhart
Copy link
Contributor Author

@harupy: Can I get a review, please? It's a pretty simple change to the production code, and the tests are passing now.

I'd love to get this change released so MLflow can handle SQL database downtime (whether planned or unplanned).

🙏

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @barrywhart !

@dbczumar dbczumar enabled auto-merge (squash) November 30, 2022 18:59
auto-merge was automatically disabled November 30, 2022 20:49

Head branch was pushed to by a user without write access

@barrywhart
Copy link
Contributor Author

@dbczumar: Ben's change to the code required an update to the test -- I just fixed that. Can you please approve the tests to run? Then this should be ready to merge. 🙏

auto-merge was automatically disabled December 1, 2022 00:31

Head branch was pushed to by a user without write access

Barry Hart and others added 6 commits November 30, 2022 19:32
…t 400

Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Co-authored-by: Ben Wilson <39283302+BenWilson2@users.noreply.github.com>
Signed-off-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
Signed-off-by: Barry Hart <barry.hart@zoro.com>
@barrywhart barrywhart force-pushed the bhart-issue_report_database_down_as_http_503_service_temporarily_unavailable branch from a5366cb to a576a8b Compare December 1, 2022 00:32
@barrywhart
Copy link
Contributor Author

@dbczumar: Ok, the failed test should work now. 🤞

@dbczumar dbczumar enabled auto-merge (squash) December 1, 2022 00:37
@barrywhart
Copy link
Contributor Author

Hmm, the new error is very concerning -- on MySQL, it's throwing an OperationalError for a constraint violation. In order for this change to make sense, OperationalError needs to imply the error can be retried. The DBAPI docs seem to imply this is the case, but if databases don't respect it, it's not workable, at least without some way to determine retryable vs not.

I tried to find a list of MySQL error codes to see if there's a clear pattern which errors are retryable, but I can't find a list.

@barrywhart
Copy link
Contributor Author

A little more research into Postgres and MySQL.

Postgres has a pretty careful mapping of error codes to exception types. This is what I expected from all databases: https://github.com/psycopg/psycopg2/blob/master/psycopg/error_type.c#L62L139.

The PyMySQL driver does this also: https://github.com/PyMySQL/PyMySQL/blob/main/pymysql/err.py.

MySQLClient has similar code, but not very sophisticated. Near the bottom of this section, it appears that almost any error code not handled earlier defaults to OperationalError: https://github.com/PyMySQL/mysqlclient/blob/main/MySQLdb/_mysql.c#L161L204

So it seems this change may be useful depending on the quality of the database client package.

What do you think? One option is to add this behavior but make it optional. For the database & driver combinations where it works, this change is really valuable for gracefully handling brief database outages, but for bad ones, it would lead to inappropriate retries, e.g. on NULL integrity constraint violations (the failing test).

It's also interesting that all the tests were passing a few weeks ago. Perhaps the tests are using a different MySQL driver package now, either an entirely different package or a different version.

@dbczumar
Copy link
Collaborator

dbczumar commented Dec 1, 2022

A little more research into Postgres and MySQL.

Postgres has a pretty careful mapping of error codes to exception types. This is what I expected from all databases: https://github.com/psycopg/psycopg2/blob/master/psycopg/error_type.c#L62L139.

The PyMySQL driver does this also: https://github.com/PyMySQL/PyMySQL/blob/main/pymysql/err.py.

MySQLClient has similar code, but not very sophisticated. Near the bottom of this section, it appears that almost any error code not handled earlier defaults to OperationalError: https://github.com/PyMySQL/mysqlclient/blob/main/MySQLdb/_mysql.c#L161L204

So it seems this change may be useful depending on the quality of the database client package.

What do you think? One option is to add this behavior but make it optional. For the database & driver combinations where it works, this change is really valuable for gracefully handling brief database outages, but for bad ones, it would lead to inappropriate retries, e.g. on NULL integrity constraint violations (the failing test).

It's also interesting that all the tests were passing a few weeks ago. Perhaps the tests are using a different MySQL driver package now, either an entirely different package or a different version.

Thanks for the detailed analysis! I'm okay with mapping all instances of this error to 503s :). Happy to merge once we adjust the test case.

Signed-off-by: Barry Hart <barry.hart@zoro.com>
auto-merge was automatically disabled December 1, 2022 12:28

Head branch was pushed to by a user without write access

@barrywhart
Copy link
Contributor Author

@dbczumar: Great, thanks! I updated the test case and added a brief code comment about the inconsistent error reporting behavior.

@dbczumar dbczumar enabled auto-merge (squash) December 1, 2022 19:11
@dbczumar
Copy link
Collaborator

dbczumar commented Dec 1, 2022

Thanks @barrywhart !

@barrywhart
Copy link
Contributor Author

How do we keep seeing a distinct new failure on each run? It's like the database clients are changing every day!

I'll look at this failure tonight or tomorrow.

Any concerns if I look at pinning some dependencies?

@dbczumar
Copy link
Collaborator

dbczumar commented Dec 1, 2022

How do we keep seeing a distinct new failure on each run? It's like the database clients are changing every day!

I'll look at this failure tonight or tomorrow.

Any concerns if I look at pinning some dependencies?

No worries! Thanks for taking a look. I'd rather we make the test a bit more forgiving (e.g. check the version of SQLALchemy and use that to determine which sqlite dbapi to reference) as opposed to pin test dependencies. It's important to ensure we're running our tests against the latest library versions.

@barrywhart
Copy link
Contributor Author

One strange thing is, in the MySQL tests, it had installed both PyMySQL and MySQLClient (I'm not talking about package versions; these are two completely different db client packages). It seems like the test should only install one, or if it needs to test both, it should do multiple test runs, one with each client package.

Do you know anything about this?

It's possible this is happening with other databases as well.

@barrywhart
Copy link
Contributor Author

I think I figured out the issue, but I'm also surprised.

The latest stable release of SQLAlchemy is 1.4.44. There are release candidates available for 2.0. Apparently the GitHub build installs a release candidate, but when I build and run locally using the same Docker commands:

./tests/db/compose.sh build --build-arg DEPENDENCIES="$(python setup.py -q dependencies)"
./tests/db/compose.sh run --rm --no-TTY mlflow-sqlite pytest tests/store/tracking/test_sqlalchemy_store.py tests/store/model_registry/test_sqlalchemy_store.py tests/db

it works for me, presumably because I'm getting the stable version of SQLAlchemy.

Anyway, fix incoming...

Signed-off-by: Barry Hart <barry.hart@zoro.com>
auto-merge was automatically disabled December 2, 2022 14:35

Head branch was pushed to by a user without write access

@barrywhart
Copy link
Contributor Author

@dbczumar: Ok, I fixed the SQLite issue -- the updated test should now work with SQLAlchemy 1.x (current stable) or 2.x (currently in "release candidate" status).

@barrywhart
Copy link
Contributor Author

@dbczumar: If you have a moment, could you approve the build to run? Feeling 🤞🍀 about the build passing this time. 😬

@dbczumar
Copy link
Collaborator

dbczumar commented Dec 2, 2022

Awesome work, @barrywhart ! Thanks so much! Merging!

@dbczumar dbczumar merged commit da4fe0f into mlflow:master Dec 2, 2022
@barrywhart
Copy link
Contributor Author

@dbczumar: Thank you! Looking forward to updating once this is released. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/server-infra MLflow Tracking server backend area/sqlalchemy Use of SQL alchemy in tracking service or model registry rn/bug-fix Mention under Bug Fixes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] MLflow REST API incorrectly reports "database down" as BAD_REQUEST (HTTP 400) error
5 participants