Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MLflow Project Runs Failing on Azure Databricks with "Invalid backend config JSON" Error #8981

Open
4 of 22 tasks
DJSaunders1997 opened this issue Jul 6, 2023 · 7 comments
Labels
area/examples Example code area/projects MLproject format, project running backends bug Something isn't working has-closing-pr This issue has a closing PR integrations/databricks Databricks integrations

Comments

@DJSaunders1997
Copy link

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Client:
$ mlflow --version
mlflow, version 2.4.1

System information

$ mlflow doctor
System information: Windows 10.0.17763
Python version: 3.11.3
MLflow version: 2.4.1
MLflow module location: C:\Users\<username>\AppData\Local\anaconda3\envs\tutorial\Lib\site-packages\mlflow\__init__.p
y
Tracking URI: databricks://DEFAULT
Registry URI: databricks://DEFAULT
MLflow environment variables:
  MLFLOW_TRACKING_URI: databricks://DEFAULT
MLflow dependencies:
  Flask: 2.3.2
  Jinja2: 3.1.2
  alembic: 1.11.1
  click: 8.1.3
  cloudpickle: 2.2.1
  databricks-cli: 0.17.7
  docker: 6.1.3
  entrypoints: 0.4
  gitpython: 3.1.31
  importlib-metadata: 6.7.0
  markdown: 3.4.3
  matplotlib: 3.7.1
  numpy: 1.24.3
  packaging: 23.0
  pandas: 1.5.3
  protobuf: 4.23.3
  pyarrow: 12.0.0
  pytz: 2023.3
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.30.0
  scikit-learn: 1.2.2
  scipy: 1.10.1
  sqlalchemy: 2.0.12
  sqlparse: 0.4.4
  waitress: 2.1.2

Describe the problem

Following Azure Databricks MLFlow Project Documentation

I've followed the official Azure Databricks instructions on how to configure MLFlow to run projects, and copied the example cluster-spec.json: https://learn.microsoft.com/en-us/azure/databricks/mlflow/projects .

cluster-spec.json

{
  "spark_version": "7.3.x-scala2.12",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

The mlflow appears to be working as I can run commands such as $ mlflow experiments search and get results from my DataBricks tracking server.

Running the command:

mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config cluster-spec.json --experiment-id <my-experiment-id>

successfully creates a job run in DataBricks.

$ mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config cluster-spec.json --experiment-id <my-experiment-id>
2023/07/06 10:36:26 INFO mlflow.projects.utils: === Fetching project from https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine into C:\Users\<username>\AppData\Local\Temp\2\tmp6lo70toh ===
2023/07/06 10:36:32 INFO mlflow.projects.utils: Fetched 'master' branch
2023/07/06 10:36:43 INFO mlflow.projects.databricks: === Creating tarball from C:\Users\<username>\AppData\Local\Temp\2\tmp6lo70toh\examples\sklearn_elasticnet_wine in temp directory C:\Users\<username>\AppData\Local\Temp\2\tmpyz6_0st9 ===
2023/07/06 10:36:43 INFO mlflow.projects.databricks: === Total file size to compress: 268.2 KB ===
2023/07/06 10:36:44 INFO mlflow.projects.databricks: === Project already exists in DBFS ===
2023/07/06 10:36:44 INFO mlflow.projects.databricks: === Running entry point main of project https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
2023/07/06 10:36:44 INFO mlflow.projects.databricks: === Submitting a run to execute the MLflow project... ===
2023/07/06 10:36:45 INFO mlflow.projects.databricks: === Launched MLflow run as Databricks job run with ID 3727042. Getting run status page URL... ===
2023/07/06 10:36:45 INFO mlflow.projects.databricks: === Check the run's status at <azure databricks job run url> ===

However this job fails to run on Databricks, due to an issue installing MLFlow:
image

I believe this failure is due to the incompatibility of the Databricks runtime 7.x and the latest version of MLFlow
image
https://docs.databricks.com/release-notes/runtime/releases.html#mlflow-compatibility-matrix

Bumping up Databricks runtime

From the compatibility Matrix I've ammended my cluster-spec.json to the latest runtime which should be compatible with the latest MLFlow version.

{
  "spark_version": "13.2.x-scala2.12",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

However this run also fails:
image

Full Standard error trace

Invalid backend config JSON. Parse error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/databricks/python3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/mlflow/cli.py", line 195, in run
    backend_config = json.loads(backend_config)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The 'Invalid backend config JSON' error would suggest this is in an issue with the cluster spec, however the cluster was created without issue and it's only the job run that has failed.

The same error is also shown when running with the latest ml cluster 13.2.x-cpu-ml-scala2.12

I'm not sure what else to try.
Let me know if there's any more info I can give regarding this issue :)

Tracking information

REPLACE_ME

Code to reproduce issue

REPLACE_ME

Stack trace

Full Standard error trace

Invalid backend config JSON. Parse error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/databricks/python3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/mlflow/cli.py", line 195, in run
    backend_config = json.loads(backend_config)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Other info / logs

Standard output trace

mlflow, version 2.4.1
sending incremental file list
ba117a884eef555178d964c987a323d82b45dd7592846f1e8b3ec535f088c41c.tar.gz

sent 70,752 bytes  received 35 bytes  141,574.00 bytes/sec
total size is 70,568  speedup is 1.00
mlflow-project/
mlflow-project/MLproject
mlflow-project/conda.yaml
mlflow-project/python_env.yaml
mlflow-project/train.ipynb
mlflow-project/train.py
mlflow-project/wine-quality.csv

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@DJSaunders1997 DJSaunders1997 added the bug Something isn't working label Jul 6, 2023
@DJSaunders1997
Copy link
Author

I've also raised an issue with the Documentations incase it is a Databricks issue rather than a MLFlow issue: MicrosoftDocs/azure-docs#111832

@mlflow-automation
Copy link
Collaborator

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

@serena-ruan
Copy link
Collaborator

Hi @DJSaunders1997, could you try passing the content of your cluster-spec.json as a string directly in cli to --backend-config to see if it works? The error message shows this parameter is not correctly loaded.

@DJSaunders1997
Copy link
Author

Hi thanks for your response.

I ran the command in git bash, using the latest spark_version runtime:

mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config '{"spark_version": "13.2.x-scala2.12", "num_workers": 1, "node_type_id": "Standard_DS3_v2"}' --experiment-id <my-experiment-id>

and I still get the same error:

Invalid backend config JSON. Parse error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/databricks/python3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python3/lib/python3.9/site-packages/mlflow/cli.py", line 195, in run
    backend_config = json.loads(backend_config)
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Is there a different way I should represent the cluster-spec as a string?

Python api

I've also attempted to run projects using the python api

mlflow.projects.run(
    uri=".",
    experiment_id="3006282966236537",
    backend="databricks",
    synchronous=False,
    backend_config=x,
)

Where the backend_config has been attempted as a:

  • Path to json file
  • String
  • Python Dictionary

all of which give the same error on databricks.

Do you have any other ideas of what I should try next?

Thanks!

@serena-ruan
Copy link
Collaborator

@DJSaunders1997 Could you open a ticket to Azure Databricks support team? They should be able to help you on it :)

@manasand
Copy link

manasand commented Aug 7, 2023

Hi. Is there any progress on this? Any workaround? Thanks.

@wolpl
Copy link
Contributor

wolpl commented Jan 11, 2024

The JSONDecodeError occurs, because MLflow uses Windows commandline escaping to assemble a command that is then executed on databricks in a bash. More info in #10811

As a workaround, it is possible to call the mlflow.run() function in a Python script (not the CLI command directly, as you did).
If you replace the function that does the incorrect escaping with a correct one before that call, the databricks run executes as intended:

import shlex
import mlflow

mlflow.projects.databricks.quote = shlex.quote  # HACK to fix the encoding of the mlflow backend configuration
mlflow.run(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/examples Example code area/projects MLproject format, project running backends bug Something isn't working has-closing-pr This issue has a closing PR integrations/databricks Databricks integrations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants