[BUG] MLflow Project Runs Failing on Azure Databricks with "Invalid backend config JSON" Error #8981

DJSaunders1997 · 2023-07-06T10:19:51Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Client:
$ mlflow --version
mlflow, version 2.4.1

System information

$ mlflow doctor
System information: Windows 10.0.17763
Python version: 3.11.3
MLflow version: 2.4.1
MLflow module location: C:\Users\<username>\AppData\Local\anaconda3\envs\tutorial\Lib\site-packages\mlflow\__init__.p
y
Tracking URI: databricks://DEFAULT
Registry URI: databricks://DEFAULT
MLflow environment variables:
  MLFLOW_TRACKING_URI: databricks://DEFAULT
MLflow dependencies:
  Flask: 2.3.2
  Jinja2: 3.1.2
  alembic: 1.11.1
  click: 8.1.3
  cloudpickle: 2.2.1
  databricks-cli: 0.17.7
  docker: 6.1.3
  entrypoints: 0.4
  gitpython: 3.1.31
  importlib-metadata: 6.7.0
  markdown: 3.4.3
  matplotlib: 3.7.1
  numpy: 1.24.3
  packaging: 23.0
  pandas: 1.5.3
  protobuf: 4.23.3
  pyarrow: 12.0.0
  pytz: 2023.3
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.30.0
  scikit-learn: 1.2.2
  scipy: 1.10.1
  sqlalchemy: 2.0.12
  sqlparse: 0.4.4
  waitress: 2.1.2

Describe the problem

Following Azure Databricks MLFlow Project Documentation

I've followed the official Azure Databricks instructions on how to configure MLFlow to run projects, and copied the example cluster-spec.json: https://learn.microsoft.com/en-us/azure/databricks/mlflow/projects .

cluster-spec.json

{
  "spark_version": "7.3.x-scala2.12",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

The mlflow appears to be working as I can run commands such as $ mlflow experiments search and get results from my DataBricks tracking server.

Running the command:

mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config cluster-spec.json --experiment-id <my-experiment-id>

successfully creates a job run in DataBricks.

$ mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config cluster-spec.json --experiment-id <my-experiment-id>
2023/07/06 10:36:26 INFO mlflow.projects.utils: === Fetching project from https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine into C:\Users\<username>\AppData\Local\Temp\2\tmp6lo70toh ===
2023/07/06 10:36:32 INFO mlflow.projects.utils: Fetched 'master' branch
2023/07/06 10:36:43 INFO mlflow.projects.databricks: === Creating tarball from C:\Users\<username>\AppData\Local\Temp\2\tmp6lo70toh\examples\sklearn_elasticnet_wine in temp directory C:\Users\<username>\AppData\Local\Temp\2\tmpyz6_0st9 ===
2023/07/06 10:36:43 INFO mlflow.projects.databricks: === Total file size to compress: 268.2 KB ===
2023/07/06 10:36:44 INFO mlflow.projects.databricks: === Project already exists in DBFS ===
2023/07/06 10:36:44 INFO mlflow.projects.databricks: === Running entry point main of project https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
2023/07/06 10:36:44 INFO mlflow.projects.databricks: === Submitting a run to execute the MLflow project... ===
2023/07/06 10:36:45 INFO mlflow.projects.databricks: === Launched MLflow run as Databricks job run with ID 3727042. Getting run status page URL... ===
2023/07/06 10:36:45 INFO mlflow.projects.databricks: === Check the run's status at <azure databricks job run url> ===

However this job fails to run on Databricks, due to an issue installing MLFlow:

I believe this failure is due to the incompatibility of the Databricks runtime 7.x and the latest version of MLFlow

https://docs.databricks.com/release-notes/runtime/releases.html#mlflow-compatibility-matrix

Bumping up Databricks runtime

From the compatibility Matrix I've ammended my cluster-spec.json to the latest runtime which should be compatible with the latest MLFlow version.

{
  "spark_version": "13.2.x-scala2.12",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

However this run also fails:

Full Standard error trace

Invalid backend config JSON. Parse error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/databricks/python3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/mlflow/cli.py", line 195, in run
    backend_config = json.loads(backend_config)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The 'Invalid backend config JSON' error would suggest this is in an issue with the cluster spec, however the cluster was created without issue and it's only the job run that has failed.

The same error is also shown when running with the latest ml cluster 13.2.x-cpu-ml-scala2.12

I'm not sure what else to try.
Let me know if there's any more info I can give regarding this issue :)

Tracking information

REPLACE_ME

Code to reproduce issue

REPLACE_ME

Stack trace

Full Standard error trace

Invalid backend config JSON. Parse error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/databricks/python3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python3/lib/python3.10/site-packages/mlflow/cli.py", line 195, in run
    backend_config = json.loads(backend_config)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Other info / logs

Standard output trace

mlflow, version 2.4.1
sending incremental file list
ba117a884eef555178d964c987a323d82b45dd7592846f1e8b3ec535f088c41c.tar.gz

sent 70,752 bytes  received 35 bytes  141,574.00 bytes/sec
total size is 70,568  speedup is 1.00
mlflow-project/
mlflow-project/MLproject
mlflow-project/conda.yaml
mlflow-project/python_env.yaml
mlflow-project/train.ipynb
mlflow-project/train.py
mlflow-project/wine-quality.csv

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

DJSaunders1997 · 2023-07-06T10:37:41Z

I've also raised an issue with the Documentations incase it is a Databricks issue rather than a MLFlow issue: MicrosoftDocs/azure-docs#111832

mlflow-automation · 2023-07-14T00:19:26Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

serena-ruan · 2023-07-15T02:33:39Z

Hi @DJSaunders1997, could you try passing the content of your cluster-spec.json as a string directly in cli to --backend-config to see if it works? The error message shows this parameter is not correctly loaded.

DJSaunders1997 · 2023-07-18T12:31:27Z

Hi thanks for your response.

I ran the command in git bash, using the latest spark_version runtime:

mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config '{"spark_version": "13.2.x-scala2.12", "num_workers": 1, "node_type_id": "Standard_DS3_v2"}' --experiment-id <my-experiment-id>

and I still get the same error:

Invalid backend config JSON. Parse error: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/databricks/python3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/databricks/python3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/databricks/python3/lib/python3.9/site-packages/mlflow/cli.py", line 195, in run
    backend_config = json.loads(backend_config)
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Is there a different way I should represent the cluster-spec as a string?

Python api

I've also attempted to run projects using the python api

mlflow.projects.run(
    uri=".",
    experiment_id="3006282966236537",
    backend="databricks",
    synchronous=False,
    backend_config=x,
)

Where the backend_config has been attempted as a:

Path to json file
String
Python Dictionary

all of which give the same error on databricks.

Do you have any other ideas of what I should try next?

Thanks!

serena-ruan · 2023-07-19T05:34:20Z

@DJSaunders1997 Could you open a ticket to Azure Databricks support team? They should be able to help you on it :)

manasand · 2023-08-07T13:10:25Z

Hi. Is there any progress on this? Any workaround? Thanks.

wolpl · 2024-01-11T08:53:51Z

The JSONDecodeError occurs, because MLflow uses Windows commandline escaping to assemble a command that is then executed on databricks in a bash. More info in #10811

As a workaround, it is possible to call the mlflow.run() function in a Python script (not the CLI command directly, as you did).
If you replace the function that does the incorrect escaping with a correct one before that call, the databricks run executes as intended:

import shlex
import mlflow

mlflow.projects.databricks.quote = shlex.quote  # HACK to fix the encoding of the mlflow backend configuration
mlflow.run(...)

DJSaunders1997 added the bug Something isn't working label Jul 6, 2023

DJSaunders1997 mentioned this issue Jul 6, 2023

MLFlow Project Runs Fail On Azure Databricks When Following Documentation MicrosoftDocs/azure-docs#111832

Closed

github-actions bot added area/examples Example code area/projects MLproject format, project running backends integrations/databricks Databricks integrations labels Jul 6, 2023

amanas mentioned this issue Aug 7, 2023

fix: [BUG] MLflow Project Runs Failing on Azure Databricks with "Inva… #9241

Closed

29 tasks

github-actions bot added the has-closing-pr This issue has a closing PR label Aug 7, 2023

wolpl mentioned this issue Jan 11, 2024

Replace Window shell-escaping of databricks run command with bash escaping #10811

Merged

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] MLflow Project Runs Failing on Azure Databricks with "Invalid backend config JSON" Error #8981

[BUG] MLflow Project Runs Failing on Azure Databricks with "Invalid backend config JSON" Error #8981

DJSaunders1997 commented Jul 6, 2023

DJSaunders1997 commented Jul 6, 2023

mlflow-automation commented Jul 14, 2023

serena-ruan commented Jul 15, 2023

DJSaunders1997 commented Jul 18, 2023

serena-ruan commented Jul 19, 2023

manasand commented Aug 7, 2023

wolpl commented Jan 11, 2024

[BUG] MLflow Project Runs Failing on Azure Databricks with "Invalid backend config JSON" Error #8981

[BUG] MLflow Project Runs Failing on Azure Databricks with "Invalid backend config JSON" Error #8981

Comments

DJSaunders1997 commented Jul 6, 2023

Issues Policy acknowledgement

Willingness to contribute

MLflow version

System information

Describe the problem

Following Azure Databricks MLFlow Project Documentation

Bumping up Databricks runtime

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

DJSaunders1997 commented Jul 6, 2023

mlflow-automation commented Jul 14, 2023

serena-ruan commented Jul 15, 2023

DJSaunders1997 commented Jul 18, 2023

Python api

serena-ruan commented Jul 19, 2023

manasand commented Aug 7, 2023

wolpl commented Jan 11, 2024