[FR] Improve UI stability to corrupt metric files #12030

joncarter1 · 2024-05-17T09:25:13Z

Willingness to contribute

No. I cannot contribute this feature at this time.

Proposal Summary

When using a file-system backend store, OS-level issues can sometimes cause corrupt metric files.
This has previously been reported in issues such as e.g.:
#3052

This issue can take down the UI for an entire experiment section, with an error message in the UI such as:

It would be very helpful if these sort of issues could be handled gracefully in the UI i.e. not crashing the whole page.

Motivation

What is the use case for this feature?

Avoid crashing the whole UI when one or more file-based metrics are corrupted.

It can be difficult to track down the exact run(s)/file(s) that lead to the issue, and I've found myself building custom scripts to scan for, and fix, corrupt files whenever this occurs.

Why is this use case valuable to support for MLflow users in general?

Has previously been reported as an issue by many members of the community.
#3052
#7932
Current alternative is to manually triage the file-based store to scan and fix corrupt files.

Details

This could also help to identify the problematic runs/metrics e.g. by only failing to plot corrupt metrics for specific runs in the UI, and flashing a message to indicate which ones failed.

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

serena-ruan · 2024-05-20T11:52:50Z

Filed PR to finish the left work on last PR, pls feel free to take a look @joncarter1

serena-ruan · 2024-05-21T00:01:13Z

Closing the issue as the PR is merged

joncarter1 added the enhancement New feature or request label May 17, 2024

github-actions bot added the area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server label May 17, 2024

serena-ruan mentioned this issue May 20, 2024

Improve malformed metric error message #12071

Merged

39 tasks

serena-ruan closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Improve UI stability to corrupt metric files #12030

[FR] Improve UI stability to corrupt metric files #12030

joncarter1 commented May 17, 2024

What is the use case for this feature?

Why is this use case valuable to support for MLflow users in general?

serena-ruan commented May 20, 2024

serena-ruan commented May 21, 2024

[FR] Improve UI stability to corrupt metric files #12030

[FR] Improve UI stability to corrupt metric files #12030

Comments

joncarter1 commented May 17, 2024

Willingness to contribute

Proposal Summary

Motivation

What is the use case for this feature?

Why is this use case valuable to support for MLflow users in general?

Details

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

serena-ruan commented May 20, 2024

serena-ruan commented May 21, 2024