[FR] Improve UI stability to corrupt metric files #12030
Labels
area/uiux
Front-end, user experience, plotting, JavaScript, JavaScript dev server
enhancement
New feature or request
Willingness to contribute
No. I cannot contribute this feature at this time.
Proposal Summary
When using a file-system backend store, OS-level issues can sometimes cause corrupt metric files.
This has previously been reported in issues such as e.g.:
#3052
This issue can take down the UI for an entire experiment section, with an error message in the UI such as:
It would be very helpful if these sort of issues could be handled gracefully in the UI i.e. not crashing the whole page.
Motivation
Avoid crashing the whole UI when one or more file-based metrics are corrupted.
It can be difficult to track down the exact run(s)/file(s) that lead to the issue, and I've found myself building custom scripts to scan for, and fix, corrupt files whenever this occurs.
Has previously been reported as an issue by many members of the community.
#3052
#7932
Current alternative is to manually triage the file-based store to scan and fix corrupt files.
Details
This could also help to identify the problematic runs/metrics e.g. by only failing to plot corrupt metrics for specific runs in the UI, and flashing a message to indicate which ones failed.
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrationsarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: