[FR] set_experiment not safe for concurrency #10334
Labels
area/tracking
Tracking service, tracking client APIs, autologging
enhancement
New feature or request
Willingness to contribute
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
Currently calling
set_experiment
with a new experiment name in multiple processes in parallel can lead to concurrency issues, in that all the processes will look for the experiment, not find it, try to create it, and all but one will find that it has already been created and error out.It would be nice to catch the error and try to retrieve the experiment again once it already exists.
Motivation
I'm doing hyper-parameter optimization in parallel, and I create the experiment on the workers. I could also ensure the experiment is created prior to calling the workers on the head node, but it seem like this would be easy to do as a feature in MLFlow.
Having multiple workers try to access a non-existing experiment in parallel seems like a common scenario.
Currently I do a retry in my code. That seems the wrong place to do the retry. I'd rather not have mlflow specific logic in my application.
Not very difficult, but it seems more appropriate to support in mlflow directly (according to my understanding of mlflow scope, which is not very extensive)
Details
No response
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/gateway
: AI Gateway service, Gateway client APIs, third-party Gateway integrationsarea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: