[FR] Improve performance by lowering amount of calls to retrieve model #5507

Davidswinkels · 2022-03-17T16:16:01Z

Thank you for submitting a feature request. Before proceeding, please review MLflow's Issue Policy for feature requests and the MLflow Contributing Guide.

Please fill in this feature request template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
No. I cannot contribute this feature at this time.

Proposal Summary

Retrieve models more efficiently by lowering required amount of requests.

Currently to retrieve a model we have to do 3 requests:
experiment_name="energy_forecast_10001_Amsterdam"
experiment = mlflow.get_experiment_by_name(experiment_name)
run = mlflow.search_runs(experiment.experiment_id, max_results=1)
model = mlflow.sklearn.load_model(os.path.join(run.artifact_uri[0], "model/"))

It would be nice if this can be speeded up by getting model in only 1 request:
model = mlflow.sklearn.load_latest_model(experiment_name)

or 2 requests:
run = mlflow.search_runs(experiment_name, max_results=1)
model = mlflow.sklearn.load_model(os.path.join(run.artifact_uri[0], "model/"))

Motivation

What is the use case for this feature?
Performance
Why is this use case valuable to support for MLflow users in general?
Performance for all users to load models.
Why is this use case valuable to support for your project(s) or organization?
Performance.
Why is it currently difficult to achieve this use case? (please be as specific as possible about why related MLflow features and components are insufficient)
It's more difficult/impossible to improve the performance at higher level when lower calls are not performant.

What component(s), interfaces, languages, and integrations does this feature affect?

Components

Interfaces

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Languages

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Details

(Use this section to include any additional information about the feature. If you have a proposal for how to implement this feature, please include it here. For implementation guidelines, please refer to the Contributing Guide.)

The text was updated successfully, but these errors were encountered:

BenWilson2 · 2022-03-18T00:56:45Z

Hi @Davidswinkels have you taken a look at the model registry functionality?
https://www.mlflow.org/docs/latest/model-registry.html#fetching-an-mlflow-model-from-the-model-registry

The ability to retrieve a particular model with a single API call is in there, allowing you to get an artifact by specifying a version or a stage directly. This might simplify your use case.

As far as tightly coupling the tracking server and artifact retrieval into a single API call, I'm afraid that it wouldn't buy any performance improvement (they are separate services) and would only complicate the APIs.

Hopefully the model registry (and perhaps also Projects https://www.mlflow.org/docs/latest/projects.html ) might help to reduce the amount of lines of code in your work if that is the concern.

Please let me know if there are any other points that you'd like to discuss.

Davidswinkels · 2022-03-18T07:39:52Z

Hi Ben. Thanks for the answer! We will look into the model registry documentation even more to see if that improves performance and if we can fetch models easier with less code. Then we will get back here. Yes agree from a loose-coupling perspective it's nice to keep tracking server and artifact retrieval separated. So then it's not wise to do:
model = mlflow.sklearn.load_latest_model(experiment_name).

Do you or others think it is interesting to be able to search_runs based on experiment_name? Or is there no need for that with model_registry?

run = mlflow.search_runs(experiment_name, max_results=1)
model = mlflow.sklearn.load_model(os.path.join(run.artifact_uri[0], "model/"))

BenWilson2 · 2022-03-21T16:51:44Z

There certainly won't be a need to search for the experiment name while using the model registry since there is a very small subset of "production-capable" models that would be registered.
That being said, that doesn't seem like what your use case is for if you're asking about searching for runs based on experiment names.
We'll have an internal discussion around whether we'd like to pursue something like this (which would add a further layer of complication performance-wise to the search_runs fluent API as we'd be adding a call to get the experiment information (e.g., from SQLAlchemy:

mlflow/mlflow/store/tracking/sqlalchemy_store.py

Lines 382 to 396 in 3ab6fbf

    
               def get_experiment_by_name(self, experiment_name): 
        
                   """ 
        
                   Specialized implementation for SQL backed store. 
        
                   """ 
        
                   with self.ManagedSessionMaker() as session: 
        
                       stages = LifecycleStage.view_type_to_stages(ViewType.ALL) 
        
                       experiment = ( 
        
                           session.query(SqlExperiment) 
        
                           .options(*self._get_eager_experiment_query_options()) 
        
                           .filter( 
        
                               SqlExperiment.name == experiment_name, SqlExperiment.lifecycle_stage.in_(stages) 
        
                           ) 
        
                           .one_or_none() 
        
                       ) 
        
                       return experiment.to_mlflow_entity() if experiment is not None else None

) to each of the provided names in the search query). Absolutely no promises here other than the fact that we'll discuss it.

BenWilson2 · 2022-03-22T00:24:01Z

Hi @Davidswinkels if you're up for creating a search_runs_by_experiment_name() implementation that performs the client-side resolution of experiment names to experiment_id's and then submits those to the search_runs() API, please feel free to file a PR and we'll be more than happy to review and provide feedback.

Davidswinkels · 2022-03-25T11:04:46Z

Hey Ben. We were thinking of adding this since we were using MLFlow with file-based backend-store-uri. We are currently switching to a database backend to improve performance. Plus with database backend we can now switch to model registry too. We still have to test how much performance would increase from using the calls via model registry. From initial performance check the "get_experiment_by_name" does not seem the bottleneck anymore:

experiment = mlflow.get_experiment_by_name(experiment_name)
run = mlflow.search_runs(experiment.experiment_id, max_results=1)
model = mlflow.sklearn.load_model(os.path.join(run.artifact_uri[0], "model/"))

Performance comparison of MLFlow model retrieval (file-based vs SQLite database) over 10 calls:

	File-based	Database (SQLite)
mlflow.get_experiment_by_name(experiment_name)	5.3 s	0.4 s
mlflow.search_runs(experiment.experiment_id, max_results=1)	8.1 s	4.7 s
mlflow.sklearn.load_model(os.path.join(run.artifact_uri[0], "model/"))	0.2 s	0.2 s

The requested feature to get run by experiment_name would still improve performance quite a bit for people who use file-based backend, but we won't develop it for now since for us with database backend getting experiment based on experiment_name is less of a performance issue.

r3stl355 · 2022-03-30T08:53:28Z

I'll give this a try

Davidswinkels · 2022-04-12T12:33:59Z

This issue was resolved by this PR (#5564) and mlflow 1.25.0 release. Did a small test on MLFlow==1.25.0 with a SQLite database. Performance did improve! It varied quite a bit compared to before. Probably due too environment (local vs kubernetes cluster, and file-based vs SQLite) and also how many runs/models were stored.

Summary performance check model retrieval per code chunk

	Average over 10 calls
Tracking registry: model via name + experiment + run	1.48 s
Tracking registry: model via name + run	1.46 s
Model registry: model via version + model registry	1.62 s
Model registry: multiple models via stage None + model registry	1.46 s
Model registry: single model via stage Production+ model registry	1.49 s

Tracking registry model retrieval

Retrieve model via name + experiment + run (1.48 s ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each))

experiment_name="Blub5"
experiment = mlflow.get_experiment_by_name(experiment_name)
run = mlflow.search_runs(experiment.experiment_id, max_results=1)
model = mlflow.pyfunc.load_model(os.path.join(run.artifact_uri[0], "model/"))

Retrieve model via name + run (1.46 s ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each))

experiment_name="Blub5"
run = mlflow.search_runs(experiment_names=[experiment_name], max_results=1)
model = mlflow.pyfunc.load_model(os.path.join(run.artifact_uri[0], "model/"))

Model registry model retrieval

Retrieve model via version + model registry (1.62 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 10 loops each))

model_name = "Blub5"
client = MlflowClient()
model_versions = client.get_latest_versions(model_name, stages=["None"])
model_version = model_versions[0].version
model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{model_version}")

Retrieve model via stage None + model registry (1.46 s ± 43.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each))
Ten models are on stage None, but most recently trained model will be retrieved

model_name = "Blub5"
stage = 'None'
model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{stage}")

Retrieve model via stage Production + model registry (1.49 s ± 66.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each))
One single model is on Production

model_name = "Blub5"
stage = 'Production'
model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{stage}")

Thanks restless for implementing this. More neat to be able to get run based on experiment_name directly from tracking registry :)

Davidswinkels added the enhancement New feature or request label Mar 17, 2022

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging and removed area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry area/artifacts Artifact stores and artifact logging labels Mar 31, 2022

Davidswinkels closed this as completed Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Improve performance by lowering amount of calls to retrieve model #5507

[FR] Improve performance by lowering amount of calls to retrieve model #5507

Davidswinkels commented Mar 17, 2022 •

edited

Loading

BenWilson2 commented Mar 18, 2022

Davidswinkels commented Mar 18, 2022

BenWilson2 commented Mar 21, 2022

BenWilson2 commented Mar 22, 2022

Davidswinkels commented Mar 25, 2022 •

edited

Loading

r3stl355 commented Mar 30, 2022

Davidswinkels commented Apr 12, 2022 •

edited

Loading

[FR] Improve performance by lowering amount of calls to retrieve model #5507

[FR] Improve performance by lowering amount of calls to retrieve model #5507

Comments

Davidswinkels commented Mar 17, 2022 • edited Loading

Willingness to contribute

Proposal Summary

Motivation

What component(s), interfaces, languages, and integrations does this feature affect?

Details

BenWilson2 commented Mar 18, 2022

Davidswinkels commented Mar 18, 2022

BenWilson2 commented Mar 21, 2022

BenWilson2 commented Mar 22, 2022

Davidswinkels commented Mar 25, 2022 • edited Loading

r3stl355 commented Mar 30, 2022

Davidswinkels commented Apr 12, 2022 • edited Loading

Summary performance check model retrieval per code chunk

Tracking registry model retrieval

Model registry model retrieval

Davidswinkels commented Mar 17, 2022 •

edited

Loading

Davidswinkels commented Mar 25, 2022 •

edited

Loading

Davidswinkels commented Apr 12, 2022 •

edited

Loading