[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653
Closed
3 of 23 tasks
Labels
area/tracking
Tracking service, tracking client APIs, autologging
area/uiux
Front-end, user experience, plotting, JavaScript, JavaScript dev server
bug
Something isn't working
Willingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
System information
mlflow --version
): 1.24.1.dev0 (at commit 3984c2c)Describe the problem
As the number of runs in the run table increases, clicking "Load more" to load more rows takes longer and performance degrades significantly.
The table below shows the time taken to load 100 more rows, measured as the time from clicking "Load more" until the new rows are fully rendered and the load more button is visible again.
Only around 100 ms of each loading time is actually fetching the data from the backend, plus two queries to the model-versions/search endpoint that take less than 10 ms each (they are returning empty responses in my test case).
These numbers are all based on running a production build of the UI from commit 3984c2c, using Firefox 98 on Fedora Linux, and using a local SQLite store, on a machine with an AMD 5900X and 64 GB of RAM. I see the same behaviour under Chrome on Linux and Windows too, although performance does seem a bit better with Chrome.
I've tested a series of changes that all help to improve performance significantly:
getRowId
, which is supposed to help prevent re-rendering rows that haven't changed when updating the data (https://www.ag-grid.com/react-data-grid/row-ids/): 3 sJust using
getRowId
without removing the full width cell renderer didn't work, this caused some errors internally within ag-grid (this.cellComp is undefined
).And making these two changes while sticking to version 25.3.0 didn't work well either, removing the full width cell renderer improved performance but implementing
getRowNodeId
(which was deprecated in 27.1.0 and replaced bygetRowId
) seemed to make performance worse.There are a couple of new warnings that appear with a dev build after the update to 27.1.0:
It looks like this has been opened as a bug in ag-grid but closed without any activity: ag-grid/ag-grid#4817 (if Databricks have a support contract with ag-grid, maybe you could get them to take a look?)
There's also currently a problem with nested runs where the parent run's row doesn't re-render when it is expanded or contracted, so the plus icon stays as a plus. Hopefully I can find a workaround for that but I thought I should open an issue before doing too much work on this.
Edit: Also, the start time and model columns don't get re-rendered for in-progress runs when clicking "Refresh" (and other columns with custom renderers are probably affected too). I'm guessing this is because eg. the model column refers to a non-existent
models
field, and thestartTime
field value doesn't actually change, so ag-grid doesn't think these need re-rendering. It should be possible to make this work correctly by using fields that actually change when the rendered result should change.Would you be happy to consider a PR with these changes?
Code to reproduce issue
NA
Other info / logs
Profiling data from Firefox when going from 500 to 600 rows, building with
yarn run build --profile
: https://share.firefox.dev/3jjD11SWhat component(s), interfaces, languages, and integrations does this bug affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: