[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653

adamreeve · 2022-04-11T02:34:48Z

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Fedora Linux 35
MLflow installed from (source or binary): Source
MLflow version (run mlflow --version): 1.24.1.dev0 (at commit 3984c2c)
Python version: 3.9.12
npm version, if running the dev UI: 8.3.1
Exact command to reproduce: Create > 1000 runs in an experiment. Then in the experiment view, keep clicking "Load more"

Describe the problem

As the number of runs in the run table increases, clicking "Load more" to load more rows takes longer and performance degrades significantly.

The table below shows the time taken to load 100 more rows, measured as the time from clicking "Load more" until the new rows are fully rendered and the load more button is visible again.

Starting number of rows	Time taken to load 100 more rows
100	3 s
200	5 s
300	8 s
400	11 s
500	15 s
600	21 s
700	26 s
800	37 s
900	45 s
1000	59 s

Only around 100 ms of each loading time is actually fetching the data from the backend, plus two queries to the model-versions/search endpoint that take less than 10 ms each (they are returning empty responses in my test case).

These numbers are all based on running a production build of the UI from commit 3984c2c, using Firefox 98 on Fedora Linux, and using a local SQLite store, on a machine with an AMD 5900X and 64 GB of RAM. I see the same behaviour under Chrome on Linux and Windows too, although performance does seem a bit better with Chrome.

I've tested a series of changes that all help to improve performance significantly:

Updating to ag-grid 27.1.0, which reduces the time to go from 1000 to 1100 runs to 33 s
Removing the use of the full width cell renderer and moving the LoadMoreBar out of the table: down to 11 s
Implementing getRowId, which is supposed to help prevent re-rendering rows that haven't changed when updating the data (https://www.ag-grid.com/react-data-grid/row-ids/): 3 s

Just using getRowId without removing the full width cell renderer didn't work, this caused some errors internally within ag-grid (this.cellComp is undefined).

And making these two changes while sticking to version 25.3.0 didn't work well either, removing the full width cell renderer improved performance but implementing getRowNodeId (which was deprecated in 27.1.0 and replaced by getRowId) seemed to make performance worse.

There are a couple of new warnings that appear with a dev build after the update to 27.1.0:

Warning: Can't perform a React state update on an unmounted component. This is a no-op, but it indicates a memory leak in your application. To fix, cancel all subscriptions and asynchronous tasks in the componentWillUnmount method.
AgGridReactUi@http://localhost:3000/static-files/static/js/bundle.js:117332:20
AgGridReact@http://localhost:3000/static-files/static/js/bundle.js:116686:20
div
ExperimentRunsTableMultiColumnView2@http://localhost:3000/static-files/static/js/bundle.js:17195:193
div
Spacer@http://localhost:3000/static-files/static/js/bundle.js:45732:193
div
div
ExperimentView@http://localhost:3000/static-files/static/js/bundle.js:18230:193
injectIntl(ExperimentView)
ConnectFunction@http://localhost:3000/static-files/static/js/bundle.js:423772:68
C@http://localhost:3000/static-files/static/js/bundle.js:426980:31

Warning: Can't perform a React state update on an unmounted component. This is a no-op, but it indicates a memory leak in your application. To fix, cancel all subscriptions and asynchronous tasks in a useEffect cleanup function.
GridBodyComp@http://localhost:3000/static-files/static/js/bundle.js:118705:37
TabGuardCompRef@http://localhost:3000/static-files/static/js/bundle.js:120742:18
div
div
GridComp@http://localhost:3000/static-files/static/js/bundle.js:118950:17
div
AgGridReactUi@http://localhost:3000/static-files/static/js/bundle.js:117332:20
AgGridReact@http://localhost:3000/static-files/static/js/bundle.js:116686:20
div
ExperimentRunsTableMultiColumnView2@http://localhost:3000/static-files/static/js/bundle.js:17195:193
div
Spacer@http://localhost:3000/static-files/static/js/bundle.js:45732:193
div
div
ExperimentView@http://localhost:3000/static-files/static/js/bundle.js:18230:193
injectIntl(ExperimentView)
ConnectFunction@http://localhost:3000/static-files/static/js/bundle.js:423772:68
C@http://localhost:3000/static-files/static/js/bundle.js:426980:31 react-dom.development.js:67

It looks like this has been opened as a bug in ag-grid but closed without any activity: ag-grid/ag-grid#4817 (if Databricks have a support contract with ag-grid, maybe you could get them to take a look?)

There's also currently a problem with nested runs where the parent run's row doesn't re-render when it is expanded or contracted, so the plus icon stays as a plus. Hopefully I can find a workaround for that but I thought I should open an issue before doing too much work on this.

Edit: Also, the start time and model columns don't get re-rendered for in-progress runs when clicking "Refresh" (and other columns with custom renderers are probably affected too). I'm guessing this is because eg. the model column refers to a non-existent models field, and the startTime field value doesn't actually change, so ag-grid doesn't think these need re-rendering. It should be possible to make this work correctly by using fields that actually change when the rendered result should change.

Would you be happy to consider a PR with these changes?

Code to reproduce issue

NA

Other info / logs

Profiling data from Firefox when going from 500 to 600 rows, building with yarn run build --profile: https://share.firefox.dev/3jjD11S

What component(s), interfaces, languages, and integrations does this bug affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

harupy · 2022-04-12T00:40:51Z

Hi @adamreeve, thanks for the thorough investigation!

Would you be happy to consider a PR with these changes?

YES, looking forward to your PR!

adamreeve · 2022-04-12T21:59:15Z

Unfortunately it looks like this is blocked on a bug in ag-grid and I don't see a way to work around this on the mlflow side yet. I've reported this in ag-grid/ag-grid#5085 and opened a PR there with a fix, so if and when that fix is released I should be able to open a PR for this.

harupy · 2022-04-12T23:27:36Z

@adamreeve

I've reported this in ag-grid/ag-grid#5085 and opened a PR there with a fix

Thanks a lot!

when that fix is released I should be able to open a PR for this.

Sounds good!

adamreeve added the bug Something isn't working label Apr 11, 2022

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server labels Apr 11, 2022

adamreeve mentioned this issue Apr 20, 2022

Update ag-grid and implement getRowId to improve runs table performance #5725

Merged

29 tasks

dbczumar closed this as completed in #5725 May 6, 2022

e-dorigatti mentioned this issue Jan 12, 2024

We'd like your feedback on our MLflow experience! #10329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653

[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653

adamreeve commented Apr 11, 2022 •

edited

harupy commented Apr 12, 2022 •

edited

adamreeve commented Apr 12, 2022 •

edited

harupy commented Apr 12, 2022 •

edited

[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653

[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653

Comments

adamreeve commented Apr 11, 2022 • edited

Willingness to contribute

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s), interfaces, languages, and integrations does this bug affect?

harupy commented Apr 12, 2022 • edited

adamreeve commented Apr 12, 2022 • edited

harupy commented Apr 12, 2022 • edited

adamreeve commented Apr 11, 2022 •

edited

harupy commented Apr 12, 2022 •

edited

adamreeve commented Apr 12, 2022 •

edited

harupy commented Apr 12, 2022 •

edited