Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loading more runs in the experiment UI becomes very slow with a large number of rows #5653

Closed
3 of 23 tasks
adamreeve opened this issue Apr 11, 2022 · 3 comments · Fixed by #5725
Closed
3 of 23 tasks
Labels
area/tracking Tracking service, tracking client APIs, autologging area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server bug Something isn't working

Comments

@adamreeve
Copy link
Contributor

adamreeve commented Apr 11, 2022

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Fedora Linux 35
  • MLflow installed from (source or binary): Source
  • MLflow version (run mlflow --version): 1.24.1.dev0 (at commit 3984c2c)
  • Python version: 3.9.12
  • npm version, if running the dev UI: 8.3.1
  • Exact command to reproduce: Create > 1000 runs in an experiment. Then in the experiment view, keep clicking "Load more"

Describe the problem

As the number of runs in the run table increases, clicking "Load more" to load more rows takes longer and performance degrades significantly.

The table below shows the time taken to load 100 more rows, measured as the time from clicking "Load more" until the new rows are fully rendered and the load more button is visible again.

Starting number of rows Time taken to load 100 more rows
100 3 s
200 5 s
300 8 s
400 11 s
500 15 s
600 21 s
700 26 s
800 37 s
900 45 s
1000 59 s

Only around 100 ms of each loading time is actually fetching the data from the backend, plus two queries to the model-versions/search endpoint that take less than 10 ms each (they are returning empty responses in my test case).

These numbers are all based on running a production build of the UI from commit 3984c2c, using Firefox 98 on Fedora Linux, and using a local SQLite store, on a machine with an AMD 5900X and 64 GB of RAM. I see the same behaviour under Chrome on Linux and Windows too, although performance does seem a bit better with Chrome.

I've tested a series of changes that all help to improve performance significantly:

  • Updating to ag-grid 27.1.0, which reduces the time to go from 1000 to 1100 runs to 33 s
  • Removing the use of the full width cell renderer and moving the LoadMoreBar out of the table: down to 11 s
  • Implementing getRowId, which is supposed to help prevent re-rendering rows that haven't changed when updating the data (https://www.ag-grid.com/react-data-grid/row-ids/): 3 s

Just using getRowId without removing the full width cell renderer didn't work, this caused some errors internally within ag-grid (this.cellComp is undefined).

And making these two changes while sticking to version 25.3.0 didn't work well either, removing the full width cell renderer improved performance but implementing getRowNodeId (which was deprecated in 27.1.0 and replaced by getRowId) seemed to make performance worse.

There are a couple of new warnings that appear with a dev build after the update to 27.1.0:

Warning: Can't perform a React state update on an unmounted component. This is a no-op, but it indicates a memory leak in your application. To fix, cancel all subscriptions and asynchronous tasks in the componentWillUnmount method.
AgGridReactUi@http://localhost:3000/static-files/static/js/bundle.js:117332:20
AgGridReact@http://localhost:3000/static-files/static/js/bundle.js:116686:20
div
ExperimentRunsTableMultiColumnView2@http://localhost:3000/static-files/static/js/bundle.js:17195:193
div
Spacer@http://localhost:3000/static-files/static/js/bundle.js:45732:193
div
div
ExperimentView@http://localhost:3000/static-files/static/js/bundle.js:18230:193
injectIntl(ExperimentView)
ConnectFunction@http://localhost:3000/static-files/static/js/bundle.js:423772:68
C@http://localhost:3000/static-files/static/js/bundle.js:426980:31
Warning: Can't perform a React state update on an unmounted component. This is a no-op, but it indicates a memory leak in your application. To fix, cancel all subscriptions and asynchronous tasks in a useEffect cleanup function.
GridBodyComp@http://localhost:3000/static-files/static/js/bundle.js:118705:37
TabGuardCompRef@http://localhost:3000/static-files/static/js/bundle.js:120742:18
div
div
GridComp@http://localhost:3000/static-files/static/js/bundle.js:118950:17
div
AgGridReactUi@http://localhost:3000/static-files/static/js/bundle.js:117332:20
AgGridReact@http://localhost:3000/static-files/static/js/bundle.js:116686:20
div
ExperimentRunsTableMultiColumnView2@http://localhost:3000/static-files/static/js/bundle.js:17195:193
div
Spacer@http://localhost:3000/static-files/static/js/bundle.js:45732:193
div
div
ExperimentView@http://localhost:3000/static-files/static/js/bundle.js:18230:193
injectIntl(ExperimentView)
ConnectFunction@http://localhost:3000/static-files/static/js/bundle.js:423772:68
C@http://localhost:3000/static-files/static/js/bundle.js:426980:31 react-dom.development.js:67

It looks like this has been opened as a bug in ag-grid but closed without any activity: ag-grid/ag-grid#4817 (if Databricks have a support contract with ag-grid, maybe you could get them to take a look?)

There's also currently a problem with nested runs where the parent run's row doesn't re-render when it is expanded or contracted, so the plus icon stays as a plus. Hopefully I can find a workaround for that but I thought I should open an issue before doing too much work on this.

Edit: Also, the start time and model columns don't get re-rendered for in-progress runs when clicking "Refresh" (and other columns with custom renderers are probably affected too). I'm guessing this is because eg. the model column refers to a non-existent models field, and the startTime field value doesn't actually change, so ag-grid doesn't think these need re-rendering. It should be possible to make this work correctly by using fields that actually change when the rendered result should change.

Would you be happy to consider a PR with these changes?

Code to reproduce issue

NA

Other info / logs

Profiling data from Firefox when going from 500 to 600 rows, building with yarn run build --profile: https://share.firefox.dev/3jjD11S

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@adamreeve adamreeve added the bug Something isn't working label Apr 11, 2022
@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server labels Apr 11, 2022
@harupy
Copy link
Member

harupy commented Apr 12, 2022

Hi @adamreeve, thanks for the thorough investigation!

Would you be happy to consider a PR with these changes?

YES, looking forward to your PR!

@adamreeve
Copy link
Contributor Author

adamreeve commented Apr 12, 2022

Unfortunately it looks like this is blocked on a bug in ag-grid and I don't see a way to work around this on the mlflow side yet. I've reported this in ag-grid/ag-grid#5085 and opened a PR there with a fix, so if and when that fix is released I should be able to open a PR for this.

@harupy
Copy link
Member

harupy commented Apr 12, 2022

@adamreeve

I've reported this in ag-grid/ag-grid#5085 and opened a PR there with a fix

Thanks a lot!

when that fix is released I should be able to open a PR for this.

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants