Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with permission graph on big instances #30474

Closed
paoliniluis opened this issue May 2, 2023 · 3 comments
Closed

Performance regression with permission graph on big instances #30474

paoliniluis opened this issue May 2, 2023 · 3 comments
Assignees
Labels
Administration/Permissions Collection or Data permissions Misc/API .Performance Priority:P2 Average run of the mill bug .Team/AdminWebapp Admin and Webapp team Type:Bug Product defects
Milestone

Comments

@paoliniluis
Copy link
Contributor

paoliniluis commented May 2, 2023

Describe the bug

Product doc

The permission graph is incrementing exponentially the more schemas/tables, groups and permissions we have, which causes a performance regression. I made a test creating 500 DB connections (sample db with only 1 schema and 4 tables), 500 users on 500 groups and I get the following performance distribution
number of users created per minute:
image

number of permissions changed by minute

| Minute | Permission changes |
|---|---|
| 2023-05-01 | 22:27:00.000000,135 |
| 2023-05-01 | 22:28:00.000000,133 |
| 2023-05-01 | 22:29:00.000000,101 |
| 2023-05-01 | 22:30:00.000000,86 |
| 2023-05-01 | 22:31:00.000000,75 |
| 2023-05-01 | 22:32:00.000000,65 |
| 2023-05-01 | 22:33:00.000000,57 |
| 2023-05-01 | 22:34:00.000000,55 |
| 2023-05-01 | 22:35:00.000000,51 |
| 2023-05-01 | 22:36:00.000000,48 |
| 2023-05-01 | 22:37:00.000000,45 |
| 2023-05-01 | 22:38:00.000000,42 |
| 2023-05-01 | 22:39:00.000000,40 |
| 2023-05-01 | 22:40:00.000000,39 |
| 2023-05-01 | 22:41:00.000000,37 |
| 2023-05-01 | 22:42:00.000000,35 |
| 2023-05-01 | 22:43:00.000000,34 |
| 2023-05-01 | 22:44:00.000000,34 |
| 2023-05-01 | 22:45:00.000000,32 |
| 2023-05-01 | 22:46:00.000000,31 |
| 2023-05-01 | 22:47:00.000000,29 |
| 2023-05-01 | 22:48:00.000000,29 |
| 2023-05-01 | 22:49:00.000000,27 |
| 2023-05-01 | 22:50:00.000000,27 |
| 2023-05-01 | 22:51:00.000000,27 |
| 2023-05-01 | 22:52:00.000000,25 |
| 2023-05-01 | 22:53:00.000000,24 |
| 2023-05-01 | 22:54:00.000000,24 |
| 2023-05-01 | 22:55:00.000000,24 |
| 2023-05-01 | 22:56:00.000000,22 |
| 2023-05-01 | 22:57:00.000000,22 |
| 2023-05-01 | 22:58:00.000000,21 |
| 2023-05-01 | 22:59:00.000000,21 |
| 2023-05-01 | 23:00:00.000000,17 |
| 2023-05-01 | 23:01:00.000000,20 |
| 2023-05-01 | 23:02:00.000000,13 |
| 2023-05-01 | 23:03:00.000000,10 |
| 2023-05-01 | 23:04:00.000000,9 |
| 2023-05-01 | 23:05:00.000000,9 |
| 2023-05-01 | 23:06:00.000000,9 |
| 2023-05-01 | 23:07:00.000000,9 |
| 2023-05-01 | 23:08:00.000000,9 |
| 2023-05-01 | 23:09:00.000000,9 |
| 2023-05-01 | 23:10:00.000000,9 |
| 2023-05-01 | 23:11:00.000000,9 |
| 2023-05-01 | 23:12:00.000000,9 |
| 2023-05-01 | 23:13:00.000000,8 |
| 2023-05-01 | 23:14:00.000000,9 |
| 2023-05-01 | 23:15:00.000000,8 |
| 2023-05-01 | 23:16:00.000000,9 |
| 2023-05-01 | 23:17:00.000000,8 |
| 2023-05-01 | 23:18:00.000000,9 |
| 2023-05-01 | 23:19:00.000000,8 |
| 2023-05-01 | 23:20:00.000000,8 |
| 2023-05-01 | 23:21:00.000000,8 |
| 2023-05-01 | 23:22:00.000000,9 |
| 2023-05-01 | 23:23:00.000000,8 |
| 2023-05-01 | 23:24:00.000000,8 |
| 2023-05-01 | 23:25:00.000000,8 |
| 2023-05-01 | 23:26:00.000000,8 |
| 2023-05-01 | 23:27:00.000000,8 |
| 2023-05-01 | 23:28:00.000000,8 |
| 2023-05-01 | 23:29:00.000000,8 |
| 2023-05-01 | 23:30:00.000000,7 |
| 2023-05-01 | 23:31:00.000000,8 |
| 2023-05-01 | 23:32:00.000000,8 |
| 2023-05-01 | 23:33:00.000000,7 |
| 2023-05-01 | 23:34:00.000000,8 |
| 2023-05-01 | 23:35:00.000000,7 |
| 2023-05-01 | 23:36:00.000000,7 |
| 2023-05-01 | 23:37:00.000000,8 |
| 2023-05-01 | 23:38:00.000000,5 |

This could be due to several factors:

  1. Clarify docstring for the PUT /api/permissions/graph endpoint; disallow incrementing revision yourself #13884: you need to get the last ID to send as the revision key in the permission graph. To do this you need to do a GET to the permission endpoint which will send you the entire permission graph which could be massive and also send a huge query to the App DB. According to what I've seen, to get the last revision key we send a SELECT * FROM "permissions_revision" ORDER BY "id" DESC and then we get the first item of the result of that query. That forces the DB to do a sequential scan of the table so this worsens as the table gets bigger. If we changed that query to SELECT "id" FROM "permissions_revision" ORDER BY "id" DESC LIMIT 1 we could get massive performance improvements as we would only pull 1 field instead of massive jsons saved as texts and also we would use the index
    E.g.: EXPLAIN ANALYZE SELECT * FROM "permissions_revision" ORDER BY "id" DESC
Sort  (cost=193.62..198.02 rows=1762 width=356) (actual time=1.286..1.412 rows=1953 loops=1)
  Sort Key: id DESC
  Sort Method: quicksort  Memory: 698kB
  ->  Seq Scan on permissions_revision  (cost=0.00..98.62 rows=1762 width=356) (actual time=0.011..0.249 rows=1953 loops=1)
Planning Time: 0.066 ms
Execution Time: 1.584 ms

vs.
EXPLAIN ANALYZE SELECT "id" FROM "permissions_revision" ORDER BY "id" DESC LIMIT 1

Limit  (cost=0.28..0.35 rows=1 width=4) (actual time=0.037..0.038 rows=1 loops=1)
  ->  Index Only Scan Backward using permissions_revision_pkey on permissions_revision  (cost=0.28..123.98 rows=1762 width=4) (actual time=0.037..0.037 rows=1 loops=1)
        Heap Fetches: 1
Planning Time: 0.073 ms
Execution Time: 0.084 ms
  1. Metabase takes a long time to process and parse the massive permission jsons. I was controlling the resources while this process took place and the database had no problem but clearly Metabase was struggling. Even when this process ended, any change to the permission graph takes the CPU to 50-60% and the endpoint takes ~4s to reply. The API doesn't support fetching the permission graph of a single group. It is only possible to fetch the entire graph. #25412

To Reproduce

  1. set up Metabase with an App DB
  2. clone https://github.com/paoliniluis/metabase-massive-permission and run it
  3. measure times

Expected behavior

Performance should be consistent across the scale

Logs

NA

Information about your Metabase installation

MB version: 1.46.2
Client: Python 3.11

Severity

P2

Additional context

Recreated this as we have customers that have massive 6MB payloads on the graph endpoint, so my test is really lightweight compared to those, but this is a great start

EDIT: did some more tests to see which endpoint is the culprit and yes, it's the permission graph. Response time per endpoint
image

@RomantsovArtur
Copy link

Any updates on this one?

@iethree
Copy link
Contributor

iethree commented Dec 20, 2023

start of the fix: #36543, still need to implement on the frontend

@escherize
Copy link
Contributor

the FE work has been merged in #36797, going to close this now, as 36610 is completed.

@darksciencebase darksciencebase added this to the 0.49 milestone Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Administration/Permissions Collection or Data permissions Misc/API .Performance Priority:P2 Average run of the mill bug .Team/AdminWebapp Admin and Webapp team Type:Bug Product defects
Projects
None yet
Development

No branches or pull requests

8 participants