-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter subgroups before paginating #27513
Conversation
Signed-off-by: Peter Keuter <github@peterkeuter.nl>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @pkeuter , the fix makes sense to me. Handling the pagination before filtering may result in less entries being returned than expected, so it makes sense to filter before paginating.
I've added a suggestion to use the no param variation of getSubGroupsStream
along with StreamsUtil.paginatedStream
that handles the pagination (including some basic validation for the values passed as first
and max
.
services/src/main/java/org/keycloak/services/resources/admin/GroupResource.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Peter Keuter <github@peterkeuter.nl>
Signed-off-by: Peter Keuter <github@peterkeuter.nl>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me! @pedroigor @ahus1 would any of you have time to check this one and possibly merge it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving based on @sguilhen's review. Thank you for this PR!
Thanks for merging @ahus1! Is there any chance we could get this in a patch version of 24 instead of in the next major release? That would be great. |
@pkeuter - sure, that's possible. Please create a PR against |
@sguilhen @ahus1 This PR can have performance implications on the Also, this endpoint is used from the admin console. The group page for a group with many children groups may be taking a lot of time to load even though the pagination is used. @alice-wondered Will this have a performance impact in your use case? cc: @pedroigor |
@mhajas I understand your concern, but I would think that having a correct response is more important than having a fast response? Even in the case of thousands of child groups, IMO the performance impact will still be minimal since it will still return a small portion and the rest of the processing is done on the server. |
I can foresee there being some interesting performance implications for our use case as we use a structured group approach to implement some form of multi-tenant organizations at the moment. As a result we have an extremely large number of groups nested within a top-level group. Loading all of them into memory will definitely have an impact on our services, especially since any group operation other than a read will also require rebuilding the entire cache. It looks like the organization work has pivoted towards using a first class entity rather than relying on groups, so our use case would eventually be resolved by switching our implementation to that approach. However we track upstream for our releases so the impact would be immediate. Here are some of my thoughts about this problem in general.... At the time of this work and while looking into the multi-tenant organization effort I was actually thinking about this particular scenario a bit. I'm of the opinion that the only scalable way to implement this behavior would be to change out the dynamic "just-in-time" evaluation of the permission entities into something that can be done upfront and then used as a filter for db requests. If the permissions system were able to return a list of group ids that match the permission (I believe that spiceDB supports behavior like this using graph optimizations on their lookup API for an example) then it would be easy enough to just get pages of group ids at once and then fetch all of them from the database. That's an easy enough operation to cache and we get the added benefit of the entire search being on the indexed column. I could also see java 21 virtual threads makings shorter work of this problem. Spinning up N virtual threads to "divide and conquer" the work and then ultimately forming the results into pages. The main performance concern here would be choosing N such that you don't keep high memory requirements as with a basic |
@mhajas yeah, I noticed that now that I looked into the JPA |
I think I now understand what the problem you are speaking of is about, since the adapter can get the groups dynamically via LDAP. That could indeed have a big performance impact. could, because I think the performance impact for the current situation will still be minimal. This is because the current implementation presumably did not yield any issues, because all either Unless the user does not have rights to see all groups and the first item the user can see starts at position 11 or beyond, in that case it will still need to get the first 10 items from the adapter. But I think this is the expected behaviour because the endpoint would otherwise be completely useless. So to make a long story short:
Unless I am completely missing something here: I don't think this is true, since the stream is lazy and will yield results after hitting the limit, therefore making this not an issue. But please correct me if I am wrong. |
honest question: how exactly can that happen? I don't see the search going into LDAP anywhere during the invocation of the endpoint, the LDAP groups were already imported when the group mapper synched the groups from LDAP. This search is going only into Keycloak's DB as far as I can see. I'm prob missing something but I just don't see LDAP involved in this search at all. |
But this has not changed? I've switched from this implementation to calling |
It has changed: this method is overriden in the JPA I'm not sure why exactly there's such a big performance difference between the two approaches as we are retrieving the entities via |
Ah I see, thanks for the clarification, and sorry for the potential confusion I have brought in 😅 I agree, the difference should theoretically still be minimal. We'll await some more details. |
I actually didn't know this was going on under the hood, which is what the cause for my concern originates from. Researching some it seems to be contentious That being said, in this scenario where we can't really do anything about the filtering that we have to do on every single entity then I'm not entirely sure there's a better optimization to be made. I still personally think the ideal would be more robust permissions evaluation that would let us not get records that we're going to throw away anyway but that's definitely out of the scope of this issue |
The missing part is that the keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/GroupAdapter.java Lines 224 to 238 in 28afd77
This could be a solution from the long-term perspective, however, I don't think it is a simple solution. Such big changes need to be designed and then planned accordingly to find a unified solution for all other places. This issue is more immediate and we should find a fix that can be added quickly. I would vote for reverting this PR and finding another solution to this issue as in my opinion, the performance drop is more serious than returning fewer groups. But I would be happy to hear other suggestions. |
@mhajas Thanks for the clarification. Because this function is still only used in the admin-ui (group picker modal and grouptree), I would personally still think that it's better to have a "mostly fast always correct"-result than an "always fast sometimes incorrect"-response. But that is, of course, not my decision. Just chiming in with my limited knowledge of Keycloak in it's entirety; maybe looping through the subgroups is not necessary because the update-actions already call the
Or hopefully someone with a better knowledge of the application can figure out something better like you mentioned because I really think reverting back to a broken situation doesn't seem like the best option here. |
Can you elaborate more on "mostly fast" part? I don't think I am following. Also, what do you mean by Are you suggesting removing caching from that method completely? |
Well I think most groups won't have thousands of subgroups, therefore this function will only be slow when the amount of subgroups is huge.
Sorry for not explaining too well. I was thinking these functions already call keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/GroupAdapter.java Line 279 in 28afd77
keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/GroupAdapter.java Line 272 in 28afd77
That was not what I intended with my suggested change, but looking more closely I see what you mean. The call to I am not sure what the best option would be. With the cache removed, not the entire list of subgroups will have to be fetched so this could be a viable option. But again: I don't know all the details that this change would impact so maybe it's good to have someone with more detailed knowledge look at this. But going back to a broken situation also doesn't look like the right solution to me. |
@mhajas question: if in the endpoint he calls |
Completely agree. It's definitely not a trivial change |
Sounds like something that could work. We should probably do some performance comparisons. Maybe some manual test with 10k groups would do the thing. |
I am wondering if this wouldn't be a good opportunity to test common table expressions to get the group structure recursively from the DB: https://in.relation.to/2023/02/20/hibernate-orm-62-ctes/ |
@mhajas @ahus1 @alice-wondered As we're inching towards Keycloak 25, I wanted to get back to this discussion because if this needs to get reverted we need to to do that sooner than later. I've tested some things as Michal suggested, created a group, 10k subgroups and took some time measurements when calling the 1- previous version, implemented by @alice-wondered , is the most efficient because it only fetches the required number of entities from the DB, then processes those entities (cache them, etc) and returns them. Jumping around the list - for example, requesting 10 sugroups starting from the 5000th subgroup - yields the same results in terms of response time. Again this is expected because the DB is always queried for a small batch of sugroups. Here's some time measurments in this scenario. Numbers in parenthesis reflect the
Numbers are similar when fetching sequential pages, although a little better than "randomly" jumping around the list:
2- The implementation in this PR, calling
The first request takes the longest as it fetches and caches every subgroup. Subsequent requests end up using the cache, but as I understand it this initial request fetching everything is what prompted the changes made by @alice-wondered , so it is something we don't want to do. Fetching sequential pages showed very similar results: first request takes more time, subsequent ones take around 50 ms:
3- If we change things in this PR to call
Fetching sequential pages is not much better either, and worse than the no-arg version:
4- Still using the idea of calling
Notice how the first request was a lot faster. The other ones don't look a lot better, but that's because when we jump, say, to position 5000, it has to process and cache everything until it reaches that point. Compared to Alice's version, where this is done when querying the DB, it is worse, for sure. Jumping back, say, to positon 2000, yields a much better results because of the cached groups (around 40 ms). However, when fetching sequential pages, it has shown good numbers, comparable to what we see from Alice's version:
|
What we have to decide is what to do with this information. In all scenarios the user has permission to read the returned groups. Alice's version is the most efficient for sure, but it makes the endpoint return the wrong number of results if the user doesn't have permissions for all the fetched groups. In some cases, it may even return zero subgroups if the first 10 fetched by the DB are all groups the user doesn't have permission to view. It is a regression as in previous versions the endpoint returned the expected number of entries. The last idea I've explored (number 4 above) yielded some interesting results. It prevents infinispan from caching every subgroup in the first request, and still has decent response time when fetching pages sequentially. It also properly filters the groups and returns the expected number of entries. Notice that with the Of course, if the user has permissions to see only a handful of groups out of the 10k, it might end up loading a good part of the DB and the response time will probably not be so good, but it is what it is. Unless we find a way to resolve the permissions in a different way, I don't see how we can avoid this situation. So I would really like to know what you would prefer to do here. The way I see we can 1- revert this commit, but the pagination will continue to be broken (not only here, but for top level groups as well - we have an issue raised from the community about that too) when user doesn't have permissions for every subgroup returned. We then must find a different way to tackle this, possibly reworking how the permissions are evaluated; 2- go with option number 4 above, which seems to perform decently in the default case (where user has permissions to read all subgroups), and returns the proper number of results from the method. Drawback is that performabce will probably drop if the user has permissions to see only a small subset of the subgroups compared to Alice's version. WDYT? |
I agree that this is not a good solution when the output is visibly incorrect. My solution would work best with a more upfront way of getting a batch of ids in order that are allowed and then only fetching those, but as mentioned above that would require more work and is out of immediate scope. When we get into very fine grained access with 10k+ entries I think this is the best solution long term.
This seems like the best immediate trade-off to me! Thanks for doing all of the bench-marking, the results are really interesting to read through |
Thanks for the feedback, @alice-wondered ! I also personally find option number 4 a good solution in the short term. It might be good to revisit the whole permission resolution at some point, because several endpoints end up filtering results from the DB based on the permissions and its really not an ideal thing to do when fetching results from the DB in a stream because you basically have to give up on using the |
@sguilhen Just to understand correctly, is the option 4 compared to option 3 faster because the I am wondering about: keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/GroupAdapter.java Lines 248 to 251 in 957859d
|
hey @mhajas! Sorting not done is the key difference, because when it is there it basically requires the whole stream to be available for the sort function, so it ends up loading everything. When it is not there, So, if you need to "jump" in the stream to fetch 10 sugroups starting in position 5000, it will process every subgroup from the DB, discarding the first 5000 results, and then it gets the 10 it wants and returns them. This is why I said that it is not so great when you jump around the stream too much. We are indeed not caching that call, but what I meant by my comment is that we still create a cached group for every subgroup that is read from the DB. And once the cached groups are there, subsequent requests run faster - probably creating and caching them individually is still a costly operation. For example, suppose you do a search starting on position 1000, fetching the next 10. So you end up processing the first 1010 entities loaded from the DB, creating cached group instances for each one of them. This has a cost, the response time will be a bit higher if you didn't previously cache all those subgroups. Now, if you do a search for 10 groups starting on position 500, it runs a lot faster - it still has to process 510 entities from the DB, but they have all been cached individually in the previous call and the overall response time is significantly lower. |
I am not sure about that statement. Where do we create the cached group adapter? |
This
ends up in the infinispan keycloak/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/RealmCacheSession.java Lines 894 to 916 in 6065f7d
|
Closes #27512