Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimising high dispatch and db query counts in broad model #1607

Open
6 tasks done
KlausVii opened this issue May 7, 2024 · 5 comments
Open
6 tasks done

Optimising high dispatch and db query counts in broad model #1607

KlausVii opened this issue May 7, 2024 · 5 comments
Labels
bug Something isn't working performance

Comments

@KlausVii
Copy link
Contributor

KlausVii commented May 7, 2024

Checklist

  • I have looked into the README and have not found a suitable solution or answer.
  • I have looked into the documentation and have not found a suitable solution or answer.
  • I have searched the issues and have not found a suitable solution or answer.
  • I have upgraded to the latest version of OpenFGA and the issue still persists.
  • I have searched the OpenFGA Community and have not found a suitable solution or answer.
  • I agree to the terms within the OpenFGA Code of Conduct.

Description

We are experiencing high latency and failing Check queries in production using the auth model below. The problem seems to stem from a particular set of users who have relations with multiple tenants and projects. When a check is performed with one of these users as the object, we are seeing dispatch counts in the hundreds and db query counts over a 100.

This is causing queries to timeout and putting undue pressure on the db.

Is there anything we could do to make the model perform better while still maintaining the same intent of the model. For example we tried to substitute owner_tenant and owner_project with [tenant#can_view_users, project#can_view_users], but this made the performance even worse.

Another option I am aware of is decreasing the OPENFGA_RESOLVE_NODE_BREADTH_LIMIT, but I'm afraid that will lead to an unacceptable number of false negatives.

Any ideas would be appreciated?

Expectation

I'd expect to be able check relations in a timely manner even for objects that have many relations.

Reproduction

  1. Use the model below.
  2. Create a user[0] with many owner_tenant and owner_project relations
  3. Create another user[1] with one shared tenant
  4. Check user[1] can_view user[0]
  5. Observe high dispatch and db query counts

Store data

model
  schema 1.1

type project
  relations
    define admin: [user] or owner
    define can_view: viewer
    define can_view_users: viewer
    define editor: [user] or admin
    define member: [user] or viewer
    define owner: [user] or super_admin from owner_tenant
    define owner_tenant: [tenant]
    define revoked: [user] or revoked from owner_tenant
    define viewer: [user] or editor

type tenant
  relations
    define admin: [user] or super_admin
    define can_view: (member or external_member) but not revoked
    define can_view_users: member but not revoked
    define editor: [user] or admin
    define external_member: [user]
    define member: [user] or editor
    define revoked: [user]
    define super_admin: [user]

type user
  relations
    define can_edit: owner
    define can_list_memberships: owner
    define can_view: owner or can_view_users from owner_tenant or can_view_users from owner_project
    define owner: [user]
    define owner_project: [project]
    define owner_tenant: [tenant]

OpenFGA version

1.5.3

How are you running OpenFGA?

In Kubernetes

What datastore are you using?

Postgres

OpenFGA Flags

OPENFGA_CHECK_QUERY_CACHE_ENABLED, OPENFGA_CHECK_QUERY_CACHE_TTL=30s, OPENFGA_DATASTORE_MAX_OPEN_CONNS=500, OPENFGA_MAX_CONCURRENT_READS_FOR_CHECK=4000000, OPENFGA_MAX_CONCURRENT_READS_FOR_LIST_OBJECTS=4000000

Logs

{
	"content": {
		"timestamp": "2024-05-07T09:05:15.952Z",
		"service": "openfga",
		"message": "grpc_req_complete",
		"attributes": {
			"store_id": "01HAM89A85W7HT9MS4KTBN4NPT",
			"raw_request": {
				"store_id": "01HAM89A85W7HT9MS4KTBN4NPT",
				"trace": false,
				"tuple_key": {
					"user": "user:c03fbcc5-5fd8-4711-9be6-33ae29384606",
					"relation": "can_view",
					"object": "user:2d65e2c0-5a9a-47b6-bb8c-236d76f50fdb"
				},
				"authorization_model_id": "01HQ8AZT8BPM228F1E0F5258Y2"
			},
			"level": "info",
			"grpc_service": "openfga.v1.OpenFGAService",
			"datastore_query_count": 515,
			"grpc_code": 0,
			"grpc_method": "Check",
			"grpc_type": "unary",
			"build": {
				"commit": "ebf39989ec19c8c3d8a7646e308bcffca5748874",
				"version": "v1.5.3"
			},
			"peer": {
				"address": "127.0.0.1:37840"
			},
			"dispatch_count": 658,
			"timestamp_milli": "1715072715952",
			"raw_response": {
				"allowed": false,
				"resolution": ""
			},
			"request_id": "629dbfe5-17af-4be2-a9cd-5004be4947eb",
			"user_agent": "openfga-sdk js/0.3.5",
			"timestamp": 1715072715.9523919,
			"authorization_model_id": "01HQ8AZT8BPM228F1E0F5258Y2"
		}
	}
}
@KlausVii KlausVii added the bug Something isn't working label May 7, 2024
@miparnisari
Copy link
Member

Hi @KlausVii, sorry for the delay!

Is there anything we could do to make the model perform better

If your model had relations with overlapping definitions, such as

define c: a or b
define d: a or b

Then we could introduce define intermediate: a or b and then define c: intermediate, define d: intermediate to improve cache hit ratio. However, your model looks very optimized in this front already, so not much we can do here.

Another option I am aware of is decreasing the OPENFGA_RESOLVE_NODE_BREADTH_LIMIT, but I'm afraid that will lead to an unacceptable number of false negatives.

This should not be the case - that flag is meant to increase the degree of concurrency that queries can have. It doesn't change the final outcome.


A few things you can try:

  • OPENFGA_DATASTORE_METRICS_ENABLED set to true, and then monitor your connection pool. If your open conns are regularly reaching max open conns, try increasing OPENFGA_DATASTORE_MAX_OPEN_CONNS. But only do this if your database allows it.
  • The default for OPENFGA_REQUEST_TIMEOUT is 3s, you can try increasing it. This is not ideal as latency will increase, but you should experience less errors.

In the short term we are planning to work on some performance improvements that could help your use case, (e.g. #1012)

@KlausVii
Copy link
Contributor Author

KlausVii commented May 11, 2024

Hmm, we've done some hacks to bypass openfga for the problematic checks for now. Increasing the timeout was not an acceptable solution for us. The max conns is already quite high and the query was negatively affecting other checks.

When are you planning on implementing this leopard index?

Would that help?

@jon-whit
Copy link
Member

@KlausVii if you have some sample tuples to share that would be great. That being said, in the meantime, the problem here stems from bi-directional relationships between user and tenant. That is user[0] --> tenant --> user[1...N]. Since OpenFGA is a directed graph, modeling that bi-directional query pattern causes high fan-out with respect to the user[1...N] --> tenant --> user[0] (reverse query) pattern.

Starting with user[0] you have to expand all of the owner_tenant and owner_project relationships for that subject, and for each one of them (could be hundreds or thousands it sounds like) you have to check if user[1] is related through the can_view_users relationship. FGA Indexes will definitely help with this pattern as the lookup for user[1] for each of the tenants and projects will reduce from O(N) lookup to O(1) 👍

Shoot those tuples over though and we can see if there are some other options in the meantime. Thanks for the report!

@KlausVii
Copy link
Contributor Author

KlausVii commented May 17, 2024

Hi @jon-whit , here's a sample store export with an example of a very expensive can_view check.

slow-store.txt (its actually yaml, but gh doesn't allow uploading those)

To reproduce:

  • import store
  • open playground
  • follow logs
  • run assertions
  • observe logs and see that the last can_view leads to excessive dispatches and db queries like this
{
  "grpc_method": "Check",
  "tuple_key": {
    "user": "user:f3659913-cde0-4aeb-b23f-dffb3176662b",
    "relation": "can_view",
    "object": "user:05d2c261-cec3-43d7-9b26-ef07b083f3ee"
  },
  "raw_response": {
    "allowed": false,
    "resolution": ""
  },
  "datastore_query_count": 829,
  "dispatch_count": 852
}

This is the worst case, where no relation to view exists, but even the more common successful one like this can lead to many db queries:

{
  "tuple_key": {
    "user": "user:0904c37c-a06b-41dc-8c29-5d66540137e9",
    "relation": "can_view",
    "object": "user:05d2c261-cec3-43d7-9b26-ef07b083f3ee"
  },
  "raw_response": {
    "allowed": true,
    "resolution": ""
  },
  "datastore_query_count": 71,
  "dispatch_count": 702
}

@KlausVii
Copy link
Contributor Author

Also is there indication to when the indexing will be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

3 participants