Optimising high dispatch and db query counts in broad model #1607

KlausVii · 2024-05-07T11:02:57Z

Checklist

I have looked into the README and have not found a suitable solution or answer.
I have looked into the documentation and have not found a suitable solution or answer.
I have searched the issues and have not found a suitable solution or answer.
I have upgraded to the latest version of OpenFGA and the issue still persists.
I have searched the OpenFGA Community and have not found a suitable solution or answer.
I agree to the terms within the OpenFGA Code of Conduct.

Description

We are experiencing high latency and failing Check queries in production using the auth model below. The problem seems to stem from a particular set of users who have relations with multiple tenants and projects. When a check is performed with one of these users as the object, we are seeing dispatch counts in the hundreds and db query counts over a 100.

This is causing queries to timeout and putting undue pressure on the db.

Is there anything we could do to make the model perform better while still maintaining the same intent of the model. For example we tried to substitute owner_tenant and owner_project with [tenant#can_view_users, project#can_view_users], but this made the performance even worse.

Another option I am aware of is decreasing the OPENFGA_RESOLVE_NODE_BREADTH_LIMIT, but I'm afraid that will lead to an unacceptable number of false negatives.

Any ideas would be appreciated?

Expectation

I'd expect to be able check relations in a timely manner even for objects that have many relations.

Reproduction

Use the model below.
Create a user[0] with many owner_tenant and owner_project relations
Create another user[1] with one shared tenant
Check user[1] can_view user[0]
Observe high dispatch and db query counts

Store data

model
  schema 1.1

type project
  relations
    define admin: [user] or owner
    define can_view: viewer
    define can_view_users: viewer
    define editor: [user] or admin
    define member: [user] or viewer
    define owner: [user] or super_admin from owner_tenant
    define owner_tenant: [tenant]
    define revoked: [user] or revoked from owner_tenant
    define viewer: [user] or editor

type tenant
  relations
    define admin: [user] or super_admin
    define can_view: (member or external_member) but not revoked
    define can_view_users: member but not revoked
    define editor: [user] or admin
    define external_member: [user]
    define member: [user] or editor
    define revoked: [user]
    define super_admin: [user]

type user
  relations
    define can_edit: owner
    define can_list_memberships: owner
    define can_view: owner or can_view_users from owner_tenant or can_view_users from owner_project
    define owner: [user]
    define owner_project: [project]
    define owner_tenant: [tenant]

OpenFGA version

1.5.3

How are you running OpenFGA?

In Kubernetes

What datastore are you using?

Postgres

OpenFGA Flags

OPENFGA_CHECK_QUERY_CACHE_ENABLED, OPENFGA_CHECK_QUERY_CACHE_TTL=30s, OPENFGA_DATASTORE_MAX_OPEN_CONNS=500, OPENFGA_MAX_CONCURRENT_READS_FOR_CHECK=4000000, OPENFGA_MAX_CONCURRENT_READS_FOR_LIST_OBJECTS=4000000

Logs

{
	"content": {
		"timestamp": "2024-05-07T09:05:15.952Z",
		"service": "openfga",
		"message": "grpc_req_complete",
		"attributes": {
			"store_id": "01HAM89A85W7HT9MS4KTBN4NPT",
			"raw_request": {
				"store_id": "01HAM89A85W7HT9MS4KTBN4NPT",
				"trace": false,
				"tuple_key": {
					"user": "user:c03fbcc5-5fd8-4711-9be6-33ae29384606",
					"relation": "can_view",
					"object": "user:2d65e2c0-5a9a-47b6-bb8c-236d76f50fdb"
				},
				"authorization_model_id": "01HQ8AZT8BPM228F1E0F5258Y2"
			},
			"level": "info",
			"grpc_service": "openfga.v1.OpenFGAService",
			"datastore_query_count": 515,
			"grpc_code": 0,
			"grpc_method": "Check",
			"grpc_type": "unary",
			"build": {
				"commit": "ebf39989ec19c8c3d8a7646e308bcffca5748874",
				"version": "v1.5.3"
			},
			"peer": {
				"address": "127.0.0.1:37840"
			},
			"dispatch_count": 658,
			"timestamp_milli": "1715072715952",
			"raw_response": {
				"allowed": false,
				"resolution": ""
			},
			"request_id": "629dbfe5-17af-4be2-a9cd-5004be4947eb",
			"user_agent": "openfga-sdk js/0.3.5",
			"timestamp": 1715072715.9523919,
			"authorization_model_id": "01HQ8AZT8BPM228F1E0F5258Y2"
		}
	}
}

The text was updated successfully, but these errors were encountered:

miparnisari · 2024-05-09T19:08:44Z

Hi @KlausVii, sorry for the delay!

Is there anything we could do to make the model perform better

If your model had relations with overlapping definitions, such as

define c: a or b
define d: a or b

Then we could introduce define intermediate: a or b and then define c: intermediate, define d: intermediate to improve cache hit ratio. However, your model looks very optimized in this front already, so not much we can do here.

Another option I am aware of is decreasing the OPENFGA_RESOLVE_NODE_BREADTH_LIMIT, but I'm afraid that will lead to an unacceptable number of false negatives.

This should not be the case - that flag is meant to increase the degree of concurrency that queries can have. It doesn't change the final outcome.

A few things you can try:

OPENFGA_DATASTORE_METRICS_ENABLED set to true, and then monitor your connection pool. If your open conns are regularly reaching max open conns, try increasing OPENFGA_DATASTORE_MAX_OPEN_CONNS. But only do this if your database allows it.
The default for OPENFGA_REQUEST_TIMEOUT is 3s, you can try increasing it. This is not ideal as latency will increase, but you should experience less errors.

In the short term we are planning to work on some performance improvements that could help your use case, (e.g. #1012)

KlausVii · 2024-05-11T00:02:14Z

Hmm, we've done some hacks to bypass openfga for the problematic checks for now. Increasing the timeout was not an acceptable solution for us. The max conns is already quite high and the query was negatively affecting other checks.

When are you planning on implementing this leopard index?

Would that help?

jon-whit · 2024-05-16T20:11:53Z

@KlausVii if you have some sample tuples to share that would be great. That being said, in the meantime, the problem here stems from bi-directional relationships between user and tenant. That is user[0] --> tenant --> user[1...N]. Since OpenFGA is a directed graph, modeling that bi-directional query pattern causes high fan-out with respect to the user[1...N] --> tenant --> user[0] (reverse query) pattern.

Starting with user[0] you have to expand all of the owner_tenant and owner_project relationships for that subject, and for each one of them (could be hundreds or thousands it sounds like) you have to check if user[1] is related through the can_view_users relationship. FGA Indexes will definitely help with this pattern as the lookup for user[1] for each of the tenants and projects will reduce from O(N) lookup to O(1) 👍

Shoot those tuples over though and we can see if there are some other options in the meantime. Thanks for the report!

KlausVii · 2024-05-17T19:57:03Z

Hi @jon-whit , here's a sample store export with an example of a very expensive can_view check.

slow-store.txt (its actually yaml, but gh doesn't allow uploading those)

To reproduce:

import store
open playground
follow logs
run assertions
observe logs and see that the last can_view leads to excessive dispatches and db queries like this

{
  "grpc_method": "Check",
  "tuple_key": {
    "user": "user:f3659913-cde0-4aeb-b23f-dffb3176662b",
    "relation": "can_view",
    "object": "user:05d2c261-cec3-43d7-9b26-ef07b083f3ee"
  },
  "raw_response": {
    "allowed": false,
    "resolution": ""
  },
  "datastore_query_count": 829,
  "dispatch_count": 852
}

This is the worst case, where no relation to view exists, but even the more common successful one like this can lead to many db queries:

{
  "tuple_key": {
    "user": "user:0904c37c-a06b-41dc-8c29-5d66540137e9",
    "relation": "can_view",
    "object": "user:05d2c261-cec3-43d7-9b26-ef07b083f3ee"
  },
  "raw_response": {
    "allowed": true,
    "resolution": ""
  },
  "datastore_query_count": 71,
  "dispatch_count": 702
}

KlausVii · 2024-05-17T19:58:10Z

Also is there indication to when the indexing will be implemented?

KlausVii added the bug Something isn't working label May 7, 2024

miparnisari added the performance label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimising high dispatch and db query counts in broad model #1607

Optimising high dispatch and db query counts in broad model #1607

KlausVii commented May 7, 2024 •

edited

miparnisari commented May 9, 2024

KlausVii commented May 11, 2024 •

edited

jon-whit commented May 16, 2024

KlausVii commented May 17, 2024 •

edited

KlausVii commented May 17, 2024

Optimising high dispatch and db query counts in broad model #1607

Optimising high dispatch and db query counts in broad model #1607

Comments

KlausVii commented May 7, 2024 • edited

Checklist

Description

Expectation

Reproduction

Store data

OpenFGA version

How are you running OpenFGA?

What datastore are you using?

OpenFGA Flags

Logs

miparnisari commented May 9, 2024

KlausVii commented May 11, 2024 • edited

jon-whit commented May 16, 2024

KlausVii commented May 17, 2024 • edited

KlausVii commented May 17, 2024

KlausVii commented May 7, 2024 •

edited

KlausVii commented May 11, 2024 •

edited

KlausVii commented May 17, 2024 •

edited