Partial Evaluation produce duplicate rules #4516

t83714 · 2022-03-30T02:58:56Z

Partial Evaluation produce duplicate rules

OPA version: tried 0.38.1 & 0.37.2
Example query:
- Example JSON Request (to compile endpoint): https://gist.github.com/t83714/0b6249a33ab9f5ecf45269df38818213
- Example Policy: https://github.com/magda-io/magda/blob/next/magda-opa/policies/entrypoint/allow.rego
Example output that OPA returned: https://gist.github.com/t83714/e13dfada9e20ca2550707b06ce227408
For server and CLI, the flags/configuration that you provided to OPA: "run", "--server", "/policies"
For server, any relevant log messages from OPA: N/A
For Go and Wasm, the arguments you invoked OPA with: N/A

General Summary

The partial evaluation result might include many duplicate / identical rules sometimes.

The sample response provided above includes 432 rules. Many of those are duplicated / identical rules. e.g. like (I converted AST into a more readable format):

{
  input.object.record.dcat-dataset-strings
  input.object.record.publishing.state
  NOT input.object.record.publishing.state = "draft"
  input.object.record.dataset-draft
  NOT input.object.record.dcat-dataset-strings
  input.object.record.publishing.state
  input.object.record.publishing.state = "published"
  NOT input.object.record.access-control.orgUnitId
}

or

{
  input.object.record.dcat-dataset-strings 
  input.object.record.publishing.state 
  NOT (input.object.record.publishing.state = "draft")
  input.object.record.dataset-draft
  NOT input.object.record.publishing.state = "published"
  input.object.record.publishing.state
  input.object.record.publishing.state = "published"
  "c59aa487-32fc-4d00-b16c-e070f2c88943" = input.object.record.access-control.orgUnitId
}

Those above are example rules that are duplicated more than once.

Steps To Reproduce

Download the example policy files and Run opa run -s [folder of policy files]
Send sample JSON request to compile endpoint

Expected behaviour

The partial evaluation result should not contain duplicate rules. But right now you can locate duplicate rules in the partial evaluation result.

The text was updated successfully, but these errors were encountered:

srenatus · 2022-03-30T16:03:09Z

Thanks for the detailed report. 👀

I'm afraid I'm not sure I fully understand what you're referring to as duplicates here. The first block seems impossible, given that it contains both

input.object.record.dcat-dataset-strings

and

NOT input.object.record.dcat-dataset-strings

So this query can't ever be true. 🤔

Maybe you refer to those as duplicates?

input.object.record.publishing.state
input.object.record.publishing.state = "published"

Or am I completely on the wrong track here?

Generally when it comes to PE results, "minimality" isn't really the first concern. Do you worry about excluding impossible queries, or about making the queries simpler..? 🤔

t83714 · 2022-03-31T01:11:21Z

@srenatus Thanks a lot for your reply 👍

I am actually having some performance issues here. And my concerns probably cover both.

i.e.: the compile result should

neither include "impossible" rules
nor duplicate rules (i.e. exact same rule that appeared more than once in the response).
The sample response actually contains 432 incremental rule definition in name of "object.record.allow".
- And 128 (out of total 432) incremental rule definitions are duplicate ones. (I counted with a simple script by comparing restructured rule JSON)
- i.e. they have exact same rule body, same expressions, same default value, negated field value etc. But appeared more than once in the response.
The rest 304 incremental rule definitions in the same name of "object.record.allow" are still not "dry" enough as some of them are "impossible" rule definitions and should be simply discarded.
- By the way, the impossible rule body is from sample response line 1572 & line 1692
- The "impossible" incremental rule definition above appeared twice in the response.

srenatus · 2022-03-31T07:47:39Z

I've just now realised that there's everything I need to reproduce this in the first message 😅

I'll play some with this locally. Usually, I prefer opa eval -fsource -p over the compile endpoint's JSON response, but I see you've got a parser for that already. 😄 Will keep you posted.

srenatus · 2022-03-31T09:10:50Z

Let's see where the queries comes from... 🔍

Taking this (from object/record/allow.rego) for an example,

# delegated to draft dataset rules
allow {
    trace(input.object.allow)
    isDraftDataset

    [resourceType, operationType, resourceUriPrefix] := breakdownOperationUri(input.operationUri)
    inputOperationUri := concat("/", ["object", "dataset", "draft", operationType])
    # proxy to dataset related permission decision
    verifyDatasetPermission(inputOperationUri, "record")
}

I'm adding trace(input.object.allow) as the first expression of that rule body so that we can trace what it becomes in the output. It's useless, but since it's argument is what is unknown, it will remain in the PE result.

Now, running opa eval -fsource -d policies -p -i inp.json -u input.object data.entrypoint.allow we find these -- skipping everything that doesn't have our marker.

full output

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	not input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	not input.object.record["access-control"].ownerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"0e0ba07e-bda4-4adb-bca2-c44c3984ddb7" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"1b3736aa-fc09-439e-bb8e-e1b03551add9" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"277d01aa-a6b9-4ee8-8519-717e2f4da6d4" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"8fd9486c-5e07-438a-8025-4dc9b99221dd" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"c59aa487-32fc-4d00-b16c-e070f2c88943" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"d4f8eb99-af81-4629-8056-bb261d51f2ba" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"e440f319-304e-4ffe-9b93-2b40adfe5aba" = input.object.record["access-control"].ownerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	not input.object.record.publishing.state
	input.object.record.publishing.state = "draft"
	"ff603803-5b64-448f-bf90-a2c9bff5134e" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	not input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	not input.object.record["access-control"].ownerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"0e0ba07e-bda4-4adb-bca2-c44c3984ddb7" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"1b3736aa-fc09-439e-bb8e-e1b03551add9" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"277d01aa-a6b9-4ee8-8519-717e2f4da6d4" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"8fd9486c-5e07-438a-8025-4dc9b99221dd" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"c59aa487-32fc-4d00-b16c-e070f2c88943" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"d4f8eb99-af81-4629-8056-bb261d51f2ba" = input.object.record["access-control"].orgUnitOwnerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"e440f319-304e-4ffe-9b93-2b40adfe5aba" = input.object.record["access-control"].ownerId
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]
	input.object.record.publishing.state = "draft"
	input.object.record.publishing.state = "draft"
	"ff603803-5b64-448f-bf90-a2c9bff5134e" = input.object.record["access-control"].orgUnitOwnerId
}

Looking at the latest orgUnitOwnerId thing, we find two blocks for "ff603803-5b64-448f-bf90-a2c9bff5134e":

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"].       # isDraftDataset body 1
	not input.object.record.publishing.state. # isDraftDataset body 1
	input.object.record.publishing.state = "draft" # verifyDatasetPermission("object/dataset/draft/read", "record")
	"ff603803-5b64-448f-bf90-a2c9bff5134e" = input.object.record["access-control"].orgUnitOwnerId # verifyRecordPermission
}

allow {
	trace(input.object.allow)
	input.object.record["dataset-draft"]                # isDraftDataset body 2
	input.object.record.publishing.state = "draft" # isDraftDataset body 2
	input.object.record.publishing.state = "draft" # verifyDatasetPermission("object/dataset/draft/read", "record")
	"ff603803-5b64-448f-bf90-a2c9bff5134e" = input.object.record["access-control"].orgUnitOwnerId # verifyRecordPermission
}

I've annotated the expression with the rule body expression they're descendant from.

So, duplicate rules are silly, but getting rid of them should also be a straightforward improvement to the PE code. I'll make a note of that.

What's more interesting is the impossible case, i.e., the first rule body. PE doesn't keep track of what unknowns could be, so it won't be able to tell on its own that this is impossible. I wonder if by restructuring the rego code, we can tell it.

I think we get the number down by replacing

isDraftDataset {
    input.object.record["dataset-draft"]
    input.object.record.publishing.state = "draft"
}

isDraftDataset {
    input.object.record["dataset-draft"]
    not input.object.record.publishing.state
}

with

isDraftDataset {
    input.object.record["dataset-draft"]
    object.get(input.object.record, ["publishing", "state"], "draft") == "draft"
}

but that'll make the object.get call end up as part of the PE results. I'm not sure if this is something you'd like to avoid.

I'm assuming you're reading the PE results and convert that to something else? I'm asking because if you were to evaluate them with OPA again, the rule indexing feature would have an impact -- but if you don't use that, it's not worth exploring.

t83714 · 2022-04-01T07:44:15Z

@srenatus Thanks so much for looking into it.

Also appreciated the opa eval -fsource -p tips --- it's much easier to look into the compile result 👍

We actually use OPA to make decisions on data filtering. The policy will be partially evaluated with context data like user attributes etc. and "residual" rules (if any) will then be translated to either SQL or elasticsearch DSL (we store the same record in different "shapes" in both PostgreSQL & Elasticsearch for different purposes) to filter out the data that a user is not allowed to access.

Right now, our implementation works correctly for all our use cases in terms of functionality/access control. But I guess the performance issue might be a blocker for us as it seems to get worse when we assign more permissions to the users.

We have already implemented some logic to filter out duplicated rules from the compiled result in our codebases. For some requests, it reduces the AST parse / process time from 10s to 2s. I can tell we probably can still improve our parse further. However, as our parser is written in typescript, I guess the performance will be much better (and might be easier) if we can address some of the issues in OPA.

In fact, instead of changing the OPA PE core logic/processing pipelines to address the issues, I was wondering if it would be possible to add some "post-processing logic" to address those issues just before outputting the PE result? Those "post-processing logic" could be simply string-comparison based rule/expression filtering as we probably don't have to eliminate 100% duplicated rules / expression --- even solving 80% of cases will likely be a big improvement.

I re-ran the policy using opa eval -fsource -p and produced a much easier result here:

https://gist.github.com/t83714/69f4aadaa6f2191455f5357a9bfe0879

I think there are 3 issues that we could possibly address with "post-processing logic":

duplicated rules

The following rule appears twice at line 3849 and line 3856.

Before OPA outputs the PE result, is it possible to filter out rules with the same rule body by simple string (e.g. AST JSON) comparison?

allow {
	input.object.record["dcat-dataset-strings"]
	input.object.record.publishing.state = "published"
	input.object.record.publishing.state = "published"
	not input.object.record["access-control"].orgUnitId
}

duplicated expressions in rules

Still take the rule above for instance:

allow {
	input.object.record["dcat-dataset-strings"]
	input.object.record.publishing.state = "published"
	input.object.record.publishing.state = "published"
	not input.object.record["access-control"].orgUnitId
}

The expression input.object.record.publishing.state = "published" seem appeared twice (line 3851 & line 3852).

For all rules that OPA is about to output, could we possibly implement some simple "post-processing logic" to filter out the exact same expressions by simple string (e.g. AST JSON string) comparison?

"Impossible" rules

Thanks for your object.get suggestion. I guess it will solve some of the issues but as we need to translate AST into SQL or other DSL, we have to avoid having any built-in function left in the PE result.

Agree it's a tricky issue to solve at "PE evaluation engine"'s point of view.
However, rather than trying to 100% eliminate all possible cases, is that possible to implement "post-processing logic" to at least handle some obvious cases?

e.g. the rule below is found at line 3827:

allow {
	input.object.record["dcat-dataset-strings"]
	not input.object.record["dataset-draft"]
	not input.object.record.publishing.state = "draft"
	not input.object.record["dcat-dataset-strings"]
	not input.object.record.publishing.state = "published"
	input.object.record.publishing.state
	input.object.record.publishing.state = "published"
	"ff603803-5b64-448f-bf90-a2c9bff5134e" = input.object.record["access-control"].orgUnitId
}

The existence of expression not input.object.record.publishing.state = "published" and expression input.object.record.publishing.state = "published" indicates that the rule is an impossible rule that should be discarded.

could we possibly address this case by implementing a "post-processing logic" that:

for each of expression of the rule with negated = false, try to find if there is an exact same expression with negated = true. As long as we can find one expression, the rule will be discarded from OPA's PE result output.

This approach might not logically solve all cases. But I believe (and based on my tests) it will help most real-life cases.
Moreover, if we propose to solve the issues just before the PE result is outputted, I guess any improvement will likely not cause too much trouble to the core PE evaluation logic.

Thanks again for your time in looking into it.
Would be appreciated for feedback on any of those above.

srenatus · 2022-04-01T11:45:03Z

Some sort of post-processing logic in OPA's PE machinery would make sense, I think. Just need to figure out here the best place to put it would be 🤔

stale · 2022-05-01T18:10:20Z

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days.

…sult body Before, we'd have added expressions multiple times. Also, it was possible to generate rule bodies that would never work out because they carried both expr and not-expr for the same expr. Now, we're checking these two things before adding an expression to the result in copypropagation. Fixes open-policy-agent#4516 for simple cases. It's still the case that we may end up with impossible bodies, like not input.foo input.foo = "bar" but at least we won't have input.foo.bar not input.foo.bar Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>

srenatus · 2022-07-26T12:42:25Z

I've given this a go with some code on this branch and the number of allow rules went from 700ish to 113. Would you like to try it? 👉 https://github.com/srenatus/opa/tree/sr/copyprop/deduplicate-expressions

t83714 · 2022-07-28T02:57:07Z

@srenatus Thanks a lot for looking into it. I can confirm that I also got signification improvement on our latest policy files --- down from 284 to 112 rules (i.e. 60% improvement 🎉 ).

Here is how I did my test:

police files: Used my latest version policy files here
input.json: use this sample input file
command used: opa eval -fsource -d policies -p -i input.json -u input.object data.entrypoint.allow
test version: 0.39.0 vs build version of your branch

Here are the output files:

One thing I noticed --- although the result after fix achieved amazing result, it still contains some identical rules (e.g. first & second rules. Please see below). Not trying to be greedy 😄 . But would it be hard to filter out those identical rules as well?

Here is one example of duplicated rules (duplicated twice):

allow {
	input.object.record["dcat-dataset-strings"]
	not input.object.record.publishing.state
	not input.object.record["access-control"].orgUnitId
}

stale · 2022-08-30T22:38:49Z

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days.

stale · 2023-04-07T09:07:08Z

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days.

t83714 added the bug label Mar 30, 2022

srenatus added the waiting-for-response label Mar 30, 2022

srenatus removed the waiting-for-response label Mar 31, 2022

srenatus added the waiting-for-response label Mar 31, 2022

srenatus removed the waiting-for-response label Apr 1, 2022

srenatus added feature-request and removed bug labels Apr 1, 2022

stale bot added the inactive label May 1, 2022

stale bot removed the inactive label Jul 26, 2022

stale bot added the inactive label Aug 30, 2022

ashutosh-narkar added the partial-eval label Mar 8, 2023

stale bot removed the inactive label Mar 8, 2023

stale bot added the inactive label Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Evaluation produce duplicate rules #4516

Partial Evaluation produce duplicate rules #4516

t83714 commented Mar 30, 2022

srenatus commented Mar 30, 2022

t83714 commented Mar 31, 2022 •

edited

srenatus commented Mar 31, 2022

srenatus commented Mar 31, 2022

t83714 commented Apr 1, 2022

srenatus commented Apr 1, 2022

stale bot commented May 1, 2022

srenatus commented Jul 26, 2022

t83714 commented Jul 28, 2022 •

edited

stale bot commented Aug 30, 2022

stale bot commented Apr 7, 2023

Partial Evaluation produce duplicate rules #4516

Partial Evaluation produce duplicate rules #4516

Comments

t83714 commented Mar 30, 2022