Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support faster loading of dependency-free bundles #6166

Open
charlieegan3 opened this issue Aug 17, 2023 · 8 comments
Open

Support faster loading of dependency-free bundles #6166

charlieegan3 opened this issue Aug 17, 2023 · 8 comments

Comments

@charlieegan3
Copy link
Contributor

charlieegan3 commented Aug 17, 2023

In highly-scaled, multi-tenancy deployments of OPA, it's possible that operators might be loading many bundles into OPA containing policy from a large number of end users (e.g. O(5000)).

This poses challenges to the performance of OPA today as we compiles new modules from bundles alongside existing modules in the store. This is in order to check for: function references and path collisions (perhaps other factors). (Related: #3841). In theory some of these checked could be skipped if we knew the bundle was self contained.

We can see this code here:

opa/bundle/store.go

Lines 765 to 773 in f05ebba

// preserve any modules already on the compiler
for name, module := range compiler.Modules {
modules[name] = module
}
// preserve any modules passed in from the store
for name, module := range extraModules {
modules[name] = module
}

Aside, this also applies to the REST API:

opa/server/server.go

Lines 2136 to 2142 in abb6cf2

modules, err := s.loadModules(ctx, txn)
if err != nil {
s.abortAuto(ctx, txn, w, err)
return
}
modules[id] = parsedMod

I have done some experiments and found that 500 modules is around 50x slower and 1000 bundles is around 130x slower than a single bundle baseline measurement.

One solution to this might be to allow the labelling of bundles such that they don't need other modules before compilation and updating of the store. Another solution might be to have this as the default behaviour for a bundle using dependency management but without any dependencies (#3371) - when this functionality were to be available.

In the meantime, users could consider sharding OPA instances or using larger, aggregated bundles.

@hpvd
Copy link

hpvd commented Aug 18, 2023

+1 on this

and the direct link to OPA Dependency Manager (ODM)
NOTE: This is an experimental project not officially supported by the OPA team or Styra. (today)
https://github.com/johanfylling/opa-dependency-manager/

(taken from #3371 mentioned above)

@hpvd
Copy link

hpvd commented Aug 18, 2023

Depending on where all components are located, compression may also have an impact on speed...
e.g. zstd is very fast on decompression (about 3-5 times faster than zlip and also good at compression ratio and compression speed) see e.g. http://facebook.github.io/zstd/#benchmarks

@deezkay
Copy link

deezkay commented Aug 18, 2023

Just to add some context from a real world customer of OPA.

I was the original customer that raised this as an issue.

Our use case is we want to use a bundle per tenant (customer) as each customer has a discrete independent REGO policy + data lists which they change infrequently. We have 5000+ customers.
Logically whenever a customer changes their policy we want to rebuild the bundle for just that customer and push it to S3 for OPA to pickup.
This is the ideal design pattern we selected to go with however when we first came to preload OPA will all the customers policy bundles we observed that as more bundles were loaded the overall load time got longer and longer.

Doing further analysis looking at the performance metrics OPA provide we could see as each bundle is loaded the compile time was increasing.
Digging a little deeper by adding some debug timings into the source code in the Compile() function in ast/compile.go I could see the runtime of this function increases over time as more and more bundles are loaded into OPA.

As I say, our bundles are per tenant (customer) and are completely independent of each other - they have no inter-dependencies between them so any collision detection or dependency checking across bundles which may be cause of the slowdown is unnecessary in our use case - I wonder if we could have a flag to disable such functionality if that is the root cause.

At the moment this is a blocker for use. At present we are having to load all customers into a single large bundle and rebuild that bundle every time a single customer makes a policy change.

@anderseknert
Copy link
Member

Hi @deezkay and @hpvd! And thanks for raising this 👍

It's an interesting use case for sure, altough not really one that OPA is currently built to deal with — at least not in the scope of a single instance. Setting the performance issue aside, there aren't any guarantees, or even attempts made, to isolate policy or data between "tenants", as OPA never considered the bundle model for the purpose of multi-tenancy. You could easily have one tenant (i.e. tenantX) referencing policy or data from data.tenantY). I suppose you could perform some analysis at compile-time to try and ensure that no policy makes references to policy or data outside of its allowed domain, but it's a solution that likely would be brittle. Same thing goes for resource consumption, as there's no way to isolate "tenants" there's also no way to provision or limit resources per "tenant". While most policy is likely well-behaved, it's not unthinkable, or even unlikely in a set of 5000 tenants, that a few policies are written in a way that consume excessive resources, which could have an impact on all tenants.

Not to mention — would you actually have OPA poll 5000 different remote endpoints for bundle updates? 😅 You'd face some challenges we haven't really accounted for.

So while I think the use case is valid, I don't think even solving the performance issue reported here would get us anywhere near a scenario where I'd be comfortable having a single (or a few single) instance(s) try and meet these requirements.

Given that all tenants run independently from each other anyway, what do you see as the benefit of having just a single OPA serve all of them? A single OPA per tenant would be ideal for the purpose of isolation, obviously, but even some scheme for partitioning, say 10 tenants per OPA, would go a long way to help solve the problems outlined here. OPA was ultimately built to be a distributed component, and it's not uncommon that organizations runs hundreds or thousands of instances inside of their clusters.

@hpvd
Copy link

hpvd commented Aug 18, 2023

@deezkay thanks for details and sharing state of investigation!
you have got a great test case with 5000+ possibly small parts to be processed :-)
and thanks for @anderseknert for jumping in.

I was looking at the same problem but from another perspective:
If you are using OPA in an really intensive way for nearly all usecases advertised in the main image at https://www.openpolicyagent.org/
where of course every part has many more rules and is backed by more data than from your 5000 customers
-> shouldn't the same performance questions arise?

@anderseknert
Copy link
Member

Not really, as you'd probably have OPAs running all over the place.

  • Infrastructure policy? OPA instances started directly in the CI/CD pipeline, likely using opa eval or opa exec.
  • Kubernetes admission control? Several OPA instances deployed as deamon sets, or whanot.
  • Application authorization? One OPA per app instance running as a sidecar inside of each pod deployed for the app.

...and so on. The distributed model comes with its own set of challenges, that's for sure — but so does the "one large instance" model, and OPA has ultimately been built primarily to solve the challenges of the former. That doesn't mean it can't be used in other configurations, but as I've tried to elaborate on, it's likely going to come with a whole lot of challenges, many of which we haven't even thought about.

@hpvd
Copy link

hpvd commented Aug 18, 2023

jep sure you can split it. The reason why we are thinking about the all in one thing is "continuous audit readiness".
Going in this direction, everything should/need to be as easy and transparent as possible for (external) auditors. Having everything in one place looks from this perspective like a good idea...

@deezkay
Copy link

deezkay commented Aug 18, 2023

Thanks @anderseknert and @hpvd for your input and feedback.

I totally agree our use case isn’t typically what OPA may be used for – but we have been impressed by OPA as a pure policy engine and how easy it is to implement new rules to support our application use case.
We also like the bundle approach, just pushing bundles to S3 in our use case and each OPA cluster picking up the changes.
As we have a number of geographically dispersed OPA clusters, not having to manage the control of knowing which OPA clusters have been updated with the latest version as in a push model is a really clean approach.

To provide some more context, we are using OPA as a policy engine to evaluate our custom application policy and return an action to be taken.
We have full control over the REGO that is generated as it is code generated. Our customers create and modify their application policy though our application UI which is stored in a relational database. We then have a Python application that translates the policy rules into REGO.
As a result we can guarantee that each customer’s policy bundle is autonomous with no inter-dependencies or overlapping with other bundles.

Regardless of the fact we do have 5000+ customers, and yes we are fully aware we may need to shard the policy across numerous OPA clusters to get the performance we need, the fundamental question we are trying to answer is why “ The current compiler implementation re-runs all stages on all modules each time the compiler is invoked”.
This is question posed by an optimization thread from 2020 #2282
A fix for this would benefit any multi-bundle implementation even if the number of bundles is 100 or less.

@ashutosh-narkar ashutosh-narkar added this to Backlog in Open Policy Agent via automation Aug 21, 2023
@ashutosh-narkar ashutosh-narkar removed this from Backlog in Open Policy Agent Oct 3, 2023
@ashutosh-narkar ashutosh-narkar added this to Nice To Have in Open Policy Agent v1.0 Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

5 participants