Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Security Performance Test Suite #3903

Open
nibix opened this issue Dec 29, 2023 · 18 comments · May be fixed by #4282
Open

[RFC] Security Performance Test Suite #3903

nibix opened this issue Dec 29, 2023 · 18 comments · May be fixed by #4282
Labels
enhancement New feature or request help wanted Community contributions are especially encouraged for these issues. performance Make it fast! triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@nibix
Copy link
Contributor

nibix commented Dec 29, 2023

Background

Already now, the OpenSearch project systematically conducts benchmark test runs and publishes the results: https://opensearch.org/benchmarks/ This is based on the pre-defined benchmark-workloads which are executed using the tool opensearch-benchmark.

Additionally, there is ongoing work to define testing processes for performance related quality assurance: opensearch-project/OpenSearch#7499

These efforts, however, focus on the OpenSearch core; the efficiency of the code implementing data retrieval and indexing is tested. In some cases, there is a comparison between OpenSearch with security enabled vs OpenSearch without security enabled. While this can give a first idea of the overhead produced by the security plugin, the results cannot be generalized. This is because the performance overhead of the security plugin depends on a number of different factors. See below for more information on this.

Ideally, there would be an additional security-focused performance test suite. This test suite could be used for assuring that the security plugin does not suffer any performance regressions. Additionally, it could be used to determine the actual benefit of performance enhancing code changes (for example #3870).

Goal of this RFC is to collect requirements for such a test suite, learn about ongoing efforts, and to find a way to integrate it into the software development process.

Performance influencing factors

The performance overhead of the security plugin depends on a number of different factors:

  • Number of indices on a cluster. The security plugin access controls involve resolving requested index patterns and index patterns in the role definitions to actual indices. For clusters with many indices, this can slow things down (see for example [RFC] Optimized Privilege Evaluation #3870 )

  • Number of roles of a user. While the example configurations usually demonstrate users with one or two roles, there are environments which act very liberal in assigning roles to users. Users with hundreds of roles do happen in reality. In order to determine the authorization status of a user, the security plugin needs to loop through every role of a user for each request.

  • Size of the user object. During inter-node communication, the security plugin needs to pass on the user object to the target node; for this, the user object needs to be serialized and deserialized. Larger user objects cause higher overhead during serialization and deserialization. Additionally, larger user objects put more stress onto the network.

  • Use of patterns vs constant index names in the role configuration.

  • Use of dynamic index patterns. The security plugin allows to define roles which use expressions to dynamically define the indices for which privileges are granted. Additionally, date math can be used to dynamically grant privileges. Both cases add a certain overhead to the privilege evaluation.

  • Operations on aliases vs operations on indices. Privilege evaluation for operations on aliases require an additional resolving of the aliases. This can have a significant impact if an alias has many member indices.

  • Use of DLS, FLS or field masking. If DLS, FLS or field masking is used, the security plugin hooks at a low level into the OpenSearch/Lucene operations. This is particularly hot code where already subtle details can have big performance impacts. Additionally, the current DLS/FLS implementation requires the serialization/deserialization of the whole DLS/FLS configuration. This can also have performance impacts for clusters with complex configurations.

  • Type of the OpenSearch operation. While the operation itself (such as bulk index, search or aggregate) should have little effect on the performance, there can be operation specific optimizations which then lead to performance characteristics that vary by operation. For example, the privilege evaluation code contains special cases to skip it on the top level for bulk index operations in favour of checks on the shard level.

  • TLS version, ciphers, implementation. TLS offers actually a quite broad field of different settings and implementations which might have an effect on the performance. As the security plugin adds TLS to the communication, it might be good to gain more insights into its performance effects.

Proposal for test dimensions

Given the complex dependencies, a three dimensional test matrix seems to be necessary. I would propose these dimensions:

  1. Operations: bulk index, search on specific index, search on *, search on alias, search with aggregation
  2. Number of indices: Cluster with 10 indices, cluster with 1000 indices
  3. Security configuration: Basic, large user object, utilizing DLS/FLS, dynamic index patterns, user with many roles

Note: Some combinations in this matrix might not make sense and can be skipped; for example, DLS/FLS makes indices read-only, thus testing indexing does not make sense in that case

Questions:

  • Are there further aspects that should be covered?
  • Should a comparison of TLS versions, ciphers or implementations be also covered here?

Process definition

We need to define when and how the performance tests should be executed.

Factors which might influence the decision are:

  • A test run will take a significant time. Thus, it might be not suitable to be included in a pull request CI run. Should there be an optional way to execute the test for a pull request?
  • A test run will require significant CPU resources. Is the Github actions environment sufficient for (small) test runs? Is it necessary to pull up dedicated environments?
  • Shall there be a nightly test run?
  • How shall the results of the test runs be published?
  • How shall the results of the test runs be monitored?

Related work

There are a number of other ongoing efforts/proposals covering OpenSearch benchmarking. In some cases, it might make sense to coordinate with these efforts:

Question: Are there further efforts?

@nibix nibix added enhancement New feature or request untriaged Require the attention of the repository maintainers and may need to be prioritized labels Dec 29, 2023
@peternied
Copy link
Member

Proposal for test dimensions

Good list of scenarios as there are clear classes where the security plugin has undefined cpu/memory performances characteristics O(n^2) or O(2^N) based on the number of indices. This would help identify where we can provide better instruction to existing clusters with best practices and also give us a way to measure the impact of performance improvements over time.

Should a comparison of TLS versions, ciphers or implementations be also covered here?

I feel like this is represents a different 'class' of test case - which is different than correctness. I think of the correctness and perf areas being those measures such as the number of indices or number of roles assigned to a user. For TLS versions the considerations depend on compliance to standards such as FIPS - they need a level of configurability.

I think it is a very useful area to test - but I'd break those out separately the dimension and I'm less concerned about getting results until we are looking to add additional configuration options and we want to make sure they are in alignment.

A test run will take a significant time. Thus, it might be not suitable to be included in a pull request CI run. Should there be an optional way to execute the test for a pull request?

I'd prefer we keep it simple and always run them, see the ResourceFocusedTests.java and associated github workflow for an example of these 'constrained' tests.

What do you think of address specific performance time issues, such as having a snapshot that can be restored with the test data set rather than individual requests for items, maybe the benchmarks' folks can help with recommendations?

A test run will require significant CPU resources. Is the Github actions environment sufficient for (small) test runs? Is it necessary to pull up dedicated environments?

I would be concerned about creating new infrastructure for this - while it would be possible such as by using OpenSearch-ci, that becomes a management burden compared to the managed github runners. By using this infrastructure it also creates a bottleneck for testing and deployment of new features or test areas in forks.

Note; I know this makes a trade off where you have a hard time testing locally - I'd rather that PRs can get this testing than relying on contributors / maintainers to remember to run the tests.

  • How shall the results of the test runs be published?
  • How shall the results of the test runs be monitored?

Not sure what is available in this space, I'd love to see proposals. What codecov provides on pull requests is really nice - but I don't know what platforms are out there for these multi-dimensional perf metrics where you could see pull request vs main.

In a more bootstrapping way we could use the a LKG that is build and runs the tests, then the PR, then another test compares the results. This might also offer a way that these tests could be run locally via gradle that is triggered by a GHA.


Closing thoughts - Reliability

I believe the most important part of these tests would be that it is clear what they are testing, and its also clear what is wrong when they fail. If we have a failing perf test it would be unfortunate for it to be off-hand ignored because it isn't clear the impact. I'd rather have only a few highly trusted test cases than a bunch of edge case coverage with high variation in results and reliability.

@peternied
Copy link
Member

Have you seen github-action-benchmark? Maybe this would be a good approach to tackle the data collection / analysis.

@nibix
Copy link
Contributor Author

nibix commented Jan 4, 2024

@peternied

I feel like this is represents a different 'class' of test case - which is different than correctness.

Good point! This helps defining the scope of the issue.

I'd prefer we keep it simple and always run them

Do you have an intuition what would be the max duration performance tests should take then?

Have you seen github-action-benchmark?

Good idea! Will have a look.

@peternied
Copy link
Member

Do you have an intuition what would be the max duration performance tests should take then?

@nibix How about we start with max of 20 minutes and work backwards from there? I pulled that number from the current max PR check times.

@peternied peternied removed the untriaged Require the attention of the repository maintainers and may need to be prioritized label Jan 5, 2024
@nibix
Copy link
Contributor Author

nibix commented Jan 18, 2024

@peternied

Sorry for the delay, but I now can follow up with some further responses and items to discuss.

I would be concerned about creating new infrastructure for this - while it would be possible such as by using OpenSearch-ci, that becomes a management burden compared to the managed github runners. By using this infrastructure it also creates a bottleneck for testing and deployment of new features or test areas in forks.

I can totally follow your reasoning here. Yet, there's one further problem we have to consider, which also plays into the trade-offs we have to consider for the infrastructure question.

It is about the variance of benchmark results we will have to expect. If you look at the existing OpenSearch benchmarks, you can see for some operations a variation of up to 18% between runs:

Screenshot from 2024-01-18 12-17-16

This poses a problem if one wants to use the benchmark results for automatically verifying the absence of performance regressions in pull requests. This is because we cannot just test for "equal or better" performance, but have to add some tolerance interval. However:

  • If the tolerance interval is too narrow, we will get too many false positives. Frequent false positives will lead developers to ignoring the tests altogether.
  • If the tolerance interval is too high, we will only notice dramatic performance regressions. However, performance regressions are often subtle and only accumulate after time.

The Benchmark GitHub action will by default trigger an error only if the performance regressed by a factor of 2: https://github.com/benchmark-action/github-action-benchmark#alert-comment-on-commit-page

To be honest, I have some doubts whether this provides sufficient control.

Let's consider what factors influence the magnitude of the variance:

  • A system where the resources are guaranteed to stay the same will have lower variance. The currently published OpenSearch benchmarks run always on the same resources. However, the environment provided by GitHub actions gives fewer guarantees about the resources. Thus, a higher variance can be expected. The github-action-benchmark publishes some examples for micro benchmarks which are run on GitHub action environments. A variance of up to 55% can be seen there: https://benchmark-action.github.io/github-action-benchmark/dev/bench/
  • To some extent, this could be alleviated by always re-running the benchmarks for the LKG in the same environment instead of using old benchmark results. However, we also need to expect some variance there. Additionally, this will double the required run time of the tests.
  • Another factor is the complexity of the system. A multi-node OpenSearch cluster is a very complex system. Many factors are present here which can cause variations in the performance. The variation would be lower if we rather would do micro (or semi-micro) benchmarks on relevant code sections. This is actually a thing I did not think about before, so I'd open a new section to discuss this in detail:

Cluster benchmarks vs micro benchmarks

Originally, I was thinking about running benchmarks using "real" clusters. This is the "natural" way to benchmark OpenSearch; there is a dedicated tool to perform such tests: opensearch-benchmark.

However, we see in the discussion that there might be some requirements which are possibly not compatible with this approach. One of these requirements is the ability to provide trustworthy benchmark results in the context of a pull request.

Another approach of doing benchmarks is doing micro-benchmarks, which focus on a particular code section. The relationship of these two approaches is similar to the relationship between integration tests and unit tests.

Let's compare the approaches a bit more!

Cluster level benchmarks

  • Advantages:
    • Give you the whole picture.
    • All code parts are automatically covered.
    • The numbers can be easily communicated: Version A of OpenSearch is 20% faster than version B.
    • The opensearch-benchmark can be utilized.
  • Disadvantages:
  • Takes significant resources
  • Higher variance in test results
  • Helps in identifying the pure presence of regressions. However, the cause of regressions will be difficult to find.

Micro benchmarks

  • Advantages:
    • More stable test results.
    • Environments can be mocked; this reduces the need for creating test data.
    • Reported regressions are already focused on certain code parts
  • Disadvantages:
    • This can only cover selected parts of the code. Full coverage is not possible.
    • It is hard to tell what consequences performance changes will have in practice, on a real cluster.

Discussion

  • Both approaches are justified for certain reasons.
  • Possibly both approaches should be utilized?
  • Are there other aspects which should be considered?

@nibix
Copy link
Contributor Author

nibix commented Jan 18, 2024

@peternied

What do you think of address specific performance time issues, such as having a snapshot that can be restored with the test data set rather than individual requests for items, maybe the benchmarks' folks can help with recommendations?

At the moment, I am not too concerned about the initialization of test data. Most test cases will only depend on the number of indices, but not the number of documents. Even if we need tens of thousands of indices, it should be possible to create these in a hunch. An exception might be DLS/FLS, but even there it is debatable whether a significant number of documents is required.

Still, having an opinion of the benchmarks' folks on the whole topic would be of course extremely helpful :-)

@nibix
Copy link
Contributor Author

nibix commented Jan 18, 2024

BTW, to increase visibility, it might make sense to tag this with the performance label.

@peternied peternied added performance Make it fast! help wanted Community contributions are especially encouraged for these issues. labels Jan 19, 2024
@peternied
Copy link
Member

Cluster benchmarks vs micro benchmarks ...

I like this thought process quite a bit - it satisfies my desire for test reliability.

I suppose where I'm curious is what micro encompasses, for example I think the 'base' test case is for privilege evaluator's 'allowed' response is something that we should capture. Is that too big already, what kind of examples did you have in mind?

@nibix
Copy link
Contributor Author

nibix commented Jan 22, 2024

@peternied

I suppose where I'm curious is what micro encompasses, for example I think the 'base' test case is for privilege evaluator's 'allowed' response is something that we should capture.

I would think the same. Roughly the call of PrivilegesEvaluator.evaluate(). However, this only captures one aspect of #3870. I would think the proposed improvements for DLS/FLS cannot be completely covered by such a micro benchmark, as this would be more a systemic improvement which is not locatable in a single method.

@rishabh6788
Copy link

Great proposal @nibix. The biggest pain point of adding any new workload is sourcing the data, prepping it and then integrating it with opensearch-benchmark, but if we are not too keen on indexing performance we can quickly get started with using a snapshot from an existing cluster that caters to our requirement.
We can in-parallel work to generate the corpus to test security performance on indexing path.

@scrawfor99
Copy link
Collaborator

[Triage] Thanks for filing this issue @nibix. Since this is an RFC we can mark this triaged for more engagement and steps for moving forward.

@scrawfor99 scrawfor99 added the triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. label Jan 22, 2024
@peternied
Copy link
Member

Just to clarify what I'd expect of a performance test. I don't think we can reliability calibrate tests on microsecond/millisecond scales. There are far too many low level system compatibilities that might be hard to reproduce without dedicated hardware which I'm not in favor of.

Test that I do think would be viable without environment constraints

  • AB Testing, with scenarios that run over 100+ iterations in <10 seconds. Detecting regressions greater than X% to lower false positive rate.
  • Linearity runtime validation, run a scenario X number of times with N documents (or the scaling factor). Repeat that scenario with 10*N values, ensure the ratio of execution time is consistant with the higher number.

There might be ways to do more/better but I think we can establish a baseline performance and detect regressions as we build features over time. What do you think @nibix?

@nibix
Copy link
Contributor Author

nibix commented Feb 12, 2024

Generally, that makes sense. However, I am still a bit skeptical that the jitter induced by the environment will allow detecting regressions in a realistic scale. If the tolerance X% is too high (like 200% which is used by default by github-action-benchmark), the benchmarks will have no practical benefit and might actually give a false sense of security. Minor regressions will just go unnoticed and accumulate with the time.

In the end, doing the benchmarks as part of pull requests has the downside that you can only consider a single A/B test. This won't give you a sense of the present jitter and also won't detect accumulated minor regressions. On the other hand, a graph similar to the ones at https://opensearch.org/benchmarks/ give you these insights - while lacking automatic actions upon regressions.

But, in the end, I guess we just have to try things out. So, I would propose now the following. Let's create a prototype, which:

  • Runs a specialized opensearch-benchmarks track in a Github actions context. This does a tournament (aka A/B test) between main and the current branch.
  • Runs a number of micro benchmarks in a Github actions context. These micro benchmarks shall focus on privilege evaluation for now. We also A/B test against the main branch.

We would need to collect some data points with this prototype to see whether the results are reliable enough in order to be used for automatic regression testing. If we get confident that this is suitable, we can implement these checks in the CI. If not, we would have to re-consider what to do with the data.

@peternied
Copy link
Member

If we get confident that this is suitable, we can implement these checks in the CI.

👍

If the tolerance X% is too high (like 200% which is used by default by github-action-benchmark), the benchmarks will have no practical benefit

Yea- this is definitely a concern in my mind as well. Being that we have no capacity to detect a 200% regression, I'd say even at a minimal level this is a great improvement. Beyond detecting large regressions when prompted investigation [1] [2] arrive, having the framework in place would help provide a starting point for measurement and validation on a developers environment.

@abraakie abraakie linked a pull request Apr 23, 2024 that will close this issue
3 tasks
@cwperks
Copy link
Member

cwperks commented May 13, 2024

@nibix Thank you for the detailed RFC! I like the idea of adding performance tests in this repo that can help to establish a baseline for the "cost" of common (security) plugin operations under different scenarios. For instance, this user reports that there are CPU spikes when there is a config update in a cluster with 300 users: #4008.

A performance test suite can help to ensure that performance issues are properly mitigated and help to understand how performance degrades or improves as changes are merged in. This will especially be helpful when assessing the performance gains from the related RFC: #3903

I think other plugins could benefit from this test suite as well.

@nibix
Copy link
Contributor Author

nibix commented May 15, 2024

@cwperks #4008 is an interesting issue, indeed.

Generally, I think performance needs to be tackled on several dimensions or levels.

In this, RFC we have already concluded that we need two separate approaches:

  • The overall cluster benchmarks
  • Micro benchmarks

Additionally, the following thing just crossed my mind looking at #4008. Just thinking out loudly:

For facilitating the diagnosis of things like #4008, it might be useful to have built-in instrumentation to gain performance metrics on production systems. Especially the security plugin, often needs to talk to external systems - like LDAP servers. This can quickly have big impacts on cluster performance. Thus, having a low-threshold diagnosis system would be beneficial. A "related" project 😉 has this: https://git.floragunn.com/search-guard/search-guard-suite-enterprise/-/merge_requests/200

@DarshitChanpura
Copy link
Member

@nibix Agreed on making this a two-part approach. What would be the plan forward based on the all the discussions up until the latest?

@nibix
Copy link
Contributor Author

nibix commented May 16, 2024

@DarshitChanpura

Agreed on making this a two-part approach. What would be the plan forward based on the all the discussions up until the latest?

There's already a draft PR for the test suite, but a bit of work is still necessary. The micro benchmarks for privilege evaluation should come with the PR #3870 which should also come up very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Community contributions are especially encouraged for these issues. performance Make it fast! triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants