Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Anomaly Detection HCAD Performance Measurement #652

Open
dbwiddis opened this issue Apr 6, 2023 · 13 comments
Open

[PROPOSAL] Anomaly Detection HCAD Performance Measurement #652

dbwiddis opened this issue Apr 6, 2023 · 13 comments
Assignees
Labels
discuss documentation Improvements or additions to documentation

Comments

@dbwiddis
Copy link
Member

dbwiddis commented Apr 6, 2023

What/Why

What are you proposing?

As part of SDK development, we have been focusing on migrating the Anomaly Detection plugin to an extension. This has proved beneficial in highlighting many features needed in the SDK. However, we have not yet addressed the performance aspect of this migration. This issue outlines the plan to measure this performance.

Many API calls are used infrequently or incur operational costs far exceeding the additional time for transport. The "hot path" feature that is needed to demonstrate scalable, performant, and stable performance is high-cardinality anomaly detection (HCAD).

What users have asked for this feature?

The Extensions team has a goal of "same or better performance" in order to demonstrate the benefits of Extensions. Users, in general, want scalable, stable performance at least possible cost, and we will demonstrate the ability to deliver that via extensions.

What problems are you trying to solve?

This blog post identifies the performance bar to meet: one million unique entities per minute. This issue outlines a plan to measure and demonstrate same-or-better performance against this metric.

What will it take to execute?

Baseline

The blog post specifies the cluster size used to achieve its results: a 39-node r5.2xlarge cluster (3 primary and 36 data nodes). The measured performance impact identified JVM memory pressure (percentage of heap used by HCAD) as the limiting factor (and sized accordingly. Specifically, overall performance during HCAD measurement were:

  • JVM Memory Pressure between 23-39% (15-25 GiB)
  • 1% CPU, except for an hourly spike to 65% for maintenance tasks
  • Disk I/O not limiting (HCAD uses memory to queue as needed to avoid impacts)

How Extensions can perform better

An OpenSearch cluster generally uses the same size server on each data node. Sizing domains is not trivial, and varies based on the amount of data, number of replicas, indexing overhead, and other factors.

One advantage of separating extensions from the OpenSearch server is independent scalability. The entire cluster does not need to be upgraded to memory-optimized servers, only the extension node(s) should do that. We expect performing HCAD models on dedicated server(s) with their own ability to control threads and resource management can provide better overall performance than attempting to orchestrate these calculations on a distributed cluster.

We will make two different tests to show "same or better"

  1. Same number and type of servers; better performance
  2. Lower server cost; same performance

Same servers, same or better performance

Reversing the blog post's 1.2 factor for servers indicates only 30 nodes are needed for this cluster without HCAD. We will reconfigure the cluster with 30 data nodes and use the 6 servers to perform extensions work. We will demonstrate the ability to match or exceed the baseline test metrics.

Note: As we have not yet implemented extension scalability, we will replace the 6 r5.2xlarge "extension servers" with a single r5.12xlarge, which has exactly 6 times the resources and exactly 6 times the cost, and run the Anomaly Detection extension on this node.

Lower server cost, same performance

Using the baseline metrics as a minimum requirement, we will reduce the total costs of the cluster. The recommended instance size for a 30-instance cluster similar to the baseline is a c5.2xlarge.search with 16 GiB memory and the same vCPUs as the baseline. The reduction in memory requirements saves 33% of the costs for these servers. Even maintaining the same r5.12xlarge extension server will reduce total cluster costs by 28%.

Any remaining open questions?

Seeking feedback on this test framework!

@dbwiddis dbwiddis added documentation Improvements or additions to documentation discuss and removed untriaged labels Apr 6, 2023
@owaiskazi19
Copy link
Member

owaiskazi19 commented Apr 6, 2023

The framework looks good to me. Thanks @dbwiddis for writing this up. I just have a minor feedback/suggestion can we emphasize on using just one extension node for testing it (Probably making single bold)?
Also, we can add the a FUTURE section which talks about spinning multiple extension node and replicating the exact scenario AD plugin had for perf testing HCAD.

@owaiskazi19
Copy link
Member

owaiskazi19 commented Apr 6, 2023

Tagging folks for the feedback on the proposal: @dblock @saratvemulapalli @kaituo @ylwu-amzn @sean-zheng-amazon

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 7, 2023

The framework looks good to me. Thanks @dbwiddis for writing this up. I just have a minor feedback/suggestion can we emphasize on using just one extension node for testing it (Probably making single bold)?

I've split out that one tweak to the original setup in a separate, bold, italic paragraph. I believe it honestly represents a similar setup, being 1/2 of a metal server rather than 6 1/12ths of a server (literally six of one, a half dozen of another!)

Also, we can add the a FUTURE section which talks about spinning multiple extension node and replicating the exact scenario AD plugin had for perf testing HCAD.

I don't believe it will perform that differently. Certainly the test will be at the exact same cost as the previous testing scenario and I believe that's the metric users care about.

@krishna-ggk
Copy link

krishna-ggk commented Apr 26, 2023

Thanks a lot @dbwiddis for writing this up. Tagging @kaituo (author of blog) to also share thoughts if any.

  1. Are there other AD specific performance metrics we could be considering? For example cache hits, training time etc. Perhaps someone from AD plugin could help compile the list?
  2. Can we also capture JVM/CPU/IO profiles of extension instance?
  3. This baseline is good start. Should we also in addition run more constrained scenarios to establish KPIs when stressed?

@dbwiddis
Copy link
Member Author

  1. Are there other AD specific performance metrics we could be considering? For example cache hits, training time etc. Perhaps someone from AD plugin could help compile the list?

Yes, this is going to be an ongoing, iterative process. My initial goal is to demonstrate "same or better" but that's just a single snapshot.

There are multiple additional goals of gaining insight into the resource usage on the nodes in both models, in order to inform customers about what trade-offs each model brings. In addition to the ones you listed, I am aware indexing pressure is a key consideration and I think shard indexing pressure provides an even earlier leading indicator of potential indexing issues, and plan on trying to validate that. See https://opensearch.org/blog/shard-indexing-backpressure-in-opensearch/

  1. Can we also capture JVM/CPU/IO profiles of extension instance?

Definitely. I'm particularly interested in whether we are just moving load around, or whether there are some improvements by avoiding the overhead of resource management when multitasking OpenSearch+HCAD on the same nodes.

  1. This baseline is good start. Should we also in addition run more constrained scenarios to establish KPIs when stressed?

Definitely! The choice of instances in my initial plan was designed simply to replicate what's already been done as step 1.

Ultimately, I am hoping that the data we are collecting can help identify whether there are other more suitable instance types for the specific types of resource usage. Customers should be able to understand the impacts of smaller instances and know how to monitor when they need to scale.

@kaituo
Copy link

kaituo commented Apr 27, 2023

@dbwiddis Thanks for writing this up. I have a few questions/comments:

First, you wrote "The recommended instance size for a 30-instance cluster similar to the baseline is a c5.2xlarge.search with 16 GiB memory and the same vCPUs as the baseline. The reduction in memory requirements saves 33% of the costs for these servers." I think the 33% reduction is due to the use of 16GB memory machine instead of 32GB memory machine, right?

Second, you said "Even maintaining the same r5.12xlarge extension server will reduce total cluster costs by 28%." But you also wrote " we will replace the 6 r5.2xlarge "extension servers" with a single r5.12xlarge, which has exactly 6 times the resources and exactly 6 times the cost, and run the Anomaly Detection extension on this node." Could you elaborate where the 28% reduction comes from?

Third, we do have training time experiment. This is part of our CI: https://github.com/opensearch-project/anomaly-detection/blob/main/.github/workflows/benchmark.yml

You can use the running time of CI to measure training time like 11m below.

 Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 11m

Fourth, we don't have cache hits metrics.

Fifth, for JVM/CPU/IO, please follow https://stackify.com/custom-metrics-aws-lambda/

Sixth, for a more constrained scenario to establish KPIs when stressed, please check the 2nd half of https://opensearch.org/blog/one-million-enitities-in-one-minute/

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 27, 2023

First, you wrote "The recommended instance size for a 30-instance cluster similar to the baseline is a c5.2xlarge.search with 16 GiB memory and the same vCPUs as the baseline. The reduction in memory requirements saves 33% of the costs for these servers." I think the 33% reduction is due to the use of 16GB memory machine instead of 32GB memory machine, right?

Yes: 1/2 the memory and same vCPU for 2/3 the cost. But that 33% reduction will only apply to 33 of the 39 nodes in the "apples to apples" cluster size comparison.

Second, you said "Even maintaining the same r5.12xlarge extension server will reduce total cluster costs by 28%." But you also wrote " we will replace the 6 r5.2xlarge "extension servers" with a single r5.12xlarge, which has exactly 6 times the resources and exactly 6 times the cost, and run the Anomaly Detection extension on this node." Could you elaborate where the 28% reduction comes from?

33% reduction on 33 of the 39 nodes by changing away from memory optimized. The other 6 nodes will be effectively replaced by the single extension node at no cost difference. 28% = 33% * 33/39.

Third, we do have training time experiment. This is part of our CI: https://github.com/opensearch-project/anomaly-detection/blob/main/.github/workflows/benchmark.yml

I'll be sure to include that.

Fourth, we don't have cache hits metrics.

Fifth, for JVM/CPU/IO, please follow https://stackify.com/custom-metrics-aws-lambda/

Sixth, for a more constrained scenario to establish KPIs when stressed, please check the 2nd half of https://opensearch.org/blog/one-million-enitities-in-one-minute/

Definitely in the plan!

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 28, 2023

A quick clarification on the new cluster and memory allocation requirements. The blog post we're replicating for the first phase of this comparison identified the memory requirements for the models. This paragraph points out the total memory allocation isn't a hard limit, but does permit some disk caching:

Model management is a trade-off: disk-based solutions that reload-use-stop-store models on every interval offer savings in memory but suffer high overhead and are hard to scale. Memory-based solutions offer lower overhead and higher throughput but typically increase memory requirements. We exploit the trade-off by implementing an adaptive mechanism that hosts models in memory as much as allowed (capped via the cluster setting plugins.anomaly_detection.model_max_size_percent), as required by best performance. When models don’t fit in memory, we process extra model requests by loading models from disks.

The domain sizing calculation also gives some specifics; it's planned for a max model size: "model_size_in_bytes": 470491 multiplied by 1 million models gives a total max memory requirement of 448.7 GiB. There's also an additional 20% recommended for memory overhead for 538.4 GiB (about 15 GiB per node in the earlier 36-node setup.)

  • Note that not all models are this max size, the blog gives an example of both a 471 KB and 403 KB model. If this were the average, total requirements are 416 GiB for the models and 84 GiB of overhead.

The proposed split of the 36 data nodes into 30 data + 6 extension (combined into a single instance) would provide 384 GiB of memory on the extension node. This (probably) isn't enough to load everything into memory, but seems possibly on the right order of magnitude depending on the impact of the disk caching mentioned. There are some alternatives that I will attempt to explore the impact of:

  • We could let the existing disk storage methods manage the models to see how limiting it is.
  • We could scale up the extension node (and reduce data nodes) for a 24 data + 12 extension (combined into a single instance) split. The bare metal machine underneath the instance sizes for the test has 768 GiB of memory. Since the 36-node cluster size was calculated based primarily on memory requirements, it seems entirely reasonable that fewer data nodes can handle the indexing load... obviously we need to test this!
  • OpenSearch recommends pre-reserving all the heap memory that it will use to prevent swapping. However, if swapping is permitted (allowing the operating system to balance the memory/disk tradeoff) then the JVM does permit allocating more memory to heap than is available in physical RAM, which might be better (or worse) than the existing AD algorithm for caching

In summary, while the proposed model may not be the best, extensions give the benefit of independent resource scalability and we are not constrained by needing to keep the cluster data nodes in the same instance size. I'm confident that with testing and experimentation we can identify a combination that meets or exceeds the current performance.

@dbwiddis
Copy link
Member Author

dbwiddis commented May 31, 2023

After learning many lessons about initialization order, subnets, stale builds, and persnickety operating system incompatibilities, I've got a small cluster running enough to begin replicating the HCAD test in this blog post.

Initial results (please enjoy with multiple pinches of salt):

Performed a smaller scale test using same node size as the blog post (r5.2xlarge)

  • 5 data nodes + 1 extension node (6 nodes) represent 1/6 of the 36-node cluster in the blog post test
  • Detected anomalies on 144,000 unique entities (1/7 of the million entities in the blog post)
  • Used default JVM settings on extension node (Heap = 1/2 of physical memory, so 32 GB of the 64GB available)

Analysis completed in 1m 40s (1/6 of the 10 minute interval).

Quick look indicates that a cluster of about 1/6 the size can perform 1/7 of the work in 1m40s. That's above the 1 minute goal, but there are a few optimizations that can be easily done:

  • field order optimizes sorting time and can be a 25% reduction
  • I can increase heap size on the extension node.

But as of today I'm optimistic that we're going to be able to achieve "same or better" performance.

Pausing more runs to focus on fixing a known performance issue (#674) that is adding a few seconds to these results and impacting other extension performance testing issues.

@dbwiddis
Copy link
Member Author

dbwiddis commented Jun 2, 2023

Full-scale performance test.

Cluster from original HCAD "1 million in 1 minute" was 36 R5.2xlarge data nodes.

Same size/cost cluster for this test: 30 r5.2xlarge data nodes. 1 r5.12xlarge extension node (scaled vertically to simulate horizontally scaling 6 r5.2xlarge nodes).

Heap configured for the extension node at 364 GB (no swap).

Number of unique entities analyzed: 990,000
Time to analyze: 53.4 seconds

@dbwiddis
Copy link
Member Author

dbwiddis commented Jun 2, 2023

Another run:
1 million results.
Start: 20:25:16.238
Finish: 20:26:06.965
50.727 seconds.

@minalsha
Copy link
Collaborator

minalsha commented Jun 8, 2023

@dbwiddis are these data-points for running extensions on a remote node or out of process? Would love to see summarized data in a table for 3 streams for doing an apple-to-apple comparison:

  1. AD plugin
  2. AD extension running out-of-proc
  3. AD extension running on a remote node

@dbwiddis
Copy link
Member Author

dbwiddis commented Jun 8, 2023

@dbwiddis are these data-points for running extensions on a remote node or out of process?

These were quick snapshot points of one metric (run time) for a remote node in a "6+30" memory optimized configuration (see below for what this means).

Would love to see summarized data in a table for 3 streams for doing an apple-to-apple comparison:

  1. AD plugin
  2. AD extension running out-of-proc
  3. AD extension running on a remote node

Option 2 isn't really possible at scale, as we have not yet done multi-node extension support (horizontal scalability). However, I do plan on doing the following apples to apples, along with apples to pears and oranges comparisons.

Naming convention is "x+y" where x is a single remote extension node vertically scaled to simulate horizontal scaling. Specifically "6" means simulating six 2xlarge nodes with one 12xlarge; "8" means simulating eight 2xlarge nodes with one 16xlarge.

  1. AD plugin (36 r5.2xlarge data nodes)
  2. Same-cost AD extension in 6+30 configuration (this may not fit all models in memory and would use CPU to generate)
  3. Same-cost AD extension in 6+30 configuration with swapfile (this will fit all models in "virtual memory" larger than physical memory, and OS would manage paging to disk)
  4. Same-cost AD extension in 8+28 configuration (this will fit all models in memory)
  5. Lower-cost AD extension in an x+y configuration based on trade-offs we observe in options 2, 3, and 4. The x would probably be 6 or 8 (12xlarge or 16xlarge extension node) and the y will be less expensive (c5) data nodes that do not need the memory optimized configuration of the plugin. Possibly the same number, possibly more or fewer; ultimately this configuration is intended to show a lower-cost, better-performance remote node configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants