New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROPOSAL] Anomaly Detection HCAD Performance Measurement #652
Comments
The framework looks good to me. Thanks @dbwiddis for writing this up. I just have a minor feedback/suggestion can we emphasize on using just one extension node for testing it (Probably making |
Tagging folks for the feedback on the proposal: @dblock @saratvemulapalli @kaituo @ylwu-amzn @sean-zheng-amazon |
I've split out that one tweak to the original setup in a separate, bold, italic paragraph. I believe it honestly represents a similar setup, being 1/2 of a metal server rather than 6 1/12ths of a server (literally six of one, a half dozen of another!)
I don't believe it will perform that differently. Certainly the test will be at the exact same cost as the previous testing scenario and I believe that's the metric users care about. |
Thanks a lot @dbwiddis for writing this up. Tagging @kaituo (author of blog) to also share thoughts if any.
|
Yes, this is going to be an ongoing, iterative process. My initial goal is to demonstrate "same or better" but that's just a single snapshot. There are multiple additional goals of gaining insight into the resource usage on the nodes in both models, in order to inform customers about what trade-offs each model brings. In addition to the ones you listed, I am aware indexing pressure is a key consideration and I think shard indexing pressure provides an even earlier leading indicator of potential indexing issues, and plan on trying to validate that. See https://opensearch.org/blog/shard-indexing-backpressure-in-opensearch/
Definitely. I'm particularly interested in whether we are just moving load around, or whether there are some improvements by avoiding the overhead of resource management when multitasking OpenSearch+HCAD on the same nodes.
Definitely! The choice of instances in my initial plan was designed simply to replicate what's already been done as step 1. Ultimately, I am hoping that the data we are collecting can help identify whether there are other more suitable instance types for the specific types of resource usage. Customers should be able to understand the impacts of smaller instances and know how to monitor when they need to scale. |
@dbwiddis Thanks for writing this up. I have a few questions/comments: First, you wrote "The recommended instance size for a 30-instance cluster similar to the baseline is a c5.2xlarge.search with 16 GiB memory and the same vCPUs as the baseline. The reduction in memory requirements saves 33% of the costs for these servers." I think the 33% reduction is due to the use of 16GB memory machine instead of 32GB memory machine, right? Second, you said "Even maintaining the same r5.12xlarge extension server will reduce total cluster costs by 28%." But you also wrote " we will replace the 6 r5.2xlarge "extension servers" with a single r5.12xlarge, which has exactly 6 times the resources and exactly 6 times the cost, and run the Anomaly Detection extension on this node." Could you elaborate where the 28% reduction comes from? Third, we do have training time experiment. This is part of our CI: https://github.com/opensearch-project/anomaly-detection/blob/main/.github/workflows/benchmark.yml You can use the running time of CI to measure training time like 11m below.
Fourth, we don't have cache hits metrics. Fifth, for JVM/CPU/IO, please follow https://stackify.com/custom-metrics-aws-lambda/ Sixth, for a more constrained scenario to establish KPIs when stressed, please check the 2nd half of https://opensearch.org/blog/one-million-enitities-in-one-minute/ |
Yes: 1/2 the memory and same vCPU for 2/3 the cost. But that 33% reduction will only apply to 33 of the 39 nodes in the "apples to apples" cluster size comparison.
33% reduction on 33 of the 39 nodes by changing away from memory optimized. The other 6 nodes will be effectively replaced by the single extension node at no cost difference. 28% = 33% * 33/39.
I'll be sure to include that.
Definitely in the plan! |
A quick clarification on the new cluster and memory allocation requirements. The blog post we're replicating for the first phase of this comparison identified the memory requirements for the models. This paragraph points out the total memory allocation isn't a hard limit, but does permit some disk caching:
The domain sizing calculation also gives some specifics; it's planned for a max model size:
The proposed split of the 36 data nodes into 30 data + 6 extension (combined into a single instance) would provide 384 GiB of memory on the extension node. This (probably) isn't enough to load everything into memory, but seems possibly on the right order of magnitude depending on the impact of the disk caching mentioned. There are some alternatives that I will attempt to explore the impact of:
In summary, while the proposed model may not be the best, extensions give the benefit of independent resource scalability and we are not constrained by needing to keep the cluster data nodes in the same instance size. I'm confident that with testing and experimentation we can identify a combination that meets or exceeds the current performance. |
After learning many lessons about initialization order, subnets, stale builds, and persnickety operating system incompatibilities, I've got a small cluster running enough to begin replicating the HCAD test in this blog post. Initial results (please enjoy with multiple pinches of salt): Performed a smaller scale test using same node size as the blog post (r5.2xlarge)
Analysis completed in 1m 40s (1/6 of the 10 minute interval). Quick look indicates that a cluster of about 1/6 the size can perform 1/7 of the work in 1m40s. That's above the 1 minute goal, but there are a few optimizations that can be easily done:
But as of today I'm optimistic that we're going to be able to achieve "same or better" performance. Pausing more runs to focus on fixing a known performance issue (#674) that is adding a few seconds to these results and impacting other extension performance testing issues. |
Full-scale performance test. Cluster from original HCAD "1 million in 1 minute" was 36 R5.2xlarge data nodes. Same size/cost cluster for this test: 30 r5.2xlarge data nodes. 1 r5.12xlarge extension node (scaled vertically to simulate horizontally scaling 6 r5.2xlarge nodes). Heap configured for the extension node at 364 GB (no swap). Number of unique entities analyzed: 990,000 |
Another run: |
@dbwiddis are these data-points for running extensions on a remote node or out of process? Would love to see summarized data in a table for 3 streams for doing an apple-to-apple comparison:
|
These were quick snapshot points of one metric (run time) for a remote node in a "6+30" memory optimized configuration (see below for what this means).
Option 2 isn't really possible at scale, as we have not yet done multi-node extension support (horizontal scalability). However, I do plan on doing the following apples to apples, along with apples to pears and oranges comparisons. Naming convention is "x+y" where x is a single remote extension node vertically scaled to simulate horizontal scaling. Specifically "6" means simulating six 2xlarge nodes with one 12xlarge; "8" means simulating eight 2xlarge nodes with one 16xlarge.
|
What/Why
What are you proposing?
As part of SDK development, we have been focusing on migrating the Anomaly Detection plugin to an extension. This has proved beneficial in highlighting many features needed in the SDK. However, we have not yet addressed the performance aspect of this migration. This issue outlines the plan to measure this performance.
Many API calls are used infrequently or incur operational costs far exceeding the additional time for transport. The "hot path" feature that is needed to demonstrate scalable, performant, and stable performance is high-cardinality anomaly detection (HCAD).
What users have asked for this feature?
The Extensions team has a goal of "same or better performance" in order to demonstrate the benefits of Extensions. Users, in general, want scalable, stable performance at least possible cost, and we will demonstrate the ability to deliver that via extensions.
What problems are you trying to solve?
This blog post identifies the performance bar to meet: one million unique entities per minute. This issue outlines a plan to measure and demonstrate same-or-better performance against this metric.
What will it take to execute?
Baseline
The blog post specifies the cluster size used to achieve its results: a 39-node r5.2xlarge cluster (3 primary and 36 data nodes). The measured performance impact identified JVM memory pressure (percentage of heap used by HCAD) as the limiting factor (and sized accordingly. Specifically, overall performance during HCAD measurement were:
How Extensions can perform better
An OpenSearch cluster generally uses the same size server on each data node. Sizing domains is not trivial, and varies based on the amount of data, number of replicas, indexing overhead, and other factors.
One advantage of separating extensions from the OpenSearch server is independent scalability. The entire cluster does not need to be upgraded to memory-optimized servers, only the extension node(s) should do that. We expect performing HCAD models on dedicated server(s) with their own ability to control threads and resource management can provide better overall performance than attempting to orchestrate these calculations on a distributed cluster.
We will make two different tests to show "same or better"
Same servers, same or better performance
Reversing the blog post's 1.2 factor for servers indicates only 30 nodes are needed for this cluster without HCAD. We will reconfigure the cluster with 30 data nodes and use the 6 servers to perform extensions work. We will demonstrate the ability to match or exceed the baseline test metrics.
Note: As we have not yet implemented extension scalability, we will replace the 6 r5.2xlarge "extension servers" with a single r5.12xlarge, which has exactly 6 times the resources and exactly 6 times the cost, and run the Anomaly Detection extension on this node.
Lower server cost, same performance
Using the baseline metrics as a minimum requirement, we will reduce the total costs of the cluster. The recommended instance size for a 30-instance cluster similar to the baseline is a c5.2xlarge.search with 16 GiB memory and the same vCPUs as the baseline. The reduction in memory requirements saves 33% of the costs for these servers. Even maintaining the same r5.12xlarge extension server will reduce total cluster costs by 28%.
Any remaining open questions?
Seeking feedback on this test framework!
The text was updated successfully, but these errors were encountered: