make 1M1min possible #620

kaituo · 2022-07-28T21:47:50Z

Description

This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker.

Testing done:

Added unit tests.
Manually confirmed 1M1min is possible after the above changes.

Issues Resolved

#338

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker. Testing done: 1. Added unit tests. 2. Manually confirmed 1M1min is possible after the above changes. Signed-off-by: Kaituo Li <kaituo@amazon.com>

codecov-commenter · 2022-07-28T22:28:56Z

Codecov Report

Merging #620 (7686723) into main (9f6a5ab) will increase coverage by 0.21%.
The diff coverage is 87.50%.

@@             Coverage Diff              @@
##               main     #620      +/-   ##
============================================
+ Coverage     78.93%   79.15%   +0.21%     
- Complexity     4204     4240      +36     
============================================
  Files           296      301       +5     
  Lines         17686    17795     +109     
  Branches       1880     1891      +11     
============================================
+ Hits          13960    14085     +125     
+ Misses         2826     2818       -8     
+ Partials        900      892       -8

Flag	Coverage Δ
plugin	`79.15% <87.50%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...opensearch/ad/ratelimit/CheckpointWriteWorker.java	`91.25% <0.00%> (-2.58%)`	⬇️
.../org/opensearch/ad/feature/CompositeRetriever.java	`83.47% <55.55%> (-1.02%)`	⬇️
.../java/org/opensearch/ad/caching/PriorityCache.java	`69.92% <62.50%> (-0.33%)`	⬇️
...rc/main/java/org/opensearch/ad/util/DateUtils.java	`75.00% <75.00%> (ø)`
...in/java/org/opensearch/ad/caching/CacheBuffer.java	`79.25% <84.21%> (+0.39%)`	⬆️
...arch/ad/transport/EntityResultTransportAction.java	`88.70% <86.95%> (+0.77%)`	⬆️
...ad/ratelimit/CheckPointMaintainRequestAdapter.java	`88.88% <88.88%> (ø)`
...a/org/opensearch/ad/ratelimit/ScheduledWorker.java	`94.28% <94.28%> (ø)`
.../java/org/opensearch/ad/AnomalyDetectorPlugin.java	`96.62% <100.00%> (+0.09%)`	⬆️
.../java/org/opensearch/ad/caching/CacheProvider.java	`100.00% <100.00%> (ø)`
... and 18 more

Signed-off-by: Kaituo Li <kaituo@amazon.com>

src/main/java/org/opensearch/ad/caching/CacheBuffer.java

src/main/java/org/opensearch/ad/feature/CompositeRetriever.java

src/main/java/org/opensearch/ad/settings/AnomalyDetectorSettings.java

src/main/java/org/opensearch/ad/util/DateUtils.java

ylwu-amzn · 2022-08-02T22:04:51Z

src/main/java/org/opensearch/ad/transport/AnomalyResultTransportAction.java

@@ -304,6 +303,9 @@ class PageListener implements ActionListener<CompositeRetriever.Page> {

        @Override
        public void onResponse(CompositeRetriever.Page entityFeatures) {
+            if (pageIterator.hasNext()) {


If there are too many pages and not reach the last page until current job interval ends, will current job still iterate next page while next job triggers?

ylwu-amzn · 2022-08-02T22:10:07Z

src/main/java/org/opensearch/ad/transport/EntityResultTransportAction.java

@@ -158,14 +162,14 @@ private ActionListener<Optional<AnomalyDetector>> onGetDetector(
    ) {
        return ActionListener.wrap(detectorOptional -> {
            if (!detectorOptional.isPresent()) {
-                listener.onFailure(new EndRunException(detectorId, "AnomalyDetector is not available.", true));
+                listener.onFailure(new EndRunException(detectorId, "AnomalyDetector is not available.", false));


Why change to false? If detector is not found, do we still need to run job?

It might be the node with the detector config is temporarily unavailable. Changing to false to to accommodate that.

ylwu-amzn · 2022-08-02T22:12:24Z

src/main/java/org/opensearch/ad/transport/EntityResultTransportAction.java

                return;
            }

            AnomalyDetector detector = detectorOptional.get();

            if (request.getEntities() == null) {
-                listener.onResponse(null);
+                listener.onFailure(new EndRunException(detectorId, "Fail to get any entities from request.", false));


For sparse data, it's possible that some interval has no data. With this change, we will stop AD job if there are no data for X intervals?

if there is no data, model node should not receive the request in the first place. If request.getEntities() == null, there might be some bug in our code.

ylwu-amzn · 2022-08-02T22:15:05Z

src/main/java/org/opensearch/ad/transport/EntityResultTransportAction.java

+                    }
+                } catch (IllegalArgumentException e) {
+                    // fail to score likely due to model corruption. Re-cold start to recover.
+                    cache.get().removeEntityModel(detectorId, modelId);


Track model corruption times? It's fine to add stats in next PR, you can add some todo here.

I added the tracking code. Will send the new commit.

Signed-off-by: Kaituo Li <kaituo@amazon.com>

kaituo · 2022-08-03T00:39:12Z

BWC test failed because we deprecated ODFE and the download link is archived.

ohltyler · 2022-08-03T17:11:46Z

BWC test failed because we deprecated ODFE and the download link is archived.

Have a fix coming soon for this - see #625

kaituo · 2022-08-03T19:35:53Z

BWC test failed because we deprecated ODFE and the download link is archived.

Have a fix coming soon for this - see #625

Thanks Tyler

kaituo · 2022-08-03T19:36:33Z

Besides BWC tests, do you have other comments? @ohltyler @ylwu-amzn

ohltyler · 2022-08-03T19:40:16Z

Besides BWC tests, do you have other comments? @ohltyler @ylwu-amzn

I've taken a look at your latest commit - LGTM

ylwu-amzn · 2022-08-03T19:41:23Z

src/main/java/org/opensearch/ad/settings/AnomalyDetectorSettings.java

        .intSetting(
-            "plugins.anomaly_detection.expected_cold_entity_execution_time_in_secs",
-            3,
+            "plugins.anomaly_detection.expected_cold_entity_execution_time_in_millisecs",


Why change from second to millisecond? Will it have any bwc issue?

it gives more control on the speed. Previously, we can only specify down to 1 second. Now we can specify down to 1 millisecond. It won't as these settings are not documented in public.

ylwu-amzn

LGTM. Thanks for the change!

ohltyler

LGTM!

ohltyler · 2022-08-03T23:38:38Z

Failed tests will be fixed in #625, I think we can merge this and backport first.

* make 1M1min possible This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker. Testing done: 1. Added unit tests. 2. Manually confirmed 1M1min is possible after the above changes. Signed-off-by: Kaituo Li <kaituo@amazon.com> (cherry picked from commit 08fdbdd)

kaituo requested review from ylwu-amzn, ohltyler and a team July 28, 2022 21:47

opensearch-trigger-bot bot added the backport 2.x label Jul 28, 2022

remove randomness in scheduled worker as it is not obvious to users

7dd5297

Signed-off-by: Kaituo Li <kaituo@amazon.com>