Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make 1M1min possible #620

Merged
merged 3 commits into from
Aug 3, 2022
Merged

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Jul 28, 2022

Description

This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker.

Testing done:

  1. Added unit tests.
  2. Manually confirmed 1M1min is possible after the above changes.

Issues Resolved

#338

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker.

Testing done:
1. Added unit tests.
2. Manually confirmed 1M1min is possible after the above changes.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2022

Codecov Report

Merging #620 (7686723) into main (9f6a5ab) will increase coverage by 0.21%.
The diff coverage is 87.50%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #620      +/-   ##
============================================
+ Coverage     78.93%   79.15%   +0.21%     
- Complexity     4204     4240      +36     
============================================
  Files           296      301       +5     
  Lines         17686    17795     +109     
  Branches       1880     1891      +11     
============================================
+ Hits          13960    14085     +125     
+ Misses         2826     2818       -8     
+ Partials        900      892       -8     
Flag Coverage Δ
plugin 79.15% <87.50%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...opensearch/ad/ratelimit/CheckpointWriteWorker.java 91.25% <0.00%> (-2.58%) ⬇️
.../org/opensearch/ad/feature/CompositeRetriever.java 83.47% <55.55%> (-1.02%) ⬇️
.../java/org/opensearch/ad/caching/PriorityCache.java 69.92% <62.50%> (-0.33%) ⬇️
...rc/main/java/org/opensearch/ad/util/DateUtils.java 75.00% <75.00%> (ø)
...in/java/org/opensearch/ad/caching/CacheBuffer.java 79.25% <84.21%> (+0.39%) ⬆️
...arch/ad/transport/EntityResultTransportAction.java 88.70% <86.95%> (+0.77%) ⬆️
...ad/ratelimit/CheckPointMaintainRequestAdapter.java 88.88% <88.88%> (ø)
...a/org/opensearch/ad/ratelimit/ScheduledWorker.java 94.28% <94.28%> (ø)
.../java/org/opensearch/ad/AnomalyDetectorPlugin.java 96.62% <100.00%> (+0.09%) ⬆️
.../java/org/opensearch/ad/caching/CacheProvider.java 100.00% <100.00%> (ø)
... and 18 more

Signed-off-by: Kaituo Li <kaituo@amazon.com>
@@ -304,6 +303,9 @@ class PageListener implements ActionListener<CompositeRetriever.Page> {

@Override
public void onResponse(CompositeRetriever.Page entityFeatures) {
if (pageIterator.hasNext()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are too many pages and not reach the last page until current job interval ends, will current job still iterate next page while next job triggers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -158,14 +162,14 @@ private ActionListener<Optional<AnomalyDetector>> onGetDetector(
) {
return ActionListener.wrap(detectorOptional -> {
if (!detectorOptional.isPresent()) {
listener.onFailure(new EndRunException(detectorId, "AnomalyDetector is not available.", true));
listener.onFailure(new EndRunException(detectorId, "AnomalyDetector is not available.", false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change to false? If detector is not found, do we still need to run job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the node with the detector config is temporarily unavailable. Changing to false to to accommodate that.

return;
}

AnomalyDetector detector = detectorOptional.get();

if (request.getEntities() == null) {
listener.onResponse(null);
listener.onFailure(new EndRunException(detectorId, "Fail to get any entities from request.", false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sparse data, it's possible that some interval has no data. With this change, we will stop AD job if there are no data for X intervals?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no data, model node should not receive the request in the first place. If request.getEntities() == null, there might be some bug in our code.

}
} catch (IllegalArgumentException e) {
// fail to score likely due to model corruption. Re-cold start to recover.
cache.get().removeEntityModel(detectorId, modelId);
Copy link
Collaborator

@ylwu-amzn ylwu-amzn Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Track model corruption times? It's fine to add stats in next PR, you can add some todo here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the tracking code. Will send the new commit.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
@kaituo
Copy link
Collaborator Author

kaituo commented Aug 3, 2022

BWC test failed because we deprecated ODFE and the download link is archived.

@ohltyler
Copy link
Member

ohltyler commented Aug 3, 2022

BWC test failed because we deprecated ODFE and the download link is archived.

Have a fix coming soon for this - see #625

@kaituo
Copy link
Collaborator Author

kaituo commented Aug 3, 2022

BWC test failed because we deprecated ODFE and the download link is archived.

Have a fix coming soon for this - see #625

Thanks Tyler

@kaituo
Copy link
Collaborator Author

kaituo commented Aug 3, 2022

Besides BWC tests, do you have other comments? @ohltyler @ylwu-amzn

@ohltyler
Copy link
Member

ohltyler commented Aug 3, 2022

Besides BWC tests, do you have other comments? @ohltyler @ylwu-amzn

I've taken a look at your latest commit - LGTM

.intSetting(
"plugins.anomaly_detection.expected_cold_entity_execution_time_in_secs",
3,
"plugins.anomaly_detection.expected_cold_entity_execution_time_in_millisecs",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change from second to millisecond? Will it have any bwc issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it gives more control on the speed. Previously, we can only specify down to 1 second. Now we can specify down to 1 millisecond. It won't as these settings are not documented in public.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the change!

Copy link
Member

@ohltyler ohltyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ohltyler
Copy link
Member

ohltyler commented Aug 3, 2022

Failed tests will be fixed in #625, I think we can merge this and backport first.

@kaituo kaituo merged commit 08fdbdd into opensearch-project:main Aug 3, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 3, 2022
* make 1M1min possible

This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker.

Testing done:
1. Added unit tests.
2. Manually confirmed 1M1min is possible after the above changes.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
(cherry picked from commit 08fdbdd)
ohltyler pushed a commit that referenced this pull request Aug 3, 2022
* make 1M1min possible

This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker.

Testing done:
1. Added unit tests.
2. Manually confirmed 1M1min is possible after the above changes.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
(cherry picked from commit 08fdbdd)
@amitgalitz amitgalitz added the enhancement New feature or request label Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants