Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Reader changes for dynamic enable/disable of RCA graph components #325

Merged
merged 18 commits into from
Aug 7, 2020

Conversation

ktkrg
Copy link
Contributor

@ktkrg ktkrg commented Jul 31, 2020

Issue #:
Fixes #327

Description of changes:
This change adds functionality to read the overrides written by the NodeDetailsCollector and update the rca conf file accordingly. To that effect this change adds two new sections to the rca[_idle][_master].conf:
mutedActions: []
mutedDeciders: []

Updating RCA conf and the muted nodes work for RCAs and deciders. For actions, the list of muted actions are looked at by the collator and prunes the ones that are muted.

Tests:
Ran tests with different overrides and made sure that the rca conf was updated correctly and muted correctly. Verified same things for the actions as well.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@7be6969). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #325   +/-   ##
=========================================
  Coverage          ?   67.15%           
  Complexity        ?     2074           
=========================================
  Files             ?      299           
  Lines             ?    13280           
  Branches          ?     1099           
=========================================
  Hits              ?     8918           
  Misses            ?     3958           
  Partials          ?      404           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be6969...556b527. Read the comment docs.

@ktkrg ktkrg linked an issue Jul 31, 2020 that may be closed by this pull request
@ktkrg ktkrg self-assigned this Jul 31, 2020
@ktkrg ktkrg added the enhancement Enhancements to existing codebase label Jul 31, 2020
@ktkrg ktkrg linked an issue Jul 31, 2020 that may be closed by this pull request
@ktkrg ktkrg changed the title [DRAFT] Reader changes for dynamic enable/disable of RCA graph components Reader changes for dynamic enable/disable of RCA graph components Jul 31, 2020
@vigyasharma
Copy link
Contributor

General comment on the description message:

For actions, the list of muted actions are looked at by the collator and prunes the ones that are muted.

Should this be done at collator level or decider level? I was thinking that if an action is muted, it returns isActionable() == false, and decider picks the next best possible action. If we mute it at collator level, then decider will keep picking that same muted action.

@ktkrg
Copy link
Contributor Author

ktkrg commented Aug 3, 2020

General comment on the description message:

For actions, the list of muted actions are looked at by the collator and prunes the ones that are muted.

Should this be done at collator level or decider level? I was thinking that if an action is muted, it returns isActionable() == false, and decider picks the next best possible action. If we mute it at collator level, then decider will keep picking that same muted action.

Aah yes, I started this before I started on collator, so I missed the fact that deciders can suggest multiple actions based.

Will change this.

@ktkrg ktkrg requested review from adityaj1107 and yojs August 6, 2020 03:17
@@ -75,15 +87,18 @@ public void processEvent(Event event) {

// An example node_metrics data is something like this for a two node cluster:
// {"current_time":1566414001749}
// 1566414001749
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file used to be a json object per line. It would be nice if we can keep that and push the override timestamp inside as follows:

{"overrides": {"enabled": {}, "disabled": {}}, ts:54674895}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a change in the plugin as well. Will follow up with a separate PR for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. But let's change it.

this(new ConfigOverridesApplier());
}

public ClusterDetailsEventProcessor(final ConfigOverridesApplier overridesApplier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to pass ConfigOverridesApplier as a constructor. No state is kept in it. It reads the config and applies it to the rca.conf. Maybe we can just use it as a helper static class ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does keep the lastAppliedTimestamp as an instance variable and uses that to check if the config overrides string received from the ClusterDetailsEventProcessor. Making it a static helper will make all the consumers see the same value of last applied time(in an IT for example).

Comment on lines 97 to 99
LOG.debug("New set of muted rcas: {}", currentMutedRcaSet);
LOG.debug("New set of muted deciders: {}", currentMutedDeciderSet);
LOG.debug("New set of muted actions: {}", currentMutedActionSet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be applying change when someone calls the API or at startup, which shouldn't be that many. Should we make it info logs ? Or add a metric.

LOG.debug("New set of muted rcas: {}", currentMutedRcaSet);
LOG.debug("New set of muted deciders: {}", currentMutedDeciderSet);
LOG.debug("New set of muted actions: {}", currentMutedActionSet);
rcaConf.updateAllRcaConfFiles(currentMutedRcaSet, currentMutedDeciderSet, currentMutedActionSet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are updating three rca.conf files. updateAllRcaConfFiles has a void return type. What happens if it succeeded on a subset of files but not all? We will still update lastAppliedTimestamp and never look back until someone makes a change again. So, should we make lastAppliedTimestamp return true when writes on all three files succeed and and then only update the lastAppliedTimestamp. If not, then we will retry again on the next run and eventually all three files will be updated or we likely have a disk corruption or disk full issues. thoughts ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! I'm checking all three updates before updating the lastAppliedTimestamp.

I'm also emitting a metric if we fail to update an rca conf file. If we end up in a state where we can't do these file system changes, I think we should be failing in more places than just this, so yea, it's probably ok to do the timestamp update only if all three succeed.

@@ -39,10 +40,12 @@
// initiate a node config cache within each AppContext space
// to store node config settings from ES
private final NodeConfigCache nodeConfigCache;
private Set<String> mutedActions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RCAController thread updates this but it is read by the SchedulerTask threads. We should make this volatile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

mutedActions);
if (!updateStatus) {
LOG.error("Failed to update the conf file at path: {}", confFilePath);
StatsCollector.instance().logMetric(RcaConsts.WRITE_UPDATED_RCA_CONF_ERROR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the new metric framework. There your metric can also tell you which file erred out ?

@ktkrg ktkrg merged commit 7ede0b9 into master Aug 7, 2020
@ktkrg ktkrg deleted the ktkrg-ed2 branch August 7, 2020 19:05
khushbr pushed a commit that referenced this pull request Sep 30, 2020
Changes for 1 package (OpenDistroPerformanceAnalyzerEngine), pushed in snapshot...

  https://code.amazon.com/snapshots/partsrut/2020-08-08T02-49-04

Changes for OpenDistroPerformanceAnalyzerEngine package:

0194f9f Fix failing reaction wheel tests
bfdb93d Merge remote-tracking branch 'upstream/master'
e03a120 Use StringUtils instead of NumberUtils to check timestamp string (#360)
79841f2 Expose queue rejection time period setting in rca.conf (#356)
f9dfda5 Implement isMuted method in DummyAction (#355)
7ede0b9 Reader changes for dynamic enable/disable of RCA graph components (#325)
ee58207 Fix bug in publisher to support cool off period on a per node basis (#351)

cr https://code.amazon.com/reviews/CR-31344155
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Enhancements to existing codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support dynamic enable/disabling of RCA, decider nodes and actions.
4 participants