Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding core changes for API to check per awareness attribute health #5231

Merged
merged 13 commits into from
Jan 3, 2023

Conversation

nishchay21
Copy link
Contributor

@nishchay21 nishchay21 commented Nov 12, 2022

Signed-off-by: Nishchay Malhotra nishcha@amazon.com

Description

The PR solves issue #5230. It adds the capability of new API where one can view per awareness attribute health status. As of today opensearch only shows health for full cluster however health per awareness attribute is still not seen. Added an API today monitor the cluster availability status per awareness attribute as that would give use more information on how many shards are assigned per awareness attribute value, how many nodes are distributes per awareness attribute value etc. Follwoing things can be shown per awareness attribute value [eg rack as attribute]:

  • Total number of nodes per rack
  • Total shards per rack.
  • Weights given per rack if defined.
  • If "cluster.routing.allocation.awareness.balance" is enforced we can also predict the unassigned shards.

Issues Resolved

#5230.

Check List

  • New functionality includes testing.
  • All tests pass
  • New functionality has been documented.
  • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@nishchay21 nishchay21 requested review from a team and reta as code owners November 12, 2022 11:25
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2022

Codecov Report

Merging #5231 (31d4061) into main (da94dbb) will decrease coverage by 0.14%.
The diff coverage is 76.92%.

@@             Coverage Diff              @@
##               main    #5231      +/-   ##
============================================
- Coverage     71.09%   70.94%   -0.15%     
+ Complexity    58499    58476      -23     
============================================
  Files          4748     4751       +3     
  Lines        278818   279104     +286     
  Branches      40296    40341      +45     
============================================
- Hits         198219   198015     -204     
- Misses        64477    64918     +441     
- Partials      16122    16171      +49     
Impacted Files Coverage Δ
...luster/awarenesshealth/ClusterAwarenessHealth.java 57.14% <57.14%> (ø)
...sshealth/ClusterAwarenessAttributeValueHealth.java 76.33% <76.33%> (ø)
...renesshealth/ClusterAwarenessAttributesHealth.java 81.88% <81.88%> (ø)
...n/indices/forcemerge/ForceMergeRequestBuilder.java 0.00% <0.00%> (-75.00%) ⬇️
...a/org/opensearch/client/cluster/ProxyModeInfo.java 0.00% <0.00%> (-60.00%) ⬇️
...ava/org/opensearch/action/NoSuchNodeException.java 0.00% <0.00%> (-50.00%) ⬇️
...a/org/opensearch/tasks/TaskCancelledException.java 50.00% <0.00%> (-50.00%) ⬇️
...ain/java/org/opensearch/search/sort/MinAndMax.java 63.15% <0.00%> (-36.85%) ⬇️
.../opensearch/client/indices/CloseIndexResponse.java 56.25% <0.00%> (-35.00%) ⬇️
.../indices/forcemerge/TransportForceMergeAction.java 25.00% <0.00%> (-33.34%) ⬇️
... and 479 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@imRishN imRishN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in rest-api-spec is missing


@Override
public List<Route> routes() {
return List.of(new Route(GET, "/_cluster/ha_health"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HA sounds like too specific and many users might not relate with it. Shall we call it verbose_health or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did finalize this during the discussions however ha sounds better to me . I will let bukhtawar and gaurav comment on this if required I can change that,.


@Override
protected void clusterManagerOperation(
final Task task,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this task used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept that if we need for future use and on debug level we can log the taskID if needed

*
* @opensearch.internal
*/
public class ClusterHaHealthRequest extends ClusterManagerNodeReadRequest<ClusterHaHealthRequest> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rather extend from ClusterHealthRequest? That might help to get the capabilities of generic health requests like specifying indices, wait for status, etc. Although those extensions can be taken up in a subsequent PR

Copy link
Contributor Author

@nishchay21 nishchay21 Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so but than segregating it seems better because if tomorrow we need to add new capabilities to this API itself would be better and more easy. So thinking in terms of single responsibility model per class this seems better to me.

Comment on lines 32 to 40

private static final String CLUSTER_NAME = "clusterName";
private static final String STATUS = "domainHealth";
private static final String NUMBER_OF_NODES = "totalNodes";
private static final String NUMBER_OF_DATA_NODES = "totalDataNodes";
private static final String ACTIVE_SHARDS = "totalShards";
private static final String UNASSIGNED_SHARDS = "totalUnassignedShards";
private static final String INITIALIZING_SHARDS = "totalInitializingShards";
private static final String AWARENESS_ATTRIBUTE = "AwarenessAttribute";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these variables also present in ClusterHealthResponse with the only change is awareness attribute health. If it is just about adding one map in regular health check call, shall we just rather add query param in the health check call which would return more verbose health like here rather than build a new API altogether? Something like ?verbose_health

What do you think?

Copy link
Contributor Author

@nishchay21 nishchay21 Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. As cluster health is aggregation for cluster and not awareness attribute specific would be better to segregate this.

builder.field(ACTIVE_SHARDS, getActiveShards());
builder.field(UNASSIGNED_SHARDS, getUnassignedShards());
builder.field(INITIALIZING_SHARDS, getInitializingShards());
builder.field(AWARENESS_ATTRIBUTE, getAwarenessAttributeHealth());
Copy link
Member

@imRishN imRishN Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is the only difference between regular cluster health and this new API

builder.field(AWARENESS_ATTRIBUTE, getAwarenessAttributeHealth());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@nishchay21 nishchay21 closed this Nov 17, 2022
@nishchay21 nishchay21 reopened this Nov 17, 2022
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

CHANGELOG.md Outdated
@@ -68,6 +68,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased 2.x]
### Added
- Added feature to check per awareness attribute value health([#5231](https://github.com/opensearch-project/OpenSearch/pull/5231))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're adding a new API here, so it's not a "feature to check", no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense will make that change

@nishchay21 nishchay21 changed the title Add feature to check per awareness attribute value health. Add API to check per awareness attribute value health. Nov 17, 2022
*
* @opensearch.internal
*/
public class ClusterHaHealthAction extends ActionType<ClusterHaHealthResponse> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to ClusterAwarenessHealthAction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public class ClusterHaHealthAction extends ActionType<ClusterHaHealthResponse> {

public static final ClusterHaHealthAction INSTANCE = new ClusterHaHealthAction();
public static final String NAME = "cluster:monitor/ha_health";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: awareness_health

* @param clusterState The current cluster state. Must not be null.
* @param clusterSettings the current cluster settings.
*/
public ClusterHaHealth(ClusterState clusterState, ClusterSettings clusterSettings) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor is just too big, would prefer smaller methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make this as small methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added smaller methods

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -205,6 +207,15 @@ public interface ClusterAdminClient extends OpenSearchClient {
*/
void state(ClusterStateRequest request, ActionListener<ClusterStateResponse> listener);

/**
* The ha_health of the cluster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ha_health/awareness_health/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 83 to 65
// Getting the awareness attribute list
List<String> awarenessAttributeList = clusterSettings.get(
AwarenessAllocationDecider.CLUSTER_ROUTING_ALLOCATION_AWARENESS_ATTRIBUTE_SETTING
);

// Getting the replicaEnforcement settings as both are necessary for replica enforcement.
boolean replicaEnforcement = clusterSettings.get(AwarenessReplicaBalance.CLUSTER_ROUTING_ALLOCATION_AWARENESS_BALANCE_SETTING);
Settings forcedAwarenessSettings = clusterSettings.get(
AwarenessAllocationDecider.CLUSTER_ROUTING_ALLOCATION_AWARENESS_FORCE_GROUP_SETTING
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a validator which validates for following things

  1. CLUSTER_ROUTING_ALLOCATION_AWARENESS_ATTRIBUTE_SETTING
  2. CLUSTER_ROUTING_ALLOCATION_AWARENESS_FORCE_GROUP_SETTING
  3. replicaEnforcement
  4. Single attribute only (two attributes like zone and rack are not supported)

If validation fails, we can do either of these

  1. [Preferred] Throw an exception
  2. Not populate ZoneHealth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we will through empty map and not exception so not moving the same to validator

);
}

private ClusterHealthStatus generateIndexAndStateStats(ClusterState clusterState) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse TransportClusterHealthAction to do most of the heavy lifting here ? I did a quick POC earlier : https://github.com/gbbafna/admin-plugin/blob/main/src/main/java/org/opensearch/path/to/plugin/TransportClusterHAHealthAction.java#L113 . The benefits are two fold

  1. Any evolution of cluster health can be reused .
  2. Cluster Health is much more complex and provides option to wait for unassigned shards etc . We can reuse that later if needed .

Copy link
Contributor Author

@nishchay21 nishchay21 Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done on consistency standpoint so that the state on full call is not changed. Will check if we can reuse the class

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets populate status as well in here and rename to setClusterHealth .

private int unassignedShards;
private int initializingShards;
private ClusterHealthStatus status;
private final Map<String, Object> collatedAttributeHealth;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Object, lets created AttributeHealth class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we are having multiple attribute value and per attribute value we are returning a list of Attribute values so instead of object class AttributeHealth class cannot be used

Comment on lines 48 to 43
private int activeShards;
private int unassignedShards;
private int initializingShards;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add relocating shards as well .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

return shardAllocationPerNode;
}

private Map<String, Object> generateAwarenessAttributeHealthMap(
Copy link
Contributor

@anshu1106 anshu1106 Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to break this function for better code readability?

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

}

// Constructor use by Unit test case.
public ClusterAwarenessAttributeValueHealth(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make all places have package private usages if used solely for tests

*
* @param name name of awareness attribute
*/
public ClusterAwarenessAttributeValueHealth(String name) {
Copy link
Collaborator

@Bukhtawar Bukhtawar Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass the list of RoutingNode with the same attribute value?

String nodeId;

// computing active, relocating and initializing shards info
for (ShardRouting shard : clusterState.routingTable().allShards()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to iterate over all shards for all distinct attribute value, its a wasted computation. Instead we could filter the RoutingNode for this attribute value and pass it in the the ctor and iterate over only those set of nodes. This would also avoid the check for nodeInCurrentAZ

attributeValue = nodeAttributes.get(awarenessAttributeName);

if (!attributesHealth.containsKey(attributeValue)) {
clusterAwarenessAttributeValueHealth = new ClusterAwarenessAttributeValueHealth(attributeValue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets pass the RoutingNode here to avoid unnecessary computation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computation will still exists in the parent class . However will make the required change


// This is the map that would store all the stats per attribute ie
// health stats for rack-1, rack-2 etc.
awarenessAttributeValueHealthMap = new HashMap<>(Collections.emptyMap());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Collections.emptyMap()?

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Gradle Check (Jenkins) Run Completed with:

double attributeWeight = 1.0;
if (weightedRoutingMetadata != null) {
WeightedRouting weightedRouting = weightedRoutingMetadata.getWeightedRouting();
attributeWeight = weightedRouting.weights().getOrDefault(name, 1.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should default to 1 if no weights specified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this would mean that all values are active and no specific parameter is set

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nishchay21, on a high level the changes look good to me. Can you also run this through @gbbafna once more to see if this looks good.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Gradle Check (Jenkins) Run Completed with:

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two nit comments regarding naming .

@@ -54,20 +56,20 @@ public ClusterAwarenessAttributesHealth(

// This is the Map which is storing the per attribute value health stats.
// The key would be the attribute value like rack-1 and the stats would be stored as value.
Map<String, ClusterAwarenessAttributeValueHealth> attributesHealth = new HashMap<>();
Map<String, List<String>> attributesHealth = new HashMap<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets rename this as this doesn't consists of health , but list of nodes node. Edit the comment above as well.

@@ -94,7 +93,7 @@ public ClusterAwarenessAttributesHealth(
}

private void setClusterAwarenessAttributeValue(
Map<String, ClusterAwarenessAttributeValueHealth> perAttributeValueHealthMap,
Map<String, List<String>> perAttributeValueHealthMap,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : rename here as well .

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar Bukhtawar merged commit 7578cd8 into opensearch-project:main Jan 3, 2023
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Jan 4, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 4, 2023
…5231)

* Adding core changes for API to check per awareness attribute health.

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
(cherry picked from commit 7578cd8)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Jan 9, 2023
…5231) (#5688)

* Adding core changes for API to check per awareness attribute health.

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2023
…5231)

* Adding core changes for API to check per awareness attribute health.

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
(cherry picked from commit 7578cd8)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Jan 10, 2023
…5231) (#5776)

* Adding core changes for API to check per awareness attribute health.

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
(cherry picked from commit 7578cd8)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Jan 25, 2023
…5231) (#5688)

* Adding core changes for API to check per awareness attribute health.

Signed-off-by: Nishchay Malhotra <nishcha@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants