[FEATURE] Replace calls to SDKClusterService state() with more targeted calls #674

dbwiddis · 2023-04-12T21:38:35Z

Is your feature request related to a problem?

In the AD plugin, the process of creating a detector makes frequent use of the ClusterService.state() method, to:

check whether index metadata exists
check after creating metadata that it exists
check whether an index exists when validating whether a detector would create a new index
check whether an alias exists when validating, in the line of code immediately after the above (uh, variable re-use anyone?)
check whether the index exists (again) right before creating it
check whether the alias exists (again) right before creating it in the line of code just after the above
check whether the cluster is write blocked before writing to it
check whether the index exists after creating it

There are several more usages associated with Job Scheduler.

In OpenSearch, these repeated calls aren't a performance issue because the methods simply return the same (up to date) object every time.

In the SDK, every single request sends a transport call and since it's unfiltered, returns the entire cluster state of OpenSearch over transport, even if we only need one bit of information.

This is like turning on a fire hose to get a sip of water.

Eight times.

What solution would you like?

Don't use SDKClusterService.state() unless you really really really mean it. And you probably don't.

We should audit our usage of state in AD (as in the first comment) and update the migration guide to identify use cases and provide suggested alternatives.

We may want to just consider marking the SDKClusterService state() method as deprecated to give a clear signal to developers that this is not the droid they are looking for.

What alternatives have you considered?

Status quo: fetch the cluster state every time

Pro: it's already done. And has some benefit in finding bugs
Con: we always wonder why our Postman REST requests time out

Maintain a local state, similar to the way we do for Settings, with update consumers:

Pro: we have a local state always ready to provide fast access to the entire state of the OpenSearch cluster!
Con: we don't need 99.9% of the information in the cluster, and it's updated frequently.

Do you have any additional context?

Hat tip to @dblock for pointing out the obvious in this comment. Reading all the comments on that PR is also educational.

The text was updated successfully, but these errors were encountered:

owaiskazi19 · 2023-04-12T21:53:17Z

This issue will address the performance for AD Extension as well.
With plugin architecture it makes sense to get the cluster state as they a local copy and gets notified only if the state change. Also, plugin eventually are part of OpenSearch (So..).
Few points I would like to reiterate if we are trying to keep ClusterState as just an internal API and not expose it:

Identify the actual use case of ClusterState in AD Extension, if it's just for checking the index exist we can use the existing APIs for such cases.
Maintain a local copy of the cluster state. But again, do we want a remote node to store a local copy of the cluster state and update it whenever there are any changes? (Need more discussion on this one)

IMO, if we are able to find all the alternative APIs for the usage of ClusterState then moving with the 1st option makes sense.

dbwiddis · 2023-04-12T21:56:21Z

in AD Extension, if it's just for checking the index exist we can use the existing APIs for such cases.

Other common case I see is checking for blocks (e.g., ClusterBlockLevel.WRITE). I'm sure there's a faster way of checking that. (On Indices API.)

dbwiddis · 2023-04-13T05:46:41Z

It's possible the ClusterService call gives information on system indices not available via the Index API, so there may be a need for some middle ground not currently covered by the APIs.

dbwiddis · 2023-06-01T04:25:58Z

Needed methods:

                String[] concreteIndices = indexNameExpressionResolver
                    .concreteIndexNames(
                        sdkClusterService.state(),
                        IndicesOptions.lenientExpandOpen(),
                        sourceIndices.toArray(new String[0])
                    );

state().metadata().hasIndex(indexName)
state().metadata().index(resultIndex)
state().metadata().hasAlias(alias)
state().metadata().indices().containsKey(key)
state().metadata().getIndicesLookup().get(indexOrAliasName).getIndices()
state().getNodes().isLocalNodeElectedMaster()
state().getMetadata().indices().get(concreteIndex)
state().getRoutingTable().hasIndex(indexName)
state().getRoutingTable().index(indexOrAliasName)
state().nodes()
state().blocks().globalBlockedException(level)
state().getClusterName()

dbwiddis · 2023-06-01T16:00:23Z

More research notes:

The getters have two forms (getMetadata() delegates to metadata(), getNodes() to nodes(), getRoutingTable() to routingTable())
hasIndex(name) is equivalent to testing whether index(name) is not null and has matching UUID
There are REST requests for many of these

joshpalis · 2023-06-01T23:11:53Z

Did some digging into the different ways AD Extension is using the cluster state to determine if an index exists, currently there are three :

state().getRoutingTable().hasIndex(indexName)

Used primarily in AnomalyDetectionIndices to determine if system indices exist in the cluster before creating it. For example

state().metadata().hasIndex(indexName)

Used only in AnomalyDetectionIndices here to determine if the custom result index created by the user exists in the cluster. This is essentially the same as [3] since internally it just calls indices().containsKey(indexName)

state().metadata().indices().containsKey(indexName)

Used to determine if the Anomaly Detector Job Index / Anomaly Detector Index has been created prior to querying for entries within those indices. For example.

From this exploration, it seems that there is no difference in using the metadata or routingtable of the cluster state to determine if an index exists (system index or otherwise). We can replace all instances of these three use cases of the cluster state with the indices exists API documented here

dbwiddis · 2023-06-01T23:36:47Z

Did some digging into the different ways AD Extension is using the cluster state to determine if an index exists

I'm curious if that differentiates between index and alias? i.e., if an index exists and has an alias, does index(alias) return the index? I think that meets our needs as the index/alias checks in AD code are always back-to-back and can be reduced to one.

Other APIs looks pretty simple...

cluster name can come from GET _cluster/health or GET _cluster/stats (the former is probably faster)
nodes is probably GET /_nodes (could look at usages to filter by type but probably not a big deal)
blocks is in cluster settings (cluster.blocks.read_only, etc.)

joshpalis · 2023-06-02T17:31:10Z

I'm curious if that differentiates between index and alias? i.e., if an index exists and has an alias, does index(alias) return the index?

I don't think this is the case, as the metadata class also has an hasAlias here to see if it exists

joshpalis · 2023-06-05T21:20:33Z

Next set of cluster state calls that I shall handle focuses on retrieving index level metadata/routing tables :

The following both return the IndexMetaData of the given index, as @dbwiddis mentioned, getMetaData() just delegates to metadata() and index() just calls indices().get(indexName). Both calls are effectively the same.

state().metadata().index(indexName)

Used in AnomalyDetectionIndices here to validate if the user-created custom anomaly result index is valid. Internally this just retrieves the index mapping so this call can be replaced with the GetIndex API documented here which contains the mapping (if any) of an index as part of the response.

Additionally used in IndexUtils here to create an object of ClusterIndexHealth which utilizes the IndexMetaData to retrieve the index name, number of shards and number of replicas. Ultimately just used to retrieve the health status code of the index

state().getMetadata().indices().get(concreteIndex)

Used in AnomalyDetectionIndices here to determine if the given index should be updated. This also just retrieves the index mapping from the IndexMetaData, which could be replaced with the GetIndex API

The following returns a list of IndexMetaData given an alias :

state().metadata().getIndicesLookup().get(indexOrAliasName).getIndices()

This call is used in IndexUtils here to determine the index the alias actually points to (if any, unless there are none or more than 1) and identifies this index as the index to determine cluster index health from

The following call returns the RoutingTable of the given index

state().getRoutingTable().index(indexOrAliasName) :

Used in IndexUtils to create an object of ClusterIndexHealth, internally used to retrieve all the IndexShardRoutingTables associated with this index. Ultimately just used to retrieve the health status code of the index

TLDR : Can replace calls to the cluster state to retrieve IndexMetaData with the Get Index API to retrieve index mappings. May need to add support to the SDKClient to retrieve IndexMetadata and IndexRoutingTable from a given index for use in determining ClusterIndexHealth, if there is no API that already exists to achieve the same. (Edit, We can retrieve Cluster Index Health status from ClusterHealth API )

joshpalis · 2023-06-07T23:02:14Z

Moving onto the following cluster state calls :

state().getClusterName()

this call is primarily used as part of the TransportNodesAction response to identify the cluster that the response is coming from. In the case of AD Extension, all the TransportNodeAction requests are directed back to the extension node so the cluster name will always be the same.

@dbwiddis rather than retrieving the cluster name from another call to the ClusterHealth API, we can retrieve the cluster name from the environment settings response after extension initialization as the cluster name is part of this :

{
  "client.type": "node",
  "cluster.name": "opensearch",
  "http.type.default": "netty4",
  "node.attr.shard_indexing_pressure_enabled": "true",
  "node.name": "dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com",
  "opensearch.experimental.feature.extensions.enabled": "true",
  "path.home": "/home/jpalis/repos/integ/OpenSearch/distribution/archives/linux-tar/build/install/opensearch-3.0.0-SNAPSHOT",
  "path.logs": "/home/jpalis/repos/integ/OpenSearch/distribution/archives/linux-tar/build/install/opensearch-3.0.0-SNAPSHOT/logs",
  "transport.type.default": "netty4"
}

dbwiddis added enhancement New feature or request untriaged labels Apr 12, 2023

dbwiddis mentioned this issue Apr 13, 2023

[Extensions] Add ClusterStateRequest parameter to cluster state transport request opensearch-project/OpenSearch#7066

Closed

5 tasks

minalsha removed the untriaged label Apr 14, 2023

dbwiddis mentioned this issue May 19, 2023

[PROPOSAL] Eliminate all Transport calls from Extensions to OpenSearch #767

Open

dbwiddis mentioned this issue May 31, 2023

[PROPOSAL] Anomaly Detection HCAD Performance Measurement #652

Open

dbwiddis self-assigned this May 31, 2023

dbwiddis assigned joshpalis and unassigned dbwiddis Jun 1, 2023

This was referenced Jun 2, 2023

Adds exists/aliasExists() methods to SDKIndicesClient #802

Merged

[Extensions] Improves performance by replacing cluster state calls for index existence with exists()/aliasExists() API calls opensearch-project/anomaly-detection#919

Merged

This was referenced Jun 6, 2023

Adds health() to SDK Cluster admin client #810

Merged

[Extensions] Improves performance by replacing cluster state calls to retrieve index mappings, cluster index health opensearch-project/anomaly-detection#921

Merged

This was referenced Jun 7, 2023

[Extensions] Improves performance by replacing cluster state calls to retrieve cluster name opensearch-project/anomaly-detection#924

Merged

Adds ClusterName to SDKClusterService #818

Merged

joshpalis mentioned this issue Jun 19, 2023

[Extensions] Fixing BaseException imports and fixing Cluster Index Health request opensearch-project/anomaly-detection#929

Merged

dbwiddis mentioned this issue Jun 21, 2023

[META] Split and modularize the server monolith opensearch-project/OpenSearch#8110

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Replace calls to SDKClusterService state() with more targeted calls #674

[FEATURE] Replace calls to SDKClusterService state() with more targeted calls #674

dbwiddis commented Apr 12, 2023 •

edited

owaiskazi19 commented Apr 12, 2023

dbwiddis commented Apr 12, 2023 •

edited

dbwiddis commented Apr 13, 2023

dbwiddis commented Jun 1, 2023 •

edited by joshpalis

dbwiddis commented Jun 1, 2023 •

edited

joshpalis commented Jun 1, 2023 •

edited

dbwiddis commented Jun 1, 2023

joshpalis commented Jun 2, 2023

joshpalis commented Jun 5, 2023 •

edited

joshpalis commented Jun 7, 2023

[FEATURE] Replace calls to SDKClusterService state() with more targeted calls #674

[FEATURE] Replace calls to SDKClusterService state() with more targeted calls #674

Comments

dbwiddis commented Apr 12, 2023 • edited

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

owaiskazi19 commented Apr 12, 2023

dbwiddis commented Apr 12, 2023 • edited

dbwiddis commented Apr 13, 2023

dbwiddis commented Jun 1, 2023 • edited by joshpalis

dbwiddis commented Jun 1, 2023 • edited

joshpalis commented Jun 1, 2023 • edited

dbwiddis commented Jun 1, 2023

joshpalis commented Jun 2, 2023

joshpalis commented Jun 5, 2023 • edited

joshpalis commented Jun 7, 2023

dbwiddis commented Apr 12, 2023 •

edited

dbwiddis commented Apr 12, 2023 •

edited

dbwiddis commented Jun 1, 2023 •

edited by joshpalis

dbwiddis commented Jun 1, 2023 •

edited

joshpalis commented Jun 1, 2023 •

edited

joshpalis commented Jun 5, 2023 •

edited