Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Replace calls to SDKClusterService state() with more targeted calls #674

Open
dbwiddis opened this issue Apr 12, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@dbwiddis
Copy link
Member

dbwiddis commented Apr 12, 2023

Is your feature request related to a problem?

In the AD plugin, the process of creating a detector makes frequent use of the ClusterService.state() method, to:

  • check whether index metadata exists
  • check after creating metadata that it exists
  • check whether an index exists when validating whether a detector would create a new index
  • check whether an alias exists when validating, in the line of code immediately after the above (uh, variable re-use anyone?)
  • check whether the index exists (again) right before creating it
  • check whether the alias exists (again) right before creating it in the line of code just after the above
  • check whether the cluster is write blocked before writing to it
  • check whether the index exists after creating it

There are several more usages associated with Job Scheduler.

In OpenSearch, these repeated calls aren't a performance issue because the methods simply return the same (up to date) object every time.

In the SDK, every single request sends a transport call and since it's unfiltered, returns the entire cluster state of OpenSearch over transport, even if we only need one bit of information.

This is like turning on a fire hose to get a sip of water.

Eight times.

What solution would you like?

Don't use SDKClusterService.state() unless you really really really mean it. And you probably don't.

We should audit our usage of state in AD (as in the first comment) and update the migration guide to identify use cases and provide suggested alternatives.

We may want to just consider marking the SDKClusterService state() method as deprecated to give a clear signal to developers that this is not the droid they are looking for.

What alternatives have you considered?

Status quo: fetch the cluster state every time

  • Pro: it's already done. And has some benefit in finding bugs
  • Con: we always wonder why our Postman REST requests time out

Maintain a local state, similar to the way we do for Settings, with update consumers:

  • Pro: we have a local state always ready to provide fast access to the entire state of the OpenSearch cluster!
  • Con: we don't need 99.9% of the information in the cluster, and it's updated frequently.

Do you have any additional context?

Hat tip to @dblock for pointing out the obvious in this comment. Reading all the comments on that PR is also educational.

@dbwiddis dbwiddis added enhancement New feature or request untriaged labels Apr 12, 2023
@owaiskazi19
Copy link
Member

This issue will address the performance for AD Extension as well.
With plugin architecture it makes sense to get the cluster state as they a local copy and gets notified only if the state change. Also, plugin eventually are part of OpenSearch (So..).
Few points I would like to reiterate if we are trying to keep ClusterState as just an internal API and not expose it:

  1. Identify the actual use case of ClusterState in AD Extension, if it's just for checking the index exist we can use the existing APIs for such cases.
  2. Maintain a local copy of the cluster state. But again, do we want a remote node to store a local copy of the cluster state and update it whenever there are any changes? (Need more discussion on this one)

IMO, if we are able to find all the alternative APIs for the usage of ClusterState then moving with the 1st option makes sense.

@dbwiddis
Copy link
Member Author

dbwiddis commented Apr 12, 2023

in AD Extension, if it's just for checking the index exist we can use the existing APIs for such cases.

Other common case I see is checking for blocks (e.g., ClusterBlockLevel.WRITE). I'm sure there's a faster way of checking that. (On Indices API.)

@dbwiddis
Copy link
Member Author

It's possible the ClusterService call gives information on system indices not available via the Index API, so there may be a need for some middle ground not currently covered by the APIs.

@dbwiddis
Copy link
Member Author

dbwiddis commented Jun 1, 2023

Needed methods:

                String[] concreteIndices = indexNameExpressionResolver
                    .concreteIndexNames(
                        sdkClusterService.state(),
                        IndicesOptions.lenientExpandOpen(),
                        sourceIndices.toArray(new String[0])
                    );
  • state().metadata().hasIndex(indexName)
  • state().metadata().index(resultIndex)
  • state().metadata().hasAlias(alias)
  • state().metadata().indices().containsKey(key)
  • state().metadata().getIndicesLookup().get(indexOrAliasName).getIndices()
  • state().getNodes().isLocalNodeElectedMaster()
  • state().getMetadata().indices().get(concreteIndex)
  • state().getRoutingTable().hasIndex(indexName)
  • state().getRoutingTable().index(indexOrAliasName)
  • state().nodes()
  • state().blocks().globalBlockedException(level)
  • state().getClusterName()

@dbwiddis
Copy link
Member Author

dbwiddis commented Jun 1, 2023

More research notes:

  • The getters have two forms (getMetadata() delegates to metadata(), getNodes() to nodes(), getRoutingTable() to routingTable())
  • hasIndex(name) is equivalent to testing whether index(name) is not null and has matching UUID
  • There are REST requests for many of these

@joshpalis
Copy link
Member

joshpalis commented Jun 1, 2023

Did some digging into the different ways AD Extension is using the cluster state to determine if an index exists, currently there are three :

  1. state().getRoutingTable().hasIndex(indexName)

Used primarily in AnomalyDetectionIndices to determine if system indices exist in the cluster before creating it. For example

  1. state().metadata().hasIndex(indexName)

Used only in AnomalyDetectionIndices here to determine if the custom result index created by the user exists in the cluster. This is essentially the same as [3] since internally it just calls indices().containsKey(indexName)

  1. state().metadata().indices().containsKey(indexName)

Used to determine if the Anomaly Detector Job Index / Anomaly Detector Index has been created prior to querying for entries within those indices. For example.

From this exploration, it seems that there is no difference in using the metadata or routingtable of the cluster state to determine if an index exists (system index or otherwise). We can replace all instances of these three use cases of the cluster state with the indices exists API documented here

@dbwiddis
Copy link
Member Author

dbwiddis commented Jun 1, 2023

Did some digging into the different ways AD Extension is using the cluster state to determine if an index exists

I'm curious if that differentiates between index and alias? i.e., if an index exists and has an alias, does index(alias) return the index? I think that meets our needs as the index/alias checks in AD code are always back-to-back and can be reduced to one.

Other APIs looks pretty simple...

  • cluster name can come from GET _cluster/health or GET _cluster/stats (the former is probably faster)
  • nodes is probably GET /_nodes (could look at usages to filter by type but probably not a big deal)
  • blocks is in cluster settings (cluster.blocks.read_only, etc.)

@dbwiddis dbwiddis assigned joshpalis and unassigned dbwiddis Jun 1, 2023
@joshpalis
Copy link
Member

I'm curious if that differentiates between index and alias? i.e., if an index exists and has an alias, does index(alias) return the index?

I don't think this is the case, as the metadata class also has an hasAlias here to see if it exists

@joshpalis
Copy link
Member

joshpalis commented Jun 5, 2023

Next set of cluster state calls that I shall handle focuses on retrieving index level metadata/routing tables :

The following both return the IndexMetaData of the given index, as @dbwiddis mentioned, getMetaData() just delegates to metadata() and index() just calls indices().get(indexName). Both calls are effectively the same.

  • state().metadata().index(indexName)

Used in AnomalyDetectionIndices here to validate if the user-created custom anomaly result index is valid. Internally this just retrieves the index mapping so this call can be replaced with the GetIndex API documented here which contains the mapping (if any) of an index as part of the response.

Additionally used in IndexUtils here to create an object of ClusterIndexHealth which utilizes the IndexMetaData to retrieve the index name, number of shards and number of replicas. Ultimately just used to retrieve the health status code of the index

  • state().getMetadata().indices().get(concreteIndex)

Used in AnomalyDetectionIndices here to determine if the given index should be updated. This also just retrieves the index mapping from the IndexMetaData, which could be replaced with the GetIndex API

The following returns a list of IndexMetaData given an alias :

  • state().metadata().getIndicesLookup().get(indexOrAliasName).getIndices()

This call is used in IndexUtils here to determine the index the alias actually points to (if any, unless there are none or more than 1) and identifies this index as the index to determine cluster index health from

The following call returns the RoutingTable of the given index

  • state().getRoutingTable().index(indexOrAliasName) :

Used in IndexUtils to create an object of ClusterIndexHealth, internally used to retrieve all the IndexShardRoutingTables associated with this index. Ultimately just used to retrieve the health status code of the index

TLDR : Can replace calls to the cluster state to retrieve IndexMetaData with the Get Index API to retrieve index mappings. May need to add support to the SDKClient to retrieve IndexMetadata and IndexRoutingTable from a given index for use in determining ClusterIndexHealth, if there is no API that already exists to achieve the same. (Edit, We can retrieve Cluster Index Health status from ClusterHealth API )

@joshpalis
Copy link
Member

Moving onto the following cluster state calls :

  • state().getClusterName()

this call is primarily used as part of the TransportNodesAction response to identify the cluster that the response is coming from. In the case of AD Extension, all the TransportNodeAction requests are directed back to the extension node so the cluster name will always be the same.

@dbwiddis rather than retrieving the cluster name from another call to the ClusterHealth API, we can retrieve the cluster name from the environment settings response after extension initialization as the cluster name is part of this :

{
  "client.type": "node",
  "cluster.name": "opensearch",
  "http.type.default": "netty4",
  "node.attr.shard_indexing_pressure_enabled": "true",
  "node.name": "dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com",
  "opensearch.experimental.feature.extensions.enabled": "true",
  "path.home": "/home/jpalis/repos/integ/OpenSearch/distribution/archives/linux-tar/build/install/opensearch-3.0.0-SNAPSHOT",
  "path.logs": "/home/jpalis/repos/integ/OpenSearch/distribution/archives/linux-tar/build/install/opensearch-3.0.0-SNAPSHOT/logs",
  "transport.type.default": "netty4"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants