Rest Layer and Async Search Cleanup Management #9

eirsep · 2020-12-14T13:05:52Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… limit setting.

…chronous-search into rest-layer

…asynchronous-search into rest-layer

itiyamas · 2020-12-21T08:09:54Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+                            }
+
+                            @Override
+                            public void handleException(TransportException e) {


add a stat?

Maybe next phase

itiyamas · 2020-12-21T08:19:21Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+    protected void doClose() {
+        ResponseCleanUpScheduler cleanUpScheduler = activeResponseCleanUpScheduler.get();
+        if (cleanUpScheduler != null) {
+            cleanUpScheduler.close();


Set cleanUpScheduler to null?

itiyamas · 2020-12-21T08:21:51Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+                ActionListener.wrap(channel::sendResponse, e -> {
+                    try {
+                        channel.sendResponse(e);
+                    } catch (Exception ex) {


Dangling request. Shouldn't you close the connection in that case or add some retries here?

It would anyways be re-scheduled. Connection would close after the failure response is sent back.

How are you sending the failure response in case of Exception. Does the framework handle it?

ActionListener.wrap(channel::sendResponse //sends response , e -> { //sends error try { channel.sendResponse(e); } catch (Exception ex) { } }

itiyamas · 2020-12-21T08:26:45Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+    }
+
+    @Override
+    public void offMaster() {


Is is possible that offMaster call comes after onMaster call for master relection to same node? Even if calls are interleaved there will be issues

Check elastic/elasticsearch@fefb31b. The PR talks about loose guarantees on sequence order of the onMaster and offMaster calls. Additionally, ES may deprecate the listener in future.

itiyamas · 2020-12-21T08:28:39Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+                            public void handleException(TransportException e) {
+                                logger.error(() -> new ParameterizedMessage("Exception executing action {}",
+                                        CLEANUP_ACTION_NAME), e);
+                                scheduleNextWakeUp();


next one woken up even before the first one ends? How would that work?

…asynchronous-search into rest-layer

itiyamas · 2020-12-21T15:01:02Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+                ActionListener.wrap(channel::sendResponse, e -> {
+                    try {
+                        channel.sendResponse(e);
+                    } catch (Exception ex) {


How are you sending the failure response in case of Exception. Does the framework handle it?

itiyamas · 2020-12-21T15:03:44Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+                        }
+
+                        @Override
+                        public void handleException(TransportException e) {


Currently we only record stats for async searches submitted by user. AsyncSearchManagementService is internal component which runs behind the scenes.

…asynchronous-search into rest-layer

getsaurabh02

Thanks for the changes. Have left few comments/clarifications.

getsaurabh02 · 2020-12-21T19:54:35Z

...om/amazon/opendistroforelasticsearch/search/async/context/active/AsyncSearchActiveStore.java

@@ -54,6 +57,7 @@ private void setMaxRunningContext(int maxRunningContext) {

    public synchronized void putContext(AsyncSearchContextId asyncSearchContextId, AsyncSearchActiveContext asyncSearchContext) {
        if (activeContexts.size() >= maxRunningContext) {
+            contextRejectionEventConsumer.accept(asyncSearchContextId);


why is this consumer needed? For cleanup ?

Consumer is required for async search throttled count stat at node level. This is harnessed in our stats API

getsaurabh02 · 2020-12-21T19:57:38Z

...mazon/opendistroforelasticsearch/search/async/context/permits/AsyncSearchContextPermits.java

@@ -58,7 +59,7 @@ public AsyncSearchContextPermits(AsyncSearchContextId asyncSearchContextId, Thre
        this.semaphore = new Semaphore(TOTAL_PERMITS, true);
    }

-    private Releasable acquirePermits(int permits, TimeValue timeout, final String details) throws TimeoutException {


reason for making it run time?

We throw ElasticSearchTimeoutException here. That's serializable over the wire.

Got it. So IllegalStateException is no more expected here?

getsaurabh02 · 2020-12-21T20:01:50Z

...endistroforelasticsearch/search/async/context/persistence/AsyncSearchPersistenceService.java

+    public static final BackoffPolicy STORE_BACKOFF_POLICY =
+            BackoffPolicy.exponentialBackoff(timeValueMillis(250), 14);


Is the reason to increase the delay & retries based on some test run data?

The backoff policy to use when saving a search response fails. The total wait time intended is 600000 milliseconds i.e. 10 minutes. We had this value earlier but got changed inadvertently. Reverted it back to what it was earlier i.e. BackoffPolicy.exponentialBackoff(timeValueMillis(250), 14)

Yeah i meant why 10 minutes. Do we think it is just a good default to start with or has any backing from code/tests.

Is there a reason for not adding jitter?

getsaurabh02 · 2020-12-21T20:11:11Z

...endistroforelasticsearch/search/async/context/persistence/AsyncSearchPersistenceService.java

+                if (((e instanceof EsRejectedExecutionException || e instanceof ClusterBlockException
+                        || e instanceof NoShardAvailableActionException) == false) || backoff.hasNext() == false) {


Is it the complete list of non-retriable exceptions? For instance why ClusterBlockException here or not the CircuitBreakingException. Can we not use ElasticsearchException directly to save on the type check and covers all runtime exceptions.

Good point, we don't have an exhaustive list of retryable exceptions. If we run into CircuitBreakingException its ideal we back out and clean up response. Will revisit this later

getsaurabh02 · 2020-12-21T20:32:02Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+    }
+
+    private void performCleanUpAction(AsyncSearchCleanUpRequest request, ActionListener<AcknowledgedResponse> listener) {
+        asyncSearchPersistenceService.deleteExpiredResponses(listener, request.absoluteTimeInMillis);


Do we really need to care and pass the absoluteTimeInMillis from here?

Yes. deleteByQuery would run a search query saying expirationTime < GIVEN_TIME to fetch docs to delete.
We are simply passing GIVEN_TIME parameter in the request.

yeah my point why given time and not the CURRENT_TIME (when the query is executing on node). It would still make sense (efficient) to clean up expired record based on current time. Why do we even expect time param in the request for cleanup here? Am i missing something?

getsaurabh02 · 2020-12-21T21:13:41Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+            final ResponseCleanUpAndRescheduleRunnable newRunnable = new ResponseCleanUpAndRescheduleRunnable();
+            activeResponseCleanUpRunnable.set(newRunnable);
+            threadPool.scheduleUnlessShuttingDown(responseCleanUpInterval, RESPONSE_CLEANUP_SCHEDULING_EXECUTOR, newRunnable);


Is the ResponseCleanUpAndRescheduleRunnable first time scheduled only after a clusterChanged event? Also since in onAfter we schedule the next one, could it create a duplicate schedule if master changes?

Possibly the activeResponseCleanUpRunnable check below isnt enough.

It's guarded by

@Override protected void doRun() { if (this == activeResponseCleanUpRunnable.get()) { super.doRun(); } else { logger.trace("master changed, scheduled cleanup job is stale"); } }

Yeah I was referring to race between activeResponseCleanUpRunnable being set to null (on old master) and new master scheduling on clusterChanged event, since the processing of event on both the nodes are disjoint and cannot be made linear. However it seems a rare use case and can be thought through if it really has a downside later on.

getsaurabh02 · 2020-12-21T21:29:09Z

.../amazon/opendistroforelasticsearch/search/async/management/AsyncSearchManagementService.java

+            transportService.sendRequest(randomNode, CLEANUP_ACTION_NAME,
+                    new AsyncSearchCleanUpRequest(threadPool.absoluteTimeInMillis()),
+                    new TransportResponseHandler<AcknowledgedResponse>() {


Wondering if sub-tasks can also be cleaned by scheduling a self-cleanup per node, instead of transport response handling? For example if it has access to ContextsToReap per node.

This is DELETE-BY-QUERY persisted response cleanup action which is distributed

If each response is modelled as a single record and is expected to present in a single shard, and not shared across multiple shards (nodes), can local cleanup schedules on nodes save the distributed action. I understand modelling that as a separate action might be tedious and we might just be better of with DELETE-BY-QUERY.

Bukhtawar · 2020-12-22T15:27:55Z

...n/java/com/amazon/opendistroforelasticsearch/search/async/response/AcknowledgedResponse.java

+        return acknowledged ? OK : NOT_FOUND;
+    }


Lets have a NOT_AVAILABLE maybe

NOT_AVAILABLE is not a valid RestStatus enum value

Bukhtawar · 2020-12-22T15:35:42Z

...ain/java/com/amazon/opendistroforelasticsearch/search/async/stats/AsyncSearchCountStats.java

+        private static final String RUNNING = "async_search_running_current";
+        private static final String PERSISTED = "async_search_persisted";
+        private static final String FAILED = "async_search_failed";
+        private static final String COMPLETED = "async_search_completed";
+        private static final String REJECTED = "async_search_rejected";


asynchronous?

Bukhtawar · 2020-12-22T15:36:58Z

.../java/com/amazon/opendistroforelasticsearch/search/async/stats/InternalAsyncSearchStats.java

+    @Override
+    public void onContextPersisted(AsyncSearchContextId asyncSearchContextId) {
+        countStatsHolder.persistedAsyncSearchCount.inc();
+    }


Lets add onContextPersistFailed too

Bukhtawar · 2020-12-22T15:38:26Z

.../java/com/amazon/opendistroforelasticsearch/search/async/stats/InternalAsyncSearchStats.java

+import org.elasticsearch.cluster.node.DiscoveryNode;
+import org.elasticsearch.common.metrics.CounterMetric;
+
+public class InternalAsyncSearchStats implements AsyncSearchContextListener {


What happens when context gets closed/deleted, does running stats go for a toss?

added onRunningContextClosed() in context listener to have a hook to decrement running async searches count

Bukhtawar · 2020-12-22T15:40:07Z

...om/amazon/opendistroforelasticsearch/search/async/context/active/AsyncSearchActiveStore.java

    public static final Setting<Integer> MAX_RUNNING_CONTEXT = Setting.intSetting(
-            "async_search.max_running_context", 100, 0, Setting.Property.Dynamic, Setting.Property.NodeScope);
+            "async_search.max_running_context", 100, 10, Setting.Property.Dynamic, Setting.Property.NodeScope);


Lets revert to 0, this would help us turn the feature off

itiyamas · 2020-12-23T04:31:46Z

...ain/java/com/amazon/opendistroforelasticsearch/search/async/stats/AsyncSearchCountStats.java

+        return builder;
+    }
+
+    static final class Fields {


Why have such long field names?Won't you have a section for async search and then add status, current directly there. Asking because long fields may take longer for serialization and deserialization depending on algo.

trimming asynchronous_search_ prefix from individual stats

itiyamas · 2020-12-23T04:31:59Z

...ain/java/com/amazon/opendistroforelasticsearch/search/async/stats/AsyncSearchCountStats.java

+    }
+
+    static final class Fields {
+        private static final String ASYNC_SEARCH_STATUS = "async_search_stats";


renamed variable to ASYNC_SEARCH_STATS

… running search when abruptly closed. rejection default change

...main/java/com/amazon/opendistroforelasticsearch/search/async/service/AsyncSearchService.java

…generic in management service

Bukhtawar

Thanks for the changes. Itiyama was fine with the set of changes. Approving to unblock

eirsep added 10 commits December 14, 2020 18:22

rest actions

9c921d2

stats api. async search clean up.

591d8ad

register transport action and rest handler for stats.

28d39b7

wire async search stats listener to active context

93ecef5

status field in async search response. async search cleanup refactor

959439a

state field in async search response. submit async search rest tests

1bbf255

submit async search rest tests. added wait for completion timeout max…

9fc6a89

… limit setting.

Merge branch 'master' of github.com:opendistro-for-elasticsearch/asyn…

0678622

…chronous-search into rest-layer

validate keep alive change

c830645

more submit api tests

7cbcbf8

eirsep requested a review from Bukhtawar December 17, 2020 11:44

eirsep and others added 12 commits December 17, 2020 17:58

api param validation tests.

97d65c4

Merge branch 'rest-layer' of github.com:opendistro-for-elasticsearch/…

ceecbac

…asynchronous-search into rest-layer

Management layer changes

903b107

async search request routing tests

85fed8f

async search settings tests and added more api tests

6f24bf0

Merge branch 'rest-layer' of github.com:opendistro-for-elasticsearch/…

a62f6cb

…asynchronous-search into rest-layer

Management layer IT

c7b2df1

Disabling transport clients for test

d072d6d

fix failing tests

fa5cd67

Management layers tests

bc64fc3

Minor fixups around permits and timeouts

3350aa8

Update expiration strengthen tests

d504524

Bukhtawar requested review from itiyamas and getsaurabh02 December 21, 2020 05:11

Bukhtawar mentioned this pull request Dec 21, 2020

Asynchronous Search component breakdown #3

Closed

8 tasks

itiyamas reviewed Dec 21, 2020

View reviewed changes

eirsep added 3 commits December 21, 2020 16:09

async search stats multi nodes tests.

c164993

Merge branch 'rest-layer' of github.com:opendistro-for-elasticsearch/…

b71e57f

…asynchronous-search into rest-layer

async search throttling count stats

362ab77

Management layer changes for PR feedback

99aacde

itiyamas reviewed Dec 21, 2020

View reviewed changes

eirsep added 2 commits December 22, 2020 02:43

request routing tests. coordinator node drop scenarios

cfac793

Merge branch 'rest-layer' of github.com:opendistro-for-elasticsearch/…

30c3a66

…asynchronous-search into rest-layer

getsaurabh02 reviewed Dec 21, 2020

View reviewed changes

Async search post processor tests

34865ca

Bukhtawar reviewed Dec 22, 2020

View reviewed changes

itiyamas reviewed Dec 23, 2020

View reviewed changes

onContextPersistFailed stat. onRunningContextClosed hook to decrement…

0878892

… running search when abruptly closed. rejection default change

Bukhtawar reviewed Dec 23, 2020

View reviewed changes

...main/java/com/amazon/opendistroforelasticsearch/search/async/service/AsyncSearchService.java Show resolved Hide resolved

eirsep added 3 commits December 23, 2020 13:34

remove persisting to closed transition. catch IOException instead of …

8173ba0

…generic in management service

equals method fix in async search response

150ad90

revert removing PERSISTING->CLOSED transition.

f90d6be

getsaurabh02 approved these changes Dec 23, 2020

View reviewed changes

Bukhtawar approved these changes Dec 23, 2020

View reviewed changes

Bukhtawar merged commit e94e968 into master Dec 23, 2020

		public static final BackoffPolicy STORE_BACKOFF_POLICY =
		BackoffPolicy.exponentialBackoff(timeValueMillis(250), 14);

		if (((e instanceof EsRejectedExecutionException \|\| e instanceof ClusterBlockException
		\|\| e instanceof NoShardAvailableActionException) == false) \|\| backoff.hasNext() == false) {

Rest Layer and Async Search Cleanup Management #9

Rest Layer and Async Search Cleanup Management #9

Conversation

eirsep commented Dec 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

getsaurabh02 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eirsep Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

getsaurabh02 Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eirsep Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar left a comment

Choose a reason for hiding this comment

eirsep Dec 22, 2020 •

edited

Loading

getsaurabh02 Dec 22, 2020 •

edited

Loading

eirsep Dec 22, 2020 •

edited

Loading