[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085

peternied · 2023-11-03T16:40:11Z

Describe the bug

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithRefreshedDataPrimaryReplicaDown
org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithCommittedData
org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithRefreshedData
org.opensearch.remotestore.RemoteStoreRestoreIT.testRemoteTranslogRestoreWithNoDataPostCommit
org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown
org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreDataOnlyInTranslog
org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithCommittedDataPrimaryReplicaDown

tests are flaky.

Stacktrace

Consistent with all but testRTSRestoreWithRefreshedDataPrimaryReplicaDown

java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      voting tombstones: []
metadata customs:
   repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}}   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, local, cluster-manager
   {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: DDhrUTuzS0-TkQaTHp03EQ
from_diff: false
meta data version: 0
   coordination_metadata:
      term: 0
      last_committed_config: VotingConfiguration{}
      last_accepted_config: VotingConfiguration{}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX      2,no cluster-manager, blocks METADATA_WRITE
nodes: 
   {node_s1}{GD7uyrgiTxSKXQLlfKgK_A}{EpXKR9vyS6mfjExohbCtsA}{127.0.0.1}{127.0.0.1:39151}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[GD7uyrgiTxSKXQLlfKgK_A][V]
---- unassigned
, cluster uuid: AzELHIGZT6uSizEPPUXcuA [committed: true]
version: 3
state uuid: borCjOpOQFWpk6JLeG0TmQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      last_accepted_config: VotingConfiguration{Kr_DbhdESPyQswOYSNJVQw}
      voting tombstones: []
metadata customs:
   repositories: {"test-remote-store-repo-2":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/SbGDsuuFWS"},"generation":-2,"pending_generation":-1},"test-remote-store-repo":{"type":"fs","settings":{"system_repository":"true","location":"/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreRestoreIT_A166C2386E979736-001/tempDir-002/repos/DVrmjempuo"},"generation":-2,"pending_generation":-1}}   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_s0}{Kr_DbhdESPyQswOYSNJVQw}{f5hbAyrURuOHqIbtC_b8_Q}{127.0.0.1}{127.0.0.1:35393}{m}{shard_indexing_pressure_enabled=true}, cluster-manager
   {node_s2}{QOKIN5VyR06merR9uOI4bw}{QteIuzBmRZWq6-fJZkbdEg}{127.0.0.1}{127.0.0.1:37443}{d}{shard_indexing_pressure_enabled=true}, local
routing_table (version 1):
routing_nodes:
-----node_id[QOKIN5VyR06merR9uOI4bw][V]
---- unassigned
]

From testRTSRestoreWithRefreshedDataPrimaryReplicaDown

java.lang.AssertionError:  inconsistent generation

To Reproduce
CI - https://build.ci.opensearch.org/job/gradle-check/29522/testReport/

Expected behavior
Test should always pass

The text was updated successfully, but these errors were encountered:

linuxpi · 2024-01-12T05:07:50Z

The recent failure reported is due to suite timeout

java.lang.Exception: Test abandoned because suite timeout was reached.
	at __randomizedtesting.SeedInfo.seed([ADC655A786161FFE]:0)

found one blocked thread

 12) Thread[id=4135, name=opensearch[node_s2][generic][T#3], state=BLOCKED, group=TGRP-RemoteStoreRestoreIT]
        at org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
        at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
        at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
        at org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
        at org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
        at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
        at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
        at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
        at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
        at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
        at org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
        at org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
        at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
        at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
        at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
        at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

linuxpi · 2024-01-12T05:19:04Z

Test was not able to properly terminate the node_s2, which lingered on for 20 mins

[2023-12-22T10:14:00,407][INFO ][o.o.t.InternalTestCluster] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] Closing filtered random node [node_s2] 
[2023-12-22T10:14:00,408][INFO ][o.o.n.Node               ] [testRTSRestoreWithRefreshedDataPrimaryReplicaDown] stopping ...
[2023-12-22T10:14:00,408][INFO ][o.o.c.c.Coordinator      ] [node_s2] cluster-manager node [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{jBjkBSbZSL-3AlK2H7Z0gg}{127.0.0.1}{127.0.0.1:35721}{m}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.transport.NodeDisconnectedException: [node_s0][127.0.0.1:35721][disconnected] disconnected
[2023-12-22T10:14:00,409][INFO ][o.o.c.c.FollowersChecker ] [node_s0] FollowerChecker{discoveryNode={node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{nOLSTsBLR7-smx0D64pDvA}{127.0.0.1
...[truncated 1074993 chars]...
v8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:28,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:38,614][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:48,615][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:24:58,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:08,617][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:18,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:25:28,618][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1

....


[2023-12-22T10:33:28,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:38,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:48,637][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,638][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node_s2] cluster-manager not discovered yet: have discovered [{node_s2}{gTu5ZLOMSFCLMpDQTdtNMA}{mEc3_Dv8TB6wroMMHHnStA}{127.0.0.1}{127.0.0.1:38765}{d}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:34693] from hosts providers and [{node_s0}{1JnXbN3XQLO0u-Q7V1Ro2g}{b0Y7f3J-Q5KjKlgMGYW35Q}{127.0.0.1}{127.0.0.1:34693}{m}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 1, last-accepted version 14 in term 1
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] stopped
[2023-12-22T10:33:58,924][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closing ...
[2023-12-22T10:33:58,931][INFO ][o.o.n.Node               ] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] closed
[2023-12-22T10:33:59,438][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] [RemoteStoreRestoreIT#testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown]: cleaning up after test
[2023-12-22T10:33:59,439][INFO ][o.o.r.RemoteStoreRestoreIT] [testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown] after test

linuxpi · 2024-01-12T05:37:45Z

Looks like the thread was blocked on

OpenSearch/server/src/main/java/org/opensearch/index/shard/IndexShard.java

Lines 4752 to 4763 in 5c82ab8

    
           @Override 
        
           public GatedCloseable<SegmentInfos> getSegmentInfosSnapshot() { 
        
               synchronized (engineMutex) { 
        
                   if (newEngineReference.get() == null) { 
        
                       throw new AlreadyClosedException("engine was closed"); 
        
                   } 
        
                   return newEngineReference.get().getSegmentInfosSnapshot(); 
        
               } 
        
           } 
        
           @Override

linuxpi · 2024-01-12T05:47:11Z

We are terminating the nodes in order of replica then primary, but we only check for shard 0. replica node for shard 0 would have had primaries for other shards. Terminating it would lead to relocation and during relocation we create the read only engine(code pointer in last comment), which where our locks are blocking the thread

OpenSearch/server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreRestoreIT.java

Lines 137 to 156 in 5c82ab8

    
               /** 
        
                * Helper function to test restoring an index having replicas from remote store when all the nodes housing the primary/replica drop. 
        
                * @param numberOfIterations Number of times a refresh/flush should be invoked, followed by indexing some data. 
        
                * @param invokeFlush If true, a flush is invoked. Otherwise, a refresh is invoked. 
        
                * @throws IOException IO Exception. 
        
                */ 
        
               private void testRestoreFlowBothPrimaryReplicasDown(int numberOfIterations, boolean invokeFlush, int shardCount) throws Exception { 
        
                   prepareCluster(1, 2, INDEX_NAME, 1, shardCount); 
        
                   Map<String, Long> indexStats = indexData(numberOfIterations, invokeFlush, INDEX_NAME); 
        
                   assertEquals(shardCount * 2, getNumShards(INDEX_NAME).totalNumShards); 
        
                   internalCluster().stopRandomNode(InternalTestCluster.nameFilter(replicaNodeName(INDEX_NAME))); 
        
                   internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNodeName(INDEX_NAME))); 
        
                   ensureRed(INDEX_NAME); 
        
                   internalCluster().startDataOnlyNodes(2); 
        
                   restoreAndVerify(shardCount, 1, indexStats); 
        
               }

linuxpi · 2024-01-12T08:36:59Z

index shutdown and replica to primary promotion causing a deadlock

"opensearch[node_s2][generic][T#3]" ID=4135 BLOCKED on java.lang.Object@3c0fa470 owned by "opensearch[node_s2][indices_shutdown][T#1]" ID=4183
	at app//org.opensearch.index.shard.IndexShard$11.getSegmentInfosSnapshot(IndexShard.java:4768)
	- blocked on java.lang.Object@3c0fa470
	at app//org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5113)
	at app//org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1676)
	at app//org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4690)
	at app//org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4684)
	at app//org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
	at app//org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1862)
	at app//org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1977)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:589)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:546)
	at app//org.opensearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:146)
	at app//org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4812)
	at app//org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:720)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca36238.run(Unknown Source)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4079)
	at app//org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4049)
	at app//org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$33(IndexShard.java:4000)
	at app//org.opensearch.index.shard.IndexShard$$Lambda/0x00007f45aca11738.accept(Unknown Source)
	at app//org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at app//org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
	Locked synchronizers:
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@28675992
	- java.util.concurrent.ThreadPoolExecutor$Worker@1197c3a3
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@57ac517a

  20) Thread[id=4183, name=opensearch[node_s2][indices_shutdown][T#1], state=WAITING, group=TGRP-RemoteStoreRestoreIT]
        at java.base/jdk.internal.misc.Unsafe.park(Native Method)
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
        at java.base/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
        at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
        at org.opensearch.index.engine.Engine.close(Engine.java:2070)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:81)
        at org.opensearch.index.shard.IndexShard$11.close(IndexShard.java:4785)
        at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2059)
        at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2003)
        at org.opensearch.index.IndexService.closeShard(IndexService.java:642)
        at org.opensearch.index.IndexService.removeShard(IndexService.java:618)
        at org.opensearch.index.IndexService.close(IndexService.java:389)
        at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1047)
        at org.opensearch.indices.IndicesService.lambda$doStop$3(IndicesService.java:542)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

In the IndexShard.close flow we acquire engineMutex and then try to acquire write lock on the engine. In replica to primary promotion flow we acquire readlock on the engine first and then try to acquire engineMutex . Read and Write lock are shared via ReenterantReadWriteLock, leading to deadlock.

linuxpi · 2024-01-12T12:10:04Z

Created a separate bug to track the deadlock issue as this bug is tracking other reasons due to which restore tests are failing.

#11869

linuxpi · 2024-02-05T17:48:32Z

I wasn't able to repro this even after 1K iterations. Since there have been only 2 occurrences and last one was almost 2 months ago, closing the issue. Feel free to reopen if you encounter same issue again

shwetathareja · 2024-03-20T17:24:14Z

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown failed again here - #12252 (comment)

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base@21.0.2/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base@21.0.2/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)

linuxpi · 2024-04-22T17:46:53Z

All 3 recent failures were for

org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown

failure trace

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3006, name=opensearch[node_s2][remote_refresh_retry][T#1], state=RUNNABLE, group=TGRP-RemoteStoreRestoreIT]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73:F08AC3EEAC6BC269]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7384c70f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@57463c3d[Wrapped task = [threaded] org.opensearch.index.shard.ReleasableRetryableRefreshListener$$Lambda/0x00007ff968b34230@56239546]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@294dab55[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 13]
	at __randomizedtesting.SeedInfo.seed([816E5637A5EABB73]:0)
	at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
	at java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base@21.0.2/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base@21.0.2/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:469)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:125)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.scheduleRetry(ReleasableRetryableRefreshListener.java:178)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167)
	at app//org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)

linuxpi · 2024-05-01T17:55:49Z

Ran around 5000 iterations locally and i cannot repro this. Will add trace logging annotation on this test to get more details in the PR build failures to help debug.

peternied added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run Storage:Remote labels Nov 3, 2023

peternied mentioned this issue Nov 3, 2023

[Backport 2.x] Mute flaky concurrent segment search tests pending fixes (#10437) #11050

Merged

8 tasks

peternied removed the untriaged label Nov 30, 2023

ankitkala assigned linuxpi Dec 16, 2023

andrross mentioned this issue Dec 22, 2023

[Backport 2.x] Remove unused java11 gradle configuration #11664

Merged

linuxpi closed this as completed Feb 6, 2024

shwetathareja reopened this Mar 20, 2024

github-actions bot added the untriaged label Mar 20, 2024

reta mentioned this issue Apr 3, 2024

[AUTO] [2.x] Add bwc version 2.13.1. #13038

Merged

gbbafna removed the untriaged label Apr 4, 2024

mch2 mentioned this issue Apr 16, 2024

[AUTOCUT] Gradle Check Failure on push to main #13182

Closed

linuxpi mentioned this issue May 1, 2024

add TestIssueLogging on flaky test - org.opensearch.remotestore.RemoteStoreRestoreIT.testRTSRestoreWithNoDataPostRefreshPrimaryReplicaDown #13489

Merged

8 tasks

gaobinlong mentioned this issue Jun 13, 2024

Add fingerprint ingest processor #13724

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085

[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085

peternied commented Nov 3, 2023

linuxpi commented Jan 12, 2024 •

edited

linuxpi commented Jan 12, 2024

linuxpi commented Jan 12, 2024

linuxpi commented Jan 12, 2024

linuxpi commented Jan 12, 2024 •

edited

linuxpi commented Jan 12, 2024

linuxpi commented Feb 5, 2024

shwetathareja commented Mar 20, 2024

linuxpi commented Apr 22, 2024

linuxpi commented May 1, 2024

[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085

[BUG] RemoteStoreRestoreIT tests are flaky - Missing cluster-manager, expected nodes #11085

Comments

peternied commented Nov 3, 2023

Stacktrace

linuxpi commented Jan 12, 2024 • edited

linuxpi commented Jan 12, 2024

linuxpi commented Jan 12, 2024

linuxpi commented Jan 12, 2024

linuxpi commented Jan 12, 2024 • edited

linuxpi commented Jan 12, 2024

linuxpi commented Feb 5, 2024

shwetathareja commented Mar 20, 2024

linuxpi commented Apr 22, 2024

linuxpi commented May 1, 2024

linuxpi commented Jan 12, 2024 •

edited

linuxpi commented Jan 12, 2024 •

edited