[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966

kksaha · 2023-09-11T12:07:55Z

Describe the bug

Shard failure, reason [replication failure]], failure [ReplicationFailedException
Failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException.||

Logs:

[2023-09-08T09:28:10,201][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], s[STARTED], a[id=1OR-X-XDTrCMLvXWr8k0sw], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], markAsStale [true]] [2023-09-08T09:28:10,201][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], s[STARTED], a[id=1OR-X-XDTrCMLvXWr8k0sw], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], markAsStale [true]] [2023-09-08T09:28:20,522][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[EDhutdeXT5W5luFLpIF7sw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=5cUw3rGZSbuWLOSrQygkvA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:10.201Z], failed_attempts[1], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: shard failure, reason [replication failure], failure ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], allocation_status[no_attempt]], expected_shard_size[13863289464], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], markAsStale [true]] [2023-09-08T09:28:20,522][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[EDhutdeXT5W5luFLpIF7sw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=5cUw3rGZSbuWLOSrQygkvA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:10.201Z], failed_attempts[1], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: shard failure, reason [replication failure], failure ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], allocation_status[no_attempt]], expected_shard_size[13863289464], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], markAsStale [true]] [2023-09-08T09:28:30,607][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=A7J2Jm_7SAOY2RLIXH-qZA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:20.522Z], failed_attempts[2], failed_nodes[[EDhutdeXT5W5luFLpIF7sw]], delayed=false, details[failed shard on node [EDhutdeXT5W5luFLpIF7sw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], allocation_status[no_attempt]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [20397ms]]; ], markAsStale [true]] [2023-09-08T09:28:30,607][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298]

It seems segment replication event failed due to index corruption exception because of missing segment file
NoSuchFileException "/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe" doesn't exist
and ShardLockObtainFailedException on shard 0.

{ "index": "ppe-000298", "shard": 0, "primary": false, "current_state": "unassigned", "unassigned_info": { "reason": "ALLOCATION_FAILED", "at": "2023-09-08T14:12:35.637Z", "failed_allocation_attempts": 5, "details": "failed shard on node [ylgTCi8VSk-iytumnaGxlg]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [17065425ms]]; ", "last_allocation_status": "no_attempt" }, "can_allocate": "no", "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions": [ { "node_id": "17zuTMYtQ9KvKKNa7gm0Ig", "node_name": “\data-az1-1", "transport_address": “*.*.*.*:9300", "node_attributes": { "zone": "az1", "shard_indexing_pressure_enabled": "true" }, "node_decision": "no", "deciders": [ { "decider": "max_retry", "decision": "NO", "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T14:12:35.637Z], failed_attempts[5], failed_nodes[[EDhutdeXT5W5luFLpIF7sw, ylgTCi8VSk-iytumnaGxlg]], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [17065425ms]]; ], allocation_status[no_attempt]]]" } ] }

Eventually replica shards fall behind primary for too long and huge lagging.

Screenshots
index shard prirep state docs store ip node
ppe-000298 1 p STARTED 33768999 31.4gb 10...* data-az2-3
ppe-000298 1 r STARTED 1412441 1.3gb 10...* data-az1-4
ppe-000298 2 p STARTED 33763101 35.3gb 10...* data-az1-6
ppe-000298 2 r STARTED 5928658 5.1gb 10...* data-az2-2
ppe-000298 0 p STARTED 33758088 30.1gb 10...* data-az2-1
ppe-000298 0 r UNASSIGNED

Host/Environment (please complete the following information):

OS: Linux
Version: 2.8.0

We have tried to manually reroute the shard allocation but that didn't help.

The text was updated successfully, but these errors were encountered:

kksaha · 2023-09-15T10:59:03Z

Can anyone please advise on how to fix this issue?

mch2 · 2023-09-25T18:49:47Z

Hi @kksaha This looks like the same issue reported here.

To fix on the running cluster you would need to bounce the node that is trying to allocate this shard ppe-000298 0 r UNASSIGNED.

The cause of the corruption has been fixed with 2.9 with this PR. However, we still need to figure out why the shard was not able to auto recover & re-allocate to the same node.

From your trace, it looks like the shard lock is still being held by the store https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/store/Store.java#L868C3-L868C3, meaning post corruption we weren't able to close the shard - so it is unable to recreate. Working on reproducing this to see whats going on.

kksaha added bug Something isn't working untriaged labels Sep 11, 2023

kotwanikunal added the Indexing Indexing, Bulk Indexing and anything related to indexing label Sep 19, 2023

anasalkouz added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Sep 22, 2023

anasalkouz added the v2.11.0 Issues and PRs related to version 2.11.0 label Sep 25, 2023

mch2 mentioned this issue Oct 4, 2023

Segment Replication - Fix ShardLockObtained error during corruption cases #10370

Merged

7 tasks

andrross closed this as completed in #10370 Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966

[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966

kksaha commented Sep 11, 2023 •

edited

kksaha commented Sep 15, 2023

mch2 commented Sep 25, 2023

[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966

[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966

Comments

kksaha commented Sep 11, 2023 • edited

kksaha commented Sep 15, 2023

mch2 commented Sep 25, 2023

kksaha commented Sep 11, 2023 •

edited