Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The docs.count is not match for leader and follower #1146

Open
q123dog opened this issue Sep 14, 2023 · 1 comment
Open

[BUG] The docs.count is not match for leader and follower #1146

q123dog opened this issue Sep 14, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@q123dog
Copy link

q123dog commented Sep 14, 2023

What is the bug?
the docs.count is not match for leader and follower

How can one reproduce the bug?
I am using opensearch 2.9.0 version
Steps to reproduce the behavior:

  1. Start a stress test program, which will create a test index in the leader cluster and write 10,000,000 docs to this index.
    such as: nohup ./opensearch-stress write --opensearch-address "http://{leader_ip:port}" --index-name "es-bulk-0" --bulk-batch-size 1000 --bulk-times 10000 &

  2. Before the stress test program ends, start the index replication task in the follower cluster.
    such as: curl -XPUT -k -H 'Content-Type: application/json' 'http://{follower_ip:port}/_plugins/_replication/es-bulk-0/_start?pretty' -d '
    {
    "leader_alias": "leader-cluster-opensearch",
    "leader_index": "es-bulk-0"
    }'

  3. After the stress test program ends, wait for the index replication task done, the docs.count of the leader index is 10,000,000, which is as expected, but the docs.count of follower index is always less than 10,000,000.

What is the expected behavior?
When the leader index is writing docs, then start the index replication task. When the leader index stops writing and the index replication task ends, the docs.count of follower index should be equal to the docs.count of the leader index.

What is your host/environment?

  • OS: Ubuntu 20.04
  • Version: opensearch 2.9.0
  • Plugins: cross cluster replication

Do you have any screenshots?
As far as I know, cross cluster replication has two stages, In the first phase, the existing data is synchronized. it will do a snapshot for the segment files of the leader index, then read these files and transfer them to the follower cluster. In the second stage, it reads changes with localCheckPoint from translog to synchronize incremental data

after first stage finished,I found that the docs.deleted of the follower index was 6561, but the stress test program only write docs without specifying _id and did not perform any delete/update/upsert operations.
IMG_20230914_160627

after the stress test program ends, the docs.count of the leader index is 10,000,000, which is as expected
IMG_20230914_161352

after two stages finished,I found that the docs.count of the follower index is 9993439, which was less than the leader index. I executed the refresh and flush APIs, but the docs.count was still less than the leader index, and the difference was exactly the value of docs.deleted in the first stage I found. (10000000 - 9993439 = 6561)
20230914_161019

Do you have any additional context?
This bug can be easily reproduced, only need to start the index replication task when data is being written in batches to the leader index.

I have reproduced this bug many times. I have tried other stress testing programs and this bug always appears. So I hope I can find help here, thx.

btw, If there is no writing to the leader index,then start the replication task,after the task ends,the dos.count of leader index is equal to the follower index.

@q123dog q123dog added bug Something isn't working untriaged labels Sep 14, 2023
@q123dog q123dog changed the title [BUG] [BUG] The docs.count is not match for leader and follower Sep 14, 2023
@ankitkala
Copy link
Member

ankitkala commented Feb 21, 2024

Can you trigger refresh on follower index and then verify the doc count?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants