-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
secondary lost data #154
Comments
@baiwfg2 I tried several times with your workload only to find latency but not find data lost on secondaries.
|
My running command is :
The other two mongod are similar. The RS config is:
Pretty sure that I haven't changed original code except my own comments. |
Today I intent to enable two secondaries's verbose logging and see if anything is missed. |
@baiwfg2 , as I've not been able to reproduce(probably because my machine has not many cores). I'd suggest you to try
I'm trying to find a machine with more cpus to reproduce. |
@baiwfg2 I've run for 10 passes of this ycsb parameters on a 32core machine. each node can burst into 800% or more cpu. I didnt find any unmatch between primary and two secondaries. |
@baiwfg2 thanks for your insist, I wrote a auto run script to reproduce. |
I've found how it happened.
here you can see, txn begins, and then query oplogReadTimestamp (allCommitTs) as readTs. how to fixeasy to fix, just query oplogReadTs before txn open(when snapshot is taken). like below
confusions@baiwfg2 if you dig into mongodb 4.2 series, you can see the official mongodb has changed the logic here as the latter type. however, in mongodb 4.0.x series, the logic acts as the above former type. mongoRocks follows mongo-wt here. So does this figure out that mongo-wt4.0 series also has bug here? I'm trying to construct some race conditions on mongo-wt4.0 to reproduce. how to let this race happen more frequent ?if you add a random sleep between txn-begin and query oplogReadTs, this problem can be reproduced much more frequent, the demo code is like below
|
I found wiredTiger reset a new snapshot when settting readTimestamp, it guarantees the consistent view of SI and timestamp. I missed that and got this bug. |
fixed by this |
Nice diagnostic work, and thanks for the clear and detailed description. |
@agorrod thanks for your attention, Alex :) |
Recently I've found a problem about secondaries losing records on benchmarking MongoRocks 4.0.
Loading a small amount of data is ok, and secondaries have got all data. But when loading a relatively large amount of data, it's very easily reproducible for secondary to lost several records. Here's my settings.
YCSB workload
Just execute the YCSB loading, and in the end I always find several records have been lost, either by my own python script(iterating primary and checking whether records exist on secondaries) or by mongo shell's
count()
oritcount()
The screenshot above shows host 50003 has lost one record.
In the mongod(secondaries) log file, there're following error that I'm not sure whether it's relevant:
![image](https://user-images.githubusercontent.com/5157680/79741867-8e782d00-8334-11ea-961a-c7fb75ac6e20.png)
I guess it's not very relevant, because host 50002 also has that error, but got all data instead.
The text was updated successfully, but these errors were encountered: