secondary lost data #154

baiwfg2 · 2020-04-20T10:34:32Z

Recently I've found a problem about secondaries losing records on benchmarking MongoRocks 4.0.

Loading a small amount of data is ok, and secondaries have got all data. But when loading a relatively large amount of data, it's very easily reproducible for secondary to lost several records. Here's my settings.

YCSB workload

recordcount=500000
operationcount=1000
workload=site.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0
updateproportion=0
scanproportion=0
insertproportion=1

requestdistribution=zipfian

fieldlength=1024
fieldcount=5
mongodb.url=mongodb://10.108.103.160:50001/ycsb?w=1

threadcount=16
table=t

Just execute the YCSB loading, and in the end I always find several records have been lost, either by my own python script(iterating primary and checking whether records exist on secondaries) or by mongo shell's count() or itcount()

The screenshot above shows host 50003 has lost one record.

In the mongod(secondaries) log file, there're following error that I'm not sure whether it's relevant:

I guess it's not very relevant, because host 50002 also has that error, but got all data instead.

The text was updated successfully, but these errors were encountered:

wolfkdy · 2020-04-20T16:11:33Z

@baiwfg2 I tried several times with your workload only to find latency but not find data lost on secondaries.

Can you post your mongod's configuration file please?
Can you turn fieldlength from 1024 to 100 to check if this can be reproduced? Since small values load faster and can eliminate IO's effect.
Are you sure you did not change something about mongodb or rocksdb source code?

baiwfg2 · 2020-04-21T02:37:43Z

My running command is :

taskset -c 1-12 ./mongod-rocks --replSet rs1 --storageEngine rocksdb --dbpath $dbpath --bind_ip 0.0.0.0 --port 50001 --logpath 5k1.log

The other two mongod are similar.

The RS config is:

rs.initiate({ _id: "rs1", version: 1, members: [ { _id:0,host:"10.108.103.160:50001"},{_id:1,host:"9.34.46.252:50002"},{_id:2,host:"10.108.117.196:50003"} ], settings: { chainingAllowed: false} })

Pretty sure that I haven't changed original code except my own comments.

baiwfg2 · 2020-04-21T02:45:30Z

Today I intent to enable two secondaries's verbose logging and see if anything is missed.

wolfkdy · 2020-04-21T04:15:25Z

@baiwfg2 , as I've not been able to reproduce(probably because my machine has not many cores). I'd suggest you to try

to check if the oplogs on secondary matches primary, if not, which one is lost
add secondary to catch up after loading data to primary is finished to see if the newly added node can reproduce?
If 2) never reproduces, this probably is because the oplog_read_source may problemicly missed
some entries, which may intend that, "all_committed" calculation is bugful
patch lines below to totdb to reproduce? I do believe this has no relation with lines below but we can have a try.

diff --git a/utilities/transactions/totransaction_db_impl.cc b/utilities/transactions/totransaction_db_impl.cc
index 971768a..81d0ffd 100644
--- a/utilities/transactions/totransaction_db_impl.cc
+++ b/utilities/transactions/totransaction_db_impl.cc
@@ -368,10 +368,6 @@ Status TOTransactionDBImpl::AddCommitQueue(const std::shared_ptr<ATN>& core,
     if (ts < core->first_commit_ts_) {
       return Status::InvalidArgument("commit-ts smaller than first-commit-ts");
     }
-    if (ts < core->commit_ts_) {
-      // in wt3.0, there is no restriction like this one.
-      return Status::InvalidArgument("commit-ts must be monotonic in a txn");
-    }
     core->commit_ts_ = ts;
     return Status::OK();
   }
diff --git a/utilities/transactions/totransaction_impl.cc b/utilities/transactions/totransaction_impl.cc
index 1cd758d..11af88c 100644
--- a/utilities/transactions/totransaction_impl.cc
+++ b/utilities/transactions/totransaction_impl.cc
@@ -88,17 +88,6 @@ Status TOTransactionImpl::SetCommitTimeStamp(const RocksTimeStamp& timestamp) {
     return Status::NotSupported("not allowed to set committs to 0");
   }
 
-  if (core_->commit_ts_set_) {
-    if (core_->commit_ts_ > timestamp) {
-      return Status::NotSupported("commit ts need equal with the pre set");
-    }
-    if (core_->commit_ts_ == timestamp) {
-      return Status::OK();
-    }
-  }
-
-  assert((!core_->commit_ts_set_) || core_->commit_ts_ < timestamp);
-
   // publish commit_ts to global view
   auto s = txn_db_impl_->AddCommitQueue(core_, timestamp);

keep recordcount not change but change filelength to 100 to reproduce. this will reduce much time for one run.

I'm trying to find a machine with more cpus to reproduce.

wolfkdy · 2020-04-21T08:13:35Z

@baiwfg2 I've run for 10 passes of this ycsb parameters on a 32core machine. each node can burst into 800% or more cpu. I didnt find any unmatch between primary and two secondaries.
I'll try to run for more turns.
If you have tried these suggestions I listed above, you can post your results here.

baiwfg2 · 2020-04-21T12:39:51Z

I've performed 4 times table inserts, and secondaries in the first 3 cases have succeed in getting all data. In the 4th cases, I increased the recordcount to 2 000 000, and finally one of secondary (9.34.46.252:50002 in my case)missed two records, although I waited enough time for it to catch up.

wolfkdy · 2020-04-21T15:54:26Z

@baiwfg2 thanks for your insist, I wrote a auto run script to reproduce.
After running for hundreds of turns, it happens.
I will dig into that.

wolfkdy · 2020-04-24T15:30:15Z

I've found how it happened.

        switch (_timestampReadSource) {
            case ReadSource::kUnset:
            case ReadSource::kNoTimestamp: {
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    auto status =
                        txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }
                txnOpen.done();
                break;
            }

here you can see, txn begins, and then query oplogReadTimestamp (allCommitTs) as readTs.
The problem here can be describe below
[A begins] [B commit(ts = 100] [advance allCommitTs to 100] [A set readTs = 100] [A read]
from timestamp aspect, A can see B, but as B commits after A begins, from SI aspect, A can not see B. In this way, a hole occurs.

how to fix

easy to fix, just query oplogReadTs before txn open(when snapshot is taken). like below

                Timestamp readTs;
                if (_isOplogReader) {
                     readTs = _oplogManager->getOplogReadTimestamp();
                }
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    auto status =
                        txnOpen.setTimestamp(readTs,
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }

confusions

@baiwfg2 if you dig into mongodb 4.2 series, you can see the official mongodb has changed the logic here as the latter type. however, in mongodb 4.0.x series, the logic acts as the above former type. mongoRocks follows mongo-wt here. So does this figure out that mongo-wt4.0 series also has bug here? I'm trying to construct some race conditions on mongo-wt4.0 to reproduce.
If I can not reproduce it on mongo-wt4.0, I should have misunderstood something about wiredtiger.
but if I can reproduce, it will be true that mongo4.0 series all over the world has the possibility of losing data.

how to let this race happen more frequent ?

if you add a random sleep between txn-begin and query oplogReadTs, this problem can be reproduced much more frequent, the demo code is like below

        switch (_timestampReadSource) {
            case ReadSource::kUnset:
            case ReadSource::kNoTimestamp: {
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    if (rand() % 100 == 1) {
                        usleep(10000);
                    }
                    auto status =
                        txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }
                txnOpen.done();
                break;
            }

wolfkdy · 2020-04-25T01:16:09Z

I found wiredTiger reset a new snapshot when settting readTimestamp, it guarantees the consistent view of SI and timestamp. I missed that and got this bug.
I'll make totdb the same as wt at this.
It is the root cause.

wolfkdy · 2020-04-25T03:17:46Z

fixed by this
commit wolfkdy/rocksdb@5eb8a67

agorrod · 2020-04-26T21:08:38Z

Nice diagnostic work, and thanks for the clear and detailed description.

wolfkdy · 2020-04-27T01:44:01Z

@agorrod thanks for your attention, Alex :)

wolfkdy self-assigned this Apr 20, 2020

wolfkdy closed this as completed Apr 25, 2020

wolfkdy added the bug label Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

secondary lost data #154

secondary lost data #154

baiwfg2 commented Apr 20, 2020

wolfkdy commented Apr 20, 2020 •

edited

Loading

baiwfg2 commented Apr 21, 2020

baiwfg2 commented Apr 21, 2020

wolfkdy commented Apr 21, 2020 •

edited

Loading

wolfkdy commented Apr 21, 2020 •

edited

Loading

baiwfg2 commented Apr 21, 2020 •

edited

Loading

wolfkdy commented Apr 21, 2020

wolfkdy commented Apr 24, 2020 •

edited

Loading

wolfkdy commented Apr 25, 2020

wolfkdy commented Apr 25, 2020

agorrod commented Apr 26, 2020

wolfkdy commented Apr 27, 2020

secondary lost data #154

secondary lost data #154

Comments

baiwfg2 commented Apr 20, 2020

YCSB workload

wolfkdy commented Apr 20, 2020 • edited Loading

baiwfg2 commented Apr 21, 2020

baiwfg2 commented Apr 21, 2020

wolfkdy commented Apr 21, 2020 • edited Loading

wolfkdy commented Apr 21, 2020 • edited Loading

baiwfg2 commented Apr 21, 2020 • edited Loading

wolfkdy commented Apr 21, 2020

wolfkdy commented Apr 24, 2020 • edited Loading

how to fix

confusions

how to let this race happen more frequent ?

wolfkdy commented Apr 25, 2020

wolfkdy commented Apr 25, 2020

agorrod commented Apr 26, 2020

wolfkdy commented Apr 27, 2020

wolfkdy commented Apr 20, 2020 •

edited

Loading

wolfkdy commented Apr 21, 2020 •

edited

Loading

wolfkdy commented Apr 21, 2020 •

edited

Loading

baiwfg2 commented Apr 21, 2020 •

edited

Loading

wolfkdy commented Apr 24, 2020 •

edited

Loading