Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secondary lost data #154

Closed
baiwfg2 opened this issue Apr 20, 2020 · 12 comments
Closed

secondary lost data #154

baiwfg2 opened this issue Apr 20, 2020 · 12 comments
Assignees
Labels

Comments

@baiwfg2
Copy link

baiwfg2 commented Apr 20, 2020

Recently I've found a problem about secondaries losing records on benchmarking MongoRocks 4.0.

Loading a small amount of data is ok, and secondaries have got all data. But when loading a relatively large amount of data, it's very easily reproducible for secondary to lost several records. Here's my settings.

YCSB workload

recordcount=500000
operationcount=1000
workload=site.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0
updateproportion=0
scanproportion=0
insertproportion=1

requestdistribution=zipfian

fieldlength=1024
fieldcount=5
mongodb.url=mongodb://10.108.103.160:50001/ycsb?w=1

threadcount=16
table=t

Just execute the YCSB loading, and in the end I always find several records have been lost, either by my own python script(iterating primary and checking whether records exist on secondaries) or by mongo shell's count() or itcount()

image
The screenshot above shows host 50003 has lost one record.

In the mongod(secondaries) log file, there're following error that I'm not sure whether it's relevant:
image
I guess it's not very relevant, because host 50002 also has that error, but got all data instead.

@wolfkdy wolfkdy self-assigned this Apr 20, 2020
@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 20, 2020

@baiwfg2 I tried several times with your workload only to find latency but not find data lost on secondaries.

  1. Can you post your mongod's configuration file please?
  2. Can you turn fieldlength from 1024 to 100 to check if this can be reproduced? Since small values load faster and can eliminate IO's effect.
  3. Are you sure you did not change something about mongodb or rocksdb source code?

@baiwfg2
Copy link
Author

baiwfg2 commented Apr 21, 2020

My running command is :

taskset -c 1-12 ./mongod-rocks --replSet rs1 --storageEngine rocksdb --dbpath $dbpath --bind_ip 0.0.0.0 --port 50001 --logpath 5k1.log

The other two mongod are similar.

The RS config is:

rs.initiate({ _id: "rs1", version: 1, members: [ { _id:0,host:"10.108.103.160:50001"},{_id:1,host:"9.34.46.252:50002"},{_id:2,host:"10.108.117.196:50003"} ], settings: { chainingAllowed: false} })

Pretty sure that I haven't changed original code except my own comments.

@baiwfg2
Copy link
Author

baiwfg2 commented Apr 21, 2020

Today I intent to enable two secondaries's verbose logging and see if anything is missed.

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 21, 2020

@baiwfg2 , as I've not been able to reproduce(probably because my machine has not many cores). I'd suggest you to try

  1. to check if the oplogs on secondary matches primary, if not, which one is lost
  2. add secondary to catch up after loading data to primary is finished to see if the newly added node can reproduce?
    If 2) never reproduces, this probably is because the oplog_read_source may problemicly missed
    some entries, which may intend that, "all_committed" calculation is bugful
  3. patch lines below to totdb to reproduce? I do believe this has no relation with lines below but we can have a try.
diff --git a/utilities/transactions/totransaction_db_impl.cc b/utilities/transactions/totransaction_db_impl.cc
index 971768a..81d0ffd 100644
--- a/utilities/transactions/totransaction_db_impl.cc
+++ b/utilities/transactions/totransaction_db_impl.cc
@@ -368,10 +368,6 @@ Status TOTransactionDBImpl::AddCommitQueue(const std::shared_ptr<ATN>& core,
     if (ts < core->first_commit_ts_) {
       return Status::InvalidArgument("commit-ts smaller than first-commit-ts");
     }
-    if (ts < core->commit_ts_) {
-      // in wt3.0, there is no restriction like this one.
-      return Status::InvalidArgument("commit-ts must be monotonic in a txn");
-    }
     core->commit_ts_ = ts;
     return Status::OK();
   }
diff --git a/utilities/transactions/totransaction_impl.cc b/utilities/transactions/totransaction_impl.cc
index 1cd758d..11af88c 100644
--- a/utilities/transactions/totransaction_impl.cc
+++ b/utilities/transactions/totransaction_impl.cc
@@ -88,17 +88,6 @@ Status TOTransactionImpl::SetCommitTimeStamp(const RocksTimeStamp& timestamp) {
     return Status::NotSupported("not allowed to set committs to 0");
   }
 
-  if (core_->commit_ts_set_) {
-    if (core_->commit_ts_ > timestamp) {
-      return Status::NotSupported("commit ts need equal with the pre set");
-    }
-    if (core_->commit_ts_ == timestamp) {
-      return Status::OK();
-    }
-  }
-
-  assert((!core_->commit_ts_set_) || core_->commit_ts_ < timestamp);
-
   // publish commit_ts to global view
   auto s = txn_db_impl_->AddCommitQueue(core_, timestamp);
  1. keep recordcount not change but change filelength to 100 to reproduce. this will reduce much time for one run.

I'm trying to find a machine with more cpus to reproduce.

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 21, 2020

@baiwfg2 I've run for 10 passes of this ycsb parameters on a 32core machine. each node can burst into 800% or more cpu. I didnt find any unmatch between primary and two secondaries.
I'll try to run for more turns.
If you have tried these suggestions I listed above, you can post your results here.

@baiwfg2
Copy link
Author

baiwfg2 commented Apr 21, 2020

I've performed 4 times table inserts, and secondaries in the first 3 cases have succeed in getting all data. In the 4th cases, I increased the recordcount to 2 000 000, and finally one of secondary (9.34.46.252:50002 in my case)missed two records, although I waited enough time for it to catch up.
image

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 21, 2020

@baiwfg2 thanks for your insist, I wrote a auto run script to reproduce.
After running for hundreds of turns, it happens.
I will dig into that.

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 24, 2020

I've found how it happened.

        switch (_timestampReadSource) {
            case ReadSource::kUnset:
            case ReadSource::kNoTimestamp: {
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    auto status =
                        txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }
                txnOpen.done();
                break;
            }

here you can see, txn begins, and then query oplogReadTimestamp (allCommitTs) as readTs.
The problem here can be describe below
[A begins] [B commit(ts = 100] [advance allCommitTs to 100] [A set readTs = 100] [A read]
from timestamp aspect, A can see B, but as B commits after A begins, from SI aspect, A can not see B. In this way, a hole occurs.

how to fix

easy to fix, just query oplogReadTs before txn open(when snapshot is taken). like below

                Timestamp readTs;
                if (_isOplogReader) {
                     readTs = _oplogManager->getOplogReadTimestamp();
                }
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    auto status =
                        txnOpen.setTimestamp(readTs,
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }

confusions

@baiwfg2 if you dig into mongodb 4.2 series, you can see the official mongodb has changed the logic here as the latter type. however, in mongodb 4.0.x series, the logic acts as the above former type. mongoRocks follows mongo-wt here. So does this figure out that mongo-wt4.0 series also has bug here? I'm trying to construct some race conditions on mongo-wt4.0 to reproduce.
If I can not reproduce it on mongo-wt4.0, I should have misunderstood something about wiredtiger.
but if I can reproduce, it will be true that mongo4.0 series all over the world has the possibility of losing data.

how to let this race happen more frequent ?

if you add a random sleep between txn-begin and query oplogReadTs, this problem can be reproduced much more frequent, the demo code is like below

        switch (_timestampReadSource) {
            case ReadSource::kUnset:
            case ReadSource::kNoTimestamp: {
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    if (rand() % 100 == 1) {
                        usleep(10000);
                    }
                    auto status =
                        txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }
                txnOpen.done();
                break;
            }

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 25, 2020

I found wiredTiger reset a new snapshot when settting readTimestamp, it guarantees the consistent view of SI and timestamp. I missed that and got this bug.
I'll make totdb the same as wt at this.
It is the root cause.

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 25, 2020

fixed by this
commit wolfkdy/rocksdb@5eb8a67

@wolfkdy wolfkdy closed this as completed Apr 25, 2020
@wolfkdy wolfkdy added the bug label Apr 25, 2020
@agorrod
Copy link

agorrod commented Apr 26, 2020

Nice diagnostic work, and thanks for the clear and detailed description.

@wolfkdy
Copy link
Collaborator

wolfkdy commented Apr 27, 2020

@agorrod thanks for your attention, Alex :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants