Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Badger support as a pure-go alternative to RocksDB #18

Closed
lni opened this issue Jan 17, 2019 · 28 comments
Closed

Badger support as a pure-go alternative to RocksDB #18

lni opened this issue Jan 17, 2019 · 28 comments
Labels
enhancement New feature or request

Comments

@lni
Copy link
Owner

lni commented Jan 17, 2019

After discussions in #4, I've add some experimental code to make badger to work with Dragonboat. The code is available in the following branch -

https://github.com/lni/dragonboat/tree/badger-logdb-experimental

As mentioned above, this is some experimental code use to see how badger works. It should never be used in production.

I'd like to point out that there are outstanding issues preventing badger support to be added -

There are probably more, will update when identified.

@lni lni added the enhancement New feature or request label Jan 17, 2019
@manishrjain
Copy link

There you go! Fixed both the issues. Not that I typically do them on request, but these were small changes.

@lni
Copy link
Owner Author

lni commented Jan 20, 2019

Thanks @manishrjain for the quick response.

I had a quick look at the changes and its related existing code. It seems to me that sst checksum is per sst file and it is checked when OpenTable is called. When the load option is LoadToRAM, it seems to be fine. When the load option is FileIO, then the content in mmap is discarded and all blocks are going to be read from the on-disk file again without checksum validation. This brings in the risk of having some TOCTOU errors, say bits flipped after the initial checksum validation but before the reads invoked by block(). Similar issue for the MemoryMap mode.

Cheers!

@manishrjain
Copy link

Hey @lni ,

All the accesses to SSTables after the initial load are to specific blocks within the file. Badger wouldn't read the full file again. So, it can't determine if some bits have flipped. Also, periodically reading the files to see if their checksum matches is going to introduce a lot of load on the FS. If you have ideas around what more could be done, feel free to create an issue with references, and we can have a look.

@lni
Copy link
Owner Author

lni commented Jan 21, 2019

Hi @manishrjain , I had a quick look at RocksDB's code, it checksums each block, each block is verified against its checksum immediate after reading from disk.

https://github.com/facebook/rocksdb/blob/01013ae7666af78d4b0fc1e1baf4a735a8db5344/table/block_fetcher.cc#L207

@jen20
Copy link

jen20 commented Jan 21, 2019

Hey @lni! Great job on Dragonboat! It's worth noting that if you run on ZFS the lack of checksums in the log storage database is not an issue, as the filesystem itself makes those checks.

I for one would love to see Badger land as a first-class supported option so as not to involve cgo in a build process, even if there are tradeoffs on various filesystems.

@manishrjain
Copy link

manishrjain commented Jan 21, 2019

it checksums each block, each block is verified against its checksum immediate after reading from disk.

I'm not entirely convinced that's the best approach. To get to know the block boundaries, one must read them from the SSTable. If the SSTable itself has its bits flipped, then those boundaries can be wrong in the first place. I tried a small truncate, and if SSTable checksum isn't checked before reading the blocks, it leads to panics.

So, I think if a block-level checksum makes sense, it only makes sense on top of doing a table level checksum. Even then, every time if you try to access block, and it is mmapped, running a checksum would be slow -- not sure how RocksDB (I haven't looked) deals with that. In particular,

  1. Does it allow mmapping the blocks?
  2. If 1, then does it run a checksum for every access to the block? (From the code, it looks like it does.)

@lni
Copy link
Owner Author

lni commented Jan 22, 2019

Sorry for the long reply.

@jen20 I don't think anyone should be expecting users of a library to pick any particular filesystem. For the exact same reason I don't think one can just ask his/her library users to always use enterprise SSDs with hardware end-to-end data protection to do checksum verification and skip such disk IO related fault tolerance designs in software.

Having cgo in the build process is never a problem, because it can be made almost transparent to users as long as all C/C++ source code are made part of the go package. The leveldb based experimental branch I pushed yesterday can already do that. There is no need to install any extra C/C++ library, just one extra fully self-contained Go package with all leveldb code and leveldb wrapper
in it, nothing really different from other pure Go packages. Users won't have to set CGO_CXXFLAGSS or CGO_LDFLAGS when building their applications.

Moving forward, RocksDB support will be provided in that way as well. Currently the outstanding issue is its compilation time - Go build tool doesn't parallelize its compilations within a package, it thus take several minutes (20 seconds for leveldb for its first run) to compile the RocksDB code for the first time. Go issue here: golang/go#9887

I am happy with cgo's performance and I think 95% of users are not going to use some Go tools to debug RocksDB/LevelDB code when using Dragonboat.

I totally agree that badger is the current most actively developed Go Key-Value store, it has good potentials. I am happy to use it in my other projects where data integrity is not the key issue, but I won't be able to switch to badger in Dragonboat in near term future.

Cheers!

@lni
Copy link
Owner Author

lni commented Jan 22, 2019

@manishrjain I am not sure how memory mapped file approach can efficiently do the checksum verification. lmdb is probably the most heavy user of such memory mapped file approach in key value stores, I can be wrong but it seems to me that it doesn't checksum its data. I contacted lmdb's author on that but couldn't get any reply.

@lni
Copy link
Owner Author

lni commented Jan 22, 2019

I am closing this issue as the current plan has been updated above in my reply to jen20.

@lni lni closed this as completed Jan 22, 2019
@manishrjain
Copy link

The performance gains of Badger over RocksDB are not purely because of Cgo, but because of the design changes, separation of key values, which reduces write amp significantly. And same goes for reads. We have benchmarks for that, and third parties have seen the same effects.

I think you are out weighing disk bit flips disproportionately. In those cases, most users would either do a restore from backup, or recreate. 99.99% users won't have the knowledge or the tools to fix the corrupt bits on disk and continue as nothing happened.

@lni
Copy link
Owner Author

lni commented Jan 22, 2019

Hi @manishrjain ,

A team from TiDB is working on bringing in the WiscKey idea into a fork of RocksDB. Will be interesting to see its performance gain compared to RocksDB.
https://github.com/pingcap/rocksdb/tree/titan-5.15/utilities/titandb

Sorry but I have concerns when any DB can silently return corrupted data without notifying its users. In the end, it is up to users to decide how they are going to value data integrity, as a distributed consensus library, what Dragonboat can do is to make sure all due diligence is done to minimize its user exposure to such data corruption risks.

Cheers!

@vtolstov
Copy link

sorry for go to this discussion (i'm read badget issues and this repo too) and i don't think that any db may prefer speed over correctness.
so +1 fo @lni comments, because users must be notified if data is corrupted. Yes this can be disabled for performance by user, but not default and not silent.

@manishrjain
Copy link

Sorry but I have concerns when any DB can silently return corrupted data without notifying its users.

RocksDB is a great DB, so feel free to use it. We used it before in Dgraph and its great.

But, Badger is definitely not how you describe it above. The specific point here is, whether every individual block should be checksummed after the table has already been checksummed (I couldn't see in RocksDB wiki whether they checksum the table, without which a block level checksum is already ineffective).

Based on the logic you're using, every user of Ini has to have an ECC memory because even RocksDB's block checksum implementation does not check if the block in memory has gotten corrupt due to memory bit flips. Ideally, every key read must verify its contents (irrespective of whether the data was picked from disk or memory) before returning data to the user. Does that then make RocksDB ineligible to Ini as well?

@lni
Copy link
Owner Author

lni commented Jan 22, 2019

The specific point here is, whether every individual block should be checksummed after the table has already been checksummed

As clearly mentioned in my earlier reply, "When the load option is LoadToRAM, it seems to be fine". For whatever reason, badger chooses to read the data again from the disk in its FileIO mode, it happens after an unspecified amount of time from the initial checksum check.

Note that this is not the only reason why I had my concerns. My concerns raise from the overall observations, e.g. a corrupted MANIFEST file is silently handled without notifying the user (e.g. panic), dgraph-io/badger#690.

every user of Ini has to have an ECC memory because even RocksDB's block checksum implementation does not check if the block in memory has gotten corrupt due to memory bit flips.

lni is my name, I guess you are talking about users of my library Dragonboat.

Yes, all Dragonboat users are suppose to run Dragonboat using ECC RAM in production. It is a distributed consensus library typically used on servers. In fact, it is getting pretty hard to find servers not using ECC RAM.

all Amazon EC2 instances use ECC RAM:

Q: Does Amazon EC2 use ECC memory?

In our experience, ECC memory is necessary for server infrastructure, and all the hardware underlying Amazon EC2 uses ECC memory.

https://aws.amazon.com/ec2/faqs/

all Azure instances use ECC RAM:

Microsoft Azure utilizes Error-Correcting Code (ECC) memory throughout our fleet of deployed systems to protect against these kinds of issues.

https://azure.microsoft.com/en-us/blog/microsoft-azure-uses-error-correcting-code-memory-for-enhanced-reliability-and-security/

btw, for serious server applications, using ECC RAM is not enough. DevOps guys are suppose to do monitoring on the frequency of those correctable errors (bit flips), it is a pretty reliable indicator on when a RAM stick is suppose to be replaced for data integrity reasons. I do that even on my home workstations and it did caught a dying ECC RAM stick last year. Oracle's guideline posted below:

Replace a DIMM when one of the following events takes place:
...
More than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs.

https://docs.oracle.com/cd/E19150-01/820-4213-11/dimms.html

Ideally, every key read must verify its contents (irrespective of whether the data was picked from disk or memory) before returning data to the user.

Repeatedly checking in memory data against its checksum is your idea, I don't think it is necessary. Everything read from the disk must be checksum checked, when the program chooses to discard the already verified content in RAM and read it again from disk, the program must checksum check the disk returned data again. It is plain and simple.

@manishrjain
Copy link

dgraph-io/badger#690 : That's a single line change that I'm happy to put, or you could send out a PR for (all open source in the end). I don't think that's what the discussion is about (or should be about).

If you're going to assume that all your users are on Amazon or Azure, then disk bit flips are also something you should not be concerned about -- because all of these providers (EBS, S3, etc.) do CRCs and multiple data copies to ensure that they'd give you back the data as it was stored. More so than ECC memories can do.

Repeatedly checking in memory data against its checksum is your idea, I don't think it is necessary.

Repeatedly checking for data on disk is required, but doing the same in RAM isn't? Which also derives that any DB which is using mmap is inherently wrong. Where and how do you draw the line, is my question. Have you chosen familiarity over reasoning?

@lni
Copy link
Owner Author

lni commented Jan 23, 2019

I don't think that's what the discussion is about (or should be about).

I have concerns on badger because 1) there are multiple places where data corruption is not handled, 2) you are repeatedly trying to justify them. your reply above is just another good example.

If you're going to assume that all your users are on Amazon or Azure

Why should I be assuming that? Amazon and Azure are the top 2 cloud vendors, by some reports (1) they have combined market share of 80% for the public cloud market. They are mentioned to show you that servers without ECC RAM are extincting.

then disk bit flips are also something you should not be concerned about -- because all of these providers (EBS, S3, etc.) do CRCs and multiple data copies to ensure that they'd give you back the data as it was stored. More so than ECC memories can do.

If I run a badger based application on Amazon EC2 machines, when badger is silently returning corrupted data to the caller, how those "CRCs and multiple data copies" in EBS and S3 are going to help? In dgraph-io/badger#690, users are not even notified for detected data corruption, in the sst case being discussed here, when file access mode is FileIO, the data is not even verified against its checksum.

Repeatedly checking for data on disk is required

Good to see we now agree on something.

Which also derives that any DB which is using mmap is inherently wrong. Where and how do you draw the line, is my question. Have you chosen familiarity over reasoning?

I didn't draw any line on mmap or the specific mmap based implementation in badger. I made it crystal clear that "I am not sure how memory mapped file approach can efficiently do the checksum verification".

Although I don't have the solution, the question here is pretty simple - in mmap based approach, to avoid repeatedly checking the data already in memory again and again at some random interval wasting CPU cycles (and still suffer from data corruption because of the same TOCTOU issue), how one can tell whether the chunk being accessed just got returned from the disk again and thus requiring a checksum check?

When choose to use mmap in badger, it is your job to answer the above question.

@manishrjain
Copy link

If I run a badger based application on Amazon EC2 machines, when badger is silently returning corrupted data to the caller, how those "CRCs and multiple data copies" in EBS and S3 are going to help?

They protect against disk bit flips. Not sure why that wasn't clear in the first round.

Your arguments obviously present an image of Badger which is specified construed to look worse. I'll just leave things at that.

I am not sure how memory mapped file approach can efficiently do the checksum verification

https://github.com/facebook/rocksdb/wiki/IO#memory-mapping

Good luck with your project!

@lni
Copy link
Owner Author

lni commented Jan 23, 2019

@manishrjain

They protect against disk bit flips. Not sure why that wasn't clear in the first round.

In your blog article introducing badger (link here), you were using ec2's i3.large instance's local 475GB NVME disk. That has protection against disk bit flips? Also why/how bit flips is the only cause of corruption that need checksum protection? How about software bug as described by yourself in dgraph-io/badger#671?

Its not just about some funny bit flips. In dgraph-io/badger#671, you correctly pointed out that sst data need to be checksum protected so corruptions caused by bugs can be noticed. According to that issue, such corruption actually happened not just some remote possibilities in theory!

Your arguments obviously present an image of Badger which is specified construed to look worse. I'll just leave things at that.

You are keep trying to convince people that data corruption is not possible (see above) or won't be noticed by users (see your reply earlier in this thread, you posted that "99.99% users won't have the knowledge or the tools to fix the corrupt bits on disk and continue as nothing happened")

This is presenting a bad image for badger. Such assumptions and attitude caused my concerns, not any particular bug.

Let's just stop here, I am sure readers of this thread already have enough input on what is going on and what is the attitude towards user system's reliability & data integrity on both side.

Cheers!

@vtolstov
Copy link

@lni as this is OSS, may be fork and fix? If authors thinks that this is not important?

@lni
Copy link
Owner Author

lni commented Jan 24, 2019

@vtolstov

Unfortunately I don't have time to maintain such a fork. Full explanations on why Badger is not supported are provided above.

If you grab the latest in Master, Dragonboat's LevelDB support has been added. When using LevelDB to store your Raft Logs, there is no extra installation step, you don't need to specify CGO_CFLAGS or CGO_LDFLAGS when building your own applications, it is thus nothing really different from a pure-Go approach. More details here:

https://github.com/lni/dragonboat/blob/master/doc/storage.md

@vtolstov
Copy link

ok thanks!

@ghost
Copy link

ghost commented Feb 10, 2019

Disappointed about badger db being evicted as a data store.

I was intending to use this on arm servers. I can't use cloud servers for legal reasons.

I will have to use the BBVA implementation of raft badger.

@lni
Copy link
Owner Author

lni commented Feb 11, 2019

@gedw99 the reason why badger is not supported at this stage is explained above.

As mentioned in the docs, the current version of dragonboat is known to work on ARM based platforms such as Amazon's EC2 A1 instances. It should also work on your own ARM servers as well. It has really nothing to do with whether any certain key-value database is supported or not.

Please feel free to raise an issue when dragonboat fails to compile or run on a recent 64bit ARM + Linux environment.

@ghost
Copy link

ghost commented Mar 4, 2019

thanks @lni for the clarification - appreciate the support.

@lni
Copy link
Owner Author

lni commented Mar 5, 2019

other pure Go Key-Value DBs are being actively investigated, e.g. https://github.com/petermattis/pebble looks very interesting.

@riaan53
Copy link

riaan53 commented May 10, 2019

There has been some recent improvements to etcd boltdb as well. Might be worth a look, havnt tested it myself yet https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/

@lni
Copy link
Owner Author

lni commented May 10, 2019

@riaan53

BoltDB is not supported because it doesn't checksum its stored data. BoltDB is very similar to lmdb and both have the same issue.

etcd uses BoltDB in its state machine, all data is also available in the raft log (which is always crc checked), if something goes wrong, the entire state machine state can be restored from the raft log. For dragonboat, the key-value store is used to keep the raft log itself and thus must have the data checksumed.

At this stage, I'd willing to bet that the above mentioned pebble project is on track to produce a high performance/reliable key-value store in pure-Go.

@riaan53
Copy link

riaan53 commented May 10, 2019

@lni thanks for the feedback, good to know. Looking forward to the pebble project :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants