Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Explore memory-mapped files for faster data file access. #10

Closed
kevinswiber opened this Issue Apr 7, 2013 · 7 comments

Comments

Projects
None yet
2 participants
Owner

kevinswiber commented Apr 7, 2013

I did a small experiment using the mmap system call for immutable data files.

The performance improvement was significant (from 57k to 198k reads per second).

To truly take advantage of memory-mapped files in Medea, the architecture has to change.

I think it makes more sense to mmap only immutable files. This means Medea would have to keep a certain amount of key-value pairs in memory, along with indexes to value offsets for all keys. As data piles up, in-memory key-value pairs would need to flush to disk, become immutable, and then get mmap'd.

This architecture starts looking a little bit closer to LevelDB than to Bitcask at this point.

While the performance improvement is too significant to ignore, I think this idea might get put on the back burner for v2.

One Medea goal is to keep as much in JavaScript as possible. Utilizing memory-mapped files would require a native add-on. None yet exists in a cross-platform way, so it would take some work to build it (not too intimidating, but potentially time consuming.)

As an aside, I disabled mmap in LevelDB and ran some benchmarks. It's clear the performance boost LevelDB gets from memory mapping is very significant, confirming some suspicions on why it's so much faster.

@kevinswiber kevinswiber referenced this issue in argo/argo Apr 7, 2013

Open

Caching #17

Contributor

sandfox commented Apr 10, 2013

Not sure if this an issue or not, or I'm reading it wrong, but would this mean that medea isn't doing durable writes anymore?

Owner

kevinswiber commented Apr 10, 2013

Durable writes should stay. On writes, we could append to the log file and keep the latest value in memory. I think this is what LevelDB does. (I feel like I read a log post that says it flushes its MemTable to disk periodically, but reading the code, it looks like LevelDB is writing to a log file and then inserting a record in its MemTable before returning.)

I don't know if there'd be any problems in implementation, but in theory, I think this sounds all right.

Contributor

sandfox commented Apr 10, 2013

Ok cool. using mmap should (in theory) seriously speed up access and reduce memory usage, when running medea across more than one process. Log table and most recent value in memory seems sound.

From my understanding LevelDB writes to log and once completed adds the record to MemTable so that things don't go weird in the event of crash and leave your application with an inconsistent state. As I write this I also realise that levelDB is still only single process which may kill attempts at clustering medea effectively because there is no way to share MemTable (or it's equivalent)...

Owner

kevinswiber commented Apr 10, 2013

You're right on the multi-process statement.

I think the way to do this might be to treat a mutli-process architecture as a distributed network architecture. That could come down to partitioning work based on a consistent hash mechanism. Some partitioning (sharding) scheme would be needed at a bare minimum, but there are other considerations, as well: replicas, high write availability, temporary failure recovery (hinted handoff), anti-entropy measures, auto-detection of new nodes in the cluster, etc.

There's enough open source state of the art from which to take inspiration, and there might be some lesser known or more experimental algorithms that could be tossed in, as well.

I'd love to see us get there. I think it'll happen.

Contributor

sandfox commented Apr 11, 2013

I see you've starred sky, have a look at this gist (specifically ben's comments) for description of how it does multi-process sort of stuff and other related useful stuff.

Owner

kevinswiber commented Apr 11, 2013

Very cool! Thanks for the link. Interesting stuff.

My initial intention with medea-clusterify was to have multiple readers and one writer. (Instead, it delegates all reads and writes to the master.)

The issue here is that the writer holds the latest file IDs and offsets in memory. Without shared memory, the other processes are in the dark and serving a stale state. I'd be okay with eventual consistency on the peers, but that mechanism doesn't yet exist. (This could be coordinated via medea-clusterify.)

The other issue is memory consumption. Medea holds all keys in memory. Without sharding, the full in-memory keydir is loaded in each process. I almost said "oh well" on this and coded medea-clusterify to use such a design, but I'm split 50-50.

Sharding: Totally doable with a decent consistent hashing algorithm (that will hopefully work without 64-bit bit-shifting). With such a design, we could have processes that both read and write. Consolidating those data files between starts and stops might be a small hurdle to overcome, but it would give us the flexibility to scale out.

Once distributed, a replication strategy is worth discussing. And then failure scenarios. And then recovery scenarios. And so on. :)

I'm going to read over that document again. Thanks for sharing.

Owner

kevinswiber commented Sep 18, 2014

Closing for now. Maybe as a wrapper in the future, this would be awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment