Skip to content

Extstore update project #972

@dormando

Description

@dormando

Issue tracking a number of extstore changes I intend to make close together. In roughly sorted order.

This change simplifies a lot of code, adds the ability to split storage across multiple files organized by purpose, and potentially reduces write amplification. It's a good starting point for broader changes.

The page data needs to serialize into the restart file, then be parsed and reloaded, plus structure for cleanly stopping and closing all of the extstore files. May delay this to the end in case I want to adjust the page structure.

This work fixes the slab mover's "out of memory while moving object" errors, allowing it to pull memory from the LRU tail instead. Further, during a "set" we will directly flush the tail item to extstore's write buffer and reuse the memory immediately. This should end up being an average tradeoff in CPU usage with a small increase in set latency. This may perform better during high load as it removes pressure on the global slab lock. It also removes user pain of having to tune or worry about overwhelming the background flusher thread.

  • Fix dropping of items during compaction if the item is actively referenced.

There's an ancient TODO marked in storage.c:storage_compact_readback() where if an item header is actively being referenced it cannot "rescue" the valid item into a new page. This work is to just add the missing code to re-allocate and replace the header in this case. Need to also audit for caveats for the now orphaned header item.

  • Fix issues where the same item is written to the same page multiple times, and compaction gets confused.

If you aggressively flush items to extstore (ie; ext_item_age=0 or have very little memory free, a key that gets frequently updated can end up in the same page multiple times. The compaction code does not try hard enough to validate that the item header it pulled for a key found in storage matches the original object that was written into the page. This can lead to weird bugs where an item is deleted multiple times, etc.

Fixing this might be as simple as ensuring the hdr_it's offset matches the current position in the readback buffer? I got stuck on this before and not sure why I didn't just do that.

  • Change IO thread count setting to be "maximum" number of IO threads, start with 1 thread, and use a heuristic for spawning threads up to the limit.

This is the most fiddly user setting, I think. Having a very low number of IO threads can make the system not recover well from IO hiccups or bursts of traffic. Having too many IO threads wastes some memory or bothers people. We should instead spawn new threads if all existing threads already have IO's queued when a new IO is being submitted. This could tune to be less aggressive as it reaches the max thread limit.

Extstore is strictly LIFO for its use of IO threads; if your IO system is very fast or not very loaded it will continually reuse the same thread instead of cycling through all of them (for various performance reasons).

  • Maybe: dedicated IO thread for write flushing and/or readbacks.

Need to benchmark or otherwise do some testing for how compaction and write flushing interacts with normal item readback and make some decisions here (the above thread tuning may make this moot)

  • Optimization: use response buffer or "read buffer" for item fetching instead of item allocation.

Every item read from extstore causes an item allocation call. The item is then thrown away after being sent to the client, except if we decide to recache the object. Originally I envisioned extstore only being used for large items (32k, 64k and upward) where this makes sense. In reality many people store a billion items that are 500 bytes or less in size.

The response objects have a small (1kish) write buffer for the response header. If there is room, read the object directly here and avoid using the item allocator. Large potential CPU savings, possible throughput improvement.

This change could also "upgrade" to the worker-thread local read buffers for objects too large for the response buffer, but less than READ_BUFFER_SIZE (16k). If we intend to recache the object it can still use item allocation up front. All of this also reduces pressure on the slab allocator which would in turn force less flushing to extstore for smaller objects.

  • Optimization: merge "io pending" objects with "CQ" objects used in worker notification.

When IO objects are passed back to a worker thread, we allocate a special wrapper object to hold the IO object. That is then queued in the per-thread queue to be returned, and freed. The allocation is local to the worker thread, but this extra bit of work is unnecessary. The IO object itself can be used to pass data and the original CQ system (which originally was only used to pass new conns from the acceptance thread to worker threads) can be removed.

Won't be a huge speedup but should be measurable.

  • Maybe: add timing to IO calls and surface this as an average (or current depth) stat.

See if we can answer the questions of "how long did the most recent request take", and "how old is the oldest item one or all of the thread queues". In most parts of memcached attempting to time processing is counter productive: from client request to response a large part of the latency will be sitting in queues waiting for epoll to wake up, thus we cannot accurately measure time. Since extstore is making calls to a flash drive we can at least measure the time specifically taken there.

  • Operation: Separate item header memory from main memory

Mostly discussed here: #541 - it should be possible to allocate and free "item header" memory to a different system.

This could be a novel type system: a blob of memory per-page and an estimate LRU, or an instance of the pre-existing slabber after a code cleanup. Memory could move between arenas via the page mover or not at all via user configuration.

A possible use of a novel memory allocator would be to just allocate headers within a page into contiguous chunks of memory. Benefit would be no slabber overhead. Drawback would be unable to reuse memory if items are partially removed from a page. A good compromise is probably to just use the slabber, but with a max chunk size just big enough to fit key + metadata so the slab classes are very close and don't waste much space.

  • User experience: block SETs until memory becomes available instead of evicting from main memory or OOM

This may have too many dependencies. Leaving a note here for later (or to pull forward into another project)

  • Reduce extstore item header size

Discussed here: #726 - item headers are 12 bytes long. Together with the standard memory header + key + cas this isn't huge, but can be cut from 12 bytes to 9 bytes for free or 8 bytes at the cost of some disk space.


Think I'm forgetting something but that's most of it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ideawork goal

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions