Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path to improving extstore's defaults #541

Open
1 of 2 tasks
dormando opened this issue Sep 22, 2019 · 7 comments
Open
1 of 2 tasks

Path to improving extstore's defaults #541

dormando opened this issue Sep 22, 2019 · 7 comments

Comments

@dormando
Copy link
Member

dormando commented Sep 22, 2019

Extstore shouldn't be compile gated anymore. This would simplify the code by cutting out ifdef's. There're a few things holding me back..

  • A performance comparison, since it adds some branches to hot paths. (this can be made neutral via changes to the way add_iov() is used)
    Note: changes staged in next fix this! lots of branches removed from the hot paths.
  • A better answer to the memory allocation hurdles.

To the second point: TL;DR: small normal items and small item HDR's take up space in the same slab classes/LRU's. Small items can't/won't be flushed to disk. This has some side effects:

  • workloads with mixed small/large items, with zero-expiry, will end up with low/no memory for larger items, causing aggressive flushing in extstore.
  • unexpiring small items can cause HDR's to be evicted when there's plenty of extstore disk space free.
  • The RAM-cache of large objects will shrink over time, increasing disk pressure for writes and reads.

These are non-obvious to a user and may be surprising. Hopefully this issue can be a discussion but I won't be surprised if it's just me talking to myself for a month again :)

Ideas:

  • Raise aggressiveness of disk flushing... People often run extstore with almost no RAM cache regardless.
  • Add a slab-balancer-algorithm cap for small vs large items. Default in the range of 50-90% small?
  • Find some way of separating the HDR's into a different LRU? We've already used up the possible four, but HOT|WARM could repurpose here.
  • Use LRU crawler to sort or prioritize HDR vs non-HDR objects.
  • The above options could be used to prioritize HDR objects when extstore has a lot of empty space. We should allow memory space to small items if extstore is like 90% full anyway.
  • Any other ideas? :(

Most of these requires making a decision on the importance of a small item vs a HDR item, which is impossible to do generically. For extstore's case the "most generic" we can go is ensuring some space is reserved for large objects so the system doesn't thrash; with perhaps an option allowing for "prioritize using memory to fill extstore" vs "treat all objects equally".

The fewer options/twiddles and better defaults the.. better.

@dormando
Copy link
Member Author

Honestly after the performance fixes I'd rather just build it in by default by leave it marked experimental. Then we can incrementally improve the memory situation.

@dormando dormando changed the title Path to building extstore by default Path to improving extstore's defaults Mar 7, 2020
@dormando
Copy link
Member Author

dormando commented Apr 25, 2020

note to self: check if it would be crazy for the extstore code to issue flushes inline with the eviction code, or wake and temporarily block on the bg thread where possible.

the code is mainly tuned to "never degrade set performance" but the flush performance is generally a write to RAM anyway, so it might make a better default to at least try so long as the performance drop isn't huge.

On a quick look it might not be too bad...

  • The storage bg thread just loops through the slab classes and if too few free chunks exist, flushes until we're back above threshhold. It calls storage_write()
  • storage_write() pulls the tail item, checks if it's valid to throw out (not a HDR item, age is above item_age, etc), then alloc's a HDR item and chucks it.
  • normal set-with-eviction has a retry loop built in because when I re-wrote the LRU there was no longer a way of directly reusing freed memory. There is a mechanism for that now (LRU_PULL_ITEM_RETURN) which can be expanded a bit.
  • Caveat: storage_write() fails while the write buffer is full and flushing to disk. For a while I had multiple write buffers but the code was awful and buggy. Might need to either revisit this or figure out how to at least have two write buffers so they can be swapped. A special return code from exstore_write_request() might help here.
  • Caveat 2: this can put pull_tail() into a loop so some care needs to be taken there.
    edit: - There is a stack of wbufs. The complicated matter is that we can read an item directly out of a write buffer if it hasn't been flushed yet, so multiple write buffers have to be carefully maintained. IIRC my first attempt had an arbitrary stack, but perhaps hard coding two would be simpler.

So on the cost of slightly increasing the inline set latency we could have bottoming out just force-flushes and directly reclaims memory. The BG thread could still run to try to keep ahead of things or get kicked into gear when something does bottom out.

This would alleviate a lot of the holes in the flushing algorithm that currently require careful tuning.

@dormando
Copy link
Member Author

dormando commented Apr 26, 2020

Lets put this together for a potential general simplification:

  • hoist the page mover logic slightly: for each chunk to be tested, move function must be called with a free chunk to swap one piece of memory into if necessary.
  • caller page mover thread uses item_allocish() to get the free memory. Manages if necessary: retries if chunk was within page to move, retries or sleeps if tail locked, etc.
  • page mover function uses this free memory to swap an item if necessary, removing the _nomem case entirely.

next:

  • main item_allocish() logic can flush to extstore directly if necessary.
  • beware of causing a looped alloc!
  • extstore needs multiple (at least two) wbuf's per bucket so one can be flushed to disk while another is written to.
  • main allocish() flushing directly covers sets as well.

then for the automove algorithm:

  • The only goal of the mover algorithm now is to keep the global memory pool above a desired threshold. Currently it tries to maintain the global pool + a free chunk ratio per slab class.
  • If it is below the threshold, pick a non-HDR slab class with the most memory allocated and move a page.
  • This may still need a special case for the highest slab class (chunked items): need enough memory to hold at least two items of -I size.
  • Most of the free chunk balancing code can be deleted.
  • BG extstore flush thread removed?

Pros:

  • Re-arranges the "RAM buffer" used to avoid early evictions from free slab memory chunks across N classes to the size of the extstore write buffer.
  • Removes the ramp-up time necessary from the background thread. (going from idle to full throttle needs a few seconds of RAM buffer)
  • The time to swap and flush an extstore write buffer is dependent on the speed of the storage device vs various background threads which may be asleep.
  • Can maybe remove the storage flush thread entirely.
  • Could work better for lower memory setups since we don't need to maintain (slabclasses * 2.5)+ pages worth of free memory.

Cons:

  • Is still, at a high level, just moving the RAM buffers around. Fast enough bursts of writes will cause evictions unless wbuf is properly sized. Sizing wbuf requires a restart!
  • Set time will increase, as a lot more work will be guaranteed to be done inline with sets vs currently done on BG threads.
  • Might have to use a wide lock while writing to a particular write buffer, else ensure items are aligned to 8 bytes within the wbuf?

The set latency could be amortized slightly by having the lru_maintainer_thread() attempt to keep a couple chunks free per high slab class. Rather than have an entire thread for the extstore flushing still.

Notes:

  • any remote chance wbufs could be replaced with bipbuffers? I assume finding an item stored in a bipbuffer would be nontrivial.
  • this might make a further change to separate HDR memory from main memory cleaner.

@dormando
Copy link
Member Author

dormando commented Jun 8, 2021

Wonder if this last idea can be simplified or split up a little more:

  • page mover loop will retry/wait/etc if bounced from the "alloc but also flush to extstore" command.
  • mover algo gets changed to aggressively keep global page pool filled by moving pages
  • the act of moving pages flushes to extstore

... then this gets released... or tested. kills the nomem case and should generally improve things.

then a second change with the:

  • extstore double/N write buffers
  • mainline alloc can flush to extstore before evicting
  • remove storage flush thread

Then on a burst of writes:

  • if the mover algo can react fast enough to the global page pool dropping, it will start moving pages and freeing memory
  • if it "bottoms out" while ramping or some other reason, writes will directly flush to extstore and free memory to upload into

@dormando
Copy link
Member Author

dormando commented Jun 9, 2021

Feel like there's still a super corner case: All memory full, burst of writes bottoms out the spare pool buffer, then extstore needs some memory to allocate headers from.

Solution is more complicated though; maybe two spare page pools. Or, past the minimum spare pool, only small/HDR objects can cause memory to pull from the global pool. That should work actually. Else there's no point to flush-on-evict without some kind of spare memory for HDR objects.

edit: blah.. if a high class doesn't have enough memory for all parallel uploads it'll start OOM'ing. So either A) It can dip into global if we'd OOM because tail locked or B) need to solve the tail lock OOM's first.

@dormando
Copy link
Member Author

Thought/refinement... if there's a clean way to decide during a SET on a high slab class that memory is low, it could evict->flush instead of pull from slabber. That avoids pushing on the global page pool and combined with the recent slab mover changes could work pretty well.

it might even be possible to do a decent pass at this with a single extstore write buffer; if the buffer is flushing to disk fall back to slab allocator temporarily. Then multiple write buffers fixes that/etc.

Still a series of not super quick changes. I'd first go for fixing the nomem case by having the mover require a free chunk passed into it. That's one of the last really dumb corner cases I've left in the system.

@dormando dormando mentioned this issue Dec 30, 2022
13 tasks
@dormando
Copy link
Member Author

More thoughts:

  • Since extstore headers are marked by ITEM_HDR bit and only allocated in specific places, I could use a different or novel memory allocator for them.
  • A lot of the SET reliability problems could be amortized by allowing sets to queue when memory fills.

To explain: If a user tries to throw the same 1mb item into cache from 1,000 clients at the same time, we need 1 gigabyte of memory for all of the data uploads, then they just get rejected one by one after the first one succeeds.

Thus, if we bottom out of memory or otherwise cap "in-flight memory buffers", the sets can be thrown into a side queue (using the same queue system we use for extstore/proxy). As memory frees we pull from that set queue and read data from sockets/etc.

With this we can reliably accept sets by blocking them until resources become available, but not hang the worker threads so they're free to handles reads or process/queue other writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant