Exposing FASTER log as a first-class abstraction #177

badrishc · 2019-09-18T20:04:58Z

We are exposing our high-speed latch-free log (or write-ahead log) facility as a generally usable first-class citizen, called FasterLog. You can append variable-sized chunks (as byte[] or Span<byte>) to the log. There is no additional header in the log. Users can perform (potentially tailing) pull-based iteration over any range of the log (including until a future log address). Users can truncate from the head of the log. They can also flush the log at very fine granularity, lower than the usual per-page flushing. All aspects of the log (page size, number of pages in memory, segment size on disk, etc.) are tunable. The log works with async code as well, and we will add async versions of the API going forward.

Note that there is no random lookup index on top of this log - if you want such an indexed log, use FasterKv as usual.

Improvements to Epochs

As part of this PR, we have also significantly optimized the epoch framework, removing important performance bottlenecks that now make it feasible to acquire and release epochs at a very fine granularity. We can perform 35 million acquire/release operation pairs per second per thread, with linear scalability. If users are able to pre-acquire a thread for the operations, we are able to achieve 100 million acquire/release operation pairs per second per thread. This improvement now makes it possible to make FASTER work at high performance in task-based async environments.

badrishc · 2019-09-18T20:16:33Z

See example at https://github.com/microsoft/FASTER/blob/19d5d82641965df3b203b2b9e141ba0594051977/cs/playground/FasterLogSample/Program.cs

badrishc · 2019-09-18T20:33:43Z

Note that this log can live on top of any device that implements the IDevice abstraction. We provide out of the box device implementations for:

Generic local storage device (any .NET core system)
Local storage device (Windows-specific high performance unbuffered overlapped IO)
Azure page blob storage device
Tiered storage device (create tiers of storage, log writes to all tiers, read from lowest available tier)
Sharded storage device (spread writes across multiple devices to get higher effective bandwidth)

badrishc · 2019-09-19T03:05:15Z

Checked in log commit and recovery support. By default, we flush the log at page boundaries. But, the user can manually flush more frequently, e.g., every 10ms. The playground sample has been updated to show this facility as well. You can kill and restart the sample as many times as you want, it will auto-resume.

badrishc · 2019-09-19T05:32:47Z

As for end to end performance, for 100 byte payloads, we get around 1GB/sec to a modern NVMe SSD with a single thread, with log tail increment being the bottleneck as we increase the number of append threads. For 1000 byte payloads, we get close to the SSD bandwidth limit (1.8 - 2 GB/sec) using one thread.

hiteshmadan · 2019-09-23T18:53:29Z

cs/src/core/Index/FasterLog/FasterLog.cs

+            BlockAllocate(4 + length, out long logicalAddress);
+            var physicalAddress = allocator.GetPhysicalAddress(logicalAddress);
+            *(int*)physicalAddress = length;
+            fixed (byte* bp = &entry.GetPinnableReference())


instead of pinning the entry, you can use the CopyTo API (which is also apparently faster in many cases than other memory copy APIs).

entry.CopyTo(new Span((void*)(4 + physicalAddress), length));

I actually benchmarked the two alternatives and went with pinning as it was non-trivially faster.

hiteshmadan · 2019-09-23T20:37:47Z

In Iterator.GetNext, there is a code path that allocates a byte[] and copies the read data to it.

Does this happen when the read and write are happening on the same shared buffer? If so, it'd be valuable to allow for the allocation to be via a memory pool since this is going to be a pretty frequent occurrence (reader keeping up with the writer).
Please consider accepting an IMemoryPool and rent an array from that pool. You'll probably have to return an IMemoryOwner instead of a Span though.
In code paths where it doesn't allocate, is the span safe to keep around as a reference in the app code? Or does it point to a buffer that (eventually) can be reused for a different page?

tli2 · 2019-09-22T18:08:50Z

cs/src/core/Device/Devices.cs

        /// <returns>Device instance</returns>
-        public static IDevice CreateLogDevice(string logPath, bool preallocateFile = true, bool deleteOnClose = false, long capacity = CAPACITY_UNSPECIFIED)
+        public static IDevice CreateLogDevice(string logPath, bool preallocateFile = true, bool deleteOnClose = false, long capacity = CAPACITY_UNSPECIFIED, bool recoverDevice = false)


Anticipating the meta-device package, this static factory will not be sufficient to encapsulate the variety of device created. Certainly not every device will care about all of these flags.

Maybe instead of adding another boolean flag, we should start moving away from it?

That makes sense, but the goal was to unblock current users because the recovery path was causing slowdown and incorrectness in certain cases. The goal was to make "no recovery" as the default so current users default to that.

tli2 · 2019-09-23T21:37:41Z

cs/src/core/Device/LocalStorageDevice.cs

                if (segmentId != prevSegmentId + 1)
                {
                    startSegment = segmentId;
-
                }
                else
                {


Do we need to truncate the segments that are not tracked by this device?

You mean delete existing files when the device is created? That would depend upon whether the device is being used as new or for recovery.

tli2 · 2019-09-23T21:38:31Z

cs/src/core/Device/ManagedLocalStorageDevice.cs

@@ -48,14 +51,19 @@ private void RecoverFiles()

            string bareName = fi.Name;

-            int prevSegmentId = -1;
+            List<int> segids = new List<int>();


Maybe I am missing something, but it's unclear to me why the list is required. Is it just for readability?

The list of files comes in alphabetical order by default, not numerical. Caused a bug without sorting.

tli2 · 2019-09-23T21:44:47Z

cs/src/core/Index/FasterLog/ILogCommitManager.cs

+        /// </summary>
+        /// <param name="address">Address committed until (for information only, not necessary to persist)</param>
+        /// <param name="commitMetadata">Commit metadata</param>
+        void Commit(long address, byte[] commitMetadata);


I am wondering if byte[] is too low level of an abstraction. Perhaps an interface with a custom serialization/deserialization method is a better choice?

I thought of that and even tried to prototype it but it was not as convenient in the end because the user had to implement too many callbacks. This can be revisited though.

hiteshmadan · 2019-09-23T22:05:36Z

cs/src/core/Index/FasterLog/FasterLog.cs

+        /// </summary>
+        /// <param name="entries"></param>
+        /// <returns>Logical address of last added entry</returns>
+        public unsafe long Append(List<byte[]> entries)


+1 to having a batch append API. However this API forces too many allocations - 1 for the list, and 1 array allocation for each entry in this case:
If the serialized byte array is generated by a serializer that pools its' buffers (like Microsoft Bond), most likely it'll return the result as an ArraySegment / Span / IMemoryOwner.

How does something like this look:

(note that Span cannot be used here in place of ArraySegment, details here: https://adamsitnik.com/Span/#span-must-not-be-a-generic-type-argument)

(this can be further optimized to ask the caller for the totalLength as a function parameter and do only one BlockAllocate outside the for loop, not sure if that's worth it)

public unsafe bool TryAppend<TState>(int numEntries, Func<int, TState, ArraySegment<byte>> getBytes, TState state, out long logicalAddress) { epoch.Resume(); logicalAddress = 0; long tail = -allocator.GetTailAddress(); allocator.CheckForAllocateComplete(ref tail); if (tail < 0) { epoch.Suspend(); return false; } for (int i = 0; i < numEntries; i++) { Span<byte> entry = getBytes(i, state); var length = entry.Length; BlockAllocate(4 + length, out logicalAddress); var physicalAddress = allocator.GetPhysicalAddress(logicalAddress); *(int*)physicalAddress = length; entry.CopyTo(new Span<byte>((void*)(4 + physicalAddress), length)); } epoch.Suspend(); return true; }

As an aside - Does the batch partially show up to the iterator while its still being written? Or does the reader only see the first event from the batch after epoch.Suspend() ?

@hiteshmadan
I don't understand everything thats going on in this PR, but if you want to use a span instead of an array segment you just need to declare a custom delegate like

public delegate ReadOnlySpan<byte> ByteGetter<TState>(int index, TState state);

or something like it. That sidesteps the generic parameter constraint.

After all the optimizations and inlining, there isn't much benefit to a batched interface in terms of performance, because acquire and release epoch - which were the only calls getting amortized - have been made very inexpensive. Also, we can't allocate for the batch in bulk because the page sizes are pretty small. Maybe we should stick to the unbatched interface?

hiteshmadan · 2019-09-23T23:56:23Z

cs/src/core/Index/FasterLog/FasterLog.cs

+                return false;
+            }
+            var length = entry.Length;
+            BlockAllocate(4 + length, out logicalAddress);


If there is already pending data waiting to be flushed + the incoming entry is big enough to not fit on the space remaining on the current page + no more pages can be allocated, BlockAllocate will block the thread waiting for the allocation to go through.

Is there any way for the early-exit check to account for entry.Length too?

We only made the common path non blocking right now. Do we need it to always be non blocking? The system atomically allocates space when when it's not yet ready to use, then waits for it to become usable. If we want it to be non blocking, we will need to surface this in the API, i.e. TryAppend will return an incomplete address that the user will need to compete by calling TryCompleteAppend. Is this what you are looking for?

badrishc · 2019-09-24T05:36:21Z

In Iterator.GetNext, there is a code path that allocates a byte[] and copies the read data to it.

Does this happen when the read and write are happening on the same shared buffer? If so, it'd be valuable to allow for the allocation to be via a memory pool since this is going to be a pretty frequent occurrence (reader keeping up with the writer).
Please consider accepting an IMemoryPool and rent an array from that pool. You'll probably have to return an IMemoryOwner instead of a Span though.

In code paths where it doesn't allocate, is the span safe to keep around as a reference in the app code? Or does it point to a buffer that (eventually) can be reused for a different page?

Correct, when reads and writes share the buffer, we have to copy out because we can only access memory safely under epoch protection. Agreed that memory pool is a possibility. If we had that, a cleaner design might be to always copy out to the user-provided buffer regardless of where we are reading from. This will address your concern in point 2 as well (currently, you must give up the span as soon as you invoke another GetNext).

badrishc · 2019-09-24T20:42:49Z

Fyi, there is a check-in coming in to reduce the number of pages needed in memory. This required a couple of internal changes for correctness.

Adding support for low memory footprint (4 pages) Added support for odd-sized payloads in presence of holes in log Fixed concurrency issue that occurs with low num of pages Improved max throughput by eliminating a 10ms sleep in BlockAllocate Misc cleanup of logic to track flush and close addresses in log

Adding truly non-blocking TryAppend functionality. See sample for how this is used.

badrishc · 2019-09-30T17:51:15Z

We now have both low-mem support (as low as 4 pages) as well as truly non-blocking TryAppend.

We also have an async API variant for append (AppendAsync) that returns only when append flush is done, but this is a rough API that needs further feedback/review. There are also significant performance implications of using this interface. Find the prototype here: #180

* Added support for TryAppend. Removed List-based batch support. * Added non-blocking TryAppend * Added span variant * Fix definition of SecondChanceFraction for read cache, to be 1 - MutableFraction of the log. * Added async FlushAndCommit * Added batched version by separating out in-memory append and wait for commit - gives better perf as the first operation is usually sync * Tweak async sample to get back to 2GB/sec * Other updates: 1) Allocations can handle thousands of parallel tasks 2) Removed concept of negative address - allocations are always over available pages 3) Improved scan interface to allow user memory pooling 4) Exposed commit task 5) Cleaned up sample * Added check for entry fitting on single page * Added batch interface (sync and async) to log append.

badrishc · 2019-10-04T00:48:27Z

This PR is almost ready to merge, and definitely ready to try out. Features include:

(1) Support for blocking Append, TryAppend, and AppendAsync
(2) Support for appending byte[] and Span to log
(3) Log auto-commit at page boundary + user-controlled commit (e.g., every 5ms)
(4) Configurable commit provider (ILogCommitManager) that can write commit info anywhere, and hook in pre- and post-commit operations.
(5) Atomic commit of a batch of entries to log (ISpanBatch)
(6) Multi-threaded and multi-task appends supported
(7) Tailing iterator over the log
(8) Exposed CommitTask so users can await the next commit if they want
(9) Support for truncation of log (from the head)

See https://github.com/microsoft/FASTER/blob/master/docs/cs/FasterLog.md for a first draft of usage guide, and the comprehensive sample with comments at https://github.com/microsoft/FASTER/blob/fasterlog/cs/playground/FasterLogSample/Program.cs

Pending

We need to refine/enhance the ways that pooled Span can be provided to the iterator; currently we defined a delegate called GetMemory here. Any other ideas here?

…fasterlog

…emory to FasterLogSettings instead of Scan. Speed up TruncateUntil. Updated nuspec.

…not change. Added CommittedBeginAddress metric.

badrishc · 2019-10-10T02:58:16Z

Summary of Interface to FasterLog

// Enqueue log entry (to memory) with spain-wait

long Enqueue(byte[] entry)
long Enqueue(ReadOnlySpan<byte> entry)
long Enqueue(IReadOnlySpanBatch readOnlySpanBatch)

// Try to enqueue log entry (to memory)

bool TryEnqueue(byte[] entry, out long logicalAddress)
bool TryEnqueue(ReadOnlySpan<byte> entry, out long logicalAddress)
bool TryEnqueue(IReadOnlySpanBatch readOnlySpanBatch, out long logicalAddress)

// Async enqueue log entry (to memory)

async ValueTask<long> EnqueueAsync(byte[] entry)
async ValueTask<long> EnqueueAsync(ReadOnlyMemory<byte> entry)
async ValueTask<long> EnqueueAsync(IReadOnlySpanBatch readOnlySpanBatch)

// Wait for commit

void WaitForCommit(long untilAddress = 0) // spin-wait
async ValueTask WaitForCommitAsync(long untilAddress = 0)

// Commit

void Commit(bool spinWait = false)
async ValueTask CommitAsync()

// Helper: enqueue log entry and spin-wait for commit

long EnqueueAndWaitForCommit(byte[] entry)
long EnqueueAndWaitForCommit(ReadOnlySpan<byte> entry)
long EnqueueAndWaitForCommit(IReadOnlySpanBatch readOnlySpanBatch)

// Helper: enqueue log entry and async wait for commit

async ValueTask<long> EnqueueAndWaitForCommitAsync(byte[] entry)
async ValueTask<long> EnqueueAndWaitForCommitAsync(ReadOnlyMemory<byte> entry)
async ValueTask<long> EnqueueAndWaitForCommitAsync(IReadOnlySpanBatch readOnlySpanBatch)

// Truncate log (from head)

void TruncateUntil(long untilAddress)

// Scan interface

FasterLogScanIterator Scan(long beginAddress, long endAddress)

// FasterLogScanIterator interface

bool GetNext(out byte[] entry, out int entryLength)
bool GetNext(MemoryPool<byte> pool, out IMemoryOwner<byte> entry, out int entryLength)
async ValueTask WaitAsync()

// Random read

async ValueTask<(byte[], int)> ReadAsync(long address, int estimatedLength = 0)

…sk to completed state.

…fasterlog

… entry.

badrishc · 2019-10-16T17:14:12Z

Added checksum support for log verification during scan/read. Enable by setting FasterLogSettings.LogChecksum.

badrishc · 2019-10-18T16:47:01Z

Update: we are in the final stages of this PR. Working on correctly handling and surfacing exceptional cases, such as transient and permanent storage failures. This is WIP in a branch (fasterlog-exceptions). ETA within a couple of days.

Edit: exception support is taking a bit longer due to subtle corner cases involving parallel flush and error conditions. Will update here when merged.

…ved spin-wait for adjacent flush completion.

* Added storage exception handling, connecting to tasks. * Cleanup of error handling, control when exception is bubbled up to user. * Added yield in NeedToWait * Improved iterator support in case of exception

badrishc · 2019-10-30T19:30:01Z

IAsyncEnumerable support has now been added for the iterator.

badrishc · 2019-10-30T21:37:53Z

In addition to the previously supported TruncateUntil on the log, we now support persistent iterators. You can create any number of iterators over the log, and "name" them if you need them to be part of commits. During recovery, if you create an iterator with the same name, we will resume iteration from the last committed iterator location. Example:

using (iter = log.Scan(log.BeginAddress, long.MaxValue, name: "foo"))
   await foreach ((byte[] result, int length) in iter.GetAsyncEnumerable())
   {
      ...
   }

badrishc · 2019-10-30T23:05:27Z

Detailed documentation is now available at:

badrishc · 2019-10-30T23:45:06Z

Merging to master and closing PR as the functionality is complete at this point. We can continue discussions here, and create new PRs for further enhancements as well.

badrishc added 5 commits September 17, 2019 10:32

Initial checkin

16edd83

Updates.

3077f52

Updates

853b3ea

Cleaned up epochs, improved fine grain scalability.

6315a14

Fixing test change

19d5d82

Added commit and recovery support.

88d7269

mookid8000 mentioned this pull request Sep 19, 2019

Add FASTER log broker implementation mookid8000/Topos#2

Closed

Added TryAppend so users can implement log throttling.

ddcc338

hiteshmadan reviewed Sep 23, 2019

View reviewed changes

tli2 reviewed Sep 23, 2019

View reviewed changes

hiteshmadan reviewed Sep 23, 2019

View reviewed changes

badrishc added 2 commits September 26, 2019 00:09

Fasterlog TryAppend (#179)

ec2a3b5

Adding truly non-blocking TryAppend functionality. See sample for how this is used.

badrishc added 2 commits September 30, 2019 11:29

minor fix

bb4e357

merge

4504937

mookid8000 mentioned this pull request Oct 2, 2019

Connecting multiple buses to Fleet Manager in one process. rebus-org/FleetManager#1

Closed

badrishc added 3 commits October 3, 2019 17:48

Merge branch 'master' into fasterlog

002b993

Added tailing iterator WaitAsync to wait for iteration to proceed.

944504b

Merge branch 'fasterlog' of https://github.com/Microsoft/FASTER into …

540d1a5

…fasterlog

badrishc added 6 commits October 7, 2019 18:01

Update next address of iterator if GetNext fails early.

0f33d4a

Added random read functionality (ReadAsync) for FasterLog. Moved GetM…

2e59b43

…emory to FasterLogSettings instead of Scan. Speed up TruncateUntil. Updated nuspec.

Ensure begin addresses commit if needed, even when tail addresses do …

aa4fef3

…not change. Added CommittedBeginAddress metric.

changed test project target

dfd683f

reverting test nuget version

8dbba0a

Updated random read example

4d1c9ea

badrishc added 5 commits October 10, 2019 12:00

Merge branch 'master' into fasterlog

66ee5d3

Use TrySetResult instead of SetResult, since log closure moves the ta…

fd15349

…sk to completed state.

Merge branch 'fasterlog' of https://github.com/Microsoft/FASTER into …

f120778

…fasterlog

Added simple version/checksum to commit info.

15c418b

Added opt-in support for per-entry 8-byte checksum (xor) in header of…

4751080

… entry.

badrishc added 3 commits October 21, 2019 14:47

Fixing issue with async enqueue.

70b4c72

Fixed testcase since thread abort not supported on some platforms.

20a7536

Fixing concurrency issue with contiguous partial flush requests. Remo…

64bbe14

…ved spin-wait for adjacent flush completion.

badrishc mentioned this pull request Oct 29, 2019

Is it possible to use FASTER as a Producer/Consumer persistent storage? #185

Closed

badrishc added 3 commits October 29, 2019 17:16

Fasterlog exceptions (#189)

9435033

* Added storage exception handling, connecting to tasks. * Cleanup of error handling, control when exception is bubbled up to user. * Added yield in NeedToWait * Improved iterator support in case of exception

Added async iterator support

e940a0a

Merging

78fd56b

badrishc added 2 commits October 30, 2019 14:10

Added support for persistent/recoverable named iterators.

8e175e0

Merge branch 'master' into fasterlog

147006c

Merge branch 'master' into fasterlog

53ad95a

Merge branch 'master' into fasterlog

819cee0

badrishc merged commit 4d90dac into master Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing FASTER log as a first-class abstraction #177

Exposing FASTER log as a first-class abstraction #177

badrishc commented Sep 18, 2019

badrishc commented Sep 18, 2019

badrishc commented Sep 18, 2019 •

edited

Loading

badrishc commented Sep 19, 2019

badrishc commented Sep 19, 2019

hiteshmadan Sep 23, 2019

badrishc Sep 23, 2019

hiteshmadan commented Sep 23, 2019

tli2 Sep 22, 2019

badrishc Sep 24, 2019

tli2 Sep 23, 2019

badrishc Sep 26, 2019

tli2 Sep 23, 2019

badrishc Sep 24, 2019

tli2 Sep 23, 2019

badrishc Sep 24, 2019

hiteshmadan Sep 23, 2019 •

edited

Loading

hiteshmadan Sep 24, 2019

AlgorithmsAreCool Sep 25, 2019

badrishc Sep 26, 2019

hiteshmadan Sep 23, 2019

badrishc Sep 26, 2019

badrishc commented Sep 24, 2019

badrishc commented Sep 24, 2019

badrishc commented Sep 30, 2019

badrishc commented Oct 4, 2019 •

edited

Loading

badrishc commented Oct 10, 2019

badrishc commented Oct 16, 2019

badrishc commented Oct 18, 2019 •

edited

Loading

badrishc commented Oct 30, 2019

badrishc commented Oct 30, 2019

badrishc commented Oct 30, 2019

badrishc commented Oct 30, 2019

Exposing FASTER log as a first-class abstraction #177

Exposing FASTER log as a first-class abstraction #177

Conversation

badrishc commented Sep 18, 2019

Improvements to Epochs

badrishc commented Sep 18, 2019

badrishc commented Sep 18, 2019 • edited Loading

badrishc commented Sep 19, 2019

badrishc commented Sep 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiteshmadan commented Sep 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiteshmadan Sep 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

badrishc commented Sep 24, 2019

badrishc commented Sep 24, 2019

badrishc commented Sep 30, 2019

badrishc commented Oct 4, 2019 • edited Loading

Pending

badrishc commented Oct 10, 2019

Summary of Interface to FasterLog

badrishc commented Oct 16, 2019

badrishc commented Oct 18, 2019 • edited Loading

badrishc commented Oct 30, 2019

badrishc commented Oct 30, 2019

badrishc commented Oct 30, 2019

badrishc commented Oct 30, 2019

badrishc commented Sep 18, 2019 •

edited

Loading

hiteshmadan Sep 23, 2019 •

edited

Loading

badrishc commented Oct 4, 2019 •

edited

Loading

badrishc commented Oct 18, 2019 •

edited

Loading