PDP 40 (Consistent Order Guarantees for Storage Flushes)

Status: Draft Author: Sachin Joshi Spec Lead: Sachin Joshi Related issues:

Summary

Motivation

During disaster recovery it is often required that we recreate consistent system state from the data found in long term storage (tier-2) and write ahead logs (tier-1). At the time of disaster, segments are flushed to long term storage at varying offsets and speeds which means there is no consistency guarantees for related segments. This results in huge use case/test case matrix to consider while designing recovery protocols (either manual or automated).

This PDP outlines an approach to ensure consistent order guarantees for flushing segments to tier 2 to aid disaster recovery.

The focus of this PDP is consistent order for the segments within one single container during a single flush cycle. This PDP does not cover any global (i.e cross container) order. Note that, this PDP enables creating consistent global snapshots by providing consistent local order guarantees, but doesn't actually describe/create mechanism for such global snapshots.

Shortcomings of the current implementation

Currently all the segments are flushed to long term storage based on either the size based or the time-based thresholds. In addition, all segments are flushed independently and there are no dependencies between them and hence no order guarantees.

This causes problem especially for Disaster Recovery, because there are number of scenarios where the write ahead log may be lost (either fully or partially). In all these scenarios the goal is to create a consistent state of system by using data stored on long term storage and reconciling it with minimal data loss. However, segments on long term storage may not be flushed or flushed at different points in time creating large use case matrix to deal with.

There are number of cases where there are auxiliary segments that are associated with main segments and any discrepancy between them needs to be resolved. (Eg. segment and its attributes)
Some system segments like metadata are critical.
- Any missing data from them means the recovery tool has no data to guide the reconciliation.
- If segments are deleted or truncated before metadata is updated and flushed to the tier-2 then tool will find data in metadata about the segments but segments themselves might be missing.

Key Design Changes

Key Concepts

Terminology

Flush iteration : A single iteration of storage writer to flush contents of segments from Write Ahead Log (tier-1) to Long Term Storage (tier-2).
Flush Eligible segment A segment is called eligible segment when it meets the criteria for flushing its writes to long term storage. Eg when size or time threshold is met.
Flush Priority : A priority number assigned for an eligible segment during a single flush iteration.
Flush Priority Bucket : A logical grouping of all eligible segments by their flush priority . All segments in a higher priority bucket are flushed before any segments from lower priority buckets are flushed.
Flush Order : Order in which segment must be flushed within a priority bucket.
System segments : All segments that are created under _system scope. This includes all segments used internally by all Pravega components (Eg. controller, client)
Container metadata segments : Any internal segments that store container metadata and other information that is critical to correct working of a container.
User segments : All segments that are neither system segments not critical segments.
Recorded Offset
- Recorded Offset : Any offset of a segment as recorded in any container metadata or index.
- Dependent segments : A metadata segment S1 is said to depend upon segment S2, if S1 contains a recorded offset that points to/refers to an offset in S2.
Recovered Offset
- Recovered Offset : Any Recorded Offset as actually found inside any container metadata or index that is present on LTS and recovered from LTS during recovery process.
- Maximum Recovered Length : Actual maximum offset up to which data in segment is flushed to LTS.
- Minimum Recovered Length : Actual minimum offset up below which data is not present on LTS.
Guaranteed Offset : Offset in segment at which data can be actually found to be present on LTS.
Guaranteed Flushed Length : Offset below which all data is guaranteed to be flushed to LTS.

Definition of consistency

The flush is said to be consistent when following conditions are met.
- Any Recovered Offsets from any metadata segments must be a Guaranteed Offset.
- All offsets between Minimum Recovered Length and Maximum Recovered Length must be a Guaranteed Offset. (There should be no gaps, offsets that can not be found)

(In other words, if a metadata points to some offset, then that data is guaranteed to be present at that offset.)

Proposal

The solution is to introduce consistent order for flushing segments within one single container during each flush cycle.

Enforcing consistent order allows us to make strong assumption about the contents of other segments based on relationships between them.

Although not required for consistency per say, we also flush system and metadata segments in every flush cycle in an attempt to minimize losing system state.

Flush order and consistency

If we guarantee that dependent segments are flushed only after the segments they depend on are flushed first, then during recovery any recorded offset we actually recover from the contents of dependent segment is guaranteed to be pointing to the data flushed earlier.

Below are proposed rules.

1. Always flush container metadata segment on every iteration and always flush it last.

Irrespective of thresholds container metadata segments are always flushed every flush cycle.
Contents of metadata table segments are critical to functioning of container as they are used to access rest of the segments, any missing metadata or inconsistencies with actual data on long term storage will cause critical failures of the system.
The container metadata segments are flushed last so that all changes made to LTS during flush cycle are always reflected in segment metadata. When a record is found in metadata then we are guaranteed that reality on long term storage is consistent with metadata found in metadata segments.
In terms of performance, the assumption is that the cost of frequent flushes is amortized by large number of segments and/or big writes with the actual user data.

2. Always flush all system segments (all segments under _system scope) on every iteration.

Irrespective of thresholds container metadata segments are always flushed every flush cycle.
Contents of metadata table segments are critical to functioning of container as they are used to access rest of the segments, any missing metadata or inconsistencies with actual data on long term storage will cause critical failures of the system.
There may be additional sequence enforced on system segments.
In terms of performance, the assumption is that the cost of frequent flushes is amortized by large number of segments and/or big writes with the actual user data.

3. For table segments, always flush data segment before index segment.

This avoids situation where index points to data that is not yet written to tier-2.
This may mean data segment is flushed even if threshold is not met.

4. For all segments always flush data whenever attribute segment is flushed.

This avoids situation where attribute points to data that is not yet written to long term storage.
This may mean data is flushed even if threshold is not met.

5. Delayed delete and truncate.

The operations for delete and truncate we split them into two distinct phases executed on two successive flush cycles. This avoids situations where segment is deleted or truncated but updated metadata is not yet flushed to the long term storage.
Mark segment deleted or truncated in metadata in first iteration
Actually delete/truncate in second iteration

Requirements

All modified system segments are eligible in every flush iteration irrespective of size of accumulated data or time since it was last flushed.
All modified metadata segments are eligible in every flush iteration irrespective of size of accumulated data or time since it was last flushed.
Segment and its attribute segment both become eligible when either of them is eligible.
All eligible segments are grouped in _priority bucket_s during each flush iteration.
All eligible segments in a higher priority bucket are flushed before any segments from lower priority buckets are flushed.
Any User segments have higher priority than system segments.
Any system segments have higher priority than metadata segments.
Any metadata segments mush have the lowest priority and must be flushed last in every iteration.
Within any priority buckets, any data segment mush be flushed before its corresponding index or attribute segment.

Analysis of Order Guarantees

Consistency Guarantees

If we guarantee that dependent segments are flushed after the segments they depend on, then during recovery any recorded offset we actually recover from the contents of dependent segment is guaranteed to be pointing to the data flushed earlier.
Therefore if metadata segments are always flushed last, their contents always reflect all changes that result from all earlier flush operations upto the point when metadata segments were flushed.
Therefore, all lengths recorded in metadata segments are always guaranteed to be below Guaranteed Flushed Length
Therefore, all lengths recorded in system segments are always guaranteed to be below Guaranteed Flushed Length
Therefore, all lengths recorded in index segments are always guaranteed to be below Guaranteed Flushed Length

Liveness Guarantees

All data is flushed eventually.

API Changes

None

Internal Changes

The flush algorithm must be changed to reflect this design.

Flush algorithm

for each iteration {
    for each priority bucket {
        calculate the set of eligible segments in the priority bucket.
        for each eligible segment flush in following order {
            flush segment
            flush index segments associated with it
            flush attribute segment associated with it
       }
   }
}

Key Concerns

Compatibility and Migration

Not applicable

Performance

This design will result in frequent and smaller writes to tier-2 for system and critical segments.

Appendix

Discarded Approaches

References

Pravega - Streaming as a new software defined storage primitive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly