New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Unified changelog stream schema #72
RFC: Unified changelog stream schema #72
Conversation
34a884a
to
6e30515
Compare
6e30515
to
b24257b
Compare
9f88800
to
82760d9
Compare
- [ ] Forwards-compatible | ||
|
||
## Summary | ||
Introduces new `op` schema column that will be used across all datasets to differentiate regular appends, corrections, and retractions. To represent corrections a two-event "changelog stream" data model similar to Apache Flink's will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @sergiimk! I think it's a good feature to have but we need to be very careful as there can be some challenges with implementation:
- order of events: if events are processed out of order, it can lead to inconsistencies in the data. we need a complex mechanism to avoid this.
- schema evolution.
- replaying and rollback.
- if datasets are large and constantly changed we need storage, indexing, etc.
- operational Complexity:monitoring and maintaining: handling failures, retries, and ensuring high availability.
But I encourage to try out this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Events in ODF datasets are strictly ordered, so an engine simply must consume events in the same order as they appear in the Parquet file to avoid any inconsistencies.
ODF of course cannot guarantee that all events are ordered by event_time
(backfills can still happen), so this is why we use stream processing which can provide consistent results even in the presence of out-of-order events.
-
Schema evolution is on our radar (Schema evolution #67) - I think this RFC does not make it any more difficult
-
We can implement rollbacks on several levels:
- dataset can be reset to a previous block (similar to
git reset --hard
) which erases part of the history - or we could issue retractions for a batch of individual events (this is history-preserving)
- dataset can be reset to a previous block (similar to
Replays seem to me as mostly an ingestion-time concern. If we realize that some derivative dataset is faulty - we can always just start over, because root datasets by default preserve all history.
- For storage we are planning to introduce periodic compactions (Compactions #45).
Re. indexing - the only index we have in ODF datasets is time (apart from built-in Parquet headers). To speed up other queries I think we will need an ability to maintain state projections (#75).
- Don't think this RFC affects any of that. Those are probably out-of-scope of the ODF spec as such - we just need to be sure that ODF protocol doesn't jeopardize any of those properties for implementors.
82760d9
to
f1f5511
Compare
13c0c36
to
c5a03f6
Compare
c5a03f6
to
4f876c8
Compare
Rendered RFC
Closes #47