Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unified changelog stream schema #72

Merged
merged 1 commit into from Jan 11, 2024

Conversation

sergiimk
Copy link
Contributor

@sergiimk sergiimk commented Dec 19, 2023

Rendered RFC

Closes #47

@sergiimk sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from 34a884a to 6e30515 Compare December 20, 2023 00:16
@sergiimk sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from 6e30515 to b24257b Compare January 4, 2024 01:27
@sergiimk sergiimk changed the base branch from master to breaking-changes January 4, 2024 01:28
@sergiimk sergiimk marked this pull request as ready for review January 4, 2024 01:31
@sergiimk sergiimk force-pushed the rfc/unified-changelog-stream-schema branch 4 times, most recently from 9f88800 to 82760d9 Compare January 4, 2024 16:59
@sergiimk sergiimk requested a review from s373r January 4, 2024 17:30
- [ ] Forwards-compatible

## Summary
Introduces new `op` schema column that will be used across all datasets to differentiate regular appends, corrections, and retractions. To represent corrections a two-event "changelog stream" data model similar to Apache Flink's will be used.
Copy link

@Wizzy-wooz Wizzy-wooz Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @sergiimk! I think it's a good feature to have but we need to be very careful as there can be some challenges with implementation:

  1. order of events: if events are processed out of order, it can lead to inconsistencies in the data. we need a complex mechanism to avoid this.
  2. schema evolution.
  3. replaying and rollback.
  4. if datasets are large and constantly changed we need storage, indexing, etc.
  5. operational Complexity:monitoring and maintaining: handling failures, retries, and ensuring high availability.

But I encourage to try out this implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Events in ODF datasets are strictly ordered, so an engine simply must consume events in the same order as they appear in the Parquet file to avoid any inconsistencies.

ODF of course cannot guarantee that all events are ordered by event_time (backfills can still happen), so this is why we use stream processing which can provide consistent results even in the presence of out-of-order events.

  1. Schema evolution is on our radar (Schema evolution #67) - I think this RFC does not make it any more difficult

  2. We can implement rollbacks on several levels:

    • dataset can be reset to a previous block (similar to git reset --hard) which erases part of the history
    • or we could issue retractions for a batch of individual events (this is history-preserving)

Replays seem to me as mostly an ingestion-time concern. If we realize that some derivative dataset is faulty - we can always just start over, because root datasets by default preserve all history.

  1. For storage we are planning to introduce periodic compactions (Compactions #45).

Re. indexing - the only index we have in ODF datasets is time (apart from built-in Parquet headers). To speed up other queries I think we will need an ability to maintain state projections (#75).

  1. Don't think this RFC affects any of that. Those are probably out-of-scope of the ODF spec as such - we just need to be sure that ODF protocol doesn't jeopardize any of those properties for implementors.

@sergiimk sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from c5a03f6 to 4f876c8 Compare January 11, 2024 18:22
@sergiimk sergiimk merged commit 4f876c8 into breaking-changes Jan 11, 2024
@sergiimk sergiimk deleted the rfc/unified-changelog-stream-schema branch January 11, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Corrections and retractions
2 participants