RFC: Unified changelog stream schema #72

sergiimk · 2023-12-19T01:28:19Z

Closes #47

Wizzy-wooz · 2024-01-05T04:49:57Z

rfcs/015-unified-changelog-stream-schema.md

+- [ ] Forwards-compatible
+
+## Summary
+Introduces new `op` schema column that will be used across all datasets to differentiate regular appends, corrections, and retractions. To represent corrections a two-event "changelog stream" data model similar to Apache Flink's will be used.


hey @sergiimk! I think it's a good feature to have but we need to be very careful as there can be some challenges with implementation:

order of events: if events are processed out of order, it can lead to inconsistencies in the data. we need a complex mechanism to avoid this.

schema evolution.

replaying and rollback.

if datasets are large and constantly changed we need storage, indexing, etc.

operational Complexity:monitoring and maintaining: handling failures, retries, and ensuring high availability.

But I encourage to try out this implementation.

Events in ODF datasets are strictly ordered, so an engine simply must consume events in the same order as they appear in the Parquet file to avoid any inconsistencies.

ODF of course cannot guarantee that all events are ordered by event_time (backfills can still happen), so this is why we use stream processing which can provide consistent results even in the presence of out-of-order events.

Schema evolution is on our radar (Schema evolution #67) - I think this RFC does not make it any more difficult

We can implement rollbacks on several levels:

dataset can be reset to a previous block (similar to git reset --hard) which erases part of the history

or we could issue retractions for a batch of individual events (this is history-preserving)

Replays seem to me as mostly an ingestion-time concern. If we realize that some derivative dataset is faulty - we can always just start over, because root datasets by default preserve all history.

For storage we are planning to introduce periodic compactions (Compactions #45).

Re. indexing - the only index we have in ODF datasets is time (apart from built-in Parquet headers). To speed up other queries I think we will need an ability to maintain state projections (#75).

Don't think this RFC affects any of that. Those are probably out-of-scope of the ODF spec as such - we just need to be sure that ODF protocol doesn't jeopardize any of those properties for implementors.

sergiimk requested a review from zaychenko-sergei December 19, 2023 01:28

sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from 34a884a to 6e30515 Compare December 20, 2023 00:16

sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from 6e30515 to b24257b Compare January 4, 2024 01:27

sergiimk changed the base branch from master to breaking-changes January 4, 2024 01:28

sergiimk marked this pull request as ready for review January 4, 2024 01:31

sergiimk requested a review from Wizzy-wooz January 4, 2024 01:31

sergiimk force-pushed the rfc/unified-changelog-stream-schema branch 4 times, most recently from 9f88800 to 82760d9 Compare January 4, 2024 16:59

sergiimk requested a review from s373r January 4, 2024 17:30

Wizzy-wooz reviewed Jan 5, 2024

View reviewed changes

sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from 82760d9 to f1f5511 Compare January 6, 2024 01:10

sergiimk mentioned this pull request Jan 6, 2024

Unified changelog schema kamu-data/kamu-engine-flink#10

Merged

sergiimk force-pushed the rfc/unified-changelog-stream-schema branch 3 times, most recently from 13c0c36 to c5a03f6 Compare January 7, 2024 02:33

This was referenced Jan 9, 2024

Unified changelog schema support kamu-data/kamu-engine-datafusion#7

Merged

Unified changelog schema support kamu-data/kamu-engine-spark#9

Closed

Unified changelog schema kamu-data/kamu-cli#431

Merged

RFC: Unified changelog stream schema

4f876c8

sergiimk force-pushed the rfc/unified-changelog-stream-schema branch from c5a03f6 to 4f876c8 Compare January 11, 2024 18:22

sergiimk merged commit 4f876c8 into breaking-changes Jan 11, 2024

sergiimk deleted the rfc/unified-changelog-stream-schema branch January 11, 2024 18:24

sergiimk mentioned this pull request Jan 11, 2024

Corrections and retractions #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Unified changelog stream schema #72

RFC: Unified changelog stream schema #72

sergiimk commented Dec 19, 2023 •

edited

Wizzy-wooz Jan 5, 2024 •

edited

sergiimk Jan 5, 2024

RFC: Unified changelog stream schema #72

RFC: Unified changelog stream schema #72

Conversation

sergiimk commented Dec 19, 2023 • edited

Wizzy-wooz Jan 5, 2024 • edited

Choose a reason for hiding this comment

sergiimk Jan 5, 2024

Choose a reason for hiding this comment

sergiimk commented Dec 19, 2023 •

edited

Wizzy-wooz Jan 5, 2024 •

edited