Merge batches from multiple datafiles in the same Fragment by eddyxu · Pull Request #815 · lance-format/lance

eddyxu · 2023-04-30T00:32:30Z

No description provided.

changhiskhan

minor comments/questions

changhiskhan · 2023-05-03T22:14:24Z

+
+    def add_columns(
+        self,
+        value_func: Callable[[pa.RecordBatch], pa.RecordBatch],


so the value_func's output will be the augmented Fragment that will then get written as the new version right? Going to assume the API will improve in subsequent PRs

so the idea is that value_func will take of matching the input data and the merge data?

the value_func will be "the producer" of the new data. This is similar to pandas udf in spark, so it takes in a "RecordBatch" as input, and create a same size "RecordBatch" as output

changhiskhan · 2023-05-03T22:18:11Z

    }

+    /// Create a new version of [`Dataset`] from a collection of fragments.
+    pub async fn create_version_from_fragments(


could we use this to manually create a version after distributed writes?

Yes, this is exactly the purpose of exposing this function as pub.

changhiskhan · 2023-05-03T22:26:47Z

-                )
-                .await
-                {
+                let file_fragment = FileFragment::new(dataset.clone(), frag.clone());


nice the FileFragment abstraction really simplifies this code

changhiskhan · 2023-05-03T22:36:47Z

+        // TODO: use tokio::async buffer to make parallel reads.
+        let mut batches = vec![];
+        for (reader, schema) in self.readers.iter() {
+            let batch = reader


can these readers be constructed with the schema so you don't need to pass tuples around?

It can be done via another refactory. Currently FileReader does take the schema. Can make FileReader hold the projection schema?

it's fine, not a blocker 🤷

changhiskhan

gogogogogogogo

eddyxu marked this pull request as draft April 30, 2023 00:32

eddyxu added WIP work in progress donotmerge Do not merge labels Apr 30, 2023

eddyxu force-pushed the lei/merge_from_data_files branch 2 times, most recently from 4d0d3e3 to 6737601 Compare May 2, 2023 22:40

eddyxu added 19 commits May 2, 2023 20:11

new writer

d81549a

fix lifetime

ed38be7

pass write test

dcdcc10

project by id

dc9f664

read ranges per data file

19f07b7

open mulitple readers

3477066

merge batches

424efa5

do not break existing code

c6cea63

fix schema

02788e7

runs

7ffa4fe

add updater api

4ac19a2

python side updater

cb9313d

updater::update

d271259

scanner next

40d130b

updater

260a3cb

buildable

59e434f

add

33d8d55

add pytest

72750f0

project using schema

bf3b27a

eddyxu force-pushed the lei/merge_from_data_files branch from 6737601 to bf3b27a Compare May 3, 2023 03:11

eddyxu added 5 commits May 2, 2023 22:33

revert to read_batch api

3efd046

refactor

8e3c3d5

pass on rust side

5e41ce8

cargo fmt

5735c2e

python test barely works

7577da5

add todo

cbee9bf

eddyxu requested review from changhiskhan and gsilvestrin May 3, 2023 20:58

eddyxu marked this pull request as ready for review May 3, 2023 20:58

eddyxu mentioned this pull request May 3, 2023

Update datasets by adding columns (eg... schema evolution availability and functioning for lance datasets) #767

Closed

eddyxu added 2 commits May 3, 2023 14:45

remove warnings

2fc90b4

cargo fmt

78b2e0f

eddyxu removed WIP work in progress donotmerge Do not merge labels May 3, 2023

changhiskhan reviewed May 3, 2023

View reviewed changes

fix project on boolean

f1a2326

eddyxu changed the title ~~[WIP] Merge batches from multiple datafiles in the same Fragment~~ Merge batches from multiple datafiles in the same Fragment May 3, 2023

changhiskhan approved these changes May 3, 2023

View reviewed changes

eddyxu merged commit 9f08965 into main May 3, 2023

eddyxu deleted the lei/merge_from_data_files branch May 3, 2023 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge batches from multiple datafiles in the same Fragment#815

Merge batches from multiple datafiles in the same Fragment#815
eddyxu merged 28 commits intomainfrom
lei/merge_from_data_files

eddyxu commented Apr 30, 2023

Uh oh!

changhiskhan left a comment

Uh oh!

changhiskhan May 3, 2023

Uh oh!

changhiskhan May 3, 2023

Uh oh!

eddyxu May 3, 2023

Uh oh!

changhiskhan May 3, 2023

Uh oh!

eddyxu May 3, 2023

Uh oh!

changhiskhan May 3, 2023

Uh oh!

changhiskhan May 3, 2023

Uh oh!

eddyxu May 3, 2023

Uh oh!

changhiskhan May 3, 2023

Uh oh!

changhiskhan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eddyxu commented Apr 30, 2023

Uh oh!

changhiskhan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changhiskhan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants