Skip to content

Allow create fragment on non-existed dataset.#825

Merged
eddyxu merged 7 commits intomainfrom
lei/create_fragment
May 5, 2023
Merged

Allow create fragment on non-existed dataset.#825
eddyxu merged 7 commits intomainfrom
lei/create_fragment

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented May 4, 2023

It allows user to distributedly create Fragments first, and then commit a Dataset later.

"""
ds = self._ds.create_version_from_fragments(new_schema, fragments)
return LanceDataset(self.uri)
if isinstance(base_uri, Path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forgot, do we need to do any relative to absolute conversion or $HOME expansion etc here? or is that all done at the Rust level?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure which side does the normalization now.

def _create_version_from_fragments(
self,
@staticmethod
def _commit(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so basically the intended usage here is:

  1. each executor node create a new fragment
  2. each fragment gets written to gs bucket under the lance directory's /data subdir
  3. call this _commit to create a) a new manifest file and b) update the _latest.manifest file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and these fragments can either be created from scratch or appending a column to an existing fragment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. The logic here is to separate the fragments preparation step with the final commit step (to make the version of dataset visible).

This control flow can be used for append new data (fragments), delete fragments, garbage collections later as well.

Comment thread python/python/lance/fragment.py Outdated
elif isinstance(data, pa.Table):
reader = data.to_reader()
elif isinstance(data, pa.dataset.Dataset):
reader = pa.dataset.Scanner.from_dataset(data).to_reader()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user passes in a LanceDataset it will fall into this case and then fail.
Instead of using the static method, use the Dataset.scanner() (or maybe to_scanner()) API (see the other lance methods for this).

The reason is that pa.dataset.Scanner.from_dataset(...) ends up referring to Dataset private internals specific to the C++ pyarrow implementation (e.g., CDataset or smth).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i was re-using the same code in write_dataset. Can make it a to_scanner() i guess.

def test_create_from_fragments(tmp_path: Path):
table = pa.Table.from_pydict({"a": range(100), "b": range(100)})
base_dir = tmp_path / "test"
fragment = lance.fragment.LanceFragment.create(base_dir, 1, table)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the input data to create ever be a LanceDataset? I'm wondering if this has the same tokio runtime issue that the read/write APIs have. If so, you may need to convert the input LanceDataset into a pyarrow Table first until we figure out how to deal with that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at fragment level? seems no. If you consider this is the first step to create fragment. this is usually just write in memory data to disk.

Comment thread python/src/dataset.rs Outdated
.ds
.create_version_from_fragments(&new_schema_with_id, &fragment_metadata)
.await
LanceDataset::commit(dataset_uri, &schema, &fragment_metadata).await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is commit the right terminology here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from the DB term, writing all fragments are the preparing phase of a transaction, and this last step "commit" the change to the dataset (as making them visible)?

Comment thread python/src/fragment.rs
let rt = tokio::runtime::Runtime::new()?;
let metadata = rt.block_on(async {
let mut batches: Box<dyn RecordBatchReader> = if reader.is_instance_of::<Scanner>()? {
let scanner: Scanner = reader.extract()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this case in the if else here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added one test.

Comment thread rust/src/dataset.rs
let indices = self.load_indices().await?;
write_manifest_file(&self.object_store, &mut manifest, Some(indices)).await?;
let base = self.object_store.base_path().clone();
write_manifest_file(&object_store, &mut manifest, Some(indices)).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preserving the indices only makes sense if we're appending rows. Here there's no guarantee as to what the input fragments actually represent?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. prevervign indices is an available option for adding columns as well tho.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check for the same WriteMode here as well.

@eddyxu
Copy link
Copy Markdown
Member Author

eddyxu commented May 5, 2023

Addressed comments.

@eddyxu eddyxu merged commit 2d389fc into main May 5, 2023
@eddyxu eddyxu deleted the lei/create_fragment branch May 5, 2023 03:32
@staticmethod
def create(
dataset_uri: Union[str, Path],
fragment_id: int,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the fragment_id here need to be sequentially increased from 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants