Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improvement of Ray sink API #2237

Merged
merged 17 commits into from
Apr 23, 2024
Merged

feat: improvement of Ray sink API #2237

merged 17 commits into from
Apr 23, 2024

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Apr 21, 2024

  • Expose max_bytes_per_file via Ray sink
  • Add a hook to provide ray.data.Dataset.write_lance() interface.

eddyxu and others added 6 commits April 21, 2024 19:27
Add LanceOperation, Add commit for Append Operation.
Pass Fragment as JSON string between Rust/Java.

---------

Co-authored-by: Lei Xu <lei@lancedb.com>
@eddyxu eddyxu marked this pull request as ready for review April 22, 2024 20:33
@wjones127 wjones127 changed the title chore: improvement of Ray sink API feat: improvement of Ray sink API Apr 22, 2024
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small issues with signatures, but otherwise looks good.

Comment on lines +78 to +80
max_bytes_per_file = (
DEFAULT_MAX_BYTES_PER_FILE if max_bytes_per_file is None else max_bytes_per_file
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just put DEFAULT_MAX_BYTES_PER_FILE in the signature as the default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So using None here we can later detect whether users specify this value or not. During a benchmark, the default 90GB causes OOM (now we know it was a bug in arrow. This allows us to provide a better value later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Seems like a good thing to write a TODO comment for.

"""

def __init__(
self,
uri: str,
*,
transform: Callable[[pa.Table], Union[pa.Table, Generator]] = lambda x: x,
transform: Callable[[pa.Table], Union[pa.Table, Generator]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to make the default None, then you have to change the type accordingly:

Suggested change
transform: Callable[[pa.Table], Union[pa.Table, Generator]] = None,
transform: Optional[Callable[[pa.Table], Union[pa.Table, Generator]]] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

max_rows_per_file : int, optional
The maximum number of rows per file. Default is 1024 * 1024.
max_bytes_per_file : int, optional
The maximum number of bytes per file. Default is None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is 90GB, not None, right?

Suggested change
The maximum number of bytes per file. Default is None.
The maximum number of bytes per file. Default is 90GB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

max_rows_per_file: int, optional
The maximum number of rows per file. Default is 1024 * 1024.
max_bytes_per_file: int, optional
The maximum number of bytes per file. Default is None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Suggested change
The maximum number of bytes per file. Default is None.
The maximum number of bytes per file. Default is 90GB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@eddyxu eddyxu merged commit 920a070 into main Apr 23, 2024
9 checks passed
@eddyxu eddyxu deleted the lei/scale_ray_sink branch April 23, 2024 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants