Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): Adds sink_csv_cloud to LazyFrame #11037

Closed

Conversation

svaningelgem
Copy link
Contributor

@svaningelgem svaningelgem commented Sep 11, 2023

In analogy with PR #11008 & PR #10060: adding sink_csv_cloud in rust.

As I don't have the rights for running rust builds on my computer here, I am opening this up as a draft until all checks check out.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Sep 11, 2023
@svaningelgem svaningelgem marked this pull request as ready for review September 12, 2023 06:40
@svaningelgem svaningelgem changed the title feat(rust): Adds sink_csv_cloud feat(rust): Adds sink_csv_cloud to LazyFrame Sep 12, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 12, 2023

In order to better-group the cloud methods, how about sink_cloud_<xyz> naming instead of sink_<xyz>_cloud? Then they will naturally be grouped together via autocomplete, in the docs, etc, making it even more straightforward as to which formats have cloud options, eg:

  • sink_cloud_csv
  • sink_cloud_ipc
  • sink_cloud_parquet

@svaningelgem
Copy link
Contributor Author

In order to better-group the cloud methods, how about sink_cloud_<xyz> naming instead of sink_<xyz>_cloud? Then they will naturally be grouped together via autocomplete, in the docs, etc, making it even more straightforward as to which formats have cloud options, eg:

  • sink_cloud_csv
  • sink_cloud_ipc
  • sink_cloud_parquet

I don't really have any preference either way. sink_<format>_cloud would firstly focus on the types, sink_cloud_<format> would focus firstly on the locality of the data... There is something to say for both.

@Qqwy would be doing a refactoring after his ipc-cloud method was added in PR #11008 . So I would be taking it up in that refactoring, wdyt?

Furthermore, I also invite you to join the discussion I started in #11056.

# Conflicts:
#	crates/polars-pipe/src/executors/sinks/file_sink.rs
@svaningelgem
Copy link
Contributor Author

Closing because no movement for a month now.

@josevalim
Copy link

I may be wrong here but would sink csv cloud yield any benefit? csv is row-based, which means you need to execute and load the data frame in memory anyway, and then write the rows.

@svaningelgem
Copy link
Contributor Author

Obviously. CSV is record based. So as soon as you have a record, you can write it away and forget it again... So sinking is an obvious choice for that kind of file formats.

@josevalim
Copy link

josevalim commented Nov 11, 2023

Yeah, I poorly phrased my question. :) You can optimize the writing of the rows but I assume there is not much optimization we can push to the query itself, compared to a columnar format that would allow us to optimize the computation and streaming of each individual column.

@josevalim
Copy link

In any case, it probably makes sense to enable cloud operations to all formats, even if only for the convenience of skipping the local write to disk.

@svaningelgem
Copy link
Contributor Author

True, with a columnar format you can optimize reading. Anyhow in my case I read in a parquet file and wrote it out as a csv file. So for these kind of operations sinking is useful.

As for enabling all cloud operations to all formats: I'm waiting for PR #10786 to be merged in order to investigate a single-source of file reader/writer where any format plugs into. So whatever new format arrives, it can immediately benefit from the full scope of reading/writing capabilities in the rust engine.

@josevalim
Copy link

josevalim commented Nov 11, 2023

@alexander-beedie To play devil's advocate on the name, people usually know the format they want to work with, so I think having the format come first is more helpful. This way they can type sink_json and see there is more than one way to sink it. I have nothing concrete to back this up, so feel free to ignore it. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants