Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for IPC Streaming Read/Write #3783

Merged
merged 24 commits into from
Jun 25, 2022

Conversation

joshuataylor
Copy link
Contributor

This is waiting for jorgecarleitao/arrow2#1095 to be reviewed and merged.

I've pretty much copied IPC, and made changes to use StreamWriter instead. I also think it might be worth moving IPC into its own directory, similar to parquet/csv_core/ndjson_core?

Closes #3778

@github-actions github-actions bot added the rust Related to Rust Polars label Jun 23, 2022
@ritchie46
Copy link
Member

Thanks for the PR.

I've pretty much copied IPC, and made changes to use StreamWriter instead. I also think it might be worth moving IPC into its own directory, similar to parquet/csv_core/ndjson_core?

Yes, could you do that? And would you then feature gate this under ipc_streaming?

// predicates
in predicate pushdown predicates were always combined with:

lit(true) && predicate

now, we simply insert the predicate.

// joins

joins branch the query executor, the schemas therefore need
to be split by join branch

// python

activate timezone feature
@github-actions github-actions bot added the python Related to Python Polars label Jun 23, 2022
@github-actions github-actions bot removed the python Related to Python Polars label Jun 24, 2022
@joshuataylor
Copy link
Contributor Author

That makes sense, I've added it. Pretty straight forward process.

@joshuataylor
Copy link
Contributor Author

I've moved it to the ipc, with ipc_streaming under a different feature flag. They both rely on io_ipc in arrow2.

We just need to bump arrow2 to support this -- #3793

Sorry as well for the multiple commits, wasn't 100% sure how to do it for feature flags (I'm new to Rust, so apologies as I'm still learning).

@ritchie46
Copy link
Member

Sorry as well for the multiple commits, wasn't 100% sure how to do it for feature flags (I'm new to Rust, so apologies as I'm still learning).

No worries, I have got a squash button! :)

@joshuataylor
Copy link
Contributor Author

Yeah, was just about some people review commits or get notifications on pushes.

As this is my first PR to this repository, and I'm new to Rust, please give any feedback you find helpful 🙌

@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2022

Codecov Report

Merging #3783 (d206421) into master (c57df45) will decrease coverage by 0.15%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #3783      +/-   ##
==========================================
- Coverage   78.12%   77.96%   -0.16%     
==========================================
  Files         445      446       +1     
  Lines       73691    73838     +147     
==========================================
+ Hits        57569    57570       +1     
- Misses      16122    16268     +146     
Impacted Files Coverage Δ
polars/polars-io/src/aggregations.rs 7.27% <ø> (ø)
polars/polars-io/src/ipc/ipc_file.rs 94.32% <ø> (ø)
polars/polars-io/src/ipc/ipc_stream.rs 0.00% <0.00%> (ø)
polars/polars-io/src/lib.rs 60.93% <ø> (ø)
polars/polars-io/src/predicates.rs 83.33% <ø> (ø)
polars/polars-io/src/prelude.rs 100.00% <ø> (ø)
polars/polars-io/src/utils.rs 72.97% <ø> (ø)
...polars-core/src/chunked_array/iterator/par/list.rs 95.23% <0.00%> (-4.77%) ⬇️
...lars/polars-core/src/chunked_array/ops/take/mod.rs 62.45% <0.00%> (-0.34%) ⬇️
polars/polars-io/src/csv_core/buffer.rs 79.49% <0.00%> (-0.26%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c57df45...d206421. Read the comment docs.

@github-actions github-actions bot added the python Related to Python Polars label Jun 24, 2022
@github-actions github-actions bot removed the python Related to Python Polars label Jun 24, 2022
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have a few comments, nothing big.

polars/polars-io/src/ipc/ipc_stream.rs Outdated Show resolved Hide resolved
polars/polars-io/src/ipc/ipc_stream.rs Outdated Show resolved Hide resolved
@joshuataylor
Copy link
Contributor Author

joshuataylor commented Jun 24, 2022

I'm running this through my streaming files from SF and will let you know if I have any dramas with them.

@joshuataylor
Copy link
Contributor Author

I have ran this across lots of Snowflake files using their sample dataset + some test ones from myself and it seems to be working well.

The two main changes I've made are:

  1. Move metadata to self, as the schema can only be read once. Otherwise it throws an error, if you do something like this:
    let f = File::open("file.arrow-stream").unwrap();
    let mut reader = BufReader::new(f);
    let mut stream_reader = IpcStreamReader::new(reader);
    let mut schema = stream_reader.arrow_schema().unwrap();
    let mut df = stream_reader.finish().unwrap(); // can't access schema again.
  1. Match on StreamState, I was sometimes getting errors that there were no more batches to read. 186f0dc

@ritchie46 ritchie46 merged commit c217364 into pola-rs:master Jun 25, 2022
cnpryer pushed a commit to cnpryer/polars that referenced this pull request Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Arrow IPC streaming files
4 participants