Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of ParquetWriter, ParquetReader w/o compiling all compression deps (WASM support) #16729

Open
akhilles opened this issue Jun 4, 2024 · 8 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@akhilles
Copy link

akhilles commented Jun 4, 2024

Description

Currently, enabling the parquet feature in polars-io also enables polars-parquet/compression, which in turn enables the following dependencies that are hard to cross-compile to wasm32-unknown-unknown:

  • zstd
  • gzip
  • snappy
  • lz4
  • brotli

Also, supporting all of these compression schemes bloats up the WASM bundle. So, it would be nice to not gate polars_io::parquet behind a flag that necessarily brings in all of these dependencies.

To address this, I propose gating polars_io::parquet behind the polars-parquet flag instead of the polars flag.

@akhilles akhilles added the enhancement New feature or an improvement of an existing feature label Jun 4, 2024
@deanm0000
Copy link
Collaborator

As an aside, there's https://github.com/kylebarron/parquet-wasm. I don't know if figuring out the interoperability is worth it.

Back to your point, it seems another idea would be to make the compression dependencies a la carte instead of all of them.

@akhilles
Copy link
Author

akhilles commented Jun 4, 2024

it seems another idea would be to make the compression dependencies a la carte instead of all of them.

With #16731, this is possible with zstd:

polars-io = { version = "0.40", default_features = false, features = ["polars-parquet", "zstd"] }

For the other compression dependencies, it's still possible but you have to add polars-parquet as a direct dependency:

polars-io = { version = "0.40", default_features = false, features = ["polars-parquet"] }
polars-parquet = { version = "0.40", default_features = false, features = ["lz4"] }

Adding passthrough feature flags for lz4, brotli, etc. would make this a bit easier but not sure if it's worth it.

@jccampagne
Copy link

I am trying to compile Polars for wasm32 target (to be used within Leptos app on client side) and I found this issue here.
After disabling some features related to compression and crossterm to make some progress compile, I am getting this compilation error:

 (git)-[main]-polars % cargo build --target wasm32-unknown-unknown 
...
   Compiling polars-arrow v0.40.0 (/Users/jc/src/polars/crates/polars-arrow)
error: this arithmetic operation will overflow
  --> crates/polars-arrow/src/compute/cast/utf8_to.rs:87:45
   |
87 |         .sliced(0, std::cmp::min(buf.len(), OffsetType::MAX as usize * 2))
   |                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ attempt to compute `usize::MAX * 2_usize`, which would overflow
   |
   = note: `#[deny(arithmetic_overflow)]` on by default

error: could not compile `polars-arrow` (lib) due to 1 previous error

This line was introduced in PR 15408 to fix an overflow problem.

I think it's due to this type definition:

#[cfg(not(test))]
type OffsetType = u32;

(see https://github.com/pola-rs/polars/pull/15408/files#diff-7b0d4a626698c6858fc7274640e5c4706b7c336ef259cf861e1a34e453fff70eR74-R75 )

This will cause usize::MAX * 2_usize to overflow on a 32-bit system (like wasm32 target).

@ritchie46
Copy link
Member

Why would you need parquet on wasm if I may ask? I have never seen uncompressed parquet files and supporting this adds a lot of complexity and breaking assumptions that I don't think are worth it.

@deanm0000
Copy link
Collaborator

I use duckdb-wasm to read parquet files and it's great. It means I can have a dashboard without an api layer and I can write most of my logic in sql instead of javascript. That said, the duckdb parquet extension supports compression. I don't know how much it'd save on bundle size and load time if it only supported uncompressed parquets. I don't know what the benefit is of uncompressed parquets is vs uncompressed ipc files though.

@akhilles akhilles changed the title Allow use of ParquetWriter, ParquetReader without compression (WASM support) Allow use of ParquetWriter, ParquetReader w/o compiling all compression deps (WASM support) Jun 10, 2024
@jccampagne
Copy link

jccampagne commented Jun 10, 2024

Why would you need parquet on wasm if I may ask?

@ritchie46 sure: I am reading, and analyzing parquet files from within the browser (client side only; no server side).
I was able to read parquet files with the parquet crate and use datafusion for the query engine.

I have never seen uncompressed parquet files

The aim was not to disable all compression. I was just trying to debug the issues and disabled one feature at a time and found this overflow issue in wasm32. I am actually using compression.

and supporting this adds a lot of complexity and breaking assumptions that I don't think are worth it.

I understand. No worries.

@akhilles
Copy link
Author

akhilles commented Jun 10, 2024

Why would you need parquet on wasm if I may ask? I have never seen uncompressed parquet files and supporting this adds a lot of complexity and breaking assumptions that I don't think are worth it.

Similar to @jccampagne, I don't necessarily need uncompressed parquet files. I just want to use a single compression scheme (zstd), but currently all of the compression dependencies are built. This is probably fine for most targets, but not really suitable for wasm for 2 reasons:

  • Compiling C/C++ dependencies to wasm32-unknown-unknown is a pain
  • Keeping binary size smaller is more important

I updated the issue to reflect this more accurately.

Really I just need some way to load a Dataframe into wasm from S3. I'm not even attached to parquet, but IPC has the same issue: enabling the ipc feature forces compilation of both zstd and lz4-rs.

@akhilles
Copy link
Author

I can make a similar change to polars_io::ipc instead if that's more welcome. Compression isn't critical for Arrow IPC files, and so being able to build polars for wasm + IPC without compression dependencies seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants