Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Arrow2/Parquet2 #3

Merged
merged 7 commits into from Mar 6, 2022
Merged

Switch to Arrow2/Parquet2 #3

merged 7 commits into from Mar 6, 2022

Conversation

kylebarron
Copy link
Owner

@kylebarron kylebarron commented Mar 6, 2022

Seems to be a bit faster (and safer Rust), but most notably, the generated IPC just works out of the box in Arrow JS.

Ignore the ugly newbie rust code 馃槀

@kylebarron
Copy link
Owner Author

the generated IPC just works out of the box in Arrow JS.

Well actually it does in most of the cases tested, but not "./data/2-partition-brotli.parquet";. There it gives

message.js:102 Uncaught (in promise) Error: Expected to read 1901288 metadata bytes, but only read 644.

When trying to read in JS 馃槩 .

@kylebarron
Copy link
Owner Author

In search of a large test file, I found s3://ookla-open-data/parquet/performance/type=fixed/year=2021/quarter=1/2021-01-01_performance_fixed_tiles.parquet (from https://registry.opendata.aws/speedtest-global-performance/). The Parquet file is 549MB, which expands to 1.52GB in memory (judging from Pyarrow). The whole file crashed WASM with panicked at 'capacity overflow'.

The first 10 row groups (out of 18) are a 316MB Parquet file. That's 888MiB according to Pyarrow. It takes about 8.5s to return the decoded bytes to JS. It takes about 2s to load the bytes using Pyarrow. Not shabby at all, and it works! 馃帀

Selecting the first n row groups:

import pyarrow.parquet as pq
f = pq.ParquetFile('2021-01-01_performance_fixed_tiles.parquet')
with pq.ParquetWriter('part.parquet', schema=f.schema_arrow) as writer:
    for i in range(10):
        writer.write_table(f.read_row_group(i))

@kylebarron
Copy link
Owner Author

Well actually it does in most of the cases tested, but not "./data/2-partition-brotli.parquet";. There it gives

message.js:102 Uncaught (in promise) Error: Expected to read 1901288 metadata bytes, but only read 644.

When trying to read in JS 馃槩 .

This appears to be a JS error again. This file loads fine in Python with pa.ipc.open_file() but gives the same error as in JS with pa.ipc.open_stream(). So it's likely JS's arrow.tableFromIPC is incorrectly deducing it as an IPC file and not an IPC stream.

@kylebarron kylebarron merged commit d8757e4 into main Mar 6, 2022
@kylebarron kylebarron deleted the arrow2 branch March 6, 2022 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant