Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fixed error in memory usage of sliced binary/list/utf8arrays #1293

Merged
merged 1 commit into from Nov 13, 2022

Conversation

ritchie46
Copy link
Collaborator

@ritchie46 ritchie46 commented Nov 5, 2022

fixes #1292

The estimated_bytes_size function did not shrink if an Utf8 array was sliced, leading to a slice recursion until there were only 3 elements left in the array. This lead to writing 175e7 / 3 pages to the parquet file, which was insanely slow.

This PR adapts estimated_bytes_size so that it reports the sliced array size. It already does this for other data types and I think this is most consistent and least error prone.

If you think we should make it a dedicated function or a branch flagged by an extra argument, that's also fine.

@codecov
Copy link

codecov bot commented Nov 5, 2022

Codecov Report

Base: 83.12% // Head: 83.11% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (61fdb87) compared to base (562de6a).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1293      +/-   ##
==========================================
- Coverage   83.12%   83.11%   -0.01%     
==========================================
  Files         369      369              
  Lines       40187    40187              
==========================================
- Hits        33405    33402       -3     
- Misses       6782     6785       +3     
Impacted Files Coverage Δ
src/compute/aggregate/memory.rs 35.71% <ø> (ø)
src/io/ipc/read/array/utf8.rs 92.75% <0.00%> (-5.80%) ⬇️
src/bitmap/utils/slice_iterator.rs 98.78% <0.00%> (+1.21%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Collaborator Author

Clippy fails, but it is unrelated to this PR.

@jorgecarleitao jorgecarleitao changed the title report sliced memory usage in binary/list/utf8arrays Fixed error in memory usage of sliced binary/list/utf8arrays Nov 13, 2022
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Nov 13, 2022
@jorgecarleitao jorgecarleitao merged commit 48a5322 into jorgecarleitao:main Nov 13, 2022
@jorgecarleitao
Copy link
Owner

Thanks @ritchie46 !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet writer stalls at a certain column size for Utf8 dtypes.
2 participants