Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Replaced RecordBatch by Chunk #717

Merged
merged 8 commits into from
Jan 3, 2022
Merged

Replaced RecordBatch by Chunk #717

merged 8 commits into from
Jan 3, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Dec 28, 2021

closed #673

This is a major refactor to the crates' IO interfaces, see #673 for details.

This PR:

Replaces RecordBatch by a new struct, Chunk, containing a vec of arrays with the same length. All IO interfaces now use Chunk and behave as follows:

  • read or infer schema (logical plane)
  • read columns (physical plane)

This allows users to not have to "leak" logical information to the physical plane unless necessary by the format.

All IO APIs were refactored to read and write Chunk (instead of RecordBatch). This removes much of the boilerplate to write a file.

Migration path

  • RecordBatch -> Chunk<Arc<dyn Array>>
  • RecordBatch::num_rows() -> Chunk::len()
  • RecordBatch::columns() -> Chunk::columns()
  • RecordBatch::column(i) -> Chunk::columns()[i]
  • RecordBatch::num_columns() -> Chunk::columns().len()
  • RecordBatch::schema() -> no longer present. Use other APIs (usually metadata) to get the schema

@codecov
Copy link

codecov bot commented Dec 28, 2021

Codecov Report

Merging #717 (743b0da) into main (ef7937d) will increase coverage by 0.11%.
The diff coverage is 68.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #717      +/-   ##
==========================================
+ Coverage   70.25%   70.37%   +0.11%     
==========================================
  Files         312      311       -1     
  Lines       17016    16920      -96     
==========================================
- Hits        11955    11907      -48     
+ Misses       5061     5013      -48     
Impacted Files Coverage Δ
benches/filter_kernels.rs 0.00% <ø> (ø)
src/array/list/mutable.rs 74.28% <0.00%> (-2.19%) ⬇️
src/compute/filter.rs 52.85% <0.00%> (-0.77%) ⬇️
src/compute/merge_sort/mod.rs 87.36% <ø> (ø)
src/compute/sort/lex_sort.rs 68.42% <ø> (ø)
src/datatypes/mod.rs 97.22% <ø> (+15.82%) ⬆️
src/io/csv/read/deserialize.rs 100.00% <ø> (ø)
src/io/csv/read_async/deserialize.rs 100.00% <ø> (ø)
src/io/flight/mod.rs 0.00% <0.00%> (ø)
src/io/ipc/write/stream_async.rs 55.55% <0.00%> (+0.29%) ⬆️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef7937d...743b0da. Read the comment docs.

@jorgecarleitao jorgecarleitao changed the title Replaced RecordBatch by Columns Replaced RecordBatch by Chunk Dec 28, 2021
@jorgecarleitao
Copy link
Owner Author

Renamed to Chunk based on @sundy-li 's suggestion: #673 (comment)

@jorgecarleitao jorgecarleitao marked this pull request as ready for review December 31, 2021 19:04
@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Jan 2, 2022

@yjshen , @houqp , @sundy-li could you take a look at this?

I envision some pain with this PR in datafusion, as datafusion currently passes logical information (Schema) down to the physical nodes.

Because this PR requires less information to write, one way to go is to declare in DataFusion

pub struct RecordBatch {
     pub columns: Chunk<Arc<dyn Array>>;
     pub schema: Arc<Schema>;

and pass columns to the interfaces (this schema is now useless from arrow2's perspective, since the schema is known before the first batch is available).

For reading, likewise the schema is always known prior to start reading the first batch. Thus, we can just store an Arc<Schema> after reading the metadata/infering the schema and clone it for every batch that comes from IO.

@yjshen
Copy link
Contributor

yjshen commented Jan 3, 2022

The new Chunk API LGTM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider removing RecordBatch
2 participants