Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support to read Avro. #406

Merged
merged 3 commits into from
Sep 17, 2021
Merged

Added support to read Avro. #406

merged 3 commits into from
Sep 17, 2021

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Sep 15, 2021

This PR adds support to read from avro, a popular row-based format.

The approach this PR takes is equivalent to what we do in CSV (also a row-based format). Specifically:

  1. (IO-bounded) A StreamingIterator to read avro (potentially compressed) blocks (Vec<u8> of compressed groups of rows)
  2. (CPU-bounded) A StreamingIterator to decompress avro blocks (that swaps the buffers to not allocate on each block)
  3. (CPU-bounded) An Iterator of record batches that consumes the decompressor and yields RecordBatch
  4. (CPU-bounded) A function deserialize used by the reader

The avro schema -> arrow schema code is based on apache/datafusion#910 by @Igosuki .

Closes #401

@jorgecarleitao jorgecarleitao added the feature A new feature label Sep 15, 2021
@codecov
Copy link

codecov bot commented Sep 15, 2021

Codecov Report

Merging #406 (5d60695) into main (06892e9) will decrease coverage by 0.08%.
The diff coverage is 70.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #406      +/-   ##
==========================================
- Coverage   80.91%   80.83%   -0.09%     
==========================================
  Files         347      353       +6     
  Lines       22098    22500     +402     
==========================================
+ Hits        17880    18187     +307     
- Misses       4218     4313      +95     
Impacted Files Coverage Δ
src/io/avro/mod.rs 0.00% <0.00%> (ø)
tests/it/array/mod.rs 100.00% <ø> (ø)
src/io/avro/read/schema.rs 40.62% <40.62%> (ø)
src/io/avro/read/mod.rs 77.19% <77.19%> (ø)
src/io/avro/read/util.rs 80.43% <80.43%> (ø)
src/io/avro/read/deserialize.rs 81.13% <81.13%> (ø)
src/io/parquet/read/mod.rs 53.01% <100.00%> (ø)
tests/it/io/avro/read/mod.rs 100.00% <100.00%> (ø)
tests/it/array/utf8/mod.rs 97.64% <0.00%> (-2.36%) ⬇️
tests/it/array/binary/mod.rs 98.79% <0.00%> (-1.21%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06892e9...5d60695. Read the comment docs.

@jorgecarleitao jorgecarleitao marked this pull request as ready for review September 15, 2021 16:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add IO read for Avro
1 participant