Add `parquet-derive` to repository README #1

konjac · 2024-05-11T10:23:36Z

Which issue does this PR close?

Document refinement. Add parquet-derive to repository README

Closes apache#5751

Rationale for this change

See apache#5751

What changes are included in this PR?

Add parquet-derive to repository README. Also some minor refinements.

Are there any user-facing changes?

No. Only README changes.

…ffer (apache#5741) * Compute data buffer length by offset buffer start and end values * Update code comment * Add unit test * Add round_trip check * Fix clippy

This patch adds reader support for a comment character for reading CSV files. While comments like almost nothing around the CSV format are not truly standardized, a common format supported by many CSV readers[^1][^2] is to ignore full lines starting with a comment character (often `#`); inline or end of line comments are not supported. Example: # This is a comment in a CSV file without header. 1,2 # Comment inside the data block. 11,22 The implementation of this for Arrow is pretty straight-forward as all we need to do is expose the existing `comment` option of `csv_core` used to read CSV files. Closes apache#5758. [^1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html [^2]: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

apache#5761) * Downgrade to Rust 1.77 in integration pipeline (apache#5719) * Checkout nanoarrow

… 0 (apache#5740) * fix: parse string to decimal when scale is 0 * fix fmt

* Expose boolean builder contents * Suggest using arrow-provided utility for boolean unpacking

… tests (apache#5764) * maybe run the nanoarrow tests * try to pass the location of nanoarrow to archery * fix name

Signed-off-by: Xuanwo <github@xuanwo.io>

* Remove deprecated comparison kernels (apache#4733) * Fix docs * Fix doctest * Update arrow/src/lib.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

) CPUs have efficient instructions for querying, setting and clearing bits, and modern compilers know how to turn simple bit indexing code into such instructions. The table lookup optimizations may have been useful in older versions of rustc, but as of rustc 1.78, they are a net loss. See PR description for more details.

…pache#5730) * improved the error message * added a test to test the overflow * fixed the format arrow * removed assert

… `decode_footer` (apache#5781)

…pache#5780) Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](rust-itertools/itertools@v0.12.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix uuid derive * fix byte array length handling * test lengths * fmt

* Support casting a `FixedSizedList<T>[1]` to `T` * Add FixedSizedList[1] => FixeSizedList[1] tests

Allows for writing binary (Binary, LargeBinary, and FixedSizeBinary) to CSV. Note: FixedSizeBinary was already being supported in this way. Values are encoded as HEX, by using the default Arrow formatter. A test was added that accounts for null values when encoding all three binary types in CSV.

…che#5776)

…ime` (apache#3125) (apache#5654) (apache#5769) * Structured interval type (apache#3125) (apache#5654) * Update integration-test * Fix 32-bit build * Review feedback

* Refine parquet documentation on types and metadata * Update regen.sh and thrift.rs * Clarify page index encompasses offset index and column index * revert unexpected diff

Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version. - [Release notes](https://github.com/tokio-rs/prost/releases) - [Commits](tokio-rs/prost@v0.12.4...v0.12.6) --- updated-dependencies: - dependency-name: prost-build dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Updates the requirements on [proc-macro2](https://github.com/dtolnay/proc-macro2) to permit the latest version. - [Release notes](https://github.com/dtolnay/proc-macro2/releases) - [Commits](dtolnay/proc-macro2@1.0.82...1.0.83) --- updated-dependencies: - dependency-name: proc-macro2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

… writing JSON (apache#5785) * feat: encode Binary and LargeBinary types in JSON as hex Added ability to the JSON writer to encode Binary and LargeBinary types as hex. This follows the behaviour for FixedSizeBinary. A test was added to check functionality for both Binary and LargeBinary. * refactor: use ArrayAccessor instead of custom trait * refactor: use generic in test instead of macro * refactor: use const DATA_TYPE from GenericBinaryType

* Refine ParquetRecordBatchReaderBuilder docs * fix link * Suggest using new(), add example

konjac · 2024-05-22T14:22:20Z

Hi @alamb @tustvold, could you help to review this?

tustvold · 2024-05-22T14:23:38Z

Did you perhaps mean to file this against the upstream repository?

alamb · 2024-05-22T21:01:49Z

As in perhaps it should be a PR againt https://github.com/apache/arrow-rs (not this repo, https://github.com/konjac/arrow-rs)

konjac · 2024-05-23T04:34:27Z

@tustvold @alamb Sorry for the stupid error. Did not notice the UX populate wrong target branch for me. New PR has been raised on the apache repo. Thank you! apache#5795

konjac and others added 28 commits May 11, 2024 18:16

Update README.md to include parquet-derive.

8d87ad2

Minor: Document object store release cadence (apache#5750)

1c86921

Compute data buffer length by using start and end values in offset bu…

cd39b8c

…ffer (apache#5741) * Compute data buffer length by offset buffer start and end values * Update code comment * Add unit test * Add round_trip check * Fix clippy

Downgrade to Rust 1.77 in integration pipeline to fix CI (apache#5719) (

c08feb4

apache#5761) * Downgrade to Rust 1.77 in integration pipeline (apache#5719) * Checkout nanoarrow

fix: parse string of scientific notation to decimal when the scale is…

326231e

… 0 (apache#5740) * fix: parse string to decimal when scale is 0 * fix fmt

Improve repository readme (apache#5752)

3566328

Expose the null buffer of every builder that has one (apache#5754)

7d465b8

Expose boolean builder contents (apache#5760)

78bda14

* Expose boolean builder contents * Suggest using arrow-provided utility for boolean unpacking

Add environment variable definitions to run the nanoarrow integration…

178ef99

… tests (apache#5764) * maybe run the nanoarrow tests * try to pass the location of nanoarrow to archery * fix name

feat: Make AsyncArrowWriter accepts AsyncFileWriter (apache#5753)

d17b206

Signed-off-by: Xuanwo <github@xuanwo.io>

Remove deprecated comparison kernels (apache#4733) (apache#5768)

30767a6

* Remove deprecated comparison kernels (apache#4733) * Fix docs * Fix doctest * Update arrow/src/lib.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Improve error message for timestamp queries outside supported range (a…

fa2ba9e

…pache#5730) * improved the error message * added a test to test the overflow * fixed the format arrow * removed assert

Fix documentation for parquet parse_metadata, decode_metadata and…

28c1cae

… `decode_footer` (apache#5781)

Encode UUID as FixedLenByteArray in parquet_derive (apache#5773)

30762e8

* fix uuid derive * fix byte array length handling * test lengths * fmt

Support casting a FixedSizedList<T>[1] to T (apache#5779)

a126d50

* Support casting a `FixedSizedList<T>[1]` to `T` * Add FixedSizedList[1] => FixeSizedList[1] tests

fix broken link to ballista crate (apache#5784)

2534976

Set the default size of BitWriter for DeltdaBitPackEndoer to 1MB (apa…

ce8363a

…che#5776)

Structured interval types for IntervalMonthDayNano or `IntervalDayT…

cf59b6c

…ime` (apache#3125) (apache#5654) (apache#5769) * Structured interval type (apache#3125) (apache#5654) * Update integration-test * Fix 32-bit build * Review feedback

Refine parquet documentation on types and metadata (apache#5786)

c6b3eaa

* Refine parquet documentation on types and metadata * Update regen.sh and thrift.rs * Clarify page index encompasses offset index and column index * revert unexpected diff

Refine ParquetRecordBatchReaderBuilder docs (apache#5774)

d65240c

* Refine ParquetRecordBatchReaderBuilder docs * fix link * Suggest using new(), add example

Fix incorrect URL to Parquet CPP types.h (apache#5790)

5e9919f

Merge branch 'master' into add-parquet-derive-to-readme

abf82ae

konjac and others added 2 commits May 28, 2024 20:19

Fix link

997ee31

Prettier

0869fc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `parquet-derive` to repository README #1

Add `parquet-derive` to repository README #1

konjac commented May 11, 2024

konjac commented May 22, 2024

tustvold commented May 22, 2024

alamb commented May 22, 2024

konjac commented May 23, 2024

Add parquet-derive to repository README #1

Are you sure you want to change the base?

Add parquet-derive to repository README #1

Conversation

konjac commented May 11, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

konjac commented May 22, 2024

tustvold commented May 22, 2024

alamb commented May 22, 2024

konjac commented May 23, 2024

Add `parquet-derive` to repository README #1

Add `parquet-derive` to repository README #1