Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet-derive to repository README #1

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

konjac
Copy link
Owner

@konjac konjac commented May 11, 2024

Which issue does this PR close?

Document refinement. Add parquet-derive to repository README

Closes apache#5751

Rationale for this change

See apache#5751

What changes are included in this PR?

Add parquet-derive to repository README. Also some minor refinements.

Are there any user-facing changes?

No. Only README changes.

konjac and others added 28 commits May 11, 2024 18:16
…ffer (apache#5741)

* Compute data buffer length by offset buffer start and end values

* Update code comment

* Add unit test

* Add round_trip check

* Fix clippy
This patch adds reader support for a comment character for reading CSV
files. While comments like almost nothing around the CSV format are not
truly standardized, a common format supported by many CSV
readers[^1][^2] is to ignore full lines starting with a comment
character (often `#`); inline or end of line comments are not supported.

Example:

    # This is a comment in a CSV file without header.
    1,2
    # Comment inside the data block.
    11,22

The implementation of this for Arrow is pretty straight-forward as all
we need to do is expose the existing `comment` option of `csv_core` used
to read CSV files.

Closes apache#5758.

[^1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
[^2]: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
apache#5761)

* Downgrade to Rust 1.77 in integration pipeline (apache#5719)

* Checkout nanoarrow
… 0 (apache#5740)

* fix: parse string to decimal when scale is 0

* fix fmt
* Expose boolean builder contents

* Suggest using arrow-provided utility for boolean unpacking
… tests (apache#5764)

* maybe run the nanoarrow tests

* try to pass the location of nanoarrow to archery

* fix name
* Remove deprecated comparison kernels (apache#4733)

* Fix docs

* Fix doctest

* Update arrow/src/lib.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
)

CPUs have efficient instructions for querying, setting and clearing bits, and
modern compilers know how to turn simple bit indexing code into such
instructions. The table lookup optimizations may have been useful in older
versions of rustc, but as of rustc 1.78, they are a net loss. See PR description
for more details.
…pache#5730)

* improved the error message

* added a test to test the overflow

* fixed the format arrow

* removed assert
…pache#5780)

Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](rust-itertools/itertools@v0.12.0...v0.13.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix uuid derive

* fix byte array length handling

* test lengths

* fmt
* Support casting a `FixedSizedList<T>[1]` to `T`

* Add FixedSizedList[1] => FixeSizedList[1] tests
Allows for writing binary (Binary, LargeBinary, and FixedSizeBinary) to
CSV. Note: FixedSizeBinary was already being supported in this way.

Values are encoded as HEX, by using the default Arrow formatter.

A test was added that accounts for null values when encoding all three
binary types in CSV.
…ime` (apache#3125) (apache#5654) (apache#5769)

* Structured interval type (apache#3125) (apache#5654)

* Update integration-test

* Fix 32-bit build

* Review feedback
* Refine parquet documentation on types and metadata

* Update regen.sh and thrift.rs

* Clarify page index encompasses offset index and column index

* revert unexpected diff
Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version.
- [Release notes](https://github.com/tokio-rs/prost/releases)
- [Commits](tokio-rs/prost@v0.12.4...v0.12.6)

---
updated-dependencies:
- dependency-name: prost-build
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Updates the requirements on [proc-macro2](https://github.com/dtolnay/proc-macro2) to permit the latest version.
- [Release notes](https://github.com/dtolnay/proc-macro2/releases)
- [Commits](dtolnay/proc-macro2@1.0.82...1.0.83)

---
updated-dependencies:
- dependency-name: proc-macro2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… writing JSON (apache#5785)

* feat: encode Binary and LargeBinary types in JSON as hex

Added ability to the JSON writer to encode Binary and LargeBinary types
as hex. This follows the behaviour for FixedSizeBinary.

A test was added to check functionality for both Binary and LargeBinary.

* refactor: use ArrayAccessor instead of custom trait

* refactor: use generic in test instead of macro

* refactor: use const DATA_TYPE from GenericBinaryType
* Refine ParquetRecordBatchReaderBuilder docs

* fix link

* Suggest using new(), add example
@konjac
Copy link
Owner Author

konjac commented May 22, 2024

Hi @alamb @tustvold, could you help to review this?

@tustvold
Copy link

Did you perhaps mean to file this against the upstream repository?

@alamb
Copy link

alamb commented May 22, 2024

As in perhaps it should be a PR againt https://github.com/apache/arrow-rs (not this repo, https://github.com/konjac/arrow-rs)

@konjac
Copy link
Owner Author

konjac commented May 23, 2024

@tustvold @alamb Sorry for the stupid error. Did not notice the UX populate wrong target branch for me. New PR has been raised on the apache repo. Thank you! apache#5795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet-derive should be included in repository README.