Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-37046: [MATLAB] Implement
featherread
in terms of `arrow.i…
…nternal.io.feather.Reader` (apache#37163) ### Rationale for this change Now that apache#37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code. ### What changes are included in this PR? 1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. 2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes. 3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`. **Example** ```matlab >> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"]) tWrite = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> featherwrite("test.feather", tWrite) >> tRead = featherread("test.feather") tRead = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> isequaln(tWrite, tRead) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes. 2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file. 4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file. ### Are there any user-facing changes? Yes. 1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 2. There are some minor error message ID changes. 4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`. ### Future Directions 1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility. 2. Delete legacy Feather V1 code and build infrastructure. ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#37046 Authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
- Loading branch information