Upgrade Arrow to 0.15 #683

rshkv · 2020-05-23T19:07:35Z

Change

This upgrades Arrow to 0.15.1 and requires 0.15 in pyspark and r-spark clients.

Commits

Taking SPARK-29376 (apache#26133), which upgrades Arrow to 0.15.1 and requires a pyarrow minimum of the same. Upgrading SparkR came after this. I understand we can worry about Arrow-serialization in SparkR independently.

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1
[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API
[SPARK-31701][R][SQL] Bump up the minimum Arrow version as 0.15.1 in SparkR

How was this patch tested?

Existing tests. Minus Arrow tests in SparkR because we don't install r-arrow.

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. No Existing tests, manually tested with Python 3.7, 3.8 Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken. First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now. **Jenkins** ``` Skipped ------------------------------------------------------------------------ 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25) - arrow not installed ``` Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243) ``` It is due to the version mismatch. ```java int messageLength = MessageSerializer.bytesToInt(buffer.array()); if (messageLength == IPC_CONTINUATION_TOKEN) { buffer.clear(); // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length if (in.readFully(buffer) == 4) { messageLength = MessageSerializer.bytesToInt(buffer.array()); } } // Length of 0 indicates end of stream if (messageLength != 0) { // Read the message into the buffer. ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength); ``` After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue. No. Pass the AppVeyor. This PR passed here. - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044 ``` SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. ................ ``` Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 0.15.1 to be consistent with PySpark side at. ### Why are the changes needed? It will reduce the maintenance overhead to match the Arrow versions, and minimize the supported range. SparkR Arrow optimization is experimental yet. ### Does this PR introduce _any_ user-facing change? No, it's the change in unreleased branches only. ### How was this patch tested? 0.15.x was already tested at SPARK-29378, and we're testing the latest version of SparkR currently in AppVeyor. I already manually tested too. Closes apache#28520 from HyukjinKwon/SPARK-31701. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

rshkv force-pushed the wr/arrow-0.15.1 branch from e4dd12c to 96c394f Compare May 23, 2020 19:25

rshkv mentioned this pull request May 23, 2020

Misc PyArrow fixes #684

Merged

9 tasks

rshkv added the do not merge label May 23, 2020

rshkv force-pushed the wr/arrow-0.15.1 branch from aef4e56 to a6a2b23 Compare June 24, 2020 18:23

rshkv changed the title ~~[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1~~ Upgrade Arrow to 0.15 Jun 24, 2020

BryanCutler and others added 4 commits July 15, 2020 13:22

rshkv force-pushed the wr/arrow-0.15.1 branch from a6a2b23 to 5eb4041 Compare July 15, 2020 11:23

rshkv removed the do not merge label Jul 15, 2020

rshkv requested a review from robert3005 July 15, 2020 14:41

robert3005 approved these changes Jul 15, 2020

View reviewed changes

rshkv merged commit a659c98 into master Jul 15, 2020

rshkv deleted the wr/arrow-0.15.1 branch July 15, 2020 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Arrow to 0.15 #683

Upgrade Arrow to 0.15 #683

rshkv commented May 23, 2020 •

edited

Loading

Upgrade Arrow to 0.15 #683

Upgrade Arrow to 0.15 #683

Conversation

rshkv commented May 23, 2020 • edited Loading

Change

Commits

How was this patch tested?

rshkv commented May 23, 2020 •

edited

Loading