Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Arrow to 0.15 #683

Merged
merged 4 commits into from
Jul 15, 2020
Merged

Upgrade Arrow to 0.15 #683

merged 4 commits into from
Jul 15, 2020

Commits on Jul 15, 2020

  1. [SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1

    Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.
    
    Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:
    
    * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
    * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
    * ARROW-5579 - [Java] shade flatbuffer dependency
    * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
    * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
    * ARROW-5893 - [C++] Remove arrow::Column class from C++ library
    * ARROW-5970 - [Java] Provide pointer to Arrow buffer
    * ARROW-6070 - [Java] Avoid creating new schema before IPC sending
    * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
    * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
    * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
    * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
    * ARROW-1261 - [Java] Add container type for Map logical type
    * ARROW-1207 - [C++] Implement Map logical type
    
    Changelog can be seen at https://arrow.apache.org/release/0.15.0.html
    
    Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.
    
    No
    
    Existing tests, manually tested with Python 3.7, 3.8
    
    Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.
    
    Authored-by: Bryan Cutler <cutlerb@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    BryanCutler authored and rshkv committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    59b23f4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide

    ### What changes were proposed in this pull request?
    
    Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.
    
    ### Why are the changes needed?
    
    Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
    
    ### Does this PR introduce any user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.
    
    Closes apache#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.
    
    Authored-by: Bryan Cutler <cutlerb@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    BryanCutler authored and rshkv committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    b28645d View commit details
    Browse the repository at this point in the history
  3. [SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API

    [[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken.
    
    First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.
    
    **Jenkins**
    ```
    Skipped ------------------------------------------------------------------------
    1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25)
    - arrow not installed
    ```
    
    Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.
    ```
    Warnings -----------------------------------------------------------------------
    1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
    	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
    	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
    ```
    
    It is due to the version mismatch.
    ```java
    int messageLength = MessageSerializer.bytesToInt(buffer.array());
    if (messageLength == IPC_CONTINUATION_TOKEN) {
      buffer.clear();
      // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
      if (in.readFully(buffer) == 4) {
        messageLength = MessageSerializer.bytesToInt(buffer.array());
      }
    }
    
    // Length of 0 indicates end of stream
    if (messageLength != 0) {
      // Read the message into the buffer.
      ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);
    ```
     After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.
    
    No.
    
    Pass the AppVeyor.
    
    This PR passed here.
    - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044
    
    ```
    SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
    ................
    ```
    
    Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun authored and rshkv committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    79b3995 View commit details
    Browse the repository at this point in the history
  4. [SPARK-31701][R][SQL] Bump up the minimum Arrow version as 0.15.1 in …

    …SparkR
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to set the minimum Arrow version as 0.15.1 to be consistent with PySpark side at.
    
    ### Why are the changes needed?
    
    It will reduce the maintenance overhead to match the Arrow versions, and minimize the supported range. SparkR Arrow optimization is experimental yet.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, it's the change in unreleased branches only.
    
    ### How was this patch tested?
    
    0.15.x was already tested at SPARK-29378, and we're testing the latest version of SparkR currently in AppVeyor. I already manually tested too.
    
    Closes apache#28520 from HyukjinKwon/SPARK-31701.
    
    Authored-by: HyukjinKwon <gurwls223@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    HyukjinKwon authored and rshkv committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    5eb4041 View commit details
    Browse the repository at this point in the history