Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1429: Debian packaging #2

Closed
wants to merge 1,270 commits into from
Closed
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Jun 13, 2017

  1. [SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive ta…

    …bles with many partitions
    
    ## What changes were proposed in this pull request?
    
    Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#18216 from srowen/SPARK-20920.
    
    (cherry picked from commit 7b7c85e)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    24836be View commit details
    Browse the repository at this point in the history
  2. [SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive ta…

    …bles with many partitions
    
    ## What changes were proposed in this pull request?
    
    Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#18216 from srowen/SPARK-20920.
    
    (cherry picked from commit 7b7c85e)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    58a8a37 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21060][WEB-UI] Css style about paging function is error in the…

    … executor page. Css style about paging function is error in the executor page. It is different of history server ui paging function css style.
    
    ## What changes were proposed in this pull request?
    
    Css style about paging function is error in the executor page. It is different of history server ui paging function css style.
    
    **But their style should be consistent**. There are three reasons.
    
    1. The first reason: 'Previous', 'Next' and number should be the button format.
    
    2. The second reason: when you are on the first page, 'Previous' and '1' should be gray and can not be clicked.
    ![1](https://user-images.githubusercontent.com/26266482/27026667-1fe745ee-4f91-11e7-8b34-150819d22bd3.png)
    
    3. The third reason: when you are on the last page, 'Previous' and 'Max number' should be gray and can not be clicked.
    ![2](https://user-images.githubusercontent.com/26266482/27026811-9d8d6fa0-4f91-11e7-8b51-7816c3feb381.png)
    
    before fix:
    ![fix_before](https://user-images.githubusercontent.com/26266482/27026428-47ec5c56-4f90-11e7-9dd5-d52c22d7bd36.png)
    
    after fix:
    ![fix_after](https://user-images.githubusercontent.com/26266482/27026439-50d17072-4f90-11e7-8405-6f81da5ab32c.png)
    
    The style of history server ui:
    ![history](https://user-images.githubusercontent.com/26266482/27026528-9c90f780-4f90-11e7-91e6-90d32651fe03.png)
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
    Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
    Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
    
    Closes apache#18275 from guoxiaolongzte/SPARK-21060.
    
    (cherry picked from commit b7304f2)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    guoxiaolong authored and srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    039c465 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTran…

    …sferServiceSuite
    
    ## What changes were proposed in this pull request?
    
    The default value for `spark.port.maxRetries` is 100,
    but we use 10 in the suite file.
    So we change it to 100 to avoid test failure.
    
    ## How was this patch tested?
    No test
    
    Author: DjvuLee <lihu@bytedance.com>
    
    Closes apache#18280 from djvulee/NettyTestBug.
    
    (cherry picked from commit b36ce2a)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    DjvuLee authored and srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    2bc2c15 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTran…

    …sferServiceSuite
    
    ## What changes were proposed in this pull request?
    
    The default value for `spark.port.maxRetries` is 100,
    but we use 10 in the suite file.
    So we change it to 100 to avoid test failure.
    
    ## How was this patch tested?
    No test
    
    Author: DjvuLee <lihu@bytedance.com>
    
    Closes apache#18280 from djvulee/NettyTestBug.
    
    (cherry picked from commit b36ce2a)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    DjvuLee authored and srowen committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    ee0e74e View commit details
    Browse the repository at this point in the history
  6. [SPARK-20979][SS] Add RateSource to generate values for tests and ben…

    …chmark
    
    ## What changes were proposed in this pull request?
    
    This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily.
    
    This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L.
    
    It supports the following options:
    - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second.
    - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds.
    - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed.
    
    Here is a simple example that prints 10 rows per seconds:
    ```
        spark.readStream
          .format("rate")
          .option("rowsPerSecond", "10")
          .load()
          .writeStream
          .format("console")
          .start()
    ```
    
    The idea came from marmbrus and he did the initial work.
    
    ## How was this patch tested?
    
    The added tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#18199 from zsxwing/rate.
    zsxwing committed Jun 13, 2017
    Configuration menu
    Copy the full SHA
    220943d View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2017

  1. [SPARK-12552][CORE] Correctly count the driver resource when recoveri…

    …ng from failure for Master
    
    Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI.
    
    So here fix this to also count the driver's resource usage.
    
    Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered.
    
    andrewor14 please help to review, thanks a lot.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes apache#10506 from jerryshao/SPARK-12552.
    
    (cherry picked from commit 9eb0952)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jerryshao authored and cloud-fan committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    53212c3 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20986][SQL] Reset table's statistics after PruneFileSourcePart…

    …itions rule.
    
    ## What changes were proposed in this pull request?
    After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed.
    
    ## How was this patch tested?
    add unit test.
    
    Author: lianhuiwang <lianhuiwang09@gmail.com>
    
    Closes apache#18205 from lianhuiwang/SPARK-20986.
    
    (cherry picked from commit 8b5b2e2)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    lianhuiwang authored and cloud-fan committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    42cc830 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21085][SQL] Failed to read the partitioned table created by Sp…

    …ark 2.1
    
    ### What changes were proposed in this pull request?
    Before the PR, Spark is unable to read the partitioned table created by Spark 2.1 when the table schema does not put the partitioning column at the end of the schema.
    [assert(partitionFields.map(_.name) == partitionColumnNames)](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L234-L236)
    
    When reading the table metadata from the metastore, we also need to reorder the columns.
    
    ### How was this patch tested?
    Added test cases to check both Hive-serde and data source tables.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18295 from gatorsmile/reorderReadSchema.
    
    (cherry picked from commit 0c88e8d)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    9bdc835 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decim…

    …al Values when the Input is BigDecimal between -1.0 and 1.0
    
    ### What changes were proposed in this pull request?
    
    This PR is to backport apache#18244 to 2.2
    
    ---
    
    The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
    
    The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
    
    Before this PR, the following queries failed:
    ```SQL
    select 1 > 0.0001
    select floor(0.0001)
    select ceil(0.0001)
    ```
    
    ### How was this patch tested?
    Added test cases.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18297 from gatorsmile/backport18244.
    gatorsmile authored and cloud-fan committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    6265119 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decim…

    …al Values when the Input is BigDecimal between -1.0 and 1.0
    
    ### What changes were proposed in this pull request?
    
    This PR is to backport apache#18244 to 2.2
    
    ---
    
    The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
    
    The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
    
    Before this PR, the following queries failed:
    ```SQL
    select 1 > 0.0001
    select floor(0.0001)
    select ceil(0.0001)
    ```
    
    ### How was this patch tested?
    Added test cases.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18297 from gatorsmile/backport18244.
    
    (cherry picked from commit 6265119)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    a890466 View commit details
    Browse the repository at this point in the history
  6. [SPARK-21089][SQL] Fix DESC EXTENDED/FORMATTED to Show Table Properties

    Since both table properties and storage properties share the same key values, table properties are not shown in the output of DESC EXTENDED/FORMATTED when the storage properties are not empty.
    
    This PR is to fix the above issue by renaming them to different keys.
    
    Added test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes apache#18294 from gatorsmile/tableProperties.
    
    (cherry picked from commit df766a4)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    gatorsmile committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    3dda682 View commit details
    Browse the repository at this point in the history
  7. Revert "[SPARK-20941][SQL] Fix SubqueryExec Reuse"

    This reverts commit 6a4e023.
    gatorsmile committed Jun 14, 2017
    Configuration menu
    Copy the full SHA
    e02e063 View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2017

  1. [SPARK-20980][SQL] Rename wholeFile to multiLine for both CSV and…

    … JSON
    
    The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
    
    N/A
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes apache#18202 from gatorsmile/renameCVSOption.
    
    (cherry picked from commit 2051428)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 15, 2017
    Configuration menu
    Copy the full SHA
    af4f89c View commit details
    Browse the repository at this point in the history
  2. [SPARK-20980][DOCS] update doc to reflect multiLine change

    ## What changes were proposed in this pull request?
    
    doc only change
    
    ## How was this patch tested?
    
    manually
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes apache#18312 from felixcheung/sqljsonwholefiledoc.
    
    (cherry picked from commit 1bf55e3)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Jun 15, 2017
    Configuration menu
    Copy the full SHA
    b5504f6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.r…

    …dd.LocalCheckpointSuite.missing checkpoint block fails with informative message
    
    ## What changes were proposed in this pull request?
    
    Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
    The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().
    
    ## How was this patch tested?
    N/A
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes apache#18314 from jiangxb1987/LocalCheckpointSuite.
    
    (cherry picked from commit 7dc3e69)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jiangxb1987 authored and cloud-fan committed Jun 15, 2017
    Configuration menu
    Copy the full SHA
    76ee41f View commit details
    Browse the repository at this point in the history
  4. [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.r…

    …dd.LocalCheckpointSuite.missing checkpoint block fails with informative message
    
    ## What changes were proposed in this pull request?
    
    Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
    The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().
    
    ## How was this patch tested?
    N/A
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes apache#18314 from jiangxb1987/LocalCheckpointSuite.
    
    (cherry picked from commit 7dc3e69)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jiangxb1987 authored and cloud-fan committed Jun 15, 2017
    Configuration menu
    Copy the full SHA
    62f2b80 View commit details
    Browse the repository at this point in the history

Commits on Jun 16, 2017

  1. [SPARK-21111][TEST][2.2] Fix the test failure of describe.sql

    ## What changes were proposed in this pull request?
    Test failed in `describe.sql`.
    
    We need to fix the related bug introduced in (apache#17649) in the follow-up PR to master.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18316 from gatorsmile/fix.
    gatorsmile authored and yhuai committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    a585c87 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21072][SQL] TreeNode.mapChildren should only apply to the chil…

    …dren node.
    
    ## What changes were proposed in this pull request?
    
    Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.
    
    https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes apache#18284 from ConeyLiu/treenode.
    
    (cherry picked from commit 87ab0ce)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and cloud-fan committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    9909be3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21072][SQL] TreeNode.mapChildren should only apply to the chil…

    …dren node.
    
    ## What changes were proposed in this pull request?
    
    Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.
    
    https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes apache#18284 from ConeyLiu/treenode.
    
    (cherry picked from commit 87ab0ce)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and cloud-fan committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    915a201 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to nam…

    …e mismatch
    
    ## What changes were proposed in this pull request?
    Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue.
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18319 from gatorsmile/fixDecimal.
    gatorsmile authored and cloud-fan committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    0ebb3b8 View commit details
    Browse the repository at this point in the history
  5. [SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s.deploy.master.Maste…

    …rSuite.master correctly recover the application"
    
    ## What changes were proposed in this pull request?
    
    Due to the RPC asynchronous event processing, The test "correctly recover the application" could potentially be failed. The issue could be found in here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78126/testReport/org.apache.spark.deploy.master/MasterSuite/master_correctly_recover_the_application/.
    
    So here fixing this flaky test.
    
    ## How was this patch tested?
    
    Existing UT.
    
    CC cloud-fan jiangxb1987 , please help to review, thanks!
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes apache#18321 from jerryshao/SPARK-12552-followup.
    
    (cherry picked from commit 2837b14)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jerryshao authored and cloud-fan committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    653e6f1 View commit details
    Browse the repository at this point in the history
  6. [MINOR][DOCS] Improve Running R Tests docs

    ## What changes were proposed in this pull request?
    
    Update Running R Tests dependence packages to:
    ```bash
    R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
    ```
    
    ## How was this patch tested?
    manual tests
    
    Author: Yuming Wang <wgyumg@gmail.com>
    
    Closes apache#18271 from wangyum/building-spark.
    
    (cherry picked from commit 45824fb)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    wangyum authored and srowen committed Jun 16, 2017
    Configuration menu
    Copy the full SHA
    d3deeb3 View commit details
    Browse the repository at this point in the history

Commits on Jun 18, 2017

  1. [SPARK-21126] The configuration which named "spark.core.connection.au…

    …th.wait.timeout" hasn't been used in spark
    
    [https://issues.apache.org/jira/browse/SPARK-21126](https://issues.apache.org/jira/browse/SPARK-21126)
    The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark,so I think it should be removed from configuration.md.
    
    Author: liuzhaokun <liu.zhaokun@zte.com.cn>
    
    Closes apache#18333 from liu-zhaokun/new3.
    
    (cherry picked from commit 0d8604b)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    liu-zhaokun authored and srowen committed Jun 18, 2017
    Configuration menu
    Copy the full SHA
    8747f8e View commit details
    Browse the repository at this point in the history
  2. [MINOR][R] Add knitr and rmarkdown packages/improve output for versio…

    …n info in AppVeyor tests
    
    ## What changes were proposed in this pull request?
    
    This PR proposes three things as below:
    
    **Install packages per documentation** - this does not affect the tests itself (but CRAN which we are not doing via AppVeyor) up to my knowledge.
    
    This adds `knitr` and `rmarkdown` per https://github.com/apache/spark/blob/45824fb608930eb461e7df53bb678c9534c183a9/R/WINDOWS.md#unit-tests (please see apache@45824fb)
    
    **Improve logs/shorten logs** - actually, long logs can be a problem on AppVeyor (e.g., see apache#17873)
    
    `R -e ...` repeats printing R information for each invocation as below:
    
    ```
    R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
    Copyright (C) 2016 The R Foundation for Statistical Computing
    Platform: i386-w64-mingw32/i386 (32-bit)
    
    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.
    
      Natural language support but running in an English locale
    
    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.
    
    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.
    ```
    
    It looks reducing the call might be slightly better and print out the versions together looks more readable.
    
    Before:
    
    ```
    # R information ...
    > packageVersion('testthat')
    [1] '1.0.2'
    >
    >
    
    # R information ...
    > packageVersion('e1071')
    [1] '1.6.8'
    >
    >
    ... 3 more times
    ```
    
    After:
    
    ```
    # R information ...
    > packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')
    [1] ‘1.16’
    [1] ‘1.6’
    [1] ‘1.0.2’
    [1] ‘1.6.8’
    [1] ‘2.41.3’
    ```
    
    **Add`appveyor.yml`/`dev/appveyor-install-dependencies.ps1` for triggering the test**
    
    Changing this file might break the test, e.g., apache#16927
    
    ## How was this patch tested?
    
    Before (please see https://ci.appveyor.com/project/HyukjinKwon/spark/build/169-master)
    After (please see the AppVeyor build in this PR):
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#18336 from HyukjinKwon/minor-add-knitr-and-rmarkdown.
    
    (cherry picked from commit 75a6d05)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    HyukjinKwon authored and srowen committed Jun 18, 2017
    Configuration menu
    Copy the full SHA
    c0d4acc View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2017

  1. [SPARK-21090][CORE] Optimize the unified memory manager code

    ## What changes were proposed in this pull request?
    1.In `acquireStorageMemory`, when the Memory Mode is OFF_HEAP ,the `maxOffHeapMemory` should be modified to `maxOffHeapStorageMemory`. after this PR,it will same as ON_HEAP Memory Mode.
    Because when acquire memory is between `maxOffHeapStorageMemory` and `maxOffHeapMemory`,it will fail surely, so if acquire memory is greater than  `maxOffHeapStorageMemory`(not greater than `maxOffHeapMemory`),we should fail fast.
    2. Borrow memory from execution, `numBytes` modified to `numBytes - storagePool.memoryFree` will be more reasonable.
    Because we just acquire `(numBytes - storagePool.memoryFree)`, unnecessary borrowed `numBytes` from execution
    
    ## How was this patch tested?
    added unit test case
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes apache#18296 from 10110346/wip-lx-0614.
    
    (cherry picked from commit 112bd9b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    10110346 authored and cloud-fan committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    d3c79b7 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21132][SQL] DISTINCT modifier of function arguments should not…

    … be silently ignored
    
    ### What changes were proposed in this pull request?
    We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages.
    
    ### How was this patch tested?
    Added test cases for both regular functions and window functions
    
    Author: Xiao Li <gatorsmile@gmail.com>
    
    Closes apache#18340 from gatorsmile/firstCount.
    
    (cherry picked from commit 9413b84)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    fab070c View commit details
    Browse the repository at this point in the history
  3. [SPARK-19688][STREAMING] Not to read spark.yarn.credentials.file fr…

    …om checkpoint.
    
    ## What changes were proposed in this pull request?
    
    Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.
    
    ## How was this patch tested?
    
    Manual tested with 1.6.3 and 2.1.1.
    I didn't test this with master because of some compile problems, but I think it will be the same result.
    
    ## Notice
    
    This should be merged into maintenance branches too.
    
    jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)
    
    Author: saturday_s <shi.indetail@gmail.com>
    
    Closes apache#18230 from saturday-shi/SPARK-21008.
    
    (cherry picked from commit e92ffe6)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    saturday_s authored and Marcelo Vanzin committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    f7fcdec View commit details
    Browse the repository at this point in the history
  4. [SPARK-19688][STREAMING] Not to read spark.yarn.credentials.file fr…

    …om checkpoint.
    
    ## What changes were proposed in this pull request?
    
    Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.
    
    ## How was this patch tested?
    
    Manual tested with 1.6.3 and 2.1.1.
    I didn't test this with master because of some compile problems, but I think it will be the same result.
    
    ## Notice
    
    This should be merged into maintenance branches too.
    
    jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)
    
    Author: saturday_s <shi.indetail@gmail.com>
    
    Closes apache#18230 from saturday-shi/SPARK-21008.
    
    (cherry picked from commit e92ffe6)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    saturday_s authored and Marcelo Vanzin committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    a44c118 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream sou…

    …rce are in a wrong table
    
    ## What changes were proposed in this pull request?
    
    The description for several options of File Source for structured streaming appeared in the File Sink description instead.
    
    This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2
    
    ## How was this patch tested?
    
    Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.
    
    The original documentation was written by tdas and lw-lin
    
    Author: assafmendelson <assaf.mendelson@gmail.com>
    
    Closes apache#18342 from assafmendelson/spark-21123.
    
    (cherry picked from commit 66a792c)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    assafmendelson authored and zsxwing committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    7b50736 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    32bd9a7 View commit details
    Browse the repository at this point in the history
  7. [MINOR][BUILD] Fix Java linter errors

    This PR cleans up a few Java linter errors for Apache Spark 2.2 release.
    
    ```bash
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks passed.
    ```
    
    We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894).
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes apache#18345 from dongjoon-hyun/fix_lint_java_2.
    
    (cherry picked from commit ecc5631)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    dongjoon-hyun authored and srowen committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    e329bea View commit details
    Browse the repository at this point in the history
  8. [SPARK-21138][YARN] Cannot delete staging dir when the clusters of "s…

    …park.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
    
    ## What changes were proposed in this pull request?
    
    When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows:
    ```
    spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
    spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
    ```
    The staging dir can not be deleted, it will prompt following message:
    ```
    java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
    ```
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: sharkdtu <sharkdtu@tencent.com>
    
    Closes apache#18352 from sharkdtu/master.
    
    (cherry picked from commit 3d4d11a)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    sharkdtu authored and Marcelo Vanzin committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    cf10fa8 View commit details
    Browse the repository at this point in the history
  9. [SPARK-21138][YARN] Cannot delete staging dir when the clusters of "s…

    …park.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
    
    ## What changes were proposed in this pull request?
    
    When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows:
    ```
    spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
    spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
    ```
    The staging dir can not be deleted, it will prompt following message:
    ```
    java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
    ```
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: sharkdtu <sharkdtu@tencent.com>
    
    Closes apache#18352 from sharkdtu/master.
    
    (cherry picked from commit 3d4d11a)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    sharkdtu authored and Marcelo Vanzin committed Jun 19, 2017
    Configuration menu
    Copy the full SHA
    7799f35 View commit details
    Browse the repository at this point in the history

Commits on Jun 20, 2017

  1. [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeExternal throw…

    …s NPE
    
    ## What changes were proposed in this pull request?
    
    Fix HighlyCompressedMapStatus#writeExternal NPE:
    ```
    17/06/18 15:00:27 ERROR Utils: Exception encountered
    java.lang.NullPointerException
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
            at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
            at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
            at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
            at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
            at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
            at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
            at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
            at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
            at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
            at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
    java.io.IOException: java.lang.NullPointerException
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
            at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
            at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
            at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
            at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
            at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
            at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
            at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
            at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
            at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
            at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.NullPointerException
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
            ... 17 more
    17/06/18 15:00:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.17.47.20:50188
    17/06/18 15:00:27 ERROR Utils: Exception encountered
    java.lang.NullPointerException
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
            at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
            at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
            at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
            at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
            at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
            at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
            at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
            at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
            at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
            at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
            at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
            at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
            at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    ```
    
    ## How was this patch tested?
    
    manual tests
    
    Author: Yuming Wang <wgyumg@gmail.com>
    
    Closes apache#18343 from wangyum/SPARK-21133.
    
    (cherry picked from commit 9b57cd8)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    wangyum authored and cloud-fan committed Jun 20, 2017
    Configuration menu
    Copy the full SHA
    8bf7f1e View commit details
    Browse the repository at this point in the history
  2. [SPARK-20929][ML] LinearSVC should use its own threshold param

    ## What changes were proposed in this pull request?
    
    LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.
    
    ## How was this patch tested?
    
    New unit test to make sure the threshold can be set to any Double value.
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes apache#18151 from jkbradley/ml-2.2-linearsvc-cleanup.
    
    (cherry picked from commit cc67bd5)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
    jkbradley committed Jun 20, 2017
    Configuration menu
    Copy the full SHA
    514a7e6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21150][SQL] Persistent view stored in Hive metastore should be…

    … case preserving
    
    ## What changes were proposed in this pull request?
    
    This is a regression in Spark 2.2. In Spark 2.2, we introduced a new way to resolve persisted view: https://issues.apache.org/jira/browse/SPARK-18209 , but this makes the persisted view non case-preserving because we store the schema in hive metastore directly. We should follow data source table and store schema in table properties.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#18360 from cloud-fan/view.
    
    (cherry picked from commit e862dc9)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Jun 20, 2017
    Configuration menu
    Copy the full SHA
    b8b80f6 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    62e442e View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e883498 View commit details
    Browse the repository at this point in the history
  6. [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream sou…

    …rce are in a wrong table - version to fix 2.1
    
    ## What changes were proposed in this pull request?
    
    The description for several options of File Source for structured streaming appeared in the File Sink description instead.
    
    This commit continues on PR apache#18342 and targets the fixes for the documentation of version spark version 2.1
    
    ## How was this patch tested?
    
    Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.
    
    zsxwing This is the PR to fix version 2.1 as discussed in PR apache#18342
    
    Author: assafmendelson <assaf.mendelson@gmail.com>
    
    Closes apache#18363 from assafmendelson/spark-21123-for-spark2.1.
    assafmendelson authored and zsxwing committed Jun 20, 2017
    Configuration menu
    Copy the full SHA
    8923bac View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2017

  1. [MINOR][DOCS] Add lost <tr> tag for configuration.md

    ## What changes were proposed in this pull request?
    
    Add lost `<tr>` tag for `configuration.md`.
    
    ## How was this patch tested?
    N/A
    
    Author: Yuming Wang <wgyumg@gmail.com>
    
    Closes apache#18372 from wangyum/docs-missing-tr.
    
    (cherry picked from commit 987eb8f)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    wangyum authored and srowen committed Jun 21, 2017
    Configuration menu
    Copy the full SHA
    529c04f View commit details
    Browse the repository at this point in the history

Commits on Jun 22, 2017

  1. [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Po…

    …ol Limit - Class Splitting
    
    ## What changes were proposed in this pull request?
    
    This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in apache#18075.
    
    ## How was this patch tested?
    
    The same test provided in apache#18075 is included in this patch.
    
    Author: ALeksander Eskilson <alek.eskilson@cerner.com>
    
    Closes apache#18354 from bdrillard/class_splitting_2.1.
    ALeksander Eskilson authored and cloud-fan committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    6b37c86 View commit details
    Browse the repository at this point in the history
  2. [SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation: Constant Po…

    …ol Limit - Class Splitting
    
    ## What changes were proposed in this pull request?
    
    This is a backport patch for Spark 2.2.x of the class splitting feature over excess generated code as was merged in apache#18075.
    
    ## How was this patch tested?
    
    The same test provided in apache#18075 is included in this patch.
    
    Author: ALeksander Eskilson <alek.eskilson@cerner.com>
    
    Closes apache#18377 from bdrillard/class_splitting_2.2.
    ALeksander Eskilson authored and cloud-fan committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    198e3a0 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21167][SS] Decode the path generated by File sink to handle sp…

    …ecial characters
    
    ## What changes were proposed in this pull request?
    
    Decode the path generated by File sink to handle special characters.
    
    ## How was this patch tested?
    
    The added unit test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18381 from zsxwing/SPARK-21167.
    
    (cherry picked from commit d66b143)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    6ef7a5b View commit details
    Browse the repository at this point in the history
  4. [SPARK-21167][SS] Decode the path generated by File sink to handle sp…

    …ecial characters
    
    ## What changes were proposed in this pull request?
    
    Decode the path generated by File sink to handle special characters.
    
    ## How was this patch tested?
    
    The added unit test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18381 from zsxwing/SPARK-21167.
    
    (cherry picked from commit d66b143)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    1a98d5d View commit details
    Browse the repository at this point in the history
  5. [SQL][DOC] Fix documentation of lpad

    ## What changes were proposed in this pull request?
    Fix incomplete documentation for `lpad`.
    
    Author: actuaryzhang <actuaryzhang10@gmail.com>
    
    Closes apache#18367 from actuaryzhang/SQLDoc.
    
    (cherry picked from commit 97b307c)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    actuaryzhang authored and srowen committed Jun 22, 2017
    Configuration menu
    Copy the full SHA
    d625734 View commit details
    Browse the repository at this point in the history

Commits on Jun 23, 2017

  1. Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation: Con…

    …stant Pool Limit - Class Splitting"
    
    This reverts commit 198e3a0.
    cloud-fan committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    b99c0e9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in…

    … INSERT AS SELECT [WIP]
    
    ### What changes were proposed in this pull request?
    
    The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule `SimplifyCasts` and `RemoveRedundantAliases`.
    ```SQL
     SELECT word, length, cast(first as string) as first FROM view1
    ```
    
    This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in `FileFormatWriter`.
    
    The related issue in the master branch has been fixed by apache#18064. After this PR is merged, I will submit a separate PR to merge the test case to the master.
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <gatorsmile@gmail.com>
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18386 from gatorsmile/newRC5.
    gatorsmile authored and cloud-fan committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    b6749ba View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    7b87527 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21144][SQL] Print a warning if the data schema and partition s…

    …chema have the duplicate columns
    
    ## What changes were proposed in this pull request?
    The current master outputs unexpected results when the data schema and partition schema have the duplicate columns:
    ```
    withTempPath { dir =>
      val basePath = dir.getCanonicalPath
      spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
      spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
      spark.read.parquet(basePath).show()
    }
    
    +---+
    |foo|
    +---+
    |  1|
    |  1|
    |  a|
    |  a|
    |  1|
    |  a|
    +---+
    ```
    This patch added code to print a warning when the duplication found.
    
    ## How was this patch tested?
    Manually checked.
    
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    
    Closes apache#18375 from maropu/SPARK-21144-3.
    
    (cherry picked from commit f3dea60)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    maropu authored and gatorsmile committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    9d29808 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21181] Release byteBuffers to suppress netty error messages

    ## What changes were proposed in this pull request?
    We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.
    
    ### Changes proposed in this fix
    By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.
    
    ## How was this patch tested?
    Ran a few spark-applications and examined the logs. The error message no longer appears.
    
    Original PR was opened against branch-2.1 => apache#18392
    
    Author: Dhruve Ashar <dhruveashar@gmail.com>
    
    Closes apache#18407 from dhruve/master.
    
    (cherry picked from commit 1ebe7ff)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    dhruve authored and Marcelo Vanzin committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    f160267 View commit details
    Browse the repository at this point in the history
  6. [SPARK-21181] Release byteBuffers to suppress netty error messages

    ## What changes were proposed in this pull request?
    We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.
    
    ### Changes proposed in this fix
    By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.
    
    ## How was this patch tested?
    Ran a few spark-applications and examined the logs. The error message no longer appears.
    
    Original PR was opened against branch-2.1 => apache#18392
    
    Author: Dhruve Ashar <dhruveashar@gmail.com>
    
    Closes apache#18407 from dhruve/master.
    
    (cherry picked from commit 1ebe7ff)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    dhruve authored and Marcelo Vanzin committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    f8fd3b4 View commit details
    Browse the repository at this point in the history
  7. [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method

    ## What changes were proposed in this pull request?
    
    * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
    * Filled in some missing parentheses
    
    ## How was this patch tested?
    
    N/A
    
    Author: Ong Ming Yang <me@ongmingyang.com>
    
    Closes apache#18398 from ongmingyang/master.
    
    (cherry picked from commit 4cc6295)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    ongmingyang authored and gatorsmile committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    3394b06 View commit details
    Browse the repository at this point in the history
  8. [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method

    ## What changes were proposed in this pull request?
    
    * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
    * Filled in some missing parentheses
    
    ## How was this patch tested?
    
    N/A
    
    Author: Ong Ming Yang <me@ongmingyang.com>
    
    Closes apache#18398 from ongmingyang/master.
    
    (cherry picked from commit 4cc6295)
    Signed-off-by: Xiao Li <gatorsmile@gmail.com>
    ongmingyang authored and gatorsmile committed Jun 23, 2017
    Configuration menu
    Copy the full SHA
    bcaf06c View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2017

  1. [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types…

    … in read path
    
    ## What changes were proposed in this pull request?
    
    This PR is to revert some code changes in the read path of apache#14377. The original fix is apache#17830
    
    When merging this PR, please give the credit to gaborfeher
    
    ## How was this patch tested?
    
    Added a test case to OracleIntegrationSuite.scala
    
    Author: Gabor Feher <gabor.feher@lynxanalytics.com>
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18408 from gatorsmile/OracleType.
    
    (cherry picked from commit b837bf9)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    Gabor Feher authored and gatorsmile committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    a3088d2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types…

    … in read path
    
    This PR is to revert some code changes in the read path of apache#14377. The original fix is apache#17830
    
    When merging this PR, please give the credit to gaborfeher
    
    Added a test case to OracleIntegrationSuite.scala
    
    Author: Gabor Feher <gabor.feher@lynxanalytics.com>
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18408 from gatorsmile/OracleType.
    Gabor Feher authored and gatorsmile committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    f12883e View commit details
    Browse the repository at this point in the history
  3. [SPARK-21159][CORE] Don't try to connect to launcher in standalone cl…

    …uster mode.
    
    Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
    the same scheduler implementation is used, and if it tries to connect to the
    launcher it will fail. So fix the scheduler so it only tries that in client mode;
    cluster mode applications will be correctly launched and will work, but monitoring
    through the launcher handle will not be available.
    
    Tested by running a cluster mode app with "SparkLauncher.startApplication".
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18397 from vanzin/SPARK-21159.
    
    (cherry picked from commit bfd73a7)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    96c04f1 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21159][CORE] Don't try to connect to launcher in standalone cl…

    …uster mode.
    
    Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
    the same scheduler implementation is used, and if it tries to connect to the
    launcher it will fail. So fix the scheduler so it only tries that in client mode;
    cluster mode applications will be correctly launched and will work, but monitoring
    through the launcher handle will not be available.
    
    Tested by running a cluster mode app with "SparkLauncher.startApplication".
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18397 from vanzin/SPARK-21159.
    
    (cherry picked from commit bfd73a7)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    6750db3 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct

    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE `tab1`
    (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
    USING parquet
    
    INSERT INTO `tab1`
    SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
    
    SELECT custom_fields.id, custom_fields.value FROM tab1
    ```
    
    The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
    
    ### How was this patch tested?
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18412 from gatorsmile/castStruct.
    
    (cherry picked from commit 2e1586f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    ad44ab5 View commit details
    Browse the repository at this point in the history
  6. [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct

    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE `tab1`
    (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
    USING parquet
    
    INSERT INTO `tab1`
    SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
    
    SELECT custom_fields.id, custom_fields.value FROM tab1
    ```
    
    The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
    
    ### How was this patch tested?
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18412 from gatorsmile/castStruct.
    
    (cherry picked from commit 2e1586f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    0d6b701 View commit details
    Browse the repository at this point in the history

Commits on Jun 25, 2017

  1. [SPARK-21079][SQL] Calculate total size of a partition table as a sum…

    … of individual partitions
    
    ## What changes were proposed in this pull request?
    
    Storage URI of a partitioned table may or may not point to a directory under which individual partitions are stored. In fact, individual partitions may be located in totally unrelated directories. Before this change, ANALYZE TABLE table COMPUTE STATISTICS command calculated total size of a table by adding up sizes of files found under table's storage URI. This calculation could produce 0 if partitions are stored elsewhere.
    
    This change uses storage URIs of individual partitions to calculate the sizes of all partitions of a table and adds these up to produce the total size of a table.
    
    CC: wzhfy
    
    ## How was this patch tested?
    
    Added unit test.
    
    Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero.
    
    Author: Masha Basmanova <mbasmanova@fb.com>
    
    Closes apache#18309 from mbasmanova/mbasmanova-analyze-part-table.
    
    (cherry picked from commit b449a1d)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    mbasmanova authored and gatorsmile committed Jun 25, 2017
    Configuration menu
    Copy the full SHA
    d8e3a4a View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Con…

    …stant Pool Limit - Class Splitting"
    
    This reverts commit 6b37c86.
    cloud-fan committed Jun 25, 2017
    Configuration menu
    Copy the full SHA
    26f4f34 View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2017

  1. Configuration menu
    Copy the full SHA
    61af209 View commit details
    Browse the repository at this point in the history

Commits on Jun 27, 2017

  1. [SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should b…

    …e global
    
    The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format.
    
        val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
        val ds = spark.createDataset(data)
    
    The `valueConverter` in `ExternalMapToCatalyst` looks like:
    
        if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
    
    There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.
    
    Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#18418 from viirya/SPARK-19104.
    
    (cherry picked from commit fd8c931)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Jun 27, 2017
    Configuration menu
    Copy the full SHA
    970f68c View commit details
    Browse the repository at this point in the history
  2. treating empty string as null for csd

    ianlcsd authored and markhamstra committed Jun 27, 2017
    Configuration menu
    Copy the full SHA
    8fdc51b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f3c40d5 View commit details
    Browse the repository at this point in the history

Commits on Jun 29, 2017

  1. [SPARK-21210][DOC][ML] Javadoc 8 fixes for ML shared param traits

    PR apache#15999 included fixes for doc strings in the ML shared param traits (occurrences of `>` and `>=`).
    
    This PR simply uses the HTML-escaped version of the param doc to embed into the Scaladoc, to ensure that when `SharedParamsCodeGen` is run, the generated javadoc will be compliant for Java 8.
    
    ## How was this patch tested?
    Existing tests
    
    Author: Nick Pentreath <nickp@za.ibm.com>
    
    Closes apache#18420 from MLnick/shared-params-javadoc8.
    
    (cherry picked from commit 70085e8)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Nick Pentreath authored and srowen committed Jun 29, 2017
    Configuration menu
    Copy the full SHA
    17a04b9 View commit details
    Browse the repository at this point in the history

Commits on Jun 30, 2017

  1. [SPARK-21253][CORE] Fix a bug that StreamCallback may not be notified…

    … if network errors happen
    
    ## What changes were proposed in this pull request?
    
    If a network error happens before processing StreamResponse/StreamFailure events, StreamCallback.onFailure won't be called.
    
    This PR fixes `failOutstandingRequests` to also notify outstanding StreamCallbacks.
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18472 from zsxwing/fix-stream-2.
    
    (cherry picked from commit 4996c53)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    zsxwing authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    20cf511 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem

    Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service.
    
    Credits to wangyum
    
    Closes apache#18466
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    Author: Yuming Wang <wgyumg@gmail.com>
    
    Closes apache#18467 from zsxwing/SPARK-21253.
    
    (cherry picked from commit 80f7ac3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    zsxwing authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    8de67e3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21176][WEB UI] Limit number of selector threads for admin ui p…

    …roxy servlets to 8
    
    ## What changes were proposed in this pull request?
    Please see also https://issues.apache.org/jira/browse/SPARK-21176
    
    This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
    The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
    Once jetty/jetty.project#1643 is available, the code could be cleaned up to avoid the method override.
    
    I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?
    
    ## How was this patch tested?
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.
    
    gurvindersingh zsxwing can you please review the change?
    
    Author: IngoSchuster <ingo.schuster@de.ibm.com>
    Author: Ingo Schuster <ingo.schuster@de.ibm.com>
    
    Closes apache#18437 from IngoSchuster/master.
    
    (cherry picked from commit 88a536b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    IngoSchuster authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    c6ba647 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21176][WEB UI] Limit number of selector threads for admin ui p…

    …roxy servlets to 8
    
    ## What changes were proposed in this pull request?
    Please see also https://issues.apache.org/jira/browse/SPARK-21176
    
    This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
    The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
    Once jetty/jetty.project#1643 is available, the code could be cleaned up to avoid the method override.
    
    I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?
    
    ## How was this patch tested?
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.
    
    gurvindersingh zsxwing can you please review the change?
    
    Author: IngoSchuster <ingo.schuster@de.ibm.com>
    Author: Ingo Schuster <ingo.schuster@de.ibm.com>
    
    Closes apache#18437 from IngoSchuster/master.
    
    (cherry picked from commit 88a536b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    IngoSchuster authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    083adb0 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21253][CORE][HOTFIX] Fix Scala 2.10 build

    ## What changes were proposed in this pull request?
    
    A follow up PR to fix Scala 2.10 build for apache#18472
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18478 from zsxwing/SPARK-21253-2.
    
    (cherry picked from commit cfc696f)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    d16e262 View commit details
    Browse the repository at this point in the history
  6. [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spi…

    …lling
    
    ## What changes were proposed in this pull request?
    `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
    
    This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by apache#16909, after this PR Spark spills more eagerly.
    
    This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.
    
    ## How was this patch tested?
    Added a regression test to `DataFrameWindowFunctionsSuite`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes apache#18470 from hvanhovell/SPARK-21258.
    
    (cherry picked from commit e2f32ee)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    hvanhovell authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    8b08fd0 View commit details
    Browse the repository at this point in the history
  7. [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spi…

    …lling
    
    ## What changes were proposed in this pull request?
    `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
    
    This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by apache#16909, after this PR Spark spills more eagerly.
    
    This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.
    
    ## How was this patch tested?
    Added a regression test to `DataFrameWindowFunctionsSuite`.
    
    Author: Herman van Hovell <hvanhovell@databricks.com>
    
    Closes apache#18470 from hvanhovell/SPARK-21258.
    
    (cherry picked from commit e2f32ee)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    hvanhovell authored and cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    d995dac View commit details
    Browse the repository at this point in the history
  8. Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation …

    …with spilling"
    
    This reverts commit d995dac.
    cloud-fan committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    3ecef24 View commit details
    Browse the repository at this point in the history
  9. [SPARK-21129][SQL] Arguments of SQL function call should not be named…

    … expressions
    
    ### What changes were proposed in this pull request?
    
    Function argument should not be named expressions. It could cause two issues:
    - Misleading error message
    - Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser.
    
    ```
    spark-sql> select count(distinct c1, distinct c2) from t1;
    Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26;
    'Project [unresolvedalias('count(c1#30, 'distinct), None)]
    +- SubqueryAlias t1
       +- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31]
    ```
    
    After the fix, the error message becomes
    ```
    spark-sql> select count(distinct c1, distinct c2) from t1;
    Error in query:
    extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35)
    
    == SQL ==
    select count(distinct c1, distinct c2) from t1
    -----------------------------------^^^
    ```
    
    ### How was this patch tested?
    Added a test case to parser suite.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18338 from gatorsmile/parserDistinctAggFunc.
    
    (cherry picked from commit eed9c4e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Jun 30, 2017
    Configuration menu
    Copy the full SHA
    29a0be2 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    a2c7b21 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    85fddf4 View commit details
    Browse the repository at this point in the history

Commits on Jul 1, 2017

  1. [SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throw…

    …s IllegalArgumentException: Self-suppression not permitted
    
    ## What changes were proposed in this pull request?
    
    Not adding the exception to the suppressed if it is the same instance as originalThrowable.
    
    ## How was this patch tested?
    
    Added new tests to verify this, these tests fail without source code changes and passes with the change.
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes apache#18384 from devaraj-kavali/SPARK-21170.
    
    (cherry picked from commit 6beca9c)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Devaraj K authored and srowen committed Jul 1, 2017
    Configuration menu
    Copy the full SHA
    6fd39ea View commit details
    Browse the repository at this point in the history

Commits on Jul 4, 2017

  1. [SPARK-20256][SQL] SessionState should be created more lazily

    ## What changes were proposed in this pull request?
    
    `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
    
    This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
    
    **BEFORE**
    ```scala
    $ bin/spark-shell
    java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
    ...
    Caused by: org.apache.spark.sql.AnalysisException:
        org.apache.hadoop.hive.ql.metadata.HiveException:
           MetaException(message:java.security.AccessControlException:
              Permission denied: user=spark, access=READ,
                 inode="/apps/hive/warehouse":hive:hdfs:drwx------
    ```
    As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
    
    **AFTER**
    ```scala
    $ bin/spark-shell
    ...
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sc.range(0, 10, 1).count()
    res0: Long = 10
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    This closes apache#18512 .
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes apache#18501 from dongjoon-hyun/SPARK-20256.
    
    (cherry picked from commit 1b50e0e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    dongjoon-hyun authored and gatorsmile committed Jul 4, 2017
    Configuration menu
    Copy the full SHA
    db21b67 View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2017

  1. [SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more la…

    …zily
    
    ## What changes were proposed in this pull request?
    
    `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
    
    This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
    
    **BEFORE**
    ```scala
    $ bin/spark-shell
    java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
    ...
    Caused by: org.apache.spark.sql.AnalysisException:
        org.apache.hadoop.hive.ql.metadata.HiveException:
           MetaException(message:java.security.AccessControlException:
              Permission denied: user=spark, access=READ,
                 inode="/apps/hive/warehouse":hive:hdfs:drwx------
    ```
    As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
    
    **AFTER**
    ```scala
    $ bin/spark-shell
    ...
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.1.2-SNAPSHOT
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> sc.range(0, 10, 1).count()
    res0: Long = 10
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes apache#18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1.
    dongjoon-hyun authored and cloud-fan committed Jul 5, 2017
    Configuration menu
    Copy the full SHA
    8f1ca69 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key pr…

    …ior to converting to internal value.
    
    ## What changes were proposed in this pull request?
    
    `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE.
    
    ## How was this patch tested?
    
    Added a test and existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes apache#18524 from ueshin/issues/SPARK-21300.
    
    (cherry picked from commit ce10545)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ueshin authored and cloud-fan committed Jul 5, 2017
    Configuration menu
    Copy the full SHA
    770fd2a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4a4d148 View commit details
    Browse the repository at this point in the history

Commits on Jul 6, 2017

  1. [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream

    ## What changes were proposed in this pull request?
    
    Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
    
    ## How was this patch tested?
    
    Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
    
    Author: Sumedh Wale <swale@snappydata.io>
    
    Closes apache#18535 from sumwale/SPARK-21312.
    
    (cherry picked from commit 14a3bb3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Sumedh Wale authored and cloud-fan committed Jul 6, 2017
    Configuration menu
    Copy the full SHA
    6e1081c View commit details
    Browse the repository at this point in the history
  2. [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream

    ## What changes were proposed in this pull request?
    
    Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
    
    ## How was this patch tested?
    
    Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
    
    Author: Sumedh Wale <swale@snappydata.io>
    
    Closes apache#18535 from sumwale/SPARK-21312.
    
    (cherry picked from commit 14a3bb3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Sumedh Wale authored and cloud-fan committed Jul 6, 2017
    Configuration menu
    Copy the full SHA
    7f7b63b View commit details
    Browse the repository at this point in the history
  3. [SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp check…

    …point dir should be deleted
    
    ## What changes were proposed in this pull request?
    
    Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes apache#18442 from tdas/DatastreamReaderWriterSuite-fix.
    
    (cherry picked from commit 60043f2)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    tdas committed Jul 6, 2017
    Configuration menu
    Copy the full SHA
    4e53a4e View commit details
    Browse the repository at this point in the history

Commits on Jul 7, 2017

  1. [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

    ## What changes were proposed in this pull request?
    
    Few changes to the Structured Streaming documentation
    - Clarify that the entire stream input table is not materialized
    - Add information for Ganglia
    - Add Kafka Sink to the main docs
    - Removed a couple of leftover experimental tags
    - Added more associated reading material and talk videos.
    
    In addition, apache#16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
    - Added a redirection to avoid breaking internal and possible external links.
    - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes apache#18485 from tdas/SPARK-21267.
    
    (cherry picked from commit 0217dfd)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    tdas authored and zsxwing committed Jul 7, 2017
    Configuration menu
    Copy the full SHA
    576fd4c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6e33965 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3f914aa View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2017

  1. [SPARK-21069][SS][DOCS] Add rate source to programming guide.

    ## What changes were proposed in this pull request?
    
    SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide.
    
    ## How was this patch tested?
    
    Tested by running jekyll locally.
    
    Author: Prashant Sharma <prashant@apache.org>
    Author: Prashant Sharma <prashsh1@in.ibm.com>
    
    Closes apache#18562 from ScrapCodes/spark-21069/rate-source-docs.
    
    (cherry picked from commit d0bfc67)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    ScrapCodes authored and zsxwing committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    ab12848 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21228][SQL][BRANCH-2.2] InSet incorrect handling of structs

    ## What changes were proposed in this pull request?
    
    This is backport of apache#18455
    When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals.
    
    ## How was this patch tested?
    New test in SQLQuerySuite.
    
    Author: Bogdan Raducanu <bogdan@databricks.com>
    
    Closes apache#18563 from bogdanrdc/SPARK-21228-BRANCH2.2.
    bogdanrdc authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    7d0b1c9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should …

    …clean up stopped sessions.
    
    `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`.
    
    Recently, master branch fails consequtively.
    - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
    
    Pass the Jenkins with a updated suite.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes apache#18567 from dongjoon-hyun/SPARK-SESSION.
    
    (cherry picked from commit 0b8dd2d)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dongjoon-hyun authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    a64f108 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20342][CORE] Update task accumulators before sending task end …

    …event.
    
    This makes sures that listeners get updated task information; otherwise it's
    possible to write incomplete task information into event logs, for example,
    making the information in a replayed UI inconsistent with the original
    application.
    
    Added a new unit test to try to detect the problem, but it's not guaranteed
    to fail since it's a race; but it fails pretty reliably for me without the
    scheduler changes.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18393 from vanzin/SPARK-20342.try2.
    
    (cherry picked from commit 9131bdb)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    c8d7855 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffle…

    …ToMem.
    
    ## What changes were proposed in this pull request?
    
    In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document.
    
    Author: jinxing <jinxing6042@126.com>
    
    Closes apache#18566 from jinxing64/SPARK-21343.
    
    (cherry picked from commit 062c336)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jinxing authored and cloud-fan committed Jul 8, 2017
    Configuration menu
    Copy the full SHA
    964332b View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2017

  1. [SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderS…

    …uite should clean up stopped sessions.
    
    ## What changes were proposed in this pull request?
    
    `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`.
    
    Recently, master branch fails consequtively.
    - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
    
    ## How was this patch tested?
    
    Pass the Jenkins with a updated suite.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes apache#18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1.
    dongjoon-hyun authored and cloud-fan committed Jul 9, 2017
    Configuration menu
    Copy the full SHA
    5e2bfd5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21083][SQL][BRANCH-2.2] Store zero size and row count when ana…

    …lyzing empty table
    
    ## What changes were proposed in this pull request?
    
    We should be able to store zero size and row count after analyzing empty table.
    This is a backport for apache@9fccc36.
    
    ## How was this patch tested?
    
    Added new test.
    
    Author: Zhenhua Wang <wangzhenhua@huawei.com>
    
    Closes apache#18575 from wzhfy/analyzeEmptyTable-2.2.
    wzhfy authored and cloud-fan committed Jul 9, 2017
    Configuration menu
    Copy the full SHA
    3bfad9d View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2017

  1. [SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when ana…

    …lyzing empty table
    
    ## What changes were proposed in this pull request?
    
    We should be able to store zero size and row count after analyzing empty table.
    This is a backport for apache@9fccc36.
    
    ## How was this patch tested?
    
    Added new test.
    
    Author: Zhenhua Wang <wzh_zju@163.com>
    
    Closes apache#18577 from wzhfy/analyzeEmptyTable-2.1.
    wzhfy authored and cloud-fan committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    2c28462 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFet…

    …cher.
    
    When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result.
    
    This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher`
    
    Author: jinxing <jinxing6042@126.com>
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes apache#18565 from jinxing64/SPARK-21342.
    
    (cherry picked from commit 6a06c4b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jinxing authored and cloud-fan committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    40fd0ce View commit details
    Browse the repository at this point in the history
  3. [SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows

    ## What changes were proposed in this pull request?
    
    Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin.
    
    ## How was this patch tested?
    
    Non-zero output rows manually seen in metrics.
    
    Author: Juliusz Sompolski <julek@databricks.com>
    
    Closes apache#18494 from juliuszsompolski/SPARK-21272.
    juliuszsompolski authored and gatorsmile committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    a05edf4 View commit details
    Browse the repository at this point in the history
  4. foo

    markhamstra committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    73df649 View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2017

  1. [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*

    ## What changes were proposed in this pull request?
    
    Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18593 from zsxwing/SPARK-21369.
    
    (cherry picked from commit 833eab2)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    zsxwing authored and cloud-fan committed Jul 11, 2017
    Configuration menu
    Copy the full SHA
    edcd9fb View commit details
    Browse the repository at this point in the history
  2. [SPARK-21366][SQL][TEST] Add sql test for window functions

    ## What changes were proposed in this pull request?
    
    Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`.
    
    ## How was this patch tested?
    
    Added `window.sql` and the corresponding output file.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes apache#18591 from jiangxb1987/window.
    
    (cherry picked from commit 66d2168)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jiangxb1987 authored and cloud-fan committed Jul 11, 2017
    Configuration menu
    Copy the full SHA
    399aa01 View commit details
    Browse the repository at this point in the history

Commits on Jul 12, 2017

  1. [SPARK-21219][CORE] Task retry occurs on same executor due to race co…

    …ndition with blacklisting
    
    There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor.  This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure).  Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed.  There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    Implemented a unit test that verifies the task is black listed before it is added to the pending task.  Ran the unit test without the fix and it fails.  Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Eric Vandenberg <ericvandenbergfb.com>
    
    Closes apache#18427 from ericvandenbergfb/blacklistFix.
    
    ## What changes were proposed in this pull request?
    
    This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.
    
    ## How was this patch tested?
    
    Ran TaskSetManagerSuite tests locally.
    
    Author: Eric Vandenberg <ericvandenberg@fb.com>
    
    Closes apache#18604 from jsoltren/branch-2.2.
    Eric Vandenberg authored and cloud-fan committed Jul 12, 2017
    Configuration menu
    Copy the full SHA
    cb6fc89 View commit details
    Browse the repository at this point in the history

Commits on Jul 13, 2017

  1. [SPARK-18646][REPL] Set parent classloader as null for ExecutorClassL…

    …oader
    
    ## What changes were proposed in this pull request?
    
    `ClassLoader` will preferentially load class from `parent`. Only when `parent` is null or the load failed, that it will call the overridden `findClass` function. To avoid the potential issue caused by loading class using inappropriate class loader, we should set the `parent` of `ClassLoader` to null, so that we can fully control which class loader is used.
    
    This is take over of apache#17074,  the primary author of this PR is taroplus .
    
    Should close apache#17074 after this PR get merged.
    
    ## How was this patch tested?
    
    Add test case in `ExecutorClassLoaderSuite`.
    
    Author: Kohki Nishio <taroplus@me.com>
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes apache#18614 from jiangxb1987/executor_classloader.
    
    (cherry picked from commit e08d06b)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    taroplus authored and cloud-fan committed Jul 13, 2017
    Configuration menu
    Copy the full SHA
    39eba30 View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-18646][REPL] Set parent classloader as null for Execut…

    …orClassLoader"
    
    This reverts commit 39eba30.
    cloud-fan committed Jul 13, 2017
    Configuration menu
    Copy the full SHA
    cf0719b View commit details
    Browse the repository at this point in the history

Commits on Jul 14, 2017

  1. [SPARK-21376][YARN] Fix yarn client token expire issue when cleaning …

    …the staging files in long running scenario
    
    ## What changes were proposed in this pull request?
    
    This issue happens in long running application with yarn cluster mode, because yarn#client doesn't sync token with AM, so it will always keep the initial token, this token may be expired in the long running scenario, so when yarn#client tries to clean up staging directory after application finished, it will use this expired token and meet token expire issue.
    
    ## How was this patch tested?
    
    Manual verification is secure cluster.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes apache#18617 from jerryshao/SPARK-21376.
    
    (cherry picked from commit cb8d5cc)
    jerryshao authored and Marcelo Vanzin committed Jul 14, 2017
    Configuration menu
    Copy the full SHA
    bfe3ba8 View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2017

  1. [SPARK-21344][SQL] BinaryType comparison does signed byte array compa…

    …rison
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
    
    ## How was this patch tested?
    
    Added a test suite in `OrderingSuite`.
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes apache#18571 from kiszk/SPARK-21344.
    
    (cherry picked from commit ac5d5d7)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    kiszk authored and gatorsmile committed Jul 15, 2017
    Configuration menu
    Copy the full SHA
    1cb4369 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21344][SQL] BinaryType comparison does signed byte array compa…

    …rison
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
    
    ## How was this patch tested?
    
    Added a test suite in `OrderingSuite`.
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes apache#18571 from kiszk/SPARK-21344.
    
    (cherry picked from commit ac5d5d7)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    kiszk authored and gatorsmile committed Jul 15, 2017
    Configuration menu
    Copy the full SHA
    ca4d2aa View commit details
    Browse the repository at this point in the history
  3. [SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming…

    …-guide redirector
    
    ## What changes were proposed in this pull request?
    
    Update internal references from programming-guide to rdd-programming-guide
    
    See apache/spark-website@5ddf243 and apache#18485 (comment)
    
    Let's keep the redirector even if it's problematic to build, but not rely on it internally.
    
    ## How was this patch tested?
    
    (Doc build)
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#18625 from srowen/SPARK-21267.2.
    
    (cherry picked from commit 74ac1fb)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Jul 15, 2017
    Configuration menu
    Copy the full SHA
    8e85ce6 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2017

  1. [SPARK-21321][SPARK CORE] Spark very verbose on shutdown

    ## What changes were proposed in this pull request?
    
    The current code is very verbose on shutdown.
    
    The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).
    
    ## How was this patch tested?
    
    Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.
    
    Author: John Lee <jlee2@yahoo-inc.com>
    
    Closes apache#18547 from yoonlee95/SPARK-21321.
    
    (cherry picked from commit 0e07a29)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
    John Lee authored and Tom Graves committed Jul 17, 2017
    Configuration menu
    Copy the full SHA
    0ef98fd View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2017

  1. [SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCat…

    …alyst should be global
    
    ## What changes were proposed in this pull request?
    
    This PR is backport of apache#18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1.
    
    The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format.
    
    ```
    val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
    val ds = spark.createDataset(data)
    ```
    The `valueConverter` in `ExternalMapToCatalyst` looks like:
    
    ```
    if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
    ```
    There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.
    
    Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.
    
    ## How was this patch tested?
    
    Added a new test suite into `DatasetPrimitiveSuite`
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes apache#18627 from kiszk/SPARK-21391.
    kiszk authored and cloud-fan committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    a9efce4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21332][SQL] Incorrect result type inferred for some decimal ex…

    …pressions
    
    ## What changes were proposed in this pull request?
    
    This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
    
    ```
        val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
        val sc = spark.sparkContext
        val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
        val df = spark.createDataFrame(rdd, inputSchema)
    
        // Works correctly since no nested decimal expression is involved
        // Expected result type: (26, 6) * (26, 6) = (38, 12)
        df.select($"col" * $"col").explain(true)
        df.select($"col" * $"col").printSchema()
    
        // Gives a wrong result since there is a nested decimal expression that should be visited first
        // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
        df.select($"col" * $"col" * $"col").explain(true)
        df.select($"col" * $"col" * $"col").printSchema()
    ```
    
    The example above gives the following output:
    
    ```
    // Correct result without sub-expressions
    == Parsed Logical Plan ==
    'Project [('col * 'col) AS (col * col)alteryx#4]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    (col * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)alteryx#4]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- (col * col): decimal(38,12) (nullable = true)
    
    // Incorrect result with sub-expressions
    == Parsed Logical Plan ==
    'Project [(('col * 'col) * 'col) AS ((col * col) * col)alteryx#11]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    ((col * col) * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)alteryx#11]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- ((col * col) * col): decimal(38,12) (nullable = true)
    ```
    
    ## How was this patch tested?
    
    This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes apache#18583 from aokolnychyi/spark-21332.
    
    (cherry picked from commit 0be5fb4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    aokolnychyi authored and gatorsmile committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    83bdb04 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21332][SQL] Incorrect result type inferred for some decimal ex…

    …pressions
    
    ## What changes were proposed in this pull request?
    
    This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
    
    ```
        val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
        val sc = spark.sparkContext
        val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
        val df = spark.createDataFrame(rdd, inputSchema)
    
        // Works correctly since no nested decimal expression is involved
        // Expected result type: (26, 6) * (26, 6) = (38, 12)
        df.select($"col" * $"col").explain(true)
        df.select($"col" * $"col").printSchema()
    
        // Gives a wrong result since there is a nested decimal expression that should be visited first
        // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
        df.select($"col" * $"col" * $"col").explain(true)
        df.select($"col" * $"col" * $"col").printSchema()
    ```
    
    The example above gives the following output:
    
    ```
    // Correct result without sub-expressions
    == Parsed Logical Plan ==
    'Project [('col * 'col) AS (col * col)alteryx#4]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    (col * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)alteryx#4]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- (col * col): decimal(38,12) (nullable = true)
    
    // Incorrect result with sub-expressions
    == Parsed Logical Plan ==
    'Project [(('col * 'col) * 'col) AS ((col * col) * col)alteryx#11]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    ((col * col) * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)alteryx#11]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- ((col * col) * col): decimal(38,12) (nullable = true)
    ```
    
    ## How was this patch tested?
    
    This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes apache#18583 from aokolnychyi/spark-21332.
    
    (cherry picked from commit 0be5fb4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    aokolnychyi authored and gatorsmile committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    caf32b3 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21445] Make IntWrapper and LongWrapper in UTF8String Serializable

    ## What changes were proposed in this pull request?
    
    Making those two classes will avoid Serialization issues like below:
    ```
    Caused by: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper
    Serialization stack:
        - object not serializable (class: org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: org.apache.spark.unsafe.types.UTF8String$IntWrapper326450e)
        - field (class: org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper)
        - object (class org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, <function1>)
    ```
    
    ## How was this patch tested?
    
    - [x] Manual testing
    - [ ] Unit test
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes apache#18660 from brkyvz/serializableutf8.
    
    (cherry picked from commit 26cd2ca)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    brkyvz authored and cloud-fan committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    99ce551 View commit details
    Browse the repository at this point in the history
  5. [SPARK-18631][SQL] Changed ExchangeCoordinator re-partitioning to avo…

    …id more data skew
    
    ## What changes were proposed in this pull request?
    
    Re-partitioning logic in ExchangeCoordinator changed so that adding another pre-shuffle partition to the post-shuffle partition will not be done if doing so would cause the size of the post-shuffle partition to exceed the target partition size.
    
    ## How was this patch tested?
    
    Existing tests updated to reflect new expectations.
    
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes apache#16065 from markhamstra/SPARK-17064.
    markhamstra committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    49e2ada View commit details
    Browse the repository at this point in the history
  6. [SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly ha…

    …ndle partition values with dot
    
    ## What changes were proposed in this pull request?
    
    When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions.
    
    ## How was this patch tested?
    
    new regression test.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#18671 from cloud-fan/hive.
    
    (cherry picked from commit f18b905)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Jul 18, 2017
    Configuration menu
    Copy the full SHA
    df061fd View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2017

  1. [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM.

    ## What changes were proposed in this pull request?
    
    In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound.
    This could result in the buffer is very big though the window is small.
    For example:
    ```
    select a, b, sum(a)
    over (partition by b order by a range between 1000000 following and 1000001 following)
    from table
    ```
    We can refine the logic and just add the qualified rows into buffer.
    
    ## How was this patch tested?
    Manual test:
    Run sql
    `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10`
    against a table with 4  columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines.
    Configure the executor with 2G bytes memory.
    With the change in this pr, it works find. Without this change, below exception will be thrown.
    ```
    MemoryError: Java heap space
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
    	at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
    	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
    	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    	at org.apache.spark.scheduler.Task.run(Task.scala:108)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ```
    
    Author: jinxing <jinxing6042@126.com>
    
    Closes apache#18634 from jinxing64/SPARK-21414.
    
    (cherry picked from commit 4eb081c)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jinxing authored and cloud-fan committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    5a0a76f View commit details
    Browse the repository at this point in the history
  2. [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results fai…

    …lures in some cases
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
    
    This issue can be reproduced by the following example:
    
    ```
    val spark = SparkSession
       .builder()
       .appName("smj-codegen")
       .master("local")
       .config("spark.sql.autoBroadcastJoinThreshold", "1")
       .getOrCreate()
    val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
    val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
    val df = df1.join(df2, df1("key") === df2("key"))
       .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
       .select("int")
       df.show()
    ```
    
    To conclude, the issue happens when:
    (1) SortMergeJoin condition contains CodegenFallback expressions.
    (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.
    
    This patch fixes the logic in `CollapseCodegenStages` rule.
    
    ## How was this patch tested?
    Unit test and manual verification in our cluster.
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes apache#18656 from DonnyZone/Fix_SortMergeJoinExec.
    
    (cherry picked from commit 6b6dd68)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    DonnyZone authored and cloud-fan committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    4c212ee View commit details
    Browse the repository at this point in the history
  3. [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results fai…

    …lures in some cases
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
    
    This issue can be reproduced by the following example:
    
    ```
    val spark = SparkSession
       .builder()
       .appName("smj-codegen")
       .master("local")
       .config("spark.sql.autoBroadcastJoinThreshold", "1")
       .getOrCreate()
    val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
    val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
    val df = df1.join(df2, df1("key") === df2("key"))
       .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
       .select("int")
       df.show()
    ```
    
    To conclude, the issue happens when:
    (1) SortMergeJoin condition contains CodegenFallback expressions.
    (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.
    
    This patch fixes the logic in `CollapseCodegenStages` rule.
    
    ## How was this patch tested?
    Unit test and manual verification in our cluster.
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes apache#18656 from DonnyZone/Fix_SortMergeJoinExec.
    
    (cherry picked from commit 6b6dd68)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    DonnyZone authored and cloud-fan committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    ac20693 View commit details
    Browse the repository at this point in the history
  4. wip

    markhamstra committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    9c61833 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    2cddd1c View commit details
    Browse the repository at this point in the history
  6. [SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingT…

    …ime class
    
    ## What changes were proposed in this pull request?
    
    Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes apache#18678 from tdas/SPARK-21464.
    
    (cherry picked from commit 70fe99d)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    tdas committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    86cd3c0 View commit details
    Browse the repository at this point in the history
  7. [SPARK-21446][SQL] Fix setAutoCommit never executed

    ## What changes were proposed in this pull request?
    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
    options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
    So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap
    
    ## How was this patch tested?
    
    Author: DFFuture <albert.zhang23@gmail.com>
    
    Closes apache#18665 from DFFuture/sparksql_pg.
    
    (cherry picked from commit c972918)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DFFuture authored and gatorsmile committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    308bce0 View commit details
    Browse the repository at this point in the history
  8. [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset…

    …#joinWith
    
    ## What changes were proposed in this pull request?
    
    Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both
    
    ## How was this patch tested?
    
    I ran the following code :
    ```
    public static void main(String[] args) {
    	SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test"));
    	Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
    	Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);
    
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();}
    }
    ```
    which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x.
    
    Author: Corey Woodfield <coreywoodfield@gmail.com>
    
    Closes apache#18462 from coreywoodfield/master.
    
    (cherry picked from commit 8cd9cdf)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    coreywoodfield authored and gatorsmile committed Jul 19, 2017
    Configuration menu
    Copy the full SHA
    9949fed View commit details
    Browse the repository at this point in the history

Commits on Jul 21, 2017

  1. [SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

    For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.
    
    Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.)
    
    Author: Dhruve Ashar <dhruveashargmail.com>
    
    Closes apache#18487 from dhruve/impr/SPARK-21243.
    
    Author: Dhruve Ashar <dhruveashar@gmail.com>
    
    Closes apache#18691 from dhruve/branch-2.2.
    dhruve authored and Tom Graves committed Jul 21, 2017
    Configuration menu
    Copy the full SHA
    88dccda View commit details
    Browse the repository at this point in the history
  2. [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.

    Update the Quickstart and RDD programming guides to mention pip.
    
    Built docs locally.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
    
    (cherry picked from commit cc00e99)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed Jul 21, 2017
    Configuration menu
    Copy the full SHA
    da403b9 View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2017

  1. [SPARK-20904][CORE] Don't report task failures to driver during shutd…

    …own.
    
    Executors run a thread pool with daemon threads to run tasks. This means
    that those threads remain active when the JVM is shutting down, meaning
    those tasks are affected by code that runs in shutdown hooks.
    
    So if a shutdown hook messes with something that the task is using (e.g.
    an HDFS connection), the task will fail and will report that failure to
    the driver. That will make the driver mark the task as failed regardless
    of what caused the executor to shut down. So, for example, if YARN pre-empted
    that executor, the driver would consider that task failed when it should
    instead ignore the failure.
    
    This change avoids reporting failures to the driver when shutdown hooks
    are executing; this fixes the YARN preemption accounting, and doesn't really
    change things much for other scenarios, other than reporting a more generic
    error ("Executor lost") when the executor shuts down unexpectedly - which
    is arguably more correct.
    
    Tested with a hacky app running on spark-shell that tried to cause failures
    only when shutdown hooks were running, verified that preemption didn't cause
    the app to fail because of task failures exceeding the threshold.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18594 from vanzin/SPARK-20904.
    
    (cherry picked from commit cecd285)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    Marcelo Vanzin authored and cloud-fan committed Jul 23, 2017
    Configuration menu
    Copy the full SHA
    62ca13d View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2017

  1. [SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource

    When NodeManagers launching Executors,
    the `missing` value will exceed the
    real value when the launch is slow, this can lead to YARN allocates more resource.
    
    We add the `numExecutorsRunning` when calculate the `missing` to avoid this.
    
    Test by experiment.
    
    Author: DjvuLee <lihu@bytedance.com>
    
    Closes apache#18651 from djvulee/YarnAllocate.
    
    (cherry picked from commit 8de080d)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    DjvuLee authored and Marcelo Vanzin committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    e5ec339 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0af0672 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21447][WEB UI] Spark history server fails to render compressed

    inprogress history file in some cases.
    
    Add failure handling for EOFException that can be thrown during
    decompression of an inprogress spark history file, treat same as case
    where can't parse the last line.
    
    ## What changes were proposed in this pull request?
    
    Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case.  This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary).  See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447)
    
    ## How was this patch tested?
    
    Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false.
    
    Author: Eric Vandenberg <ericvandenberg@fb.com>
    
    Closes apache#18673 from ericvandenbergfb/fix_inprogress_compr_history_file.
    
    (cherry picked from commit 06a9793)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Eric Vandenberg authored and Marcelo Vanzin committed Jul 25, 2017
    Configuration menu
    Copy the full SHA
    c91191b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ec50897 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f3df120 View commit details
    Browse the repository at this point in the history

Commits on Jul 26, 2017

  1. [SPARK-21494][NETWORK] Use correct app id when authenticating to exte…

    …rnal service.
    
    There was some code based on the old SASL handler in the new auth client that
    was incorrectly using the SASL user as the user to authenticate against the
    external shuffle service. This caused the external service to not be able to
    find the correct secret to authenticate the connection, failing the connection.
    
    In the course of debugging, I found that some log messages from the YARN shuffle
    service were a little noisy, so I silenced some of them, and also added a couple
    of new ones that helped find this issue. On top of that, I found that a check
    in the code that records app secrets was wrong, causing more log spam and also
    using an O(n) operation instead of an O(1) call.
    
    Also added a new integration suite for the YARN shuffle service with auth on,
    and verified it failed before, and passes now.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18706 from vanzin/SPARK-21494.
    
    (cherry picked from commit 300807c)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Jul 26, 2017
    Configuration menu
    Copy the full SHA
    1bfd1a8 View commit details
    Browse the repository at this point in the history

Commits on Jul 27, 2017

  1. Configuration menu
    Copy the full SHA
    420e6e9 View commit details
    Browse the repository at this point in the history
  2. fix mismerge

    markhamstra committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    464a934 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API

    ## What changes were proposed in this pull request?
    
    This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description:
    
    ```
    spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
    spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
    spark.range(1).withColumnRenamed("id", "x").sort('id) // works
    spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
    org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
    ```
    The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes apache#18740 from aokolnychyi/spark-21538.
    
    (cherry picked from commit f44ead8)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    aokolnychyi authored and gatorsmile committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    06b2ef0 View commit details
    Browse the repository at this point in the history

Commits on Jul 28, 2017

  1. [SPARK-21306][ML] OneVsRest should support setWeightCol

    ## What changes were proposed in this pull request?
    
    add `setWeightCol` method for OneVsRest.
    
    `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
    
    ## How was this patch tested?
    
    + [x] add an unit test.
    
    Author: Yan Facai (颜发才) <facai.yan@gmail.com>
    
    Closes apache#18554 from facaiy/BUG/oneVsRest_missing_weightCol.
    
    (cherry picked from commit a5a3189)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    facaiy authored and yanboliang committed Jul 28, 2017
    Configuration menu
    Copy the full SHA
    9379031 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2017

  1. [SPARK-21508][DOC] Fix example code provided in Spark Streaming Docum…

    …entation
    
    ## What changes were proposed in this pull request?
    
    JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508)
    
    correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation
    The example code provided in the documentation on 'Spark Streaming Custom Receivers' has an error.
    doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html
    
    ```
    
    // Assuming ssc is the StreamingContext
    val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
    val words = lines.flatMap(_.split(" "))
    ...
    ```
    
    instead of `lines.flatMap(_.split(" "))`
    it should be `customReceiverStream.flatMap(_.split(" "))`
    
    ## How was this patch tested?
    this documentation change is tested manually by jekyll build , running below commands
    ```
    jekyll build
    jekyll serve --watch
    ```
    screen-shots provided below
    ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png)
    ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png)
    
    Author: Remis Haroon <Remis.Haroon@insdc01.pwc.com>
    
    Closes apache#18770 from remisharoon/master.
    
    (cherry picked from commit c143820)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Remis Haroon authored and srowen committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    df6cd35 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically…

    … by its canonicalized child
    
    ## What changes were proposed in this pull request?
    
    When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
    
    An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.
    
    Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
    
    If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.
    
    Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
    
    One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#18761 from viirya/SPARK-21555.
    
    (cherry picked from commit 9c8109e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    viirya authored and gatorsmile committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    24a9bac View commit details
    Browse the repository at this point in the history
  3. [SPARK-19451][SQL] rangeBetween method should accept Long value as bo…

    …undary
    
    ## What changes were proposed in this pull request?
    
    Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
    
    Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
    
    This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c
    
    After this been merged, we can close apache#16818 .
    
    ## How was this patch tested?
    
    Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes apache#18540 from jiangxb1987/rangeFrame.
    
    (cherry picked from commit 92d8563)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    jiangxb1987 authored and gatorsmile committed Jul 29, 2017
    Configuration menu
    Copy the full SHA
    66fa6bd View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2017

  1. Revert "[SPARK-19451][SQL] rangeBetween method should accept Long val…

    …ue as boundary"
    
    This reverts commit 66fa6bd.
    gatorsmile committed Jul 30, 2017
    Configuration menu
    Copy the full SHA
    e2062b9 View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2017

  1. [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.

    Handle the case where the server closes the socket before the full message
    has been written by the client.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18727 from vanzin/SPARK-21522.
    
    (cherry picked from commit b133501)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Marcelo Vanzin committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    1745434 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21593][DOCS] Fix 2 rendering errors on configuration page

    ## What changes were proposed in this pull request?
    
    Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355.
    
    ## How was this patch tested?
    
    Manually built and viewed docs with jekyll
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#18793 from srowen/SPARK-21593.
    
    (cherry picked from commit b1d59e6)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    srowen committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    79e5805 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21339][CORE] spark-shell --packages option does not add jars t…

    …o classpath on windows
    
    The --packages option jars are getting added to the classpath with the scheme as "file:///", in Unix it doesn't have problem with this since the scheme contains the Unix Path separator which separates the jar name with location in the classpath. In Windows, the jar file is not getting resolved from the classpath because of the scheme.
    
    Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar
    Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar
    
    With this PR, we are avoiding the 'file://' scheme to get added to the packages jar files.
    
    I have verified manually in Windows and Unix environments, with the change it adds the jar to classpath like below,
    
    Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar
    Unix : /home/<user>/.ivy2/jars/<jar-name>.jar
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes apache#18708 from devaraj-kavali/SPARK-21339.
    
    (cherry picked from commit 58da1a2)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Devaraj K authored and Marcelo Vanzin committed Aug 1, 2017
    Configuration menu
    Copy the full SHA
    67c60d7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    8d04581 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2017

  1. [SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats

    ## What changes were proposed in this pull request?
    
    This PR fixed a potential overflow issue in EventTimeStats.
    
    ## How was this patch tested?
    
    The new unit tests
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18803 from zsxwing/avg.
    
    (cherry picked from commit 7f63e85)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Aug 2, 2017
    Configuration menu
    Copy the full SHA
    397f904 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21546][SS] dropDuplicates should ignore watermark when it's no…

    …t a key
    
    ## What changes were proposed in this pull request?
    
    When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue.
    
    ## How was this patch tested?
    
    The new unit test.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18822 from zsxwing/SPARK-21546.
    
    (cherry picked from commit 0d26b3a)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    zsxwing committed Aug 2, 2017
    Configuration menu
    Copy the full SHA
    467ee8d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8820569 View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2017

  1. [SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle…

    … registry
    
    ## What changes were proposed in this pull request?
    
    When using PySpark broadcast variables in a multi-threaded environment,  `SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread.  This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.
    
    ## How was this patch tested?
    
    Added a unit test that causes this race condition using another thread.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes apache#18823 from BryanCutler/branch-2.2.
    BryanCutler authored and HyukjinKwon committed Aug 3, 2017
    Configuration menu
    Copy the full SHA
    690f491 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b1a731c View commit details
    Browse the repository at this point in the history
  3. Fix Java SimpleApp spark application

    ## What changes were proposed in this pull request?
    
    Add missing import and missing parentheses to invoke `SparkSession::text()`.
    
    ## How was this patch tested?
    
    Built and the code for this application, ran jekyll locally per docs/README.md.
    
    Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov>
    
    Closes apache#18795 from christiam/master.
    
    (cherry picked from commit dd72b10)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    christiam authored and srowen committed Aug 3, 2017
    Configuration menu
    Copy the full SHA
    1bcfa2a View commit details
    Browse the repository at this point in the history

Commits on Aug 4, 2017

  1. [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC tab…

    …le with extreme values on the partition column
    
    ## What changes were proposed in this pull request?
    
    An overflow of the difference of bounds on the partitioning column leads to no data being read. This
    patch checks for this overflow.
    
    ## How was this patch tested?
    
    New unit test.
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes apache#18800 from aray/SPARK-21330.
    
    (cherry picked from commit 25826c7)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    aray authored and srowen committed Aug 4, 2017
    Configuration menu
    Copy the full SHA
    f9aae8e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8aa9405 View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2017

  1. [SPARK-21580][SQL] Integers in aggregation expressions are wrongly ta…

    …ken as group-by ordinal
    
    ## What changes were proposed in this pull request?
    
    create temporary view data as select * from values
    (1, 1),
    (1, 2),
    (2, 1),
    (2, 2),
    (3, 1),
    (3, 2)
    as data(a, b);
    
    `select 3, 4, sum(b) from data group by 1, 2;`
    `select 3 as c, 4 as d, sum(b) from data group by c, d;`
    When running these two cases, the following exception occurred:
    `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`
    
    The cause of this failure:
    If an aggregateExpression is integer, after replaced with this aggregateExpression, the
    groupExpression still considered as an ordinal.
    
    The solution:
    This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
    
    ## How was this patch tested?
    Added unit test case
    
    Author: liuxian <liu.xian3@zte.com.cn>
    
    Closes apache#18779 from 10110346/groupby.
    
    (cherry picked from commit 894d5a4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    10110346 authored and gatorsmile committed Aug 5, 2017
    Configuration menu
    Copy the full SHA
    841bc2f View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2017

  1. [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null

    ## What changes were proposed in this pull request?
    
    In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter
    
    Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue)
    
    ## How was this patch tested?
    Added unit test
    
    Author: vinodkc <vinod.kc.in@gmail.com>
    
    Closes apache#18852 from vinodkc/br_Fix_SPARK-21588.
    
    (cherry picked from commit 1ba967b)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    vinodkc authored and gatorsmile committed Aug 6, 2017
    Configuration menu
    Copy the full SHA
    098aaec View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2017

  1. [SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWrit…

    …er.commitAndGet called
    
    ## What changes were proposed in this pull request?
    
    We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called.
    Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called.
    
    ## How was this patch tested?
    Modified existing test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Xianyang Liu <xianyang.liu@intel.com>
    
    Closes apache#18830 from ConeyLiu/DiskBlockObjectWriter.
    
    (cherry picked from commit 534a063)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and cloud-fan committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    7a04def View commit details
    Browse the repository at this point in the history
  2. [SPARK-21647][SQL] Fix SortMergeJoin when using CROSS

    ### What changes were proposed in this pull request?
    author: BoleynSu
    closes apache#18836
    
    ```Scala
    val df = Seq((1, 1)).toDF("i", "j")
    df.createOrReplaceTempView("T")
    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
      sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " +
        "cross join T t2 where t2.i = t1.i").explain(true)
    }
    ```
    The above code could cause the following exception:
    ```
    SortMergeJoinExec should not take Cross as the JoinType
    java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType
    	at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100)
    ```
    
    Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue.
    
    ### How was this patch tested?
    Modified the two existing test cases.
    
    Author: Xiao Li <gatorsmile@gmail.com>
    Author: Boleyn Su <boleyn.su@gmail.com>
    
    Closes apache#18863 from gatorsmile/pr-18836.
    
    (cherry picked from commit bbfd6b5)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gatorsmile authored and cloud-fan committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    4f0eb0c View commit details
    Browse the repository at this point in the history
  3. [SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with di…

    …sabled FS cache
    
    This PR replaces apache#18623 to do some clean up.
    
    Closes apache#18623
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    Author: Andrey Taptunov <taptunov@amazon.com>
    
    Closes apache#18848 from zsxwing/review-pr18623.
    Andrey Taptunov authored and zsxwing committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    43f9c84 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    0aacb6b View commit details
    Browse the repository at this point in the history
  5. [SPARK-21565][SS] Propagate metadata in attribute replacement.

    ## What changes were proposed in this pull request?
    
    Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes.
    
    ## How was this patch tested?
    new unit test, which was verified to fail before the fix
    
    Author: Jose Torres <joseph-torres@databricks.com>
    
    Closes apache#18840 from joseph-torres/SPARK-21565.
    
    (cherry picked from commit cce25b3)
    Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
    Jose Torres authored and zsxwing committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    fa92a7b View commit details
    Browse the repository at this point in the history
  6. [SPARK-21648][SQL] Fix confusing assert failure in JDBC source when p…

    …arallel fetching parameters are not properly provided.
    
    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE mytesttable1
    USING org.apache.spark.sql.jdbc
      OPTIONS (
      url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
      dbtable 'mytesttable1',
      paritionColumn 'state_id',
      lowerBound '0',
      upperBound '52',
      numPartitions '53',
      fetchSize '10000'
    )
    ```
    
    The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error.
    
    ```
    AssertionError: assertion failed
    java.lang.AssertionError: assertion failed
    	at scala.Predef$.assert(Predef.scala:156)
    	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
    	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
    ```
    
    ### How was this patch tested?
    Added a test case
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#18864 from gatorsmile/jdbcPartCol.
    
    (cherry picked from commit baf5cac)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gatorsmile committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    a1c1199 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2017

  1. [SPARK-21567][SQL] Dataset should work with type alias

    If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset.
    
    A reproducible case looks like:
    
        object C {
          type TwoInt = (Int, Int)
          def tupleTypeAlias: TwoInt = (1, 1)
        }
    
        Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))
    
    It throws an exception like:
    
        type T1 is not a class
        scala.ScalaReflectionException: type T1 is not a class
          at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
          ...
    
    This patch accesses the dealias of type in many places in `ScalaReflection` to fix it.
    
    Added test case.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#18813 from viirya/SPARK-21567.
    
    (cherry picked from commit ee13041)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Aug 8, 2017
    Configuration menu
    Copy the full SHA
    86609a9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e87ffca View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2017

  1. [SPARK-21503][UI] Spark UI shows incorrect task status for a killed E…

    …xecutor Process
    
    The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command.
    Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc.
    
    ## How was this patch tested?
    Manually Tested the fix by observing the UI change before and after.
    Before:
    <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png">
    After:
    <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png">
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: pgandhi <pgandhi@yahoo-inc.com>
    Author: pgandhi999 <parthkgandhi9@gmail.com>
    
    Closes apache#18707 from pgandhi999/master.
    
    (cherry picked from commit f016f5c)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    pgandhi authored and cloud-fan committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    d023314 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in …

    …strong wolfe line search
    
    ## What changes were proposed in this pull request?
    
    Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
    scalanlp/breeze#651
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes apache#18797 from WeichenXu123/update-breeze.
    
    (cherry picked from commit b35660d)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    WeichenXu123 authored and yanboliang committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    7446be3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…

    … return value
    
    Same PR as apache#18799 but for branch 2.2. Main discussion the other PR.
    --------
    
    When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.
    
    This PR ensures that places calling HDFSMetadataLog.get always check the return value.
    
    Jenkins
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18890 from tdas/SPARK-21596-2.2.
    zsxwing committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    f6d56d2 View commit details
    Browse the repository at this point in the history
  4. [SPARK-21663][TESTS] test("remote fetch below max RPC message size") …

    …should call masterTracker.stop() in MapOutputTrackerSuite
    
    Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
    
    ## What changes were proposed in this pull request?
    After Unit tests end,there should be call masterTracker.stop() to free resource;
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    Run Unit tests;
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 10087686 <wang.jiaochun@zte.com.cn>
    
    Closes apache#18867 from wangjiaochun/mapout.
    
    (cherry picked from commit 6426adf)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    wangjiaochun authored and cloud-fan committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    3ca55ea View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2017

  1. [SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog

    ## What changes were proposed in this pull request?
    This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption.
    
    ## How was this patch tested?
    Removed the test case.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#18912 from rxin/remove-getTableOption.
    
    (cherry picked from commit 584c7f1)
    Signed-off-by: Reynold Xin <rxin@databricks.com>
    rxin committed Aug 11, 2017
    Configuration menu
    Copy the full SHA
    c909496 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21595] Separate thresholds for buffering and spilling in Exter…

    …nalAppendOnlyUnsafeRowArray
    
    ## What changes were proposed in this pull request?
    
    [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).
    
    Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.
    
    ## How was this patch tested?
    
    Added unit tests
    
    Author: Tejas Patil <tejasp@fb.com>
    
    Closes apache#18843 from tejasapatil/SPARK-21595.
    
    (cherry picked from commit 9443999)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    tejasapatil authored and hvanhovell committed Aug 11, 2017
    Configuration menu
    Copy the full SHA
    406eb1c View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2017

  1. [SPARK-21563][CORE] Fix race condition when serializing TaskDescripti…

    …ons and adding jars
    
    ## What changes were proposed in this pull request?
    
    Fix the race condition when serializing TaskDescriptions and adding jars by keeping the set of jars and files for a TaskSet constant across the lifetime of the TaskSet.  Otherwise TaskDescription serialization can produce an invalid serialization when new file/jars are added concurrently as the TaskDescription is serialized.
    
    ## How was this patch tested?
    
    Additional unit test ensures jars/files contained in the TaskDescription remain constant throughout the lifetime of the TaskSet.
    
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes apache#18913 from ash211/SPARK-21563.
    
    (cherry picked from commit 6847e93)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ash211 authored and cloud-fan committed Aug 14, 2017
    Configuration menu
    Copy the full SHA
    7b98077 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    dc3cdd5 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21696][SS] Fix a potential issue that may generate partial sna…

    …pshot files
    
    ## What changes were proposed in this pull request?
    
    Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#18928 from zsxwing/SPARK-21696.
    
    (cherry picked from commit 282f00b)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    zsxwing authored and tdas committed Aug 14, 2017
    Configuration menu
    Copy the full SHA
    48bacd3 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3a02a3c View commit details
    Browse the repository at this point in the history

Commits on Aug 15, 2017

  1. [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are…

    … successfully removed
    
    ## What changes were proposed in this pull request?
    
    We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size.
    
    ## How was this patch tested?
    
    Added a test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#18934 from viirya/SPARK-21721.
    
    (cherry picked from commit 4c3cf1c)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    viirya authored and gatorsmile committed Aug 15, 2017
    Configuration menu
    Copy the full SHA
    d9c8e62 View commit details
    Browse the repository at this point in the history

Commits on Aug 16, 2017

  1. [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures)

    Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon
    
    (Maybe the usage should be forbidden when writing, in a major version change?).
    
    Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz>
    
    Closes apache#18872 from ProtD/master.
    
    (cherry picked from commit 8321c14)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Jan Vrsovsky authored and srowen committed Aug 16, 2017
    Configuration menu
    Copy the full SHA
    f1accc8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21656][CORE] spark dynamic allocation should not idle timeout …

    …executors when tasks still to run
    
    ## What changes were proposed in this pull request?
    
    Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer than the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run.
    We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run.
    
    ## How was this patch tested?
    
    Tested by manually adding executors to `executorsIdsToBeRemoved` list and seeing if those executors were removed when there are a lot of tasks and a high `numExecutorsTarget` value.
    
    Code used
    
    In  `ExecutorAllocationManager.start()`
    
    ```
        start_time = clock.getTimeMillis()
    ```
    
    In `ExecutorAllocationManager.schedule()`
    ```
        val executorIdsToBeRemoved = ArrayBuffer[String]()
        if ( now > start_time + 1000 * 60 * 2) {
          logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
          start_time +=  1000 * 60 * 100
          var counter = 0
          for (x <- executorIds) {
            counter += 1
            if (counter == 2) {
              counter = 0
              executorIdsToBeRemoved += x
            }
          }
        }
    
    Author: John Lee <jlee2@yahoo-inc.com>
    
    Closes apache#18874 from yoonlee95/SPARK-21656.
    
    (cherry picked from commit adf005d)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
    John Lee authored and Tom Graves committed Aug 16, 2017
    Configuration menu
    Copy the full SHA
    f5ede0d View commit details
    Browse the repository at this point in the history
  3. [SPARK-18464][SQL][BACKPORT] support old table which doesn't store sc…

    …hema in table properties
    
    backport apache#18907 to branch 2.2
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#18963 from cloud-fan/backport.
    cloud-fan authored and gatorsmile committed Aug 16, 2017
    Configuration menu
    Copy the full SHA
    2a96975 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    851e162 View commit details
    Browse the repository at this point in the history

Commits on Aug 18, 2017

  1. [SPARK-21739][SQL] Cast expression should initialize timezoneId when …

    …it is called statically to convert something into TimestampType
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
    
    This issue is caused by introducing TimeZoneAwareExpression.
    When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase.
    
    However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases,  `NoSuchElementException: None.get` will be thrown for TimestampType.
    
    This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
    
    ## How was this patch tested?
    
    unit test
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes apache#18960 from DonnyZone/spark-21739.
    
    (cherry picked from commit 310454b)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DonnyZone authored and gatorsmile committed Aug 18, 2017
    Configuration menu
    Copy the full SHA
    fdea642 View commit details
    Browse the repository at this point in the history

Commits on Aug 20, 2017

  1. [MINOR] Correct validateAndTransformSchema in GaussianMixture and AFT…

    …SurvivalRegression
    
    ## What changes were proposed in this pull request?
    
    The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol.
    
    ## How was this patch tested?
    
    Manually.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Cédric Pelvet <cedric.pelvet@gmail.com>
    
    Closes apache#18980 from sharp-pixel/master.
    
    (cherry picked from commit 73e04ec)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    sharp-pixel authored and srowen committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    6c2a38a View commit details
    Browse the repository at this point in the history
  2. [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when…

    … paths are successfully removed
    
    ## What changes were proposed in this pull request?
    
    Fix a typo in test.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#19005 from viirya/SPARK-21721-followup.
    
    (cherry picked from commit 28a6cca)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    viirya authored and cloud-fan committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    0f640e9 View commit details
    Browse the repository at this point in the history

Commits on Aug 21, 2017

  1. Configuration menu
    Copy the full SHA
    b8d83ee View commit details
    Browse the repository at this point in the history
  2. [SPARK-21617][SQL] Store correct table metadata when altering schema …

    …in Hive metastore.
    
    For Hive tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For data source tables, Spark may generate a non-compatible Hive table;
    but for that to work with Hive 2.1, the detection of data source tables needs
    to be fixed in the Hive client, to also consider the raw tables used by code
    such as `alterTableSchema`.
    
    Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#18849 from vanzin/SPARK-21617.
    
    (cherry picked from commit 84b5b16)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    Marcelo Vanzin authored and gatorsmile committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    526087f View commit details
    Browse the repository at this point in the history

Commits on Aug 23, 2017

  1. Configuration menu
    Copy the full SHA
    4876824 View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2017

  1. [SPARK-21805][SPARKR] Disable R vignettes code on Windows

    ## What changes were proposed in this pull request?
    
    Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
    
    fix * checking re-building of vignette outputs ... WARNING
    and
    > %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
    
    ## How was this patch tested?
    
    jenkins, appveyor, r-hub
    
    before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
    after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
    
    Author: Felix Cheung <felixcheung_m@hotmail.com>
    
    Closes apache#19016 from felixcheung/rvigwind.
    
    (cherry picked from commit 43cbfad)
    Signed-off-by: Felix Cheung <felixcheung@apache.org>
    felixcheung authored and Felix Cheung committed Aug 24, 2017
    Configuration menu
    Copy the full SHA
    236b2f4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21826][SQL] outer broadcast hash join should not throw NPE

    This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 .
    
    Non-equal join condition should only be applied when the equal-join condition matches.
    
    regression test
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#19036 from cloud-fan/bug.
    
    (cherry picked from commit 2dd37d8)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    cloud-fan authored and hvanhovell committed Aug 24, 2017
    Configuration menu
    Copy the full SHA
    a585367 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureS…

    …td contains zero (backport PR for 2.2)
    
    ## What changes were proposed in this pull request?
    
    This is backport PR of apache#18896
    
    fix bug of MLOR do not work correctly when featureStd contains zero
    
    We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
    ```
        val multinomialDatasetWithZeroVar = {
          val nPoints = 100
          val coefficients = Array(
            -0.57997, 0.912083, -0.371077,
            -0.16624, -0.84355, -0.048509)
    
          val xMean = Array(5.843, 3.0)
          val xVariance = Array(0.6856, 0.0)  // including zero variance
    
          val testData = generateMultinomialLogisticInput(
            coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
    
          val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
          df.cache()
          df
        }
    ```
    ## How was this patch tested?
    
    testcase added.
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes apache#19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
    WeichenXu123 authored and jkbradley committed Aug 24, 2017
    Configuration menu
    Copy the full SHA
    2b4bd79 View commit details
    Browse the repository at this point in the history

Commits on Aug 25, 2017

  1. Configuration menu
    Copy the full SHA
    4e7d45e View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2017

  1. [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.vari…

    …ance generate negative result
    
    Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
    
    **This is a serious bug because many algos in MLLib**
    **use stddev computed from** `sqrt(variance)`
    **it will generate NaN and crash the whole algorithm.**
    
    we can reproduce this bug use the following code:
    ```
        val summarizer1 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.7)
        val summarizer2 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
        val summarizer3 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.5)
        val summarizer4 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
    
        val summarizer = summarizer1
          .merge(summarizer2)
          .merge(summarizer3)
          .merge(summarizer4)
    
        println(summarizer.variance(0))
    ```
    This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`
    
    test cases added.
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes apache#19029 from WeichenXu123/fix_summarizer_var_bug.
    
    (cherry picked from commit 0456b40)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    WeichenXu123 authored and srowen committed Aug 28, 2017
    Configuration menu
    Copy the full SHA
    0d4ef2f View commit details
    Browse the repository at this point in the history
  2. [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config …

    …for launching daemons like History Server
    
    History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server.
    
    Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode.
    
    ## How was this patch tested?
    Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running.
    
    Author: pgandhi <pgandhi@yahoo-inc.com>
    Author: pgandhi999 <parthkgandhi9@gmail.com>
    
    Closes apache#19047 from pgandhi999/master.
    
    (cherry picked from commit 24e6c18)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
    pgandhi authored and Tom Graves committed Aug 28, 2017
    Configuration menu
    Copy the full SHA
    59bb7eb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    24baf03 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2017

  1. [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resour…

    …ces in yarn client mode
    
    ## What changes were proposed in this pull request?
    
    This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is apache#18962.
    
    ## How was this patch tested?
    
    Tested in local UT.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes apache#19074 from jerryshao/SPARK-21714-2.2-backport.
    jerryshao authored and Marcelo Vanzin committed Aug 29, 2017
    Configuration menu
    Copy the full SHA
    59529b2 View commit details
    Browse the repository at this point in the history
  2. Revert "[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remot…

    …e resources in yarn client mode"
    
    This reverts commit 59529b2.
    Marcelo Vanzin committed Aug 29, 2017
    Configuration menu
    Copy the full SHA
    917fe66 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2017

  1. [SPARK-21254][WEBUI] History UI performance fixes

    ## This is a backport of PR apache#18783 to the latest released branch 2.2.
    
    ## What changes were proposed in this pull request?
    
    As described in JIRA ticket, History page is taking ~1min to load for cases when amount of jobs is 10k+.
    Most of the time is currently being spent on DOM manipulations and all additional costs implied by this (browser repaints and reflows).
    PR's goal is not to change any behavior but to optimize time of History UI rendering:
    
    1. The most costly operation is setting `innerHTML` for `duration` column within a loop, which is [extremely unperformant](https://jsperf.com/jquery-append-vs-html-list-performance/24). [Refactoring ](criteo-forks@b7e56ee) this helped to get page load time **down to 10-15s**
    
    2. Second big gain bringing page load time **down to 4s** was [was achieved](criteo-forks@3630ca2) by detaching table's DOM before parsing it with DataTables jQuery plugin.
    
    3. Another chunk of improvements ([1]criteo-forks@aeeeeb5), [2](criteo-forks@e25be9a), [3](criteo-forks@9169707)) was focused on removing unnecessary DOM manipulations that in total contributed ~250ms to page load time.
    
    ## How was this patch tested?
    
    Tested by existing Selenium tests in `org.apache.spark.deploy.history.HistoryServerSuite`.
    
    Changes were also tested on Criteo's spark-2.1 fork with 20k+ number of rows in the table, reducing load time to 4s.
    
    Author: Dmitry Parfenchik <d.parfenchik@criteo.com>
    
    Closes apache#18860 from 2ooom/history-ui-perf-fix-2.2.
    2ooom authored and srowen committed Aug 30, 2017
    Configuration menu
    Copy the full SHA
    a6a9944 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    952c577 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resour…

    …ces in yarn client mode
    
    ## What changes were proposed in this pull request?
    
    This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is apache#18962.
    
    ## How was this patch tested?
    
    Tested in local UT.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes apache#19074 from jerryshao/SPARK-21714-2.2-backport.
    jerryshao authored and Marcelo Vanzin committed Aug 30, 2017
    Configuration menu
    Copy the full SHA
    d10c9dc View commit details
    Browse the repository at this point in the history
  4. [SPARK-21834] Incorrect executor request in case of dynamic allocation

    ## What changes were proposed in this pull request?
    
    killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself.
    
    ## How was this patch tested?
    
    Ran a job on the cluster and made sure the executor request is correct
    
    Author: Sital Kedia <skedia@fb.com>
    
    Closes apache#19081 from sitalkedia/skedia/oss_fix_executor_allocation.
    
    (cherry picked from commit 6949a9c)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Sital Kedia authored and Marcelo Vanzin committed Aug 30, 2017
    Configuration menu
    Copy the full SHA
    14054ff View commit details
    Browse the repository at this point in the history

Commits on Aug 31, 2017

  1. Configuration menu
    Copy the full SHA
    c412c77 View commit details
    Browse the repository at this point in the history

Commits on Sep 1, 2017

  1. [SPARK-21884][SPARK-21477][BACKPORT-2.2][SQL] Mark LocalTableScanExec…

    …'s input data transient
    
    This PR is to backport apache#18686 for resolving the issue in apache#19094
    
    ---
    
    ## What changes were proposed in this pull request?
    This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient. It can avoid serializing the unneeded objects.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#19101 from gatorsmile/backport-21477.
    gatorsmile committed Sep 1, 2017
    Configuration menu
    Copy the full SHA
    50f86e1 View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2017

  1. [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScan…

    …Exec with sun.io.serialization.extendedDebugInfo=true
    
    ## What changes were proposed in this pull request?
    
    If no SparkConf is available to Utils.redact, simply don't redact.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#19123 from srowen/SPARK-21418.
    
    (cherry picked from commit ca59445)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
    srowen authored and hvanhovell committed Sep 4, 2017
    Configuration menu
    Copy the full SHA
    fb1b5f0 View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2017

  1. Configuration menu
    Copy the full SHA
    d0df025 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21925] Update trigger interval documentation in docs with beha…

    …vior change in Spark 2.2
    
    Forgot to update docs with behavior change.
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes apache#19138 from brkyvz/trigger-doc-fix.
    
    (cherry picked from commit 8c954d2)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
    brkyvz authored and tdas committed Sep 5, 2017
    Configuration menu
    Copy the full SHA
    1f7c486 View commit details
    Browse the repository at this point in the history
  3. [MINOR][DOC] Update Partition Discovery section to enumerate all av…

    …ailable file sources
    
    ## What changes were proposed in this pull request?
    
    All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly.
    
    **AFTER**
    
    <img width="906" alt="1" src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png">
    
    ## How was this patch tested?
    
    ```
    SKIP_API=1 jekyll serve --watch
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes apache#19139 from dongjoon-hyun/partitiondiscovery.
    
    (cherry picked from commit 9e451bc)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    dongjoon-hyun authored and gatorsmile committed Sep 5, 2017
    Configuration menu
    Copy the full SHA
    7da8fbf View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2017

  1. [SPARK-21924][DOCS] Update structured streaming programming guide doc

    ## What changes were proposed in this pull request?
    
    Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide.
    
    Author: Riccardo Corbella <r.corbella@reply.it>
    
    Closes apache#19137 from riccardocorbella/bugfix.
    
    (cherry picked from commit 4ee7dfe)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Riccardo Corbella authored and srowen committed Sep 6, 2017
    Configuration menu
    Copy the full SHA
    9afab9a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a7d0b0a View commit details
    Browse the repository at this point in the history
  3. [SPARK-21901][SS] Define toString for StateOperatorProgress

    ## What changes were proposed in this pull request?
    
    Just `StateOperatorProgress.toString` + few formatting fixes
    
    ## How was this patch tested?
    
    Local build. Waiting for OK from Jenkins.
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes apache#19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
    
    (cherry picked from commit fa0092b)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    jaceklaskowski authored and zsxwing committed Sep 6, 2017
    Configuration menu
    Copy the full SHA
    342cc2a View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2017

  1. Fixed pandoc dependency issue in python/setup.py

    ## Problem Description
    
    When pyspark is listed as a dependency of another package, installing
    the other package will cause an install failure in pyspark. When the
    other package is being installed, pyspark's setup_requires requirements
    are installed including pypandoc. Thus, the exception handling on
    setup.py:152 does not work because the pypandoc module is indeed
    available. However, the pypandoc.convert() function fails if pandoc
    itself is not installed (in our use cases it is not). This raises an
    OSError that is not handled, and setup fails.
    
    The following is a sample failure:
    ```
    $ which pandoc
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ pip install pyspark
    Collecting pyspark
      Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
        100% |████████████████████████████████| 188.3MB 16.8MB/s
        Complete output from command python setup.py egg_info:
        Maybe try:
    
            sudo apt-get install pandoc
        See http://johnmacfarlane.net/pandoc/installing.html
        for installation options
        ---------------------------------------------------------------
    
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module>
            long_description = pypandoc.convert('README.md', 'rst')
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert
            outputfile=outputfile, filters=filters)
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input
            _ensure_pandoc_path()
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path
            raise OSError("No pandoc was found: either install pandoc and add it\n"
        OSError: No pandoc was found: either install pandoc and add it
        to your PATH or or call pypandoc.download_pandoc(...) or
        install pypandoc wheels with included pandoc.
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/
    ```
    
    ## What changes were proposed in this pull request?
    
    This change simply adds an additional exception handler for the OSError
    that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed.
    
    ## How was this patch tested?
    
    I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel.
    
    Here is the output
    
    ```
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ which pandoc
    $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0)
    Installing collected packages: pyspark
    Successfully installed pyspark-2.3.0.dev0
    ```
    
    Author: Tucker Beck <tucker.beck@rentrakmail.com>
    
    Closes apache#18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.
    
    (cherry picked from commit aad2125)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    Tucker Beck authored and HyukjinKwon committed Sep 7, 2017
    Configuration menu
    Copy the full SHA
    49968de View commit details
    Browse the repository at this point in the history
  2. [SPARK-21890] Credentials not being passed to add the tokens

    ## What changes were proposed in this pull request?
    I observed this while running a oozie job trying to connect to hbase via spark.
    It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release.
    More Info as to why it fails on secure grid:
    Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along.
    The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab).
    The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception.
    There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens.
    
    https://issues.apache.org/jira/browse/SPARK-21890
    Modified to pass creds to get delegation tokens
    
    ## How was this patch tested?
    Manual testing on our secure cluster
    
    Author: Sanket Chintapalli <schintap@yahoo-inc.com>
    
    Closes apache#19103 from redsanket/SPARK-21890.
    Sanket Chintapalli authored and Marcelo Vanzin committed Sep 7, 2017
    Configuration menu
    Copy the full SHA
    0848df1 View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2017

  1. [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should s…

    …top SparkContext.
    
    ## What changes were proposed in this pull request?
    
    `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests.
    This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes apache#19158 from ueshin/issues/SPARK-21950.
    
    (cherry picked from commit 57bc1e9)
    Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
    ueshin committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    4304d0b View commit details
    Browse the repository at this point in the history
  2. [SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing

    dongjoon-hyun HyukjinKwon
    
    Error in PySpark example code:
    /examples/src/main/python/ml/estimator_transformer_param_example.py
    
    The original Scala code says
    println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
    
    The parent is lr
    
    There is no method for accessing parent as is done in Scala.
    
    This code has been tested in Python, and returns values consistent with Scala
    
    ## What changes were proposed in this pull request?
    
    Proposing to call the lr variable instead of model1 or model2
    
    ## How was this patch tested?
    
    This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines.
    
    The output for model2 in PySpark should be
    
    {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True}
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: MarkTab marktab.net <marktab@users.noreply.github.com>
    
    Closes apache#19152 from marktab/branch-2.2.
    marktab authored and srowen committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    781a1f8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-21936][SQL][2.2] backward compatibility test framework for Hiv…

    …eExternalCatalog
    
    backport apache#19148 to 2.2
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#19163 from cloud-fan/test.
    cloud-fan authored and gatorsmile committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    08cb06a View commit details
    Browse the repository at this point in the history
  4. [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table…

    …" in InMemoryCatalogedDDLSuite
    
    ## What changes were proposed in this pull request?
    
    This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
    Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result.
    
    ## How was this patch tested?
    
    Use existing test case
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes apache#19159 from kiszk/SPARK-21946.
    
    (cherry picked from commit 8a4f228)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    kiszk authored and gatorsmile committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    9ae7c96 View commit details
    Browse the repository at this point in the history
  5. [SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "met…

    …astore_db" before listing files in R tests
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
    
    ## How was this patch tested?
    
    Manually running multiple times R tests via `./R/run-tests.sh`.
    
    **Before**
    
    Second run:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ....................................................................................................1234.......................
    
    Failed -------------------------------------------------------------------------
    1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
    length(list1) not equal to length(list2).
    1/1 mismatches
    [1] 25 - 23 == 2
    
    2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
    sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
    10/25 mismatches
    x[16]: "metastore_db"
    y[16]: "pkg"
    
    x[17]: "pkg"
    y[17]: "R"
    
    x[18]: "R"
    y[18]: "README.md"
    
    x[19]: "README.md"
    y[19]: "run-tests.sh"
    
    x[20]: "run-tests.sh"
    y[20]: "SparkR_2.2.0.tar.gz"
    
    x[21]: "metastore_db"
    y[21]: "pkg"
    
    x[22]: "pkg"
    y[22]: "R"
    
    x[23]: "R"
    y[23]: "README.md"
    
    x[24]: "README.md"
    y[24]: "run-tests.sh"
    
    x[25]: "run-tests.sh"
    y[25]: "SparkR_2.2.0.tar.gz"
    
    3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
    length(list1) not equal to length(list2).
    1/1 mismatches
    [1] 25 - 23 == 2
    
    4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
    sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
    10/25 mismatches
    x[16]: "metastore_db"
    y[16]: "pkg"
    
    x[17]: "pkg"
    y[17]: "R"
    
    x[18]: "R"
    y[18]: "README.md"
    
    x[19]: "README.md"
    y[19]: "run-tests.sh"
    
    x[20]: "run-tests.sh"
    y[20]: "SparkR_2.2.0.tar.gz"
    
    x[21]: "metastore_db"
    y[21]: "pkg"
    
    x[22]: "pkg"
    y[22]: "R"
    
    x[23]: "R"
    y[23]: "README.md"
    
    x[24]: "README.md"
    y[24]: "run-tests.sh"
    
    x[25]: "run-tests.sh"
    y[25]: "SparkR_2.2.0.tar.gz"
    
    DONE ===========================================================================
    ```
    
    **After**
    
    Second run:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................
    ```
    
    Author: hyukjinkwon <gurwls223gmail.com>
    
    Closes apache#18335 from HyukjinKwon/SPARK-21128.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#19166 from felixcheung/rbackport21128.
    HyukjinKwon authored and Felix Cheung committed Sep 8, 2017
    Configuration menu
    Copy the full SHA
    9876821 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2017

  1. [SPARK-21954][SQL] JacksonUtils should verify MapType's value type in…

    …stead of key type
    
    ## What changes were proposed in this pull request?
    
    `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
    
    Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#19167 from viirya/test-jacksonutils.
    
    (cherry picked from commit 6b45d7e)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    viirya authored and HyukjinKwon committed Sep 9, 2017
    Configuration menu
    Copy the full SHA
    182478e View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2017

  1. [SPARK-20098][PYSPARK] dataType's typeName fix

    ## What changes were proposed in this pull request?
    `typeName`  classmethod has been fixed by using type -> typeName map.
    
    ## How was this patch tested?
    local build
    
    Author: Peter Szalai <szalaipeti.vagyok@gmail.com>
    
    Closes apache#17435 from szalai1/datatype-gettype-fix.
    
    (cherry picked from commit 520d92a)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    szalai1 authored and HyukjinKwon committed Sep 10, 2017
    Configuration menu
    Copy the full SHA
    b1b5a7f View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2017

  1. [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error.

    ## What changes were proposed in this pull request?
    
    Fixed wrong documentation for Mean Absolute Error.
    
    Even though the code is correct for the MAE:
    
    ```scala
    Since("1.2.0")
      def meanAbsoluteError: Double = {
        summary.normL1(1) / summary.count
      }
    ```
    In the documentation the division by N is missing.
    
    ## How was this patch tested?
    
    All of spark tests were run.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: FavioVazquez <favio.vazquezp@gmail.com>
    Author: faviovazquez <favio.vazquezp@gmail.com>
    Author: Favio André Vázquez <favio.vazquezp@gmail.com>
    
    Closes apache#19190 from FavioVazquez/mae-fix.
    
    (cherry picked from commit e2ac2f1)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    FavioVazquez authored and srowen committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    10c6836 View commit details
    Browse the repository at this point in the history
  2. [DOCS] Fix unreachable links in the document

    ## What changes were proposed in this pull request?
    
    Recently, I found two unreachable links in the document and fixed them.
    Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
    
    Closes apache#19195 from sarutak/fix-unreachable-link.
    
    (cherry picked from commit 9575582)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    sarutak authored and srowen committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    63098dc View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c66ddce View commit details
    Browse the repository at this point in the history
  4. [SPARK-18608][ML] Fix double caching

    ## What changes were proposed in this pull request?
    `df.rdd.getStorageLevel` => `df.storageLevel`
    
    using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
    
    Previous discussion in other PRs: apache#19107, apache#17014
    
    ## How was this patch tested?
    existing tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes apache#19197 from zhengruifeng/double_caching.
    
    (cherry picked from commit c5f9b89)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
    zhengruifeng authored and jkbradley committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    b606dc1 View commit details
    Browse the repository at this point in the history
  5. parquet versioning

    markhamstra committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    30e7298 View commit details
    Browse the repository at this point in the history
  6. style fix

    markhamstra committed Sep 12, 2017
    Configuration menu
    Copy the full SHA
    7966c84 View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2017

  1. [SPARK-21980][SQL] References in grouping functions should be indexed…

    … with semanticEquals
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-21980
    
    This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
    
    The problem can be reproduced by:
    
    `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
     df.cube("a").agg(grouping("A")).show()`
    
    ## How was this patch tested?
    unit tests
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes apache#19202 from DonnyZone/ResolveGroupingAnalytics.
    
    (cherry picked from commit 21c4450)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    DonnyZone authored and gatorsmile committed Sep 13, 2017
    Configuration menu
    Copy the full SHA
    3a692e3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0e8f032 View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2017

  1. [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

    ## What changes were proposed in this pull request?
    apache#19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#19220 from yanboliang/SPARK-18608.
    
    (cherry picked from commit c76153c)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
    yanboliang committed Sep 14, 2017
    Configuration menu
    Copy the full SHA
    51e5a82 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2017

  1. [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

    ## What changes were proposed in this pull request?
    (edited)
    Fixes a bug introduced in apache#16121
    
    In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.
    
    ## How was this patch tested?
    
    Additional unit test
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes apache#19226 from aray/SPARK-21985.
    
    (cherry picked from commit 6adf67d)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    aray authored and HyukjinKwon committed Sep 17, 2017
    Configuration menu
    Copy the full SHA
    42852bb View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2017

  1. [SPARK-21953] Show both memory and disk bytes spilled if either is pr…

    …esent
    
    As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden.
    
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes apache#19164 from ash211/patch-3.
    
    (cherry picked from commit 6308c65)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ash211 authored and cloud-fan committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    309c401 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22043][PYTHON] Improves error message for show_profiles and du…

    …mp_profiles
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to improve error message from:
    
    ```
    >>> sc.show_profiles()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
        self.profiler_collector.show_profiles()
    AttributeError: 'NoneType' object has no attribute 'show_profiles'
    >>> sc.dump_profiles("/tmp/abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
        self.profiler_collector.dump_profiles(path)
    AttributeError: 'NoneType' object has no attribute 'dump_profiles'
    ```
    
    to
    
    ```
    >>> sc.show_profiles()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
        raise RuntimeError("'spark.python.profile' configuration must be set "
    RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
    >>> sc.dump_profiles("/tmp/abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
        raise RuntimeError("'spark.python.profile' configuration must be set "
    RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `python/pyspark/tests.py` and manual tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#19260 from HyukjinKwon/profile-errors.
    
    (cherry picked from commit 7c72662)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    HyukjinKwon committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    a86831d View commit details
    Browse the repository at this point in the history
  3. [SPARK-22047][TEST] ignore HiveExternalCatalogVersionsSuite

    ## What changes were proposed in this pull request?
    
    As reported in https://issues.apache.org/jira/browse/SPARK-22047 , HiveExternalCatalogVersionsSuite is failing frequently, let's disable this test suite to unblock other PRs, I'm looking into the root cause.
    
    ## How was this patch tested?
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#19264 from cloud-fan/test.
    
    (cherry picked from commit 894a756)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    48d6aef View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    504732d View commit details
    Browse the repository at this point in the history
  5. Parquet versioning

    markhamstra committed Sep 18, 2017
    Configuration menu
    Copy the full SHA
    dfbc6a5 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    d0f83de View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2017

  1. [SPARK-22047][FLAKY TEST] HiveExternalCatalogVersionsSuite

    ## What changes were proposed in this pull request?
    
    This PR tries to download Spark for each test run, to make sure each test run is absolutely isolated.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#19265 from cloud-fan/test.
    
    (cherry picked from commit 10f45b3)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Sep 19, 2017
    Configuration menu
    Copy the full SHA
    d0234eb View commit details
    Browse the repository at this point in the history
  2. [SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala

    Current implementation for processingRate-total uses wrong metric:
    mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond
    
    ## What changes were proposed in this pull request?
    Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond
    
    ## How was this patch tested?
    
    Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric.
    <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png">
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Taaffy <32072374+Taaffy@users.noreply.github.com>
    
    Closes apache#19268 from Taaffy/patch-1.
    
    (cherry picked from commit 1bc17a6)
    Signed-off-by: Sean Owen <sowen@cloudera.com>
    Taaffy authored and srowen committed Sep 19, 2017
    Configuration menu
    Copy the full SHA
    6764408 View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2017

  1. [SPARK-22076][SQL] Expand.projections should not be a Stream

    ## What changes were proposed in this pull request?
    
    Spark with Scala 2.10 fails with a group by cube:
    ```
    spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug")
    spark.sql("select 1 from rollup_bug group by rollup ()").show
    ```
    
    It can be traced back to apache#15484 , which made `Expand.projections` a lazy `Stream` for group by cube.
    
    In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts.
    
    This change is also good for master branch, to reduce the serialized size of `Expand.projections`.
    
    ## How was this patch tested?
    
    manually verified with Spark with Scala 2.10.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#19289 from cloud-fan/bug.
    
    (cherry picked from commit ce6a71e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and gatorsmile committed Sep 20, 2017
    Configuration menu
    Copy the full SHA
    5d10586 View commit details
    Browse the repository at this point in the history
  2. [SPARK-21384][YARN] Spark + YARN fails with LocalFileSystem as defaul…

    …t FS
    
    ## What changes were proposed in this pull request?
    
    When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.
    
    With this change, client copies the files to remote always when the source scheme is "file".
    
    ## How was this patch tested?
    
    I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.
    
    Author: Devaraj K <devaraj@apache.org>
    
    Closes apache#19141 from devaraj-kavali/SPARK-21384.
    
    (cherry picked from commit 55d5fa7)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    Devaraj K authored and Marcelo Vanzin committed Sep 20, 2017
    Configuration menu
    Copy the full SHA
    401ac20 View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2017

  1. [SPARK-21928][CORE] Set classloader on SerializerManager's private kryo

    ## What changes were proposed in this pull request?
    
    We have to make sure that SerializerManager's private instance of
    kryo also uses the right classloader, regardless of the current thread
    classloader.  In particular, this fixes serde during remote cache
    fetches, as those occur in netty threads.
    
    ## How was this patch tested?
    
    Manual tests & existing suite via jenkins.  I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally.  I manually verified the warning message is no longer present.
    
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes apache#19280 from squito/SPARK-21928_ser_classloader.
    
    (cherry picked from commit b75bd17)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    squito authored and Marcelo Vanzin committed Sep 21, 2017
    Configuration menu
    Copy the full SHA
    765fd92 View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2017

  1. [SPARK-22094][SS] processAllAvailable should check the query state

    `processAllAvailable` should also check the query state and if the query is stopped, it should return.
    
    The new unit test.
    
    Author: Shixiong Zhu <zsxwing@gmail.com>
    
    Closes apache#19314 from zsxwing/SPARK-22094.
    
    (cherry picked from commit fedf696)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    zsxwing committed Sep 22, 2017
    Configuration menu
    Copy the full SHA
    090b987 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22072][SPARK-22071][BUILD] Improve release build scripts

    ## What changes were proposed in this pull request?
    
    Check JDK version (with javac) and use SPARK_VERSION for publish-release
    
    ## How was this patch tested?
    
    Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled)
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#19312 from holdenk/improve-release-scripts-r2.
    
    (cherry picked from commit 8f130ad)
    Signed-off-by: Holden Karau <holden@us.ibm.com>
    holdenk committed Sep 22, 2017
    Configuration menu
    Copy the full SHA
    de6274a View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2017

  1. [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

    ## What changes were proposed in this pull request?
    
    Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark.
    
    ## How was this patch tested?
    
    Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).
    
    Author: Jakub Nowacki <j.s.nowacki@gmail.com>
    
    Closes apache#19310 from jsnowacki/master.
    
    (cherry picked from commit c11f24a)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    jsnowacki authored and HyukjinKwon committed Sep 23, 2017
    Configuration menu
    Copy the full SHA
    c0a34a9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22092] Reallocation in OffHeapColumnVector.reserveInternal cor…

    …rupts struct and array data
    
    `OffHeapColumnVector.reserveInternal()` will only copy already inserted values during reallocation if `data != null`. In vectors containing arrays or structs this is incorrect, since there field `data` is not used at all. We need to check `nulls` instead.
    
    Adds new tests to `ColumnVectorSuite` that reproduce the errors.
    
    Author: Ala Luszczak <ala@databricks.com>
    
    Closes apache#19323 from ala/port-vector-realloc.
    ala authored and hvanhovell committed Sep 23, 2017
    Configuration menu
    Copy the full SHA
    1a829df View commit details
    Browse the repository at this point in the history
  3. [SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between string…

    …s and timestamps in partition column
    
    ## What changes were proposed in this pull request?
    
    This PR backports apache@04975a6 into branch-2.2.
    
    ## How was this patch tested?
    
    Unit tests in `ParquetPartitionDiscoverySuite`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#19333 from HyukjinKwon/SPARK-22109-backport-2.2.
    HyukjinKwon authored and ueshin committed Sep 23, 2017
    Configuration menu
    Copy the full SHA
    211d81b View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2017

  1. [SPARK-22107] Change as to alias in python quickstart

    ## What changes were proposed in this pull request?
    
    Updated docs so that a line of python in the quick start guide executes. Closes apache#19283
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: John O'Leary <jgoleary@gmail.com>
    
    Closes apache#19326 from jgoleary/issues/22107.
    
    (cherry picked from commit 20adf9a)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
    John O'Leary authored and HyukjinKwon committed Sep 25, 2017
    Configuration menu
    Copy the full SHA
    8acce00 View commit details
    Browse the repository at this point in the history
  2. [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace

    ## What changes were proposed in this pull request?
    
    MemoryStore.evictBlocksToFreeSpace acquires write locks for all the
    blocks it intends to evict up front.  If there is a failure to evict
    blocks (eg., some failure dropping a block to disk), then we have to
    release the lock.  Otherwise the lock is never released and an executor
    trying to get the lock will wait forever.
    
    ## How was this patch tested?
    
    Added unit test.
    
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes apache#19311 from squito/SPARK-22083.
    
    (cherry picked from commit 2c5b9b1)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    squito authored and Marcelo Vanzin committed Sep 25, 2017
    Configuration menu
    Copy the full SHA
    9836ea1 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    d2b369a View commit details
    Browse the repository at this point in the history
  4. [SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive…

    … warehouse directory
    
    ## What changes were proposed in this pull request?
    During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory.
    
    ## How was this patch tested?
    Ran full suite of tests locally, verified that they pass.
    
    Author: Greg Owen <greg@databricks.com>
    
    Closes apache#19341 from GregOwen/SPARK-22120.
    
    (cherry picked from commit ce20478)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    Greg Owen authored and gatorsmile committed Sep 25, 2017
    Configuration menu
    Copy the full SHA
    b0f30b5 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    8f39361 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2017

  1. [SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking…

    … Cartesian products
    
    Back port apache#19362 to branch-2.2
    
    ## What changes were proposed in this pull request?
    
    When inferring constraints from children, Join's condition can be simplified as None.
    For example,
    ```
    val testRelation = LocalRelation('a.int)
    val x = testRelation.as("x")
    val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
    x.join.where($"x.a" === $"y.a")
    ```
    The plan will become
    ```
    Join Inner
    :- LocalRelation <empty>, [a#23]
    +- LocalRelation <empty>, [a#224]
    ```
    And the Cartesian products check will throw exception for above plan.
    
    Propagate empty relation before checking Cartesian products, and the issue is resolved.
    
    ## How was this patch tested?
    
    Unit test
    
    Author: Wang Gengliang <ltnwgl@gmail.com>
    
    Closes apache#19366 from gengliangwang/branch-2.2.
    gengliangwang authored and hvanhovell committed Sep 27, 2017
    Configuration menu
    Copy the full SHA
    a406473 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6dbda6e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    28ae8fd View commit details
    Browse the repository at this point in the history

Commits on Sep 28, 2017

  1. SPY-1429

    ianlcsd committed Sep 28, 2017
    Configuration menu
    Copy the full SHA
    ef02a07 View commit details
    Browse the repository at this point in the history