-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1429: Debian packaging #2
Commits on Jun 13, 2017
-
[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive ta…
…bles with many partitions ## What changes were proposed in this pull request? Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#18216 from srowen/SPARK-20920. (cherry picked from commit 7b7c85e) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 24836be - Browse repository at this point
Copy the full SHA 24836beView commit details -
[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive ta…
…bles with many partitions ## What changes were proposed in this pull request? Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#18216 from srowen/SPARK-20920. (cherry picked from commit 7b7c85e) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 58a8a37 - Browse repository at this point
Copy the full SHA 58a8a37View commit details -
[SPARK-21060][WEB-UI] Css style about paging function is error in the…
… executor page. Css style about paging function is error in the executor page. It is different of history server ui paging function css style. ## What changes were proposed in this pull request? Css style about paging function is error in the executor page. It is different of history server ui paging function css style. **But their style should be consistent**. There are three reasons. 1. The first reason: 'Previous', 'Next' and number should be the button format. 2. The second reason: when you are on the first page, 'Previous' and '1' should be gray and can not be clicked. ![1](https://user-images.githubusercontent.com/26266482/27026667-1fe745ee-4f91-11e7-8b34-150819d22bd3.png) 3. The third reason: when you are on the last page, 'Previous' and 'Max number' should be gray and can not be clicked. ![2](https://user-images.githubusercontent.com/26266482/27026811-9d8d6fa0-4f91-11e7-8b51-7816c3feb381.png) before fix: ![fix_before](https://user-images.githubusercontent.com/26266482/27026428-47ec5c56-4f90-11e7-9dd5-d52c22d7bd36.png) after fix: ![fix_after](https://user-images.githubusercontent.com/26266482/27026439-50d17072-4f90-11e7-8405-6f81da5ab32c.png) The style of history server ui: ![history](https://user-images.githubusercontent.com/26266482/27026528-9c90f780-4f90-11e7-91e6-90d32651fe03.png) ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes apache#18275 from guoxiaolongzte/SPARK-21060. (cherry picked from commit b7304f2) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 039c465 - Browse repository at this point
Copy the full SHA 039c465View commit details -
[SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTran…
…sferServiceSuite ## What changes were proposed in this pull request? The default value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. So we change it to 100 to avoid test failure. ## How was this patch tested? No test Author: DjvuLee <lihu@bytedance.com> Closes apache#18280 from djvulee/NettyTestBug. (cherry picked from commit b36ce2a) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 2bc2c15 - Browse repository at this point
Copy the full SHA 2bc2c15View commit details -
[SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTran…
…sferServiceSuite ## What changes were proposed in this pull request? The default value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. So we change it to 100 to avoid test failure. ## How was this patch tested? No test Author: DjvuLee <lihu@bytedance.com> Closes apache#18280 from djvulee/NettyTestBug. (cherry picked from commit b36ce2a) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for ee0e74e - Browse repository at this point
Copy the full SHA ee0e74eView commit details -
[SPARK-20979][SS] Add RateSource to generate values for tests and ben…
…chmark ## What changes were proposed in this pull request? This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily. This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L. It supports the following options: - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second. - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds. - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed. Here is a simple example that prints 10 rows per seconds: ``` spark.readStream .format("rate") .option("rowsPerSecond", "10") .load() .writeStream .format("console") .start() ``` The idea came from marmbrus and he did the initial work. ## How was this patch tested? The added tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#18199 from zsxwing/rate.
Configuration menu - View commit details
-
Copy full SHA for 220943d - Browse repository at this point
Copy the full SHA 220943dView commit details
Commits on Jun 14, 2017
-
[SPARK-12552][CORE] Correctly count the driver resource when recoveri…
…ng from failure for Master Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI. So here fix this to also count the driver's resource usage. Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered. andrewor14 please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes apache#10506 from jerryshao/SPARK-12552. (cherry picked from commit 9eb0952) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 53212c3 - Browse repository at this point
Copy the full SHA 53212c3View commit details -
[SPARK-20986][SQL] Reset table's statistics after PruneFileSourcePart…
…itions rule. ## What changes were proposed in this pull request? After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed. ## How was this patch tested? add unit test. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes apache#18205 from lianhuiwang/SPARK-20986. (cherry picked from commit 8b5b2e2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 42cc830 - Browse repository at this point
Copy the full SHA 42cc830View commit details -
[SPARK-21085][SQL] Failed to read the partitioned table created by Sp…
…ark 2.1 ### What changes were proposed in this pull request? Before the PR, Spark is unable to read the partitioned table created by Spark 2.1 when the table schema does not put the partitioning column at the end of the schema. [assert(partitionFields.map(_.name) == partitionColumnNames)](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L234-L236) When reading the table metadata from the metastore, we also need to reorder the columns. ### How was this patch tested? Added test cases to check both Hive-serde and data source tables. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18295 from gatorsmile/reorderReadSchema. (cherry picked from commit 0c88e8d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 9bdc835 - Browse repository at this point
Copy the full SHA 9bdc835View commit details -
[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decim…
…al Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? This PR is to backport apache#18244 to 2.2 --- The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18297 from gatorsmile/backport18244.
Configuration menu - View commit details
-
Copy full SHA for 6265119 - Browse repository at this point
Copy the full SHA 6265119View commit details -
[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decim…
…al Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? This PR is to backport apache#18244 to 2.2 --- The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18297 from gatorsmile/backport18244. (cherry picked from commit 6265119) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for a890466 - Browse repository at this point
Copy the full SHA a890466View commit details -
[SPARK-21089][SQL] Fix DESC EXTENDED/FORMATTED to Show Table Properties
Since both table properties and storage properties share the same key values, table properties are not shown in the output of DESC EXTENDED/FORMATTED when the storage properties are not empty. This PR is to fix the above issue by renaming them to different keys. Added test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes apache#18294 from gatorsmile/tableProperties. (cherry picked from commit df766a4) Signed-off-by: Xiao Li <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3dda682 - Browse repository at this point
Copy the full SHA 3dda682View commit details -
Revert "[SPARK-20941][SQL] Fix SubqueryExec Reuse"
This reverts commit 6a4e023.
Configuration menu - View commit details
-
Copy full SHA for e02e063 - Browse repository at this point
Copy the full SHA e02e063View commit details
Commits on Jun 15, 2017
-
[SPARK-20980][SQL] Rename
wholeFile
tomultiLine
for both CSV and…… JSON The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`. N/A Author: Xiao Li <gatorsmile@gmail.com> Closes apache#18202 from gatorsmile/renameCVSOption. (cherry picked from commit 2051428) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for af4f89c - Browse repository at this point
Copy the full SHA af4f89cView commit details -
[SPARK-20980][DOCS] update doc to reflect multiLine change
## What changes were proposed in this pull request? doc only change ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#18312 from felixcheung/sqljsonwholefiledoc. (cherry picked from commit 1bf55e3) Signed-off-by: Felix Cheung <felixcheung@apache.org>
Configuration menu - View commit details
-
Copy full SHA for b5504f6 - Browse repository at this point
Copy the full SHA b5504f6View commit details -
[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.r…
…dd.LocalCheckpointSuite.missing checkpoint block fails with informative message ## What changes were proposed in this pull request? Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case. The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply(). ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18314 from jiangxb1987/LocalCheckpointSuite. (cherry picked from commit 7dc3e69) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 76ee41f - Browse repository at this point
Copy the full SHA 76ee41fView commit details -
[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.r…
…dd.LocalCheckpointSuite.missing checkpoint block fails with informative message ## What changes were proposed in this pull request? Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case. The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply(). ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18314 from jiangxb1987/LocalCheckpointSuite. (cherry picked from commit 7dc3e69) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 62f2b80 - Browse repository at this point
Copy the full SHA 62f2b80View commit details
Commits on Jun 16, 2017
-
[SPARK-21111][TEST][2.2] Fix the test failure of describe.sql
## What changes were proposed in this pull request? Test failed in `describe.sql`. We need to fix the related bug introduced in (apache#17649) in the follow-up PR to master. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18316 from gatorsmile/fix.
Configuration menu - View commit details
-
Copy full SHA for a585c87 - Browse repository at this point
Copy the full SHA a585c87View commit details -
[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the chil…
…dren node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes apache#18284 from ConeyLiu/treenode. (cherry picked from commit 87ab0ce) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 9909be3 - Browse repository at this point
Copy the full SHA 9909be3View commit details -
[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the chil…
…dren node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes apache#18284 from ConeyLiu/treenode. (cherry picked from commit 87ab0ce) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 915a201 - Browse repository at this point
Copy the full SHA 915a201View commit details -
[SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to nam…
…e mismatch ## What changes were proposed in this pull request? Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/ https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/ ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18319 from gatorsmile/fixDecimal.
Configuration menu - View commit details
-
Copy full SHA for 0ebb3b8 - Browse repository at this point
Copy the full SHA 0ebb3b8View commit details -
[SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s.deploy.master.Maste…
…rSuite.master correctly recover the application" ## What changes were proposed in this pull request? Due to the RPC asynchronous event processing, The test "correctly recover the application" could potentially be failed. The issue could be found in here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78126/testReport/org.apache.spark.deploy.master/MasterSuite/master_correctly_recover_the_application/. So here fixing this flaky test. ## How was this patch tested? Existing UT. CC cloud-fan jiangxb1987 , please help to review, thanks! Author: jerryshao <sshao@hortonworks.com> Closes apache#18321 from jerryshao/SPARK-12552-followup. (cherry picked from commit 2837b14) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 653e6f1 - Browse repository at this point
Copy the full SHA 653e6f1View commit details -
[MINOR][DOCS] Improve Running R Tests docs
## What changes were proposed in this pull request? Update Running R Tests dependence packages to: ```bash R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" ``` ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes apache#18271 from wangyum/building-spark. (cherry picked from commit 45824fb) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for d3deeb3 - Browse repository at this point
Copy the full SHA d3deeb3View commit details
Commits on Jun 18, 2017
-
[SPARK-21126] The configuration which named "spark.core.connection.au…
…th.wait.timeout" hasn't been used in spark [https://issues.apache.org/jira/browse/SPARK-21126](https://issues.apache.org/jira/browse/SPARK-21126) The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark,so I think it should be removed from configuration.md. Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes apache#18333 from liu-zhaokun/new3. (cherry picked from commit 0d8604b) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 8747f8e - Browse repository at this point
Copy the full SHA 8747f8eView commit details -
[MINOR][R] Add knitr and rmarkdown packages/improve output for versio…
…n info in AppVeyor tests ## What changes were proposed in this pull request? This PR proposes three things as below: **Install packages per documentation** - this does not affect the tests itself (but CRAN which we are not doing via AppVeyor) up to my knowledge. This adds `knitr` and `rmarkdown` per https://github.com/apache/spark/blob/45824fb608930eb461e7df53bb678c9534c183a9/R/WINDOWS.md#unit-tests (please see apache@45824fb) **Improve logs/shorten logs** - actually, long logs can be a problem on AppVeyor (e.g., see apache#17873) `R -e ...` repeats printing R information for each invocation as below: ``` R version 3.3.1 (2016-06-21) -- "Bug in Your Hair" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: i386-w64-mingw32/i386 (32-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. ``` It looks reducing the call might be slightly better and print out the versions together looks more readable. Before: ``` # R information ... > packageVersion('testthat') [1] '1.0.2' > > # R information ... > packageVersion('e1071') [1] '1.6.8' > > ... 3 more times ``` After: ``` # R information ... > packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival') [1] ‘1.16’ [1] ‘1.6’ [1] ‘1.0.2’ [1] ‘1.6.8’ [1] ‘2.41.3’ ``` **Add`appveyor.yml`/`dev/appveyor-install-dependencies.ps1` for triggering the test** Changing this file might break the test, e.g., apache#16927 ## How was this patch tested? Before (please see https://ci.appveyor.com/project/HyukjinKwon/spark/build/169-master) After (please see the AppVeyor build in this PR): Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#18336 from HyukjinKwon/minor-add-knitr-and-rmarkdown. (cherry picked from commit 75a6d05) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for c0d4acc - Browse repository at this point
Copy the full SHA c0d4accView commit details
Commits on Jun 19, 2017
-
[SPARK-21090][CORE] Optimize the unified memory manager code
## What changes were proposed in this pull request? 1.In `acquireStorageMemory`, when the Memory Mode is OFF_HEAP ,the `maxOffHeapMemory` should be modified to `maxOffHeapStorageMemory`. after this PR,it will same as ON_HEAP Memory Mode. Because when acquire memory is between `maxOffHeapStorageMemory` and `maxOffHeapMemory`,it will fail surely, so if acquire memory is greater than `maxOffHeapStorageMemory`(not greater than `maxOffHeapMemory`),we should fail fast. 2. Borrow memory from execution, `numBytes` modified to `numBytes - storagePool.memoryFree` will be more reasonable. Because we just acquire `(numBytes - storagePool.memoryFree)`, unnecessary borrowed `numBytes` from execution ## How was this patch tested? added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes apache#18296 from 10110346/wip-lx-0614. (cherry picked from commit 112bd9b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d3c79b7 - Browse repository at this point
Copy the full SHA d3c79b7View commit details -
[SPARK-21132][SQL] DISTINCT modifier of function arguments should not…
… be silently ignored ### What changes were proposed in this pull request? We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages. ### How was this patch tested? Added test cases for both regular functions and window functions Author: Xiao Li <gatorsmile@gmail.com> Closes apache#18340 from gatorsmile/firstCount. (cherry picked from commit 9413b84) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for fab070c - Browse repository at this point
Copy the full SHA fab070cView commit details -
[SPARK-19688][STREAMING] Not to read
spark.yarn.credentials.file
fr……om checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes apache#18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
saturday_s authored and Marcelo Vanzin committedJun 19, 2017 Configuration menu - View commit details
-
Copy full SHA for f7fcdec - Browse repository at this point
Copy the full SHA f7fcdecView commit details -
[SPARK-19688][STREAMING] Not to read
spark.yarn.credentials.file
fr……om checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes apache#18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
saturday_s authored and Marcelo Vanzin committedJun 19, 2017 Configuration menu - View commit details
-
Copy full SHA for a44c118 - Browse repository at this point
Copy the full SHA a44c118View commit details -
[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream sou…
…rce are in a wrong table ## What changes were proposed in this pull request? The description for several options of File Source for structured streaming appeared in the File Sink description instead. This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2 ## How was this patch tested? Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide. The original documentation was written by tdas and lw-lin Author: assafmendelson <assaf.mendelson@gmail.com> Closes apache#18342 from assafmendelson/spark-21123. (cherry picked from commit 66a792c) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 7b50736 - Browse repository at this point
Copy the full SHA 7b50736View commit details -
Configuration menu - View commit details
-
Copy full SHA for 32bd9a7 - Browse repository at this point
Copy the full SHA 32bd9a7View commit details -
[MINOR][BUILD] Fix Java linter errors
This PR cleans up a few Java linter errors for Apache Spark 2.2 release. ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894). Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#18345 from dongjoon-hyun/fix_lint_java_2. (cherry picked from commit ecc5631) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for e329bea - Browse repository at this point
Copy the full SHA e329beaView commit details -
[SPARK-21138][YARN] Cannot delete staging dir when the clusters of "s…
…park.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests Author: sharkdtu <sharkdtu@tencent.com> Closes apache#18352 from sharkdtu/master. (cherry picked from commit 3d4d11a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
sharkdtu authored and Marcelo Vanzin committedJun 19, 2017 Configuration menu - View commit details
-
Copy full SHA for cf10fa8 - Browse repository at this point
Copy the full SHA cf10fa8View commit details -
[SPARK-21138][YARN] Cannot delete staging dir when the clusters of "s…
…park.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests Author: sharkdtu <sharkdtu@tencent.com> Closes apache#18352 from sharkdtu/master. (cherry picked from commit 3d4d11a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
sharkdtu authored and Marcelo Vanzin committedJun 19, 2017 Configuration menu - View commit details
-
Copy full SHA for 7799f35 - Browse repository at this point
Copy the full SHA 7799f35View commit details
Commits on Jun 20, 2017
-
[SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeExternal throw…
…s NPE ## What changes were proposed in this pull request? Fix HighlyCompressedMapStatus#writeExternal NPE: ``` 17/06/18 15:00:27 ERROR Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) ... 17 more 17/06/18 15:00:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.17.47.20:50188 17/06/18 15:00:27 ERROR Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes apache#18343 from wangyum/SPARK-21133. (cherry picked from commit 9b57cd8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 8bf7f1e - Browse repository at this point
Copy the full SHA 8bf7f1eView commit details -
[SPARK-20929][ML] LinearSVC should use its own threshold param
## What changes were proposed in this pull request? LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#18151 from jkbradley/ml-2.2-linearsvc-cleanup. (cherry picked from commit cc67bd5) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 514a7e6 - Browse repository at this point
Copy the full SHA 514a7e6View commit details -
[SPARK-21150][SQL] Persistent view stored in Hive metastore should be…
… case preserving ## What changes were proposed in this pull request? This is a regression in Spark 2.2. In Spark 2.2, we introduced a new way to resolve persisted view: https://issues.apache.org/jira/browse/SPARK-18209 , but this makes the persisted view non case-preserving because we store the schema in hive metastore directly. We should follow data source table and store schema in table properties. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18360 from cloud-fan/view. (cherry picked from commit e862dc9) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for b8b80f6 - Browse repository at this point
Copy the full SHA b8b80f6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 62e442e - Browse repository at this point
Copy the full SHA 62e442eView commit details -
Configuration menu - View commit details
-
Copy full SHA for e883498 - Browse repository at this point
Copy the full SHA e883498View commit details -
[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream sou…
…rce are in a wrong table - version to fix 2.1 ## What changes were proposed in this pull request? The description for several options of File Source for structured streaming appeared in the File Sink description instead. This commit continues on PR apache#18342 and targets the fixes for the documentation of version spark version 2.1 ## How was this patch tested? Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide. zsxwing This is the PR to fix version 2.1 as discussed in PR apache#18342 Author: assafmendelson <assaf.mendelson@gmail.com> Closes apache#18363 from assafmendelson/spark-21123-for-spark2.1.
Configuration menu - View commit details
-
Copy full SHA for 8923bac - Browse repository at this point
Copy the full SHA 8923bacView commit details
Commits on Jun 21, 2017
-
[MINOR][DOCS] Add lost <tr> tag for configuration.md
## What changes were proposed in this pull request? Add lost `<tr>` tag for `configuration.md`. ## How was this patch tested? N/A Author: Yuming Wang <wgyumg@gmail.com> Closes apache#18372 from wangyum/docs-missing-tr. (cherry picked from commit 987eb8f) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 529c04f - Browse repository at this point
Copy the full SHA 529c04fView commit details
Commits on Jun 22, 2017
-
[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Po…
…ol Limit - Class Splitting ## What changes were proposed in this pull request? This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in apache#18075. ## How was this patch tested? The same test provided in apache#18075 is included in this patch. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes apache#18354 from bdrillard/class_splitting_2.1.
Configuration menu - View commit details
-
Copy full SHA for 6b37c86 - Browse repository at this point
Copy the full SHA 6b37c86View commit details -
[SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation: Constant Po…
…ol Limit - Class Splitting ## What changes were proposed in this pull request? This is a backport patch for Spark 2.2.x of the class splitting feature over excess generated code as was merged in apache#18075. ## How was this patch tested? The same test provided in apache#18075 is included in this patch. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes apache#18377 from bdrillard/class_splitting_2.2.
Configuration menu - View commit details
-
Copy full SHA for 198e3a0 - Browse repository at this point
Copy the full SHA 198e3a0View commit details -
[SPARK-21167][SS] Decode the path generated by File sink to handle sp…
…ecial characters ## What changes were proposed in this pull request? Decode the path generated by File sink to handle special characters. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18381 from zsxwing/SPARK-21167. (cherry picked from commit d66b143) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 6ef7a5b - Browse repository at this point
Copy the full SHA 6ef7a5bView commit details -
[SPARK-21167][SS] Decode the path generated by File sink to handle sp…
…ecial characters ## What changes were proposed in this pull request? Decode the path generated by File sink to handle special characters. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18381 from zsxwing/SPARK-21167. (cherry picked from commit d66b143) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 1a98d5d - Browse repository at this point
Copy the full SHA 1a98d5dView commit details -
[SQL][DOC] Fix documentation of lpad
## What changes were proposed in this pull request? Fix incomplete documentation for `lpad`. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes apache#18367 from actuaryzhang/SQLDoc. (cherry picked from commit 97b307c) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for d625734 - Browse repository at this point
Copy the full SHA d625734View commit details
Commits on Jun 23, 2017
-
Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation: Con…
…stant Pool Limit - Class Splitting" This reverts commit 198e3a0.
Configuration menu - View commit details
-
Copy full SHA for b99c0e9 - Browse repository at this point
Copy the full SHA b99c0e9View commit details -
[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in…
… INSERT AS SELECT [WIP] ### What changes were proposed in this pull request? The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule `SimplifyCasts` and `RemoveRedundantAliases`. ```SQL SELECT word, length, cast(first as string) as first FROM view1 ``` This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in `FileFormatWriter`. The related issue in the master branch has been fixed by apache#18064. After this PR is merged, I will submit a separate PR to merge the test case to the master. ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18386 from gatorsmile/newRC5.
Configuration menu - View commit details
-
Copy full SHA for b6749ba - Browse repository at this point
Copy the full SHA b6749baView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7b87527 - Browse repository at this point
Copy the full SHA 7b87527View commit details -
[SPARK-21144][SQL] Print a warning if the data schema and partition s…
…chema have the duplicate columns ## What changes were proposed in this pull request? The current master outputs unexpected results when the data schema and partition schema have the duplicate columns: ``` withTempPath { dir => val basePath = dir.getCanonicalPath spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString) spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString) spark.read.parquet(basePath).show() } +---+ |foo| +---+ | 1| | 1| | a| | a| | 1| | a| +---+ ``` This patch added code to print a warning when the duplication found. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#18375 from maropu/SPARK-21144-3. (cherry picked from commit f3dea60) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9d29808 - Browse repository at this point
Copy the full SHA 9d29808View commit details -
[SPARK-21181] Release byteBuffers to suppress netty error messages
## What changes were proposed in this pull request? We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user. ### Changes proposed in this fix By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0. ## How was this patch tested? Ran a few spark-applications and examined the logs. The error message no longer appears. Original PR was opened against branch-2.1 => apache#18392 Author: Dhruve Ashar <dhruveashar@gmail.com> Closes apache#18407 from dhruve/master. (cherry picked from commit 1ebe7ff) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for f160267 - Browse repository at this point
Copy the full SHA f160267View commit details -
[SPARK-21181] Release byteBuffers to suppress netty error messages
## What changes were proposed in this pull request? We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user. ### Changes proposed in this fix By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0. ## How was this patch tested? Ran a few spark-applications and examined the logs. The error message no longer appears. Original PR was opened against branch-2.1 => apache#18392 Author: Dhruve Ashar <dhruveashar@gmail.com> Closes apache#18407 from dhruve/master. (cherry picked from commit 1ebe7ff) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for f8fd3b4 - Browse repository at this point
Copy the full SHA f8fd3b4View commit details -
[MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method
## What changes were proposed in this pull request? * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`. * Filled in some missing parentheses ## How was this patch tested? N/A Author: Ong Ming Yang <me@ongmingyang.com> Closes apache#18398 from ongmingyang/master. (cherry picked from commit 4cc6295) Signed-off-by: Xiao Li <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3394b06 - Browse repository at this point
Copy the full SHA 3394b06View commit details -
[MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method
## What changes were proposed in this pull request? * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`. * Filled in some missing parentheses ## How was this patch tested? N/A Author: Ong Ming Yang <me@ongmingyang.com> Closes apache#18398 from ongmingyang/master. (cherry picked from commit 4cc6295) Signed-off-by: Xiao Li <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for bcaf06c - Browse repository at this point
Copy the full SHA bcaf06cView commit details
Commits on Jun 24, 2017
-
[SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types…
… in read path ## What changes were proposed in this pull request? This PR is to revert some code changes in the read path of apache#14377. The original fix is apache#17830 When merging this PR, please give the credit to gaborfeher ## How was this patch tested? Added a test case to OracleIntegrationSuite.scala Author: Gabor Feher <gabor.feher@lynxanalytics.com> Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18408 from gatorsmile/OracleType. (cherry picked from commit b837bf9) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a3088d2 - Browse repository at this point
Copy the full SHA a3088d2View commit details -
[SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types…
… in read path This PR is to revert some code changes in the read path of apache#14377. The original fix is apache#17830 When merging this PR, please give the credit to gaborfeher Added a test case to OracleIntegrationSuite.scala Author: Gabor Feher <gabor.feher@lynxanalytics.com> Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18408 from gatorsmile/OracleType.
Configuration menu - View commit details
-
Copy full SHA for f12883e - Browse repository at this point
Copy the full SHA f12883eView commit details -
[SPARK-21159][CORE] Don't try to connect to launcher in standalone cl…
…uster mode. Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but the same scheduler implementation is used, and if it tries to connect to the launcher it will fail. So fix the scheduler so it only tries that in client mode; cluster mode applications will be correctly launched and will work, but monitoring through the launcher handle will not be available. Tested by running a cluster mode app with "SparkLauncher.startApplication". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18397 from vanzin/SPARK-21159. (cherry picked from commit bfd73a7) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 96c04f1 - Browse repository at this point
Copy the full SHA 96c04f1View commit details -
[SPARK-21159][CORE] Don't try to connect to launcher in standalone cl…
…uster mode. Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but the same scheduler implementation is used, and if it tries to connect to the launcher it will fail. So fix the scheduler so it only tries that in client mode; cluster mode applications will be correctly launched and will work, but monitoring through the launcher handle will not be available. Tested by running a cluster mode app with "SparkLauncher.startApplication". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18397 from vanzin/SPARK-21159. (cherry picked from commit bfd73a7) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 6750db3 - Browse repository at this point
Copy the full SHA 6750db3View commit details -
[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct
### What changes were proposed in this pull request? ```SQL CREATE TABLE `tab1` (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>) USING parquet INSERT INTO `tab1` SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b')) SELECT custom_fields.id, custom_fields.value FROM tab1 ``` The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18412 from gatorsmile/castStruct. (cherry picked from commit 2e1586f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for ad44ab5 - Browse repository at this point
Copy the full SHA ad44ab5View commit details -
[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct
### What changes were proposed in this pull request? ```SQL CREATE TABLE `tab1` (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>) USING parquet INSERT INTO `tab1` SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b')) SELECT custom_fields.id, custom_fields.value FROM tab1 ``` The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18412 from gatorsmile/castStruct. (cherry picked from commit 2e1586f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 0d6b701 - Browse repository at this point
Copy the full SHA 0d6b701View commit details
Commits on Jun 25, 2017
-
[SPARK-21079][SQL] Calculate total size of a partition table as a sum…
… of individual partitions ## What changes were proposed in this pull request? Storage URI of a partitioned table may or may not point to a directory under which individual partitions are stored. In fact, individual partitions may be located in totally unrelated directories. Before this change, ANALYZE TABLE table COMPUTE STATISTICS command calculated total size of a table by adding up sizes of files found under table's storage URI. This calculation could produce 0 if partitions are stored elsewhere. This change uses storage URIs of individual partitions to calculate the sizes of all partitions of a table and adds these up to produce the total size of a table. CC: wzhfy ## How was this patch tested? Added unit test. Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero. Author: Masha Basmanova <mbasmanova@fb.com> Closes apache#18309 from mbasmanova/mbasmanova-analyze-part-table. (cherry picked from commit b449a1d) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d8e3a4a - Browse repository at this point
Copy the full SHA d8e3a4aView commit details -
Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Con…
…stant Pool Limit - Class Splitting" This reverts commit 6b37c86.
Configuration menu - View commit details
-
Copy full SHA for 26f4f34 - Browse repository at this point
Copy the full SHA 26f4f34View commit details
Commits on Jun 26, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 61af209 - Browse repository at this point
Copy the full SHA 61af209View commit details
Commits on Jun 27, 2017
-
[SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should b…
…e global The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format. val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100)))) val ds = spark.createDataset(data) The `valueConverter` in `ExternalMapToCatalyst` looks like: if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value) There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`. Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore. Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18418 from viirya/SPARK-19104. (cherry picked from commit fd8c931) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 970f68c - Browse repository at this point
Copy the full SHA 970f68cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8fdc51b - Browse repository at this point
Copy the full SHA 8fdc51bView commit details -
Configuration menu - View commit details
-
Copy full SHA for f3c40d5 - Browse repository at this point
Copy the full SHA f3c40d5View commit details
Commits on Jun 29, 2017
-
[SPARK-21210][DOC][ML] Javadoc 8 fixes for ML shared param traits
PR apache#15999 included fixes for doc strings in the ML shared param traits (occurrences of `>` and `>=`). This PR simply uses the HTML-escaped version of the param doc to embed into the Scaladoc, to ensure that when `SharedParamsCodeGen` is run, the generated javadoc will be compliant for Java 8. ## How was this patch tested? Existing tests Author: Nick Pentreath <nickp@za.ibm.com> Closes apache#18420 from MLnick/shared-params-javadoc8. (cherry picked from commit 70085e8) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 17a04b9 - Browse repository at this point
Copy the full SHA 17a04b9View commit details
Commits on Jun 30, 2017
-
[SPARK-21253][CORE] Fix a bug that StreamCallback may not be notified…
… if network errors happen ## What changes were proposed in this pull request? If a network error happens before processing StreamResponse/StreamFailure events, StreamCallback.onFailure won't be called. This PR fixes `failOutstandingRequests` to also notify outstanding StreamCallbacks. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18472 from zsxwing/fix-stream-2. (cherry picked from commit 4996c53) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 20cf511 - Browse repository at this point
Copy the full SHA 20cf511View commit details -
[SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem
Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service. Credits to wangyum Closes apache#18466 Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Yuming Wang <wgyumg@gmail.com> Closes apache#18467 from zsxwing/SPARK-21253. (cherry picked from commit 80f7ac3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 8de67e3 - Browse repository at this point
Copy the full SHA 8de67e3View commit details -
[SPARK-21176][WEB UI] Limit number of selector threads for admin ui p…
…roxy servlets to 8 ## What changes were proposed in this pull request? Please see also https://issues.apache.org/jira/browse/SPARK-21176 This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2). The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers). Once jetty/jetty.project#1643 is available, the code could be cleaned up to avoid the method override. I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR? ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy. gurvindersingh zsxwing can you please review the change? Author: IngoSchuster <ingo.schuster@de.ibm.com> Author: Ingo Schuster <ingo.schuster@de.ibm.com> Closes apache#18437 from IngoSchuster/master. (cherry picked from commit 88a536b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for c6ba647 - Browse repository at this point
Copy the full SHA c6ba647View commit details -
[SPARK-21176][WEB UI] Limit number of selector threads for admin ui p…
…roxy servlets to 8 ## What changes were proposed in this pull request? Please see also https://issues.apache.org/jira/browse/SPARK-21176 This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2). The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers). Once jetty/jetty.project#1643 is available, the code could be cleaned up to avoid the method override. I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR? ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy. gurvindersingh zsxwing can you please review the change? Author: IngoSchuster <ingo.schuster@de.ibm.com> Author: Ingo Schuster <ingo.schuster@de.ibm.com> Closes apache#18437 from IngoSchuster/master. (cherry picked from commit 88a536b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 083adb0 - Browse repository at this point
Copy the full SHA 083adb0View commit details -
[SPARK-21253][CORE][HOTFIX] Fix Scala 2.10 build
## What changes were proposed in this pull request? A follow up PR to fix Scala 2.10 build for apache#18472 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18478 from zsxwing/SPARK-21253-2. (cherry picked from commit cfc696f) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d16e262 - Browse repository at this point
Copy the full SHA d16e262View commit details -
[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spi…
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by apache#16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 8b08fd0 - Browse repository at this point
Copy the full SHA 8b08fd0View commit details -
[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spi…
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by apache#16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d995dac - Browse repository at this point
Copy the full SHA d995dacView commit details -
Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation …
…with spilling" This reverts commit d995dac.
Configuration menu - View commit details
-
Copy full SHA for 3ecef24 - Browse repository at this point
Copy the full SHA 3ecef24View commit details -
[SPARK-21129][SQL] Arguments of SQL function call should not be named…
… expressions ### What changes were proposed in this pull request? Function argument should not be named expressions. It could cause two issues: - Misleading error message - Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser. ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26; 'Project [unresolvedalias('count(c1#30, 'distinct), None)] +- SubqueryAlias t1 +- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31] ``` After the fix, the error message becomes ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35) == SQL == select count(distinct c1, distinct c2) from t1 -----------------------------------^^^ ``` ### How was this patch tested? Added a test case to parser suite. Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18338 from gatorsmile/parserDistinctAggFunc. (cherry picked from commit eed9c4e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 29a0be2 - Browse repository at this point
Copy the full SHA 29a0be2View commit details -
Configuration menu - View commit details
-
Copy full SHA for a2c7b21 - Browse repository at this point
Copy the full SHA a2c7b21View commit details -
Configuration menu - View commit details
-
Copy full SHA for 85fddf4 - Browse repository at this point
Copy the full SHA 85fddf4View commit details
Commits on Jul 1, 2017
-
[SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throw…
…s IllegalArgumentException: Self-suppression not permitted ## What changes were proposed in this pull request? Not adding the exception to the suppressed if it is the same instance as originalThrowable. ## How was this patch tested? Added new tests to verify this, these tests fail without source code changes and passes with the change. Author: Devaraj K <devaraj@apache.org> Closes apache#18384 from devaraj-kavali/SPARK-21170. (cherry picked from commit 6beca9c) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 6fd39ea - Browse repository at this point
Copy the full SHA 6fd39eaView commit details
Commits on Jul 4, 2017
-
[SPARK-20256][SQL] SessionState should be created more lazily
## What changes were proposed in this pull request? `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)). This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems. **BEFORE** ```scala $ bin/spark-shell java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' ... Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx------ ``` As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user. **AFTER** ```scala $ bin/spark-shell ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) Type in expressions to have them evaluated. Type :help for more information. scala> sc.range(0, 10, 1).count() res0: Long = 10 ``` ## How was this patch tested? Manual. This closes apache#18512 . Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#18501 from dongjoon-hyun/SPARK-20256. (cherry picked from commit 1b50e0e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for db21b67 - Browse repository at this point
Copy the full SHA db21b67View commit details
Commits on Jul 5, 2017
-
[SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more la…
…zily ## What changes were proposed in this pull request? `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)). This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems. **BEFORE** ```scala $ bin/spark-shell java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' ... Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx------ ``` As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user. **AFTER** ```scala $ bin/spark-shell ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.2-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> sc.range(0, 10, 1).count() res0: Long = 10 ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1.
Configuration menu - View commit details
-
Copy full SHA for 8f1ca69 - Browse repository at this point
Copy the full SHA 8f1ca69View commit details -
[SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key pr…
…ior to converting to internal value. ## What changes were proposed in this pull request? `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#18524 from ueshin/issues/SPARK-21300. (cherry picked from commit ce10545) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 770fd2a - Browse repository at this point
Copy the full SHA 770fd2aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4a4d148 - Browse repository at this point
Copy the full SHA 4a4d148View commit details
Commits on Jul 6, 2017
-
[SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream
## What changes were proposed in this pull request? Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes. ## How was this patch tested? Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset. Author: Sumedh Wale <swale@snappydata.io> Closes apache#18535 from sumwale/SPARK-21312. (cherry picked from commit 14a3bb3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 6e1081c - Browse repository at this point
Copy the full SHA 6e1081cView commit details -
[SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream
## What changes were proposed in this pull request? Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes. ## How was this patch tested? Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset. Author: Sumedh Wale <swale@snappydata.io> Closes apache#18535 from sumwale/SPARK-21312. (cherry picked from commit 14a3bb3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 7f7b63b - Browse repository at this point
Copy the full SHA 7f7b63bView commit details -
[SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp check…
…point dir should be deleted ## What changes were proposed in this pull request? Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#18442 from tdas/DatastreamReaderWriterSuite-fix. (cherry picked from commit 60043f2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 4e53a4e - Browse repository at this point
Copy the full SHA 4e53a4eView commit details
Commits on Jul 7, 2017
-
[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation
## What changes were proposed in this pull request? Few changes to the Structured Streaming documentation - Clarify that the entire stream input table is not materialized - Add information for Ganglia - Add Kafka Sink to the main docs - Removed a couple of leftover experimental tags - Added more associated reading material and talk videos. In addition, apache#16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan. - Added a redirection to avoid breaking internal and possible external links. - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#18485 from tdas/SPARK-21267. (cherry picked from commit 0217dfd) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 576fd4c - Browse repository at this point
Copy the full SHA 576fd4cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6e33965 - Browse repository at this point
Copy the full SHA 6e33965View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3f914aa - Browse repository at this point
Copy the full SHA 3f914aaView commit details
Commits on Jul 8, 2017
-
[SPARK-21069][SS][DOCS] Add rate source to programming guide.
## What changes were proposed in this pull request? SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide. ## How was this patch tested? Tested by running jekyll locally. Author: Prashant Sharma <prashant@apache.org> Author: Prashant Sharma <prashsh1@in.ibm.com> Closes apache#18562 from ScrapCodes/spark-21069/rate-source-docs. (cherry picked from commit d0bfc67) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for ab12848 - Browse repository at this point
Copy the full SHA ab12848View commit details -
[SPARK-21228][SQL][BRANCH-2.2] InSet incorrect handling of structs
## What changes were proposed in this pull request? This is backport of apache#18455 When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals. ## How was this patch tested? New test in SQLQuerySuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes apache#18563 from bogdanrdc/SPARK-21228-BRANCH2.2.
Configuration menu - View commit details
-
Copy full SHA for 7d0b1c9 - Browse repository at this point
Copy the full SHA 7d0b1c9View commit details -
[SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should …
…clean up stopped sessions. `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`. Recently, master branch fails consequtively. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ Pass the Jenkins with a updated suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#18567 from dongjoon-hyun/SPARK-SESSION. (cherry picked from commit 0b8dd2d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for a64f108 - Browse repository at this point
Copy the full SHA a64f108View commit details -
[SPARK-20342][CORE] Update task accumulators before sending task end …
…event. This makes sures that listeners get updated task information; otherwise it's possible to write incomplete task information into event logs, for example, making the information in a replayed UI inconsistent with the original application. Added a new unit test to try to detect the problem, but it's not guaranteed to fail since it's a race; but it fails pretty reliably for me without the scheduler changes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18393 from vanzin/SPARK-20342.try2. (cherry picked from commit 9131bdb) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for c8d7855 - Browse repository at this point
Copy the full SHA c8d7855View commit details -
[SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffle…
…ToMem. ## What changes were proposed in this pull request? In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document. Author: jinxing <jinxing6042@126.com> Closes apache#18566 from jinxing64/SPARK-21343. (cherry picked from commit 062c336) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 964332b - Browse repository at this point
Copy the full SHA 964332bView commit details
Commits on Jul 9, 2017
-
[SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderS…
…uite should clean up stopped sessions. ## What changes were proposed in this pull request? `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`. Recently, master branch fails consequtively. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ ## How was this patch tested? Pass the Jenkins with a updated suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1.
Configuration menu - View commit details
-
Copy full SHA for 5e2bfd5 - Browse repository at this point
Copy the full SHA 5e2bfd5View commit details -
[SPARK-21083][SQL][BRANCH-2.2] Store zero size and row count when ana…
…lyzing empty table ## What changes were proposed in this pull request? We should be able to store zero size and row count after analyzing empty table. This is a backport for apache@9fccc36. ## How was this patch tested? Added new test. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes apache#18575 from wzhfy/analyzeEmptyTable-2.2.
Configuration menu - View commit details
-
Copy full SHA for 3bfad9d - Browse repository at this point
Copy the full SHA 3bfad9dView commit details
Commits on Jul 10, 2017
-
[SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when ana…
…lyzing empty table ## What changes were proposed in this pull request? We should be able to store zero size and row count after analyzing empty table. This is a backport for apache@9fccc36. ## How was this patch tested? Added new test. Author: Zhenhua Wang <wzh_zju@163.com> Closes apache#18577 from wzhfy/analyzeEmptyTable-2.1.
Configuration menu - View commit details
-
Copy full SHA for 2c28462 - Browse repository at this point
Copy the full SHA 2c28462View commit details -
[SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFet…
…cher. When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result. This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher` Author: jinxing <jinxing6042@126.com> Author: Shixiong Zhu <zsxwing@gmail.com> Closes apache#18565 from jinxing64/SPARK-21342. (cherry picked from commit 6a06c4b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 40fd0ce - Browse repository at this point
Copy the full SHA 40fd0ceView commit details -
[SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows
## What changes were proposed in this pull request? Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin. ## How was this patch tested? Non-zero output rows manually seen in metrics. Author: Juliusz Sompolski <julek@databricks.com> Closes apache#18494 from juliuszsompolski/SPARK-21272.
Configuration menu - View commit details
-
Copy full SHA for a05edf4 - Browse repository at this point
Copy the full SHA a05edf4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 73df649 - Browse repository at this point
Copy the full SHA 73df649View commit details
Commits on Jul 11, 2017
-
[SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*
## What changes were proposed in this pull request? Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18593 from zsxwing/SPARK-21369. (cherry picked from commit 833eab2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for edcd9fb - Browse repository at this point
Copy the full SHA edcd9fbView commit details -
[SPARK-21366][SQL][TEST] Add sql test for window functions
## What changes were proposed in this pull request? Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`. ## How was this patch tested? Added `window.sql` and the corresponding output file. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18591 from jiangxb1987/window. (cherry picked from commit 66d2168) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 399aa01 - Browse repository at this point
Copy the full SHA 399aa01View commit details
Commits on Jul 12, 2017
-
[SPARK-21219][CORE] Task retry occurs on same executor due to race co…
…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenbergfb.com> Closes apache#18427 from ericvandenbergfb/blacklistFix. ## What changes were proposed in this pull request? This is a backport of the fix to SPARK-21219, already checked in as 96d58f2. ## How was this patch tested? Ran TaskSetManagerSuite tests locally. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes apache#18604 from jsoltren/branch-2.2.
Configuration menu - View commit details
-
Copy full SHA for cb6fc89 - Browse repository at this point
Copy the full SHA cb6fc89View commit details
Commits on Jul 13, 2017
-
[SPARK-18646][REPL] Set parent classloader as null for ExecutorClassL…
…oader ## What changes were proposed in this pull request? `ClassLoader` will preferentially load class from `parent`. Only when `parent` is null or the load failed, that it will call the overridden `findClass` function. To avoid the potential issue caused by loading class using inappropriate class loader, we should set the `parent` of `ClassLoader` to null, so that we can fully control which class loader is used. This is take over of apache#17074, the primary author of this PR is taroplus . Should close apache#17074 after this PR get merged. ## How was this patch tested? Add test case in `ExecutorClassLoaderSuite`. Author: Kohki Nishio <taroplus@me.com> Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18614 from jiangxb1987/executor_classloader. (cherry picked from commit e08d06b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 39eba30 - Browse repository at this point
Copy the full SHA 39eba30View commit details -
Revert "[SPARK-18646][REPL] Set parent classloader as null for Execut…
…orClassLoader" This reverts commit 39eba30.
Configuration menu - View commit details
-
Copy full SHA for cf0719b - Browse repository at this point
Copy the full SHA cf0719bView commit details
Commits on Jul 14, 2017
-
[SPARK-21376][YARN] Fix yarn client token expire issue when cleaning …
…the staging files in long running scenario ## What changes were proposed in this pull request? This issue happens in long running application with yarn cluster mode, because yarn#client doesn't sync token with AM, so it will always keep the initial token, this token may be expired in the long running scenario, so when yarn#client tries to clean up staging directory after application finished, it will use this expired token and meet token expire issue. ## How was this patch tested? Manual verification is secure cluster. Author: jerryshao <sshao@hortonworks.com> Closes apache#18617 from jerryshao/SPARK-21376. (cherry picked from commit cb8d5cc)
Configuration menu - View commit details
-
Copy full SHA for bfe3ba8 - Browse repository at this point
Copy the full SHA bfe3ba8View commit details
Commits on Jul 15, 2017
-
[SPARK-21344][SQL] BinaryType comparison does signed byte array compa…
…rison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#18571 from kiszk/SPARK-21344. (cherry picked from commit ac5d5d7) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1cb4369 - Browse repository at this point
Copy the full SHA 1cb4369View commit details -
[SPARK-21344][SQL] BinaryType comparison does signed byte array compa…
…rison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#18571 from kiszk/SPARK-21344. (cherry picked from commit ac5d5d7) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for ca4d2aa - Browse repository at this point
Copy the full SHA ca4d2aaView commit details -
[SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming…
…-guide redirector ## What changes were proposed in this pull request? Update internal references from programming-guide to rdd-programming-guide See apache/spark-website@5ddf243 and apache#18485 (comment) Let's keep the redirector even if it's problematic to build, but not rely on it internally. ## How was this patch tested? (Doc build) Author: Sean Owen <sowen@cloudera.com> Closes apache#18625 from srowen/SPARK-21267.2. (cherry picked from commit 74ac1fb) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 8e85ce6 - Browse repository at this point
Copy the full SHA 8e85ce6View commit details
Commits on Jul 17, 2017
-
[SPARK-21321][SPARK CORE] Spark very verbose on shutdown
## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes apache#18547 from yoonlee95/SPARK-21321. (cherry picked from commit 0e07a29) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
John Lee authored and Tom Graves committedJul 17, 2017 Configuration menu - View commit details
-
Copy full SHA for 0ef98fd - Browse repository at this point
Copy the full SHA 0ef98fdView commit details
Commits on Jul 18, 2017
-
[SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCat…
…alyst should be global ## What changes were proposed in this pull request? This PR is backport of apache#18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1. The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format. ``` val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100)))) val ds = spark.createDataset(data) ``` The `valueConverter` in `ExternalMapToCatalyst` looks like: ``` if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value) ``` There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`. Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore. ## How was this patch tested? Added a new test suite into `DatasetPrimitiveSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#18627 from kiszk/SPARK-21391.
Configuration menu - View commit details
-
Copy full SHA for a9efce4 - Browse repository at this point
Copy the full SHA a9efce4View commit details -
[SPARK-21332][SQL] Incorrect result type inferred for some decimal ex…
…pressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)alteryx#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)alteryx#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4] +- Scan ExistingRDD[col#1] // Schema root |-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)alteryx#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)alteryx#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11] +- Scan ExistingRDD[col#1] // Schema root |-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes apache#18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 83bdb04 - Browse repository at this point
Copy the full SHA 83bdb04View commit details -
[SPARK-21332][SQL] Incorrect result type inferred for some decimal ex…
…pressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)alteryx#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)alteryx#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)alteryx#4] +- Scan ExistingRDD[col#1] // Schema root |-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)alteryx#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)alteryx#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)alteryx#11] +- Scan ExistingRDD[col#1] // Schema root |-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes apache#18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for caf32b3 - Browse repository at this point
Copy the full SHA caf32b3View commit details -
[SPARK-21445] Make IntWrapper and LongWrapper in UTF8String Serializable
## What changes were proposed in this pull request? Making those two classes will avoid Serialization issues like below: ``` Caused by: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper Serialization stack: - object not serializable (class: org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: org.apache.spark.unsafe.types.UTF8String$IntWrapper326450e) - field (class: org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper) - object (class org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, <function1>) ``` ## How was this patch tested? - [x] Manual testing - [ ] Unit test Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#18660 from brkyvz/serializableutf8. (cherry picked from commit 26cd2ca) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 99ce551 - Browse repository at this point
Copy the full SHA 99ce551View commit details -
[SPARK-18631][SQL] Changed ExchangeCoordinator re-partitioning to avo…
…id more data skew ## What changes were proposed in this pull request? Re-partitioning logic in ExchangeCoordinator changed so that adding another pre-shuffle partition to the post-shuffle partition will not be done if doing so would cause the size of the post-shuffle partition to exceed the target partition size. ## How was this patch tested? Existing tests updated to reflect new expectations. Author: Mark Hamstra <markhamstra@gmail.com> Closes apache#16065 from markhamstra/SPARK-17064.
Configuration menu - View commit details
-
Copy full SHA for 49e2ada - Browse repository at this point
Copy the full SHA 49e2adaView commit details -
[SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly ha…
…ndle partition values with dot ## What changes were proposed in this pull request? When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions. ## How was this patch tested? new regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18671 from cloud-fan/hive. (cherry picked from commit f18b905) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for df061fd - Browse repository at this point
Copy the full SHA df061fdView commit details
Commits on Jul 19, 2017
-
[SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM.
## What changes were proposed in this pull request? In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound. This could result in the buffer is very big though the window is small. For example: ``` select a, b, sum(a) over (partition by b order by a range between 1000000 following and 1000001 following) from table ``` We can refine the logic and just add the qualified rows into buffer. ## How was this patch tested? Manual test: Run sql `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10` against a table with 4 columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines. Configure the executor with 2G bytes memory. With the change in this pr, it works find. Without this change, below exception will be thrown. ``` MemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62) at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ``` Author: jinxing <jinxing6042@126.com> Closes apache#18634 from jinxing64/SPARK-21414. (cherry picked from commit 4eb081c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 5a0a76f - Browse repository at this point
Copy the full SHA 5a0a76fView commit details -
[SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results fai…
…lures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 This issue can be reproduced by the following example: ``` val spark = SparkSession .builder() .appName("smj-codegen") .master("local") .config("spark.sql.autoBroadcastJoinThreshold", "1") .getOrCreate() val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str") val df = df1.join(df2, df1("key") === df2("key")) .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1") .select("int") df.show() ``` To conclude, the issue happens when: (1) SortMergeJoin condition contains CodegenFallback expressions. (2) In PhysicalPlan tree, SortMergeJoin node is the child of root node, e.g., the Project in above example. This patch fixes the logic in `CollapseCodegenStages` rule. ## How was this patch tested? Unit test and manual verification in our cluster. Author: donnyzone <wellfengzhu@gmail.com> Closes apache#18656 from DonnyZone/Fix_SortMergeJoinExec. (cherry picked from commit 6b6dd68) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 4c212ee - Browse repository at this point
Copy the full SHA 4c212eeView commit details -
[SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results fai…
…lures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 This issue can be reproduced by the following example: ``` val spark = SparkSession .builder() .appName("smj-codegen") .master("local") .config("spark.sql.autoBroadcastJoinThreshold", "1") .getOrCreate() val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str") val df = df1.join(df2, df1("key") === df2("key")) .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1") .select("int") df.show() ``` To conclude, the issue happens when: (1) SortMergeJoin condition contains CodegenFallback expressions. (2) In PhysicalPlan tree, SortMergeJoin node is the child of root node, e.g., the Project in above example. This patch fixes the logic in `CollapseCodegenStages` rule. ## How was this patch tested? Unit test and manual verification in our cluster. Author: donnyzone <wellfengzhu@gmail.com> Closes apache#18656 from DonnyZone/Fix_SortMergeJoinExec. (cherry picked from commit 6b6dd68) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for ac20693 - Browse repository at this point
Copy the full SHA ac20693View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9c61833 - Browse repository at this point
Copy the full SHA 9c61833View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2cddd1c - Browse repository at this point
Copy the full SHA 2cddd1cView commit details -
[SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingT…
…ime class ## What changes were proposed in this pull request? Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible. ## How was this patch tested? Existing tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#18678 from tdas/SPARK-21464. (cherry picked from commit 70fe99d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 86cd3c0 - Browse repository at this point
Copy the full SHA 86cd3c0View commit details -
[SPARK-21446][SQL] Fix setAutoCommit never executed
## What changes were proposed in this pull request? JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446 options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities. So change properties of beforeFetch from options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap ## How was this patch tested? Author: DFFuture <albert.zhang23@gmail.com> Closes apache#18665 from DFFuture/sparksql_pg. (cherry picked from commit c972918) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 308bce0 - Browse repository at this point
Copy the full SHA 308bce0View commit details -
[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset…
…#joinWith ## What changes were proposed in this pull request? Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both ## How was this patch tested? I ran the following code : ``` public static void main(String[] args) { SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test")); Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class); Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class); try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();} } ``` which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x. Author: Corey Woodfield <coreywoodfield@gmail.com> Closes apache#18462 from coreywoodfield/master. (cherry picked from commit 8cd9cdf) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9949fed - Browse repository at this point
Copy the full SHA 9949fedView commit details
Commits on Jul 21, 2017
-
[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch
For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar <dhruveashargmail.com> Closes apache#18487 from dhruve/impr/SPARK-21243. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes apache#18691 from dhruve/branch-2.2.
Configuration menu - View commit details
-
Copy full SHA for 88dccda - Browse repository at this point
Copy the full SHA 88dccdaView commit details -
[SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.
Update the Quickstart and RDD programming guides to mention pip. Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes apache#18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation. (cherry picked from commit cc00e99) Signed-off-by: Holden Karau <holden@us.ibm.com>
Configuration menu - View commit details
-
Copy full SHA for da403b9 - Browse repository at this point
Copy the full SHA da403b9View commit details
Commits on Jul 23, 2017
-
[SPARK-20904][CORE] Don't report task failures to driver during shutd…
…own. Executors run a thread pool with daemon threads to run tasks. This means that those threads remain active when the JVM is shutting down, meaning those tasks are affected by code that runs in shutdown hooks. So if a shutdown hook messes with something that the task is using (e.g. an HDFS connection), the task will fail and will report that failure to the driver. That will make the driver mark the task as failed regardless of what caused the executor to shut down. So, for example, if YARN pre-empted that executor, the driver would consider that task failed when it should instead ignore the failure. This change avoids reporting failures to the driver when shutdown hooks are executing; this fixes the YARN preemption accounting, and doesn't really change things much for other scenarios, other than reporting a more generic error ("Executor lost") when the executor shuts down unexpectedly - which is arguably more correct. Tested with a hacky app running on spark-shell that tried to cause failures only when shutdown hooks were running, verified that preemption didn't cause the app to fail because of task failures exceeding the threshold. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18594 from vanzin/SPARK-20904. (cherry picked from commit cecd285) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 62ca13d - Browse repository at this point
Copy the full SHA 62ca13dView commit details
Commits on Jul 25, 2017
-
[SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource
When NodeManagers launching Executors, the `missing` value will exceed the real value when the launch is slow, this can lead to YARN allocates more resource. We add the `numExecutorsRunning` when calculate the `missing` to avoid this. Test by experiment. Author: DjvuLee <lihu@bytedance.com> Closes apache#18651 from djvulee/YarnAllocate. (cherry picked from commit 8de080d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
DjvuLee authored and Marcelo Vanzin committedJul 25, 2017 Configuration menu - View commit details
-
Copy full SHA for e5ec339 - Browse repository at this point
Copy the full SHA e5ec339View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0af0672 - Browse repository at this point
Copy the full SHA 0af0672View commit details -
[SPARK-21447][WEB UI] Spark history server fails to render compressed
inprogress history file in some cases. Add failure handling for EOFException that can be thrown during decompression of an inprogress spark history file, treat same as case where can't parse the last line. ## What changes were proposed in this pull request? Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case. This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary). See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447) ## How was this patch tested? Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes apache#18673 from ericvandenbergfb/fix_inprogress_compr_history_file. (cherry picked from commit 06a9793) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Eric Vandenberg authored and Marcelo Vanzin committedJul 25, 2017 Configuration menu - View commit details
-
Copy full SHA for c91191b - Browse repository at this point
Copy the full SHA c91191bView commit details -
Configuration menu - View commit details
-
Copy full SHA for ec50897 - Browse repository at this point
Copy the full SHA ec50897View commit details -
Configuration menu - View commit details
-
Copy full SHA for f3df120 - Browse repository at this point
Copy the full SHA f3df120View commit details
Commits on Jul 26, 2017
-
[SPARK-21494][NETWORK] Use correct app id when authenticating to exte…
…rnal service. There was some code based on the old SASL handler in the new auth client that was incorrectly using the SASL user as the user to authenticate against the external shuffle service. This caused the external service to not be able to find the correct secret to authenticate the connection, failing the connection. In the course of debugging, I found that some log messages from the YARN shuffle service were a little noisy, so I silenced some of them, and also added a couple of new ones that helped find this issue. On top of that, I found that a check in the code that records app secrets was wrong, causing more log spam and also using an O(n) operation instead of an O(1) call. Also added a new integration suite for the YARN shuffle service with auth on, and verified it failed before, and passes now. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18706 from vanzin/SPARK-21494. (cherry picked from commit 300807c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Marcelo Vanzin committedJul 26, 2017 Configuration menu - View commit details
-
Copy full SHA for 1bfd1a8 - Browse repository at this point
Copy the full SHA 1bfd1a8View commit details
Commits on Jul 27, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 420e6e9 - Browse repository at this point
Copy the full SHA 420e6e9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 464a934 - Browse repository at this point
Copy the full SHA 464a934View commit details -
[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API
## What changes were proposed in this pull request? This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description: ``` spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works spark.range(1).withColumnRenamed("id", "x").sort($"id") // works spark.range(1).withColumnRenamed("id", "x").sort('id) // works spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x); ``` The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes apache#18740 from aokolnychyi/spark-21538. (cherry picked from commit f44ead8) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 06b2ef0 - Browse repository at this point
Copy the full SHA 06b2ef0View commit details
Commits on Jul 28, 2017
-
[SPARK-21306][ML] OneVsRest should support setWeightCol
## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes apache#18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9379031 - Browse repository at this point
Copy the full SHA 9379031View commit details
Commits on Jul 29, 2017
-
[SPARK-21508][DOC] Fix example code provided in Spark Streaming Docum…
…entation ## What changes were proposed in this pull request? JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508) correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation The example code provided in the documentation on 'Spark Streaming Custom Receivers' has an error. doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html ``` // Assuming ssc is the StreamingContext val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port)) val words = lines.flatMap(_.split(" ")) ... ``` instead of `lines.flatMap(_.split(" "))` it should be `customReceiverStream.flatMap(_.split(" "))` ## How was this patch tested? this documentation change is tested manually by jekyll build , running below commands ``` jekyll build jekyll serve --watch ``` screen-shots provided below ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png) ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png) Author: Remis Haroon <Remis.Haroon@insdc01.pwc.com> Closes apache#18770 from remisharoon/master. (cherry picked from commit c143820) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for df6cd35 - Browse repository at this point
Copy the full SHA df6cd35View commit details -
[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically…
… by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 24a9bac - Browse repository at this point
Copy the full SHA 24a9bacView commit details -
[SPARK-19451][SQL] rangeBetween method should accept Long value as bo…
…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close apache#16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 66fa6bd - Browse repository at this point
Copy the full SHA 66fa6bdView commit details
Commits on Jul 30, 2017
-
Revert "[SPARK-19451][SQL] rangeBetween method should accept Long val…
…ue as boundary" This reverts commit 66fa6bd.
Configuration menu - View commit details
-
Copy full SHA for e2062b9 - Browse repository at this point
Copy the full SHA e2062b9View commit details
Commits on Aug 1, 2017
-
[SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.
Handle the case where the server closes the socket before the full message has been written by the client. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18727 from vanzin/SPARK-21522. (cherry picked from commit b133501) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Marcelo Vanzin committedAug 1, 2017 Configuration menu - View commit details
-
Copy full SHA for 1745434 - Browse repository at this point
Copy the full SHA 1745434View commit details -
[SPARK-21593][DOCS] Fix 2 rendering errors on configuration page
## What changes were proposed in this pull request? Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355. ## How was this patch tested? Manually built and viewed docs with jekyll Author: Sean Owen <sowen@cloudera.com> Closes apache#18793 from srowen/SPARK-21593. (cherry picked from commit b1d59e6) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 79e5805 - Browse repository at this point
Copy the full SHA 79e5805View commit details -
[SPARK-21339][CORE] spark-shell --packages option does not add jars t…
…o classpath on windows The --packages option jars are getting added to the classpath with the scheme as "file:///", in Unix it doesn't have problem with this since the scheme contains the Unix Path separator which separates the jar name with location in the classpath. In Windows, the jar file is not getting resolved from the classpath because of the scheme. Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar With this PR, we are avoiding the 'file://' scheme to get added to the packages jar files. I have verified manually in Windows and Unix environments, with the change it adds the jar to classpath like below, Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar Unix : /home/<user>/.ivy2/jars/<jar-name>.jar Author: Devaraj K <devaraj@apache.org> Closes apache#18708 from devaraj-kavali/SPARK-21339. (cherry picked from commit 58da1a2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Devaraj K authored and Marcelo Vanzin committedAug 1, 2017 Configuration menu - View commit details
-
Copy full SHA for 67c60d7 - Browse repository at this point
Copy the full SHA 67c60d7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8d04581 - Browse repository at this point
Copy the full SHA 8d04581View commit details
Commits on Aug 2, 2017
-
[SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats
## What changes were proposed in this pull request? This PR fixed a potential overflow issue in EventTimeStats. ## How was this patch tested? The new unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18803 from zsxwing/avg. (cherry picked from commit 7f63e85) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 397f904 - Browse repository at this point
Copy the full SHA 397f904View commit details -
[SPARK-21546][SS] dropDuplicates should ignore watermark when it's no…
…t a key ## What changes were proposed in this pull request? When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18822 from zsxwing/SPARK-21546. (cherry picked from commit 0d26b3a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 467ee8d - Browse repository at this point
Copy the full SHA 467ee8dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8820569 - Browse repository at this point
Copy the full SHA 8820569View commit details
Commits on Aug 3, 2017
-
[SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle…
… registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#18823 from BryanCutler/branch-2.2.
Configuration menu - View commit details
-
Copy full SHA for 690f491 - Browse repository at this point
Copy the full SHA 690f491View commit details -
Configuration menu - View commit details
-
Copy full SHA for b1a731c - Browse repository at this point
Copy the full SHA b1a731cView commit details -
Fix Java SimpleApp spark application
## What changes were proposed in this pull request? Add missing import and missing parentheses to invoke `SparkSession::text()`. ## How was this patch tested? Built and the code for this application, ran jekyll locally per docs/README.md. Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov> Closes apache#18795 from christiam/master. (cherry picked from commit dd72b10) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 1bcfa2a - Browse repository at this point
Copy the full SHA 1bcfa2aView commit details
Commits on Aug 4, 2017
-
[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC tab…
…le with extreme values on the partition column ## What changes were proposed in this pull request? An overflow of the difference of bounds on the partitioning column leads to no data being read. This patch checks for this overflow. ## How was this patch tested? New unit test. Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#18800 from aray/SPARK-21330. (cherry picked from commit 25826c7) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for f9aae8e - Browse repository at this point
Copy the full SHA f9aae8eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8aa9405 - Browse repository at this point
Copy the full SHA 8aa9405View commit details
Commits on Aug 5, 2017
-
[SPARK-21580][SQL] Integers in aggregation expressions are wrongly ta…
…ken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes apache#18779 from 10110346/groupby. (cherry picked from commit 894d5a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 841bc2f - Browse repository at this point
Copy the full SHA 841bc2fView commit details
Commits on Aug 6, 2017
-
[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null
## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes apache#18852 from vinodkc/br_Fix_SPARK-21588. (cherry picked from commit 1ba967b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 098aaec - Browse repository at this point
Copy the full SHA 098aaecView commit details
Commits on Aug 7, 2017
-
[SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWrit…
…er.commitAndGet called ## What changes were proposed in this pull request? We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called. Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called. ## How was this patch tested? Modified existing test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes apache#18830 from ConeyLiu/DiskBlockObjectWriter. (cherry picked from commit 534a063) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 7a04def - Browse repository at this point
Copy the full SHA 7a04defView commit details -
[SPARK-21647][SQL] Fix SortMergeJoin when using CROSS
### What changes were proposed in this pull request? author: BoleynSu closes apache#18836 ```Scala val df = Seq((1, 1)).toDF("i", "j") df.createOrReplaceTempView("T") withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " + "cross join T t2 where t2.i = t1.i").explain(true) } ``` The above code could cause the following exception: ``` SortMergeJoinExec should not take Cross as the JoinType java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100) ``` Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue. ### How was this patch tested? Modified the two existing test cases. Author: Xiao Li <gatorsmile@gmail.com> Author: Boleyn Su <boleyn.su@gmail.com> Closes apache#18863 from gatorsmile/pr-18836. (cherry picked from commit bbfd6b5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 4f0eb0c - Browse repository at this point
Copy the full SHA 4f0eb0cView commit details -
[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with di…
…sabled FS cache This PR replaces apache#18623 to do some clean up. Closes apache#18623 Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Andrey Taptunov <taptunov@amazon.com> Closes apache#18848 from zsxwing/review-pr18623.
Configuration menu - View commit details
-
Copy full SHA for 43f9c84 - Browse repository at this point
Copy the full SHA 43f9c84View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0aacb6b - Browse repository at this point
Copy the full SHA 0aacb6bView commit details -
[SPARK-21565][SS] Propagate metadata in attribute replacement.
## What changes were proposed in this pull request? Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes. ## How was this patch tested? new unit test, which was verified to fail before the fix Author: Jose Torres <joseph-torres@databricks.com> Closes apache#18840 from joseph-torres/SPARK-21565. (cherry picked from commit cce25b3) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for fa92a7b - Browse repository at this point
Copy the full SHA fa92a7bView commit details -
[SPARK-21648][SQL] Fix confusing assert failure in JDBC source when p…
…arallel fetching parameters are not properly provided. ### What changes were proposed in this pull request? ```SQL CREATE TABLE mytesttable1 USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}', dbtable 'mytesttable1', paritionColumn 'state_id', lowerBound '0', upperBound '52', numPartitions '53', fetchSize '10000' ) ``` The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error. ``` AssertionError: assertion failed java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312) ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes apache#18864 from gatorsmile/jdbcPartCol. (cherry picked from commit baf5cac) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a1c1199 - Browse repository at this point
Copy the full SHA a1c1199View commit details
Commits on Aug 8, 2017
-
[SPARK-21567][SQL] Dataset should work with type alias
If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. Added test case. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18813 from viirya/SPARK-21567. (cherry picked from commit ee13041) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 86609a9 - Browse repository at this point
Copy the full SHA 86609a9View commit details -
Revert "[SPARK-21567][SQL] Dataset should work with type alias"
This reverts commit 86609a9.
Configuration menu - View commit details
-
Copy full SHA for e87ffca - Browse repository at this point
Copy the full SHA e87ffcaView commit details
Commits on Aug 9, 2017
-
[SPARK-21503][UI] Spark UI shows incorrect task status for a killed E…
…xecutor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png"> After: <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <pgandhi@yahoo-inc.com> Author: pgandhi999 <parthkgandhi9@gmail.com> Closes apache#18707 from pgandhi999/master. (cherry picked from commit f016f5c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d023314 - Browse repository at this point
Copy the full SHA d023314View commit details -
[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in …
…strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search scalanlp/breeze#651 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#18797 from WeichenXu123/update-breeze. (cherry picked from commit b35660d) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 7446be3 - Browse repository at this point
Copy the full SHA 7446be3View commit details -
[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…
… return value Same PR as apache#18799 but for branch 2.2. Main discussion the other PR. -------- When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug. This PR ensures that places calling HDFSMetadataLog.get always check the return value. Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18890 from tdas/SPARK-21596-2.2.
Configuration menu - View commit details
-
Copy full SHA for f6d56d2 - Browse repository at this point
Copy the full SHA f6d56d2View commit details -
[SPARK-21663][TESTS] test("remote fetch below max RPC message size") …
…should call masterTracker.stop() in MapOutputTrackerSuite Signed-off-by: 10087686 <wang.jiaochunzte.com.cn> ## What changes were proposed in this pull request? After Unit tests end,there should be call masterTracker.stop() to free resource; (Please fill in changes proposed in this fix) ## How was this patch tested? Run Unit tests; (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10087686 <wang.jiaochun@zte.com.cn> Closes apache#18867 from wangjiaochun/mapout. (cherry picked from commit 6426adf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 3ca55ea - Browse repository at this point
Copy the full SHA 3ca55eaView commit details
Commits on Aug 11, 2017
-
[SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog
## What changes were proposed in this pull request? This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption. ## How was this patch tested? Removed the test case. Author: Reynold Xin <rxin@databricks.com> Closes apache#18912 from rxin/remove-getTableOption. (cherry picked from commit 584c7f1) Signed-off-by: Reynold Xin <rxin@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for c909496 - Browse repository at this point
Copy the full SHA c909496View commit details -
[SPARK-21595] Separate thresholds for buffering and spilling in Exter…
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes apache#18843 from tejasapatil/SPARK-21595. (cherry picked from commit 9443999) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 406eb1c - Browse repository at this point
Copy the full SHA 406eb1cView commit details
Commits on Aug 14, 2017
-
[SPARK-21563][CORE] Fix race condition when serializing TaskDescripti…
…ons and adding jars ## What changes were proposed in this pull request? Fix the race condition when serializing TaskDescriptions and adding jars by keeping the set of jars and files for a TaskSet constant across the lifetime of the TaskSet. Otherwise TaskDescription serialization can produce an invalid serialization when new file/jars are added concurrently as the TaskDescription is serialized. ## How was this patch tested? Additional unit test ensures jars/files contained in the TaskDescription remain constant throughout the lifetime of the TaskSet. Author: Andrew Ash <andrew@andrewash.com> Closes apache#18913 from ash211/SPARK-21563. (cherry picked from commit 6847e93) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 7b98077 - Browse repository at this point
Copy the full SHA 7b98077View commit details -
Configuration menu - View commit details
-
Copy full SHA for dc3cdd5 - Browse repository at this point
Copy the full SHA dc3cdd5View commit details -
[SPARK-21696][SS] Fix a potential issue that may generate partial sna…
…pshot files ## What changes were proposed in this pull request? Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#18928 from zsxwing/SPARK-21696. (cherry picked from commit 282f00b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 48bacd3 - Browse repository at this point
Copy the full SHA 48bacd3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3a02a3c - Browse repository at this point
Copy the full SHA 3a02a3cView commit details
Commits on Aug 15, 2017
-
[SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are…
… successfully removed ## What changes were proposed in this pull request? We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size. ## How was this patch tested? Added a test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18934 from viirya/SPARK-21721. (cherry picked from commit 4c3cf1c) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d9c8e62 - Browse repository at this point
Copy the full SHA d9c8e62View commit details
Commits on Aug 16, 2017
-
[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures)
Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz> Closes apache#18872 from ProtD/master. (cherry picked from commit 8321c14) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for f1accc8 - Browse repository at this point
Copy the full SHA f1accc8View commit details -
[SPARK-21656][CORE] spark dynamic allocation should not idle timeout …
…executors when tasks still to run ## What changes were proposed in this pull request? Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer than the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run. We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run. ## How was this patch tested? Tested by manually adding executors to `executorsIdsToBeRemoved` list and seeing if those executors were removed when there are a lot of tasks and a high `numExecutorsTarget` value. Code used In `ExecutorAllocationManager.start()` ``` start_time = clock.getTimeMillis() ``` In `ExecutorAllocationManager.schedule()` ``` val executorIdsToBeRemoved = ArrayBuffer[String]() if ( now > start_time + 1000 * 60 * 2) { logInfo("--- REMOVING 1/2 of the EXECUTORS ---") start_time += 1000 * 60 * 100 var counter = 0 for (x <- executorIds) { counter += 1 if (counter == 2) { counter = 0 executorIdsToBeRemoved += x } } } Author: John Lee <jlee2@yahoo-inc.com> Closes apache#18874 from yoonlee95/SPARK-21656. (cherry picked from commit adf005d) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
John Lee authored and Tom Graves committedAug 16, 2017 Configuration menu - View commit details
-
Copy full SHA for f5ede0d - Browse repository at this point
Copy the full SHA f5ede0dView commit details -
[SPARK-18464][SQL][BACKPORT] support old table which doesn't store sc…
…hema in table properties backport apache#18907 to branch 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18963 from cloud-fan/backport.
Configuration menu - View commit details
-
Copy full SHA for 2a96975 - Browse repository at this point
Copy the full SHA 2a96975View commit details -
Configuration menu - View commit details
-
Copy full SHA for 851e162 - Browse repository at this point
Copy the full SHA 851e162View commit details
Commits on Aug 18, 2017
-
[SPARK-21739][SQL] Cast expression should initialize timezoneId when …
…it is called statically to convert something into TimestampType ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739 This issue is caused by introducing TimeZoneAwareExpression. When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase. However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases, `NoSuchElementException: None.get` will be thrown for TimestampType. This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`). ## How was this patch tested? unit test Author: donnyzone <wellfengzhu@gmail.com> Closes apache#18960 from DonnyZone/spark-21739. (cherry picked from commit 310454b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for fdea642 - Browse repository at this point
Copy the full SHA fdea642View commit details
Commits on Aug 20, 2017
-
[MINOR] Correct validateAndTransformSchema in GaussianMixture and AFT…
…SurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes apache#18980 from sharp-pixel/master. (cherry picked from commit 73e04ec) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 6c2a38a - Browse repository at this point
Copy the full SHA 6c2a38aView commit details -
[SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when…
… paths are successfully removed ## What changes were proposed in this pull request? Fix a typo in test. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#19005 from viirya/SPARK-21721-followup. (cherry picked from commit 28a6cca) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 0f640e9 - Browse repository at this point
Copy the full SHA 0f640e9View commit details
Commits on Aug 21, 2017
-
Configuration menu - View commit details
-
Copy full SHA for b8d83ee - Browse repository at this point
Copy the full SHA b8d83eeView commit details -
[SPARK-21617][SQL] Store correct table metadata when altering schema …
…in Hive metastore. For Hive tables, the current "replace the schema" code is the correct path, except that an exception in that path should result in an error, and not in retrying in a different way. For data source tables, Spark may generate a non-compatible Hive table; but for that to work with Hive 2.1, the detection of data source tables needs to be fixed in the Hive client, to also consider the raw tables used by code such as `alterTableSchema`. Tested with existing and added unit tests (plus internal tests with a 2.1 metastore). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18849 from vanzin/SPARK-21617. (cherry picked from commit 84b5b16) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 526087f - Browse repository at this point
Copy the full SHA 526087fView commit details
Commits on Aug 23, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 4876824 - Browse repository at this point
Copy the full SHA 4876824View commit details
Commits on Aug 24, 2017
-
[SPARK-21805][SPARKR] Disable R vignettes code on Windows
## What changes were proposed in this pull request? Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there) fix * checking re-building of vignette outputs ... WARNING and > %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir. ## How was this patch tested? jenkins, appveyor, r-hub before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#19016 from felixcheung/rvigwind. (cherry picked from commit 43cbfad) Signed-off-by: Felix Cheung <felixcheung@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 236b2f4 - Browse repository at this point
Copy the full SHA 236b2f4View commit details -
[SPARK-21826][SQL] outer broadcast hash join should not throw NPE
This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 . Non-equal join condition should only be applied when the equal-join condition matches. regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19036 from cloud-fan/bug. (cherry picked from commit 2dd37d8) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for a585367 - Browse repository at this point
Copy the full SHA a585367View commit details -
[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureS…
…td contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of apache#18896 fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
Configuration menu - View commit details
-
Copy full SHA for 2b4bd79 - Browse repository at this point
Copy the full SHA 2b4bd79View commit details
Commits on Aug 25, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 4e7d45e - Browse repository at this point
Copy the full SHA 4e7d45eView commit details
Commits on Aug 28, 2017
-
[SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.vari…
…ance generate negative result Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance. **This is a serious bug because many algos in MLLib** **use stddev computed from** `sqrt(variance)` **it will generate NaN and crash the whole algorithm.** we can reproduce this bug use the following code: ``` val summarizer1 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.7) val summarizer2 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer3 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.5) val summarizer4 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer = summarizer1 .merge(summarizer2) .merge(summarizer3) .merge(summarizer4) println(summarizer.variance(0)) ``` This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares` test cases added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#19029 from WeichenXu123/fix_summarizer_var_bug. (cherry picked from commit 0456b40) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 0d4ef2f - Browse repository at this point
Copy the full SHA 0d4ef2fView commit details -
[SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config …
…for launching daemons like History Server History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server. Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode. ## How was this patch tested? Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running. Author: pgandhi <pgandhi@yahoo-inc.com> Author: pgandhi999 <parthkgandhi9@gmail.com> Closes apache#19047 from pgandhi999/master. (cherry picked from commit 24e6c18) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
pgandhi authored and Tom Graves committedAug 28, 2017 Configuration menu - View commit details
-
Copy full SHA for 59bb7eb - Browse repository at this point
Copy the full SHA 59bb7ebView commit details -
Configuration menu - View commit details
-
Copy full SHA for 24baf03 - Browse repository at this point
Copy the full SHA 24baf03View commit details
Commits on Aug 29, 2017
-
[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resour…
…ces in yarn client mode ## What changes were proposed in this pull request? This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is apache#18962. ## How was this patch tested? Tested in local UT. Author: jerryshao <sshao@hortonworks.com> Closes apache#19074 from jerryshao/SPARK-21714-2.2-backport.
Configuration menu - View commit details
-
Copy full SHA for 59529b2 - Browse repository at this point
Copy the full SHA 59529b2View commit details -
Revert "[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remot…
…e resources in yarn client mode" This reverts commit 59529b2.
Marcelo Vanzin committedAug 29, 2017 Configuration menu - View commit details
-
Copy full SHA for 917fe66 - Browse repository at this point
Copy the full SHA 917fe66View commit details
Commits on Aug 30, 2017
-
[SPARK-21254][WEBUI] History UI performance fixes
## This is a backport of PR apache#18783 to the latest released branch 2.2. ## What changes were proposed in this pull request? As described in JIRA ticket, History page is taking ~1min to load for cases when amount of jobs is 10k+. Most of the time is currently being spent on DOM manipulations and all additional costs implied by this (browser repaints and reflows). PR's goal is not to change any behavior but to optimize time of History UI rendering: 1. The most costly operation is setting `innerHTML` for `duration` column within a loop, which is [extremely unperformant](https://jsperf.com/jquery-append-vs-html-list-performance/24). [Refactoring ](criteo-forks@b7e56ee) this helped to get page load time **down to 10-15s** 2. Second big gain bringing page load time **down to 4s** was [was achieved](criteo-forks@3630ca2) by detaching table's DOM before parsing it with DataTables jQuery plugin. 3. Another chunk of improvements ([1]criteo-forks@aeeeeb5), [2](criteo-forks@e25be9a), [3](criteo-forks@9169707)) was focused on removing unnecessary DOM manipulations that in total contributed ~250ms to page load time. ## How was this patch tested? Tested by existing Selenium tests in `org.apache.spark.deploy.history.HistoryServerSuite`. Changes were also tested on Criteo's spark-2.1 fork with 20k+ number of rows in the table, reducing load time to 4s. Author: Dmitry Parfenchik <d.parfenchik@criteo.com> Closes apache#18860 from 2ooom/history-ui-perf-fix-2.2.
Configuration menu - View commit details
-
Copy full SHA for a6a9944 - Browse repository at this point
Copy the full SHA a6a9944View commit details -
Configuration menu - View commit details
-
Copy full SHA for 952c577 - Browse repository at this point
Copy the full SHA 952c577View commit details -
[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resour…
…ces in yarn client mode ## What changes were proposed in this pull request? This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is apache#18962. ## How was this patch tested? Tested in local UT. Author: jerryshao <sshao@hortonworks.com> Closes apache#19074 from jerryshao/SPARK-21714-2.2-backport.
Configuration menu - View commit details
-
Copy full SHA for d10c9dc - Browse repository at this point
Copy the full SHA d10c9dcView commit details -
[SPARK-21834] Incorrect executor request in case of dynamic allocation
## What changes were proposed in this pull request? killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself. ## How was this patch tested? Ran a job on the cluster and made sure the executor request is correct Author: Sital Kedia <skedia@fb.com> Closes apache#19081 from sitalkedia/skedia/oss_fix_executor_allocation. (cherry picked from commit 6949a9c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sital Kedia authored and Marcelo Vanzin committedAug 30, 2017 Configuration menu - View commit details
-
Copy full SHA for 14054ff - Browse repository at this point
Copy the full SHA 14054ffView commit details
Commits on Aug 31, 2017
-
Configuration menu - View commit details
-
Copy full SHA for c412c77 - Browse repository at this point
Copy the full SHA c412c77View commit details
Commits on Sep 1, 2017
-
[SPARK-21884][SPARK-21477][BACKPORT-2.2][SQL] Mark LocalTableScanExec…
…'s input data transient This PR is to backport apache#18686 for resolving the issue in apache#19094 --- ## What changes were proposed in this pull request? This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient. It can avoid serializing the unneeded objects. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes apache#19101 from gatorsmile/backport-21477.
Configuration menu - View commit details
-
Copy full SHA for 50f86e1 - Browse repository at this point
Copy the full SHA 50f86e1View commit details
Commits on Sep 4, 2017
-
[SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScan…
…Exec with sun.io.serialization.extendedDebugInfo=true ## What changes were proposed in this pull request? If no SparkConf is available to Utils.redact, simply don't redact. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes apache#19123 from srowen/SPARK-21418. (cherry picked from commit ca59445) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for fb1b5f0 - Browse repository at this point
Copy the full SHA fb1b5f0View commit details
Commits on Sep 5, 2017
-
Configuration menu - View commit details
-
Copy full SHA for d0df025 - Browse repository at this point
Copy the full SHA d0df025View commit details -
[SPARK-21925] Update trigger interval documentation in docs with beha…
…vior change in Spark 2.2 Forgot to update docs with behavior change. Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#19138 from brkyvz/trigger-doc-fix. (cherry picked from commit 8c954d2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1f7c486 - Browse repository at this point
Copy the full SHA 1f7c486View commit details -
[MINOR][DOC] Update
Partition Discovery
section to enumerate all av……ailable file sources ## What changes were proposed in this pull request? All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly. **AFTER** <img width="906" alt="1" src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png"> ## How was this patch tested? ``` SKIP_API=1 jekyll serve --watch ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#19139 from dongjoon-hyun/partitiondiscovery. (cherry picked from commit 9e451bc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 7da8fbf - Browse repository at this point
Copy the full SHA 7da8fbfView commit details
Commits on Sep 6, 2017
-
[SPARK-21924][DOCS] Update structured streaming programming guide doc
## What changes were proposed in this pull request? Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide. Author: Riccardo Corbella <r.corbella@reply.it> Closes apache#19137 from riccardocorbella/bugfix. (cherry picked from commit 4ee7dfe) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 9afab9a - Browse repository at this point
Copy the full SHA 9afab9aView commit details -
Configuration menu - View commit details
-
Copy full SHA for a7d0b0a - Browse repository at this point
Copy the full SHA a7d0b0aView commit details -
[SPARK-21901][SS] Define toString for StateOperatorProgress
## What changes were proposed in this pull request? Just `StateOperatorProgress.toString` + few formatting fixes ## How was this patch tested? Local build. Waiting for OK from Jenkins. Author: Jacek Laskowski <jacek@japila.pl> Closes apache#19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString. (cherry picked from commit fa0092b) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 342cc2a - Browse repository at this point
Copy the full SHA 342cc2aView commit details
Commits on Sep 7, 2017
-
Fixed pandoc dependency issue in python/setup.py
## Problem Description When pyspark is listed as a dependency of another package, installing the other package will cause an install failure in pyspark. When the other package is being installed, pyspark's setup_requires requirements are installed including pypandoc. Thus, the exception handling on setup.py:152 does not work because the pypandoc module is indeed available. However, the pypandoc.convert() function fails if pandoc itself is not installed (in our use cases it is not). This raises an OSError that is not handled, and setup fails. The following is a sample failure: ``` $ which pandoc $ pip freeze | grep pypandoc pypandoc==1.4 $ pip install pyspark Collecting pyspark Downloading pyspark-2.2.0.post0.tar.gz (188.3MB) 100% |████████████████████████████████| 188.3MB 16.8MB/s Complete output from command python setup.py egg_info: Maybe try: sudo apt-get install pandoc See http://johnmacfarlane.net/pandoc/installing.html for installation options --------------------------------------------------------------- Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module> long_description = pypandoc.convert('README.md', 'rst') File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert outputfile=outputfile, filters=filters) File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input _ensure_pandoc_path() File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path raise OSError("No pandoc was found: either install pandoc and add it\n" OSError: No pandoc was found: either install pandoc and add it to your PATH or or call pypandoc.download_pandoc(...) or install pypandoc wheels with included pandoc. ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/ ``` ## What changes were proposed in this pull request? This change simply adds an additional exception handler for the OSError that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed. ## How was this patch tested? I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel. Here is the output ``` $ pip freeze | grep pypandoc pypandoc==1.4 $ which pandoc $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0) Installing collected packages: pyspark Successfully installed pyspark-2.3.0.dev0 ``` Author: Tucker Beck <tucker.beck@rentrakmail.com> Closes apache#18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py. (cherry picked from commit aad2125) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 49968de - Browse repository at this point
Copy the full SHA 49968deView commit details -
[SPARK-21890] Credentials not being passed to add the tokens
## What changes were proposed in this pull request? I observed this while running a oozie job trying to connect to hbase via spark. It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release. More Info as to why it fails on secure grid: Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along. The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab). The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception. There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens. https://issues.apache.org/jira/browse/SPARK-21890 Modified to pass creds to get delegation tokens ## How was this patch tested? Manual testing on our secure cluster Author: Sanket Chintapalli <schintap@yahoo-inc.com> Closes apache#19103 from redsanket/SPARK-21890.
Sanket Chintapalli authored and Marcelo Vanzin committedSep 7, 2017 Configuration menu - View commit details
-
Copy full SHA for 0848df1 - Browse repository at this point
Copy the full SHA 0848df1View commit details
Commits on Sep 8, 2017
-
[SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should s…
…top SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#19158 from ueshin/issues/SPARK-21950. (cherry picked from commit 57bc1e9) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 4304d0b - Browse repository at this point
Copy the full SHA 4304d0bView commit details -
[SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing
dongjoon-hyun HyukjinKwon Error in PySpark example code: /examples/src/main/python/ml/estimator_transformer_param_example.py The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala ## What changes were proposed in this pull request? Proposing to call the lr variable instead of model1 or model2 ## How was this patch tested? This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines. The output for model2 in PySpark should be {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06, Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability', Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55, Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30, Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True} Please review http://spark.apache.org/contributing.html before opening a pull request. Author: MarkTab marktab.net <marktab@users.noreply.github.com> Closes apache#19152 from marktab/branch-2.2.
Configuration menu - View commit details
-
Copy full SHA for 781a1f8 - Browse repository at this point
Copy the full SHA 781a1f8View commit details -
[SPARK-21936][SQL][2.2] backward compatibility test framework for Hiv…
…eExternalCatalog backport apache#19148 to 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19163 from cloud-fan/test.
Configuration menu - View commit details
-
Copy full SHA for 08cb06a - Browse repository at this point
Copy the full SHA 08cb06aView commit details -
[SPARK-21946][TEST] fix flaky test: "alter table: rename cached table…
…" in InMemoryCatalogedDDLSuite ## What changes were proposed in this pull request? This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`. Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result. ## How was this patch tested? Use existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#19159 from kiszk/SPARK-21946. (cherry picked from commit 8a4f228) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9ae7c96 - Browse repository at this point
Copy the full SHA 9ae7c96View commit details -
[SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "met…
…astore_db" before listing files in R tests ## What changes were proposed in this pull request? This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying. ## How was this patch tested? Manually running multiple times R tests via `./R/run-tests.sh`. **Before** Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ....................................................................................................1234....................... Failed ------------------------------------------------------------------------- 1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" 3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" DONE =========================================================================== ``` **After** Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................... ``` Author: hyukjinkwon <gurwls223gmail.com> Closes apache#18335 from HyukjinKwon/SPARK-21128. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#19166 from felixcheung/rbackport21128.
Configuration menu - View commit details
-
Copy full SHA for 9876821 - Browse repository at this point
Copy the full SHA 9876821View commit details
Commits on Sep 9, 2017
-
[SPARK-21954][SQL] JacksonUtils should verify MapType's value type in…
…stead of key type ## What changes were proposed in this pull request? `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys. Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#19167 from viirya/test-jacksonutils. (cherry picked from commit 6b45d7e) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 182478e - Browse repository at this point
Copy the full SHA 182478eView commit details
Commits on Sep 10, 2017
-
[SPARK-20098][PYSPARK] dataType's typeName fix
## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <szalaipeti.vagyok@gmail.com> Closes apache#17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for b1b5a7f - Browse repository at this point
Copy the full SHA b1b5a7fView commit details
Commits on Sep 12, 2017
-
[SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error.
## What changes were proposed in this pull request? Fixed wrong documentation for Mean Absolute Error. Even though the code is correct for the MAE: ```scala Since("1.2.0") def meanAbsoluteError: Double = { summary.normL1(1) / summary.count } ``` In the documentation the division by N is missing. ## How was this patch tested? All of spark tests were run. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: FavioVazquez <favio.vazquezp@gmail.com> Author: faviovazquez <favio.vazquezp@gmail.com> Author: Favio André Vázquez <favio.vazquezp@gmail.com> Closes apache#19190 from FavioVazquez/mae-fix. (cherry picked from commit e2ac2f1) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 10c6836 - Browse repository at this point
Copy the full SHA 10c6836View commit details -
[DOCS] Fix unreachable links in the document
## What changes were proposed in this pull request? Recently, I found two unreachable links in the document and fixed them. Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed. ## How was this patch tested? Tested manually. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes apache#19195 from sarutak/fix-unreachable-link. (cherry picked from commit 9575582) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 63098dc - Browse repository at this point
Copy the full SHA 63098dcView commit details -
Configuration menu - View commit details
-
Copy full SHA for c66ddce - Browse repository at this point
Copy the full SHA c66ddceView commit details -
[SPARK-18608][ML] Fix double caching
## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: apache#19107, apache#17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for b606dc1 - Browse repository at this point
Copy the full SHA b606dc1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 30e7298 - Browse repository at this point
Copy the full SHA 30e7298View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7966c84 - Browse repository at this point
Copy the full SHA 7966c84View commit details
Commits on Sep 13, 2017
-
[SPARK-21980][SQL] References in grouping functions should be indexed…
… with semanticEquals ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21980 This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations. The problem can be reproduced by: `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") df.cube("a").agg(grouping("A")).show()` ## How was this patch tested? unit tests Author: donnyzone <wellfengzhu@gmail.com> Closes apache#19202 from DonnyZone/ResolveGroupingAnalytics. (cherry picked from commit 21c4450) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3a692e3 - Browse repository at this point
Copy the full SHA 3a692e3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0e8f032 - Browse repository at this point
Copy the full SHA 0e8f032View commit details
Commits on Sep 14, 2017
-
[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.
## What changes were proposed in this pull request? apache#19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#19220 from yanboliang/SPARK-18608. (cherry picked from commit c76153c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 51e5a82 - Browse repository at this point
Copy the full SHA 51e5a82View commit details
Commits on Sep 17, 2017
-
[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs
## What changes were proposed in this pull request? (edited) Fixes a bug introduced in apache#16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 42852bb - Browse repository at this point
Copy the full SHA 42852bbView commit details
Commits on Sep 18, 2017
-
[SPARK-21953] Show both memory and disk bytes spilled if either is pr…
…esent As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. Author: Andrew Ash <andrew@andrewash.com> Closes apache#19164 from ash211/patch-3. (cherry picked from commit 6308c65) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 309c401 - Browse repository at this point
Copy the full SHA 309c401View commit details -
[SPARK-22043][PYTHON] Improves error message for show_profiles and du…
…mp_profiles ## What changes were proposed in this pull request? This PR proposes to improve error message from: ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1000, in show_profiles self.profiler_collector.show_profiles() AttributeError: 'NoneType' object has no attribute 'show_profiles' >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles self.profiler_collector.dump_profiles(path) AttributeError: 'NoneType' object has no attribute 'dump_profiles' ``` to ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1003, in show_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. ``` ## How was this patch tested? Unit tests added in `python/pyspark/tests.py` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#19260 from HyukjinKwon/profile-errors. (cherry picked from commit 7c72662) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a86831d - Browse repository at this point
Copy the full SHA a86831dView commit details -
[SPARK-22047][TEST] ignore HiveExternalCatalogVersionsSuite
## What changes were proposed in this pull request? As reported in https://issues.apache.org/jira/browse/SPARK-22047 , HiveExternalCatalogVersionsSuite is failing frequently, let's disable this test suite to unblock other PRs, I'm looking into the root cause. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19264 from cloud-fan/test. (cherry picked from commit 894a756) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 48d6aef - Browse repository at this point
Copy the full SHA 48d6aefView commit details -
Configuration menu - View commit details
-
Copy full SHA for 504732d - Browse repository at this point
Copy the full SHA 504732dView commit details -
Configuration menu - View commit details
-
Copy full SHA for dfbc6a5 - Browse repository at this point
Copy the full SHA dfbc6a5View commit details -
Configuration menu - View commit details
-
Copy full SHA for d0f83de - Browse repository at this point
Copy the full SHA d0f83deView commit details
Commits on Sep 19, 2017
-
[SPARK-22047][FLAKY TEST] HiveExternalCatalogVersionsSuite
## What changes were proposed in this pull request? This PR tries to download Spark for each test run, to make sure each test run is absolutely isolated. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19265 from cloud-fan/test. (cherry picked from commit 10f45b3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d0234eb - Browse repository at this point
Copy the full SHA d0234ebView commit details -
[SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala
Current implementation for processingRate-total uses wrong metric: mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond ## What changes were proposed in this pull request? Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond ## How was this patch tested? Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric. <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Taaffy <32072374+Taaffy@users.noreply.github.com> Closes apache#19268 from Taaffy/patch-1. (cherry picked from commit 1bc17a6) Signed-off-by: Sean Owen <sowen@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 6764408 - Browse repository at this point
Copy the full SHA 6764408View commit details
Commits on Sep 20, 2017
-
[SPARK-22076][SQL] Expand.projections should not be a Stream
## What changes were proposed in this pull request? Spark with Scala 2.10 fails with a group by cube: ``` spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug") spark.sql("select 1 from rollup_bug group by rollup ()").show ``` It can be traced back to apache#15484 , which made `Expand.projections` a lazy `Stream` for group by cube. In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts. This change is also good for master branch, to reduce the serialized size of `Expand.projections`. ## How was this patch tested? manually verified with Spark with Scala 2.10. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19289 from cloud-fan/bug. (cherry picked from commit ce6a71e) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 5d10586 - Browse repository at this point
Copy the full SHA 5d10586View commit details -
[SPARK-21384][YARN] Spark + YARN fails with LocalFileSystem as defaul…
…t FS ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. With this change, client copies the files to remote always when the source scheme is "file". ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. Author: Devaraj K <devaraj@apache.org> Closes apache#19141 from devaraj-kavali/SPARK-21384. (cherry picked from commit 55d5fa7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Devaraj K authored and Marcelo Vanzin committedSep 20, 2017 Configuration menu - View commit details
-
Copy full SHA for 401ac20 - Browse repository at this point
Copy the full SHA 401ac20View commit details
Commits on Sep 21, 2017
-
[SPARK-21928][CORE] Set classloader on SerializerManager's private kryo
## What changes were proposed in this pull request? We have to make sure that SerializerManager's private instance of kryo also uses the right classloader, regardless of the current thread classloader. In particular, this fixes serde during remote cache fetches, as those occur in netty threads. ## How was this patch tested? Manual tests & existing suite via jenkins. I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally. I manually verified the warning message is no longer present. Author: Imran Rashid <irashid@cloudera.com> Closes apache#19280 from squito/SPARK-21928_ser_classloader. (cherry picked from commit b75bd17) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 765fd92 - Browse repository at this point
Copy the full SHA 765fd92View commit details
Commits on Sep 22, 2017
-
[SPARK-22094][SS] processAllAvailable should check the query state
`processAllAvailable` should also check the query state and if the query is stopped, it should return. The new unit test. Author: Shixiong Zhu <zsxwing@gmail.com> Closes apache#19314 from zsxwing/SPARK-22094. (cherry picked from commit fedf696) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 090b987 - Browse repository at this point
Copy the full SHA 090b987View commit details -
[SPARK-22072][SPARK-22071][BUILD] Improve release build scripts
## What changes were proposed in this pull request? Check JDK version (with javac) and use SPARK_VERSION for publish-release ## How was this patch tested? Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled) Author: Holden Karau <holden@us.ibm.com> Closes apache#19312 from holdenk/improve-release-scripts-r2. (cherry picked from commit 8f130ad) Signed-off-by: Holden Karau <holden@us.ibm.com>
Configuration menu - View commit details
-
Copy full SHA for de6274a - Browse repository at this point
Copy the full SHA de6274aView commit details
Commits on Sep 23, 2017
-
[SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows
## What changes were proposed in this pull request? Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark. ## How was this patch tested? Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1). Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes apache#19310 from jsnowacki/master. (cherry picked from commit c11f24a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for c0a34a9 - Browse repository at this point
Copy the full SHA c0a34a9View commit details -
[SPARK-22092] Reallocation in OffHeapColumnVector.reserveInternal cor…
…rupts struct and array data `OffHeapColumnVector.reserveInternal()` will only copy already inserted values during reallocation if `data != null`. In vectors containing arrays or structs this is incorrect, since there field `data` is not used at all. We need to check `nulls` instead. Adds new tests to `ColumnVectorSuite` that reproduce the errors. Author: Ala Luszczak <ala@databricks.com> Closes apache#19323 from ala/port-vector-realloc.
Configuration menu - View commit details
-
Copy full SHA for 1a829df - Browse repository at this point
Copy the full SHA 1a829dfView commit details -
[SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between string…
…s and timestamps in partition column ## What changes were proposed in this pull request? This PR backports apache@04975a6 into branch-2.2. ## How was this patch tested? Unit tests in `ParquetPartitionDiscoverySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#19333 from HyukjinKwon/SPARK-22109-backport-2.2.
Configuration menu - View commit details
-
Copy full SHA for 211d81b - Browse repository at this point
Copy the full SHA 211d81bView commit details
Commits on Sep 25, 2017
-
[SPARK-22107] Change as to alias in python quickstart
## What changes were proposed in this pull request? Updated docs so that a line of python in the quick start guide executes. Closes apache#19283 ## How was this patch tested? Existing tests. Author: John O'Leary <jgoleary@gmail.com> Closes apache#19326 from jgoleary/issues/22107. (cherry picked from commit 20adf9a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 8acce00 - Browse repository at this point
Copy the full SHA 8acce00View commit details -
[SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace
## What changes were proposed in this pull request? MemoryStore.evictBlocksToFreeSpace acquires write locks for all the blocks it intends to evict up front. If there is a failure to evict blocks (eg., some failure dropping a block to disk), then we have to release the lock. Otherwise the lock is never released and an executor trying to get the lock will wait forever. ## How was this patch tested? Added unit test. Author: Imran Rashid <irashid@cloudera.com> Closes apache#19311 from squito/SPARK-22083. (cherry picked from commit 2c5b9b1) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 9836ea1 - Browse repository at this point
Copy the full SHA 9836ea1View commit details -
Configuration menu - View commit details
-
Copy full SHA for d2b369a - Browse repository at this point
Copy the full SHA d2b369aView commit details -
[SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive…
… warehouse directory ## What changes were proposed in this pull request? During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory. ## How was this patch tested? Ran full suite of tests locally, verified that they pass. Author: Greg Owen <greg@databricks.com> Closes apache#19341 from GregOwen/SPARK-22120. (cherry picked from commit ce20478) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for b0f30b5 - Browse repository at this point
Copy the full SHA b0f30b5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8f39361 - Browse repository at this point
Copy the full SHA 8f39361View commit details
Commits on Sep 27, 2017
-
[SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking…
… Cartesian products Back port apache#19362 to branch-2.2 ## What changes were proposed in this pull request? When inferring constraints from children, Join's condition can be simplified as None. For example, ``` val testRelation = LocalRelation('a.int) val x = testRelation.as("x") val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y") x.join.where($"x.a" === $"y.a") ``` The plan will become ``` Join Inner :- LocalRelation <empty>, [a#23] +- LocalRelation <empty>, [a#224] ``` And the Cartesian products check will throw exception for above plan. Propagate empty relation before checking Cartesian products, and the issue is resolved. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes apache#19366 from gengliangwang/branch-2.2.
Configuration menu - View commit details
-
Copy full SHA for a406473 - Browse repository at this point
Copy the full SHA a406473View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6dbda6e - Browse repository at this point
Copy the full SHA 6dbda6eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 28ae8fd - Browse repository at this point
Copy the full SHA 28ae8fdView commit details
Commits on Sep 28, 2017
-
Configuration menu - View commit details
-
Copy full SHA for ef02a07 - Browse repository at this point
Copy the full SHA ef02a07View commit details