Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1429: Debian packaging #2

Closed
wants to merge 1,270 commits into from
Closed

Conversation

ianlcsd
Copy link

@ianlcsd ianlcsd commented Sep 28, 2017

What changes were proposed in this pull request?

Debian packaging for CSD's deployment need

(Please fill in changes proposed in this fix)

How was this patch tested?

maven build

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

srowen and others added 30 commits June 13, 2017 10:48
…bles with many partitions

## What changes were proposed in this pull request?

Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#18216 from srowen/SPARK-20920.

(cherry picked from commit 7b7c85e)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…bles with many partitions

## What changes were proposed in this pull request?

Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#18216 from srowen/SPARK-20920.

(cherry picked from commit 7b7c85e)
Signed-off-by: Sean Owen <sowen@cloudera.com>
… executor page. Css style about paging function is error in the executor page. It is different of history server ui paging function css style.

## What changes were proposed in this pull request?

Css style about paging function is error in the executor page. It is different of history server ui paging function css style.

**But their style should be consistent**. There are three reasons.

1. The first reason: 'Previous', 'Next' and number should be the button format.

2. The second reason: when you are on the first page, 'Previous' and '1' should be gray and can not be clicked.
![1](https://user-images.githubusercontent.com/26266482/27026667-1fe745ee-4f91-11e7-8b34-150819d22bd3.png)

3. The third reason: when you are on the last page, 'Previous' and 'Max number' should be gray and can not be clicked.
![2](https://user-images.githubusercontent.com/26266482/27026811-9d8d6fa0-4f91-11e7-8b51-7816c3feb381.png)

before fix:
![fix_before](https://user-images.githubusercontent.com/26266482/27026428-47ec5c56-4f90-11e7-9dd5-d52c22d7bd36.png)

after fix:
![fix_after](https://user-images.githubusercontent.com/26266482/27026439-50d17072-4f90-11e7-8405-6f81da5ab32c.png)

The style of history server ui:
![history](https://user-images.githubusercontent.com/26266482/27026528-9c90f780-4f90-11e7-91e6-90d32651fe03.png)

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes apache#18275 from guoxiaolongzte/SPARK-21060.

(cherry picked from commit b7304f2)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…sferServiceSuite

## What changes were proposed in this pull request?

The default value for `spark.port.maxRetries` is 100,
but we use 10 in the suite file.
So we change it to 100 to avoid test failure.

## How was this patch tested?
No test

Author: DjvuLee <lihu@bytedance.com>

Closes apache#18280 from djvulee/NettyTestBug.

(cherry picked from commit b36ce2a)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…sferServiceSuite

## What changes were proposed in this pull request?

The default value for `spark.port.maxRetries` is 100,
but we use 10 in the suite file.
So we change it to 100 to avoid test failure.

## How was this patch tested?
No test

Author: DjvuLee <lihu@bytedance.com>

Closes apache#18280 from djvulee/NettyTestBug.

(cherry picked from commit b36ce2a)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…chmark

## What changes were proposed in this pull request?

This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily.

This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L.

It supports the following options:
- `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second.
- `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds.
- `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed.

Here is a simple example that prints 10 rows per seconds:
```
    spark.readStream
      .format("rate")
      .option("rowsPerSecond", "10")
      .load()
      .writeStream
      .format("console")
      .start()
```

The idea came from marmbrus and he did the initial work.

## How was this patch tested?

The added tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#18199 from zsxwing/rate.
…ng from failure for Master

Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI.

So here fix this to also count the driver's resource usage.

Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered.

andrewor14 please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes apache#10506 from jerryshao/SPARK-12552.

(cherry picked from commit 9eb0952)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…itions rule.

## What changes were proposed in this pull request?
After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed.

## How was this patch tested?
add unit test.

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes apache#18205 from lianhuiwang/SPARK-20986.

(cherry picked from commit 8b5b2e2)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ark 2.1

### What changes were proposed in this pull request?
Before the PR, Spark is unable to read the partitioned table created by Spark 2.1 when the table schema does not put the partitioning column at the end of the schema.
[assert(partitionFields.map(_.name) == partitionColumnNames)](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L234-L236)

When reading the table metadata from the metastore, we also need to reorder the columns.

### How was this patch tested?
Added test cases to check both Hive-serde and data source tables.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#18295 from gatorsmile/reorderReadSchema.

(cherry picked from commit 0c88e8d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…al Values when the Input is BigDecimal between -1.0 and 1.0

### What changes were proposed in this pull request?

This PR is to backport apache#18244 to 2.2

---

The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.

The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.

Before this PR, the following queries failed:
```SQL
select 1 > 0.0001
select floor(0.0001)
select ceil(0.0001)
```

### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#18297 from gatorsmile/backport18244.
…al Values when the Input is BigDecimal between -1.0 and 1.0

### What changes were proposed in this pull request?

This PR is to backport apache#18244 to 2.2

---

The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.

The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.

Before this PR, the following queries failed:
```SQL
select 1 > 0.0001
select floor(0.0001)
select ceil(0.0001)
```

### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#18297 from gatorsmile/backport18244.

(cherry picked from commit 6265119)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Since both table properties and storage properties share the same key values, table properties are not shown in the output of DESC EXTENDED/FORMATTED when the storage properties are not empty.

This PR is to fix the above issue by renaming them to different keys.

Added test cases.

Author: Xiao Li <gatorsmile@gmail.com>

Closes apache#18294 from gatorsmile/tableProperties.

(cherry picked from commit df766a4)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
… JSON

The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.

N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes apache#18202 from gatorsmile/renameCVSOption.

(cherry picked from commit 2051428)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?

doc only change

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes apache#18312 from felixcheung/sqljsonwholefiledoc.

(cherry picked from commit 1bf55e3)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
…dd.LocalCheckpointSuite.missing checkpoint block fails with informative message

## What changes were proposed in this pull request?

Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().

## How was this patch tested?
N/A

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes apache#18314 from jiangxb1987/LocalCheckpointSuite.

(cherry picked from commit 7dc3e69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…dd.LocalCheckpointSuite.missing checkpoint block fails with informative message

## What changes were proposed in this pull request?

Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().

## How was this patch tested?
N/A

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes apache#18314 from jiangxb1987/LocalCheckpointSuite.

(cherry picked from commit 7dc3e69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Test failed in `describe.sql`.

We need to fix the related bug introduced in (apache#17649) in the follow-up PR to master.

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#18316 from gatorsmile/fix.
…dren node.

## What changes were proposed in this pull request?

Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342

## How was this patch tested?

Existing tests.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes apache#18284 from ConeyLiu/treenode.

(cherry picked from commit 87ab0ce)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…dren node.

## What changes were proposed in this pull request?

Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342

## How was this patch tested?

Existing tests.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes apache#18284 from ConeyLiu/treenode.

(cherry picked from commit 87ab0ce)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…e mismatch

## What changes were proposed in this pull request?
Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#18319 from gatorsmile/fixDecimal.
…rSuite.master correctly recover the application"

## What changes were proposed in this pull request?

Due to the RPC asynchronous event processing, The test "correctly recover the application" could potentially be failed. The issue could be found in here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78126/testReport/org.apache.spark.deploy.master/MasterSuite/master_correctly_recover_the_application/.

So here fixing this flaky test.

## How was this patch tested?

Existing UT.

CC cloud-fan jiangxb1987 , please help to review, thanks!

Author: jerryshao <sshao@hortonworks.com>

Closes apache#18321 from jerryshao/SPARK-12552-followup.

(cherry picked from commit 2837b14)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?

Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```

## How was this patch tested?
manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes apache#18271 from wangyum/building-spark.

(cherry picked from commit 45824fb)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…th.wait.timeout" hasn't been used in spark

[https://issues.apache.org/jira/browse/SPARK-21126](https://issues.apache.org/jira/browse/SPARK-21126)
The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark,so I think it should be removed from configuration.md.

Author: liuzhaokun <liu.zhaokun@zte.com.cn>

Closes apache#18333 from liu-zhaokun/new3.

(cherry picked from commit 0d8604b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
…n info in AppVeyor tests

## What changes were proposed in this pull request?

This PR proposes three things as below:

**Install packages per documentation** - this does not affect the tests itself (but CRAN which we are not doing via AppVeyor) up to my knowledge.

This adds `knitr` and `rmarkdown` per https://github.com/apache/spark/blob/45824fb608930eb461e7df53bb678c9534c183a9/R/WINDOWS.md#unit-tests (please see apache@45824fb)

**Improve logs/shorten logs** - actually, long logs can be a problem on AppVeyor (e.g., see apache#17873)

`R -e ...` repeats printing R information for each invocation as below:

```
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
```

It looks reducing the call might be slightly better and print out the versions together looks more readable.

Before:

```
# R information ...
> packageVersion('testthat')
[1] '1.0.2'
>
>

# R information ...
> packageVersion('e1071')
[1] '1.6.8'
>
>
... 3 more times
```

After:

```
# R information ...
> packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')
[1] ‘1.16’
[1] ‘1.6’
[1] ‘1.0.2’
[1] ‘1.6.8’
[1] ‘2.41.3’
```

**Add`appveyor.yml`/`dev/appveyor-install-dependencies.ps1` for triggering the test**

Changing this file might break the test, e.g., apache#16927

## How was this patch tested?

Before (please see https://ci.appveyor.com/project/HyukjinKwon/spark/build/169-master)
After (please see the AppVeyor build in this PR):

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#18336 from HyukjinKwon/minor-add-knitr-and-rmarkdown.

(cherry picked from commit 75a6d05)
Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request?
1.In `acquireStorageMemory`, when the Memory Mode is OFF_HEAP ,the `maxOffHeapMemory` should be modified to `maxOffHeapStorageMemory`. after this PR,it will same as ON_HEAP Memory Mode.
Because when acquire memory is between `maxOffHeapStorageMemory` and `maxOffHeapMemory`,it will fail surely, so if acquire memory is greater than  `maxOffHeapStorageMemory`(not greater than `maxOffHeapMemory`),we should fail fast.
2. Borrow memory from execution, `numBytes` modified to `numBytes - storagePool.memoryFree` will be more reasonable.
Because we just acquire `(numBytes - storagePool.memoryFree)`, unnecessary borrowed `numBytes` from execution

## How was this patch tested?
added unit test case

Author: liuxian <liu.xian3@zte.com.cn>

Closes apache#18296 from 10110346/wip-lx-0614.

(cherry picked from commit 112bd9b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… be silently ignored

### What changes were proposed in this pull request?
We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages.

### How was this patch tested?
Added test cases for both regular functions and window functions

Author: Xiao Li <gatorsmile@gmail.com>

Closes apache#18340 from gatorsmile/firstCount.

(cherry picked from commit 9413b84)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <shi.indetail@gmail.com>

Closes apache#18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <shi.indetail@gmail.com>

Closes apache#18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…rce are in a wrong table

## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

The original documentation was written by tdas and lw-lin

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes apache#18342 from assafmendelson/spark-21123.

(cherry picked from commit 66a792c)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
ash211 and others added 25 commits September 18, 2017 10:42
…esent

As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden.

Author: Andrew Ash <andrew@andrewash.com>

Closes apache#19164 from ash211/patch-3.

(cherry picked from commit 6308c65)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…mp_profiles

## What changes were proposed in this pull request?

This PR proposes to improve error message from:

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
    self.profiler_collector.show_profiles()
AttributeError: 'NoneType' object has no attribute 'show_profiles'
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
    self.profiler_collector.dump_profiles(path)
AttributeError: 'NoneType' object has no attribute 'dump_profiles'
```

to

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
    raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
    raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
```

## How was this patch tested?

Unit tests added in `python/pyspark/tests.py` and manual tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#19260 from HyukjinKwon/profile-errors.

(cherry picked from commit 7c72662)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
## What changes were proposed in this pull request?

As reported in https://issues.apache.org/jira/browse/SPARK-22047 , HiveExternalCatalogVersionsSuite is failing frequently, let's disable this test suite to unblock other PRs, I'm looking into the root cause.

## How was this patch tested?
N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19264 from cloud-fan/test.

(cherry picked from commit 894a756)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?

This PR tries to download Spark for each test run, to make sure each test run is absolutely isolated.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19265 from cloud-fan/test.

(cherry picked from commit 10f45b3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Current implementation for processingRate-total uses wrong metric:
mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond

## What changes were proposed in this pull request?
Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond

## How was this patch tested?

Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric.
<img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png">

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Taaffy <32072374+Taaffy@users.noreply.github.com>

Closes apache#19268 from Taaffy/patch-1.

(cherry picked from commit 1bc17a6)
Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request?

Spark with Scala 2.10 fails with a group by cube:
```
spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug")
spark.sql("select 1 from rollup_bug group by rollup ()").show
```

It can be traced back to apache#15484 , which made `Expand.projections` a lazy `Stream` for group by cube.

In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts.

This change is also good for master branch, to reduce the serialized size of `Expand.projections`.

## How was this patch tested?

manually verified with Spark with Scala 2.10.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19289 from cloud-fan/bug.

(cherry picked from commit ce6a71e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…t FS

## What changes were proposed in this pull request?

When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.

With this change, client copies the files to remote always when the source scheme is "file".

## How was this patch tested?

I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.

Author: Devaraj K <devaraj@apache.org>

Closes apache#19141 from devaraj-kavali/SPARK-21384.

(cherry picked from commit 55d5fa7)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?

We have to make sure that SerializerManager's private instance of
kryo also uses the right classloader, regardless of the current thread
classloader.  In particular, this fixes serde during remote cache
fetches, as those occur in netty threads.

## How was this patch tested?

Manual tests & existing suite via jenkins.  I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally.  I manually verified the warning message is no longer present.

Author: Imran Rashid <irashid@cloudera.com>

Closes apache#19280 from squito/SPARK-21928_ser_classloader.

(cherry picked from commit b75bd17)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
`processAllAvailable` should also check the query state and if the query is stopped, it should return.

The new unit test.

Author: Shixiong Zhu <zsxwing@gmail.com>

Closes apache#19314 from zsxwing/SPARK-22094.

(cherry picked from commit fedf696)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
## What changes were proposed in this pull request?

Check JDK version (with javac) and use SPARK_VERSION for publish-release

## How was this patch tested?

Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled)

Author: Holden Karau <holden@us.ibm.com>

Closes apache#19312 from holdenk/improve-release-scripts-r2.

(cherry picked from commit 8f130ad)
Signed-off-by: Holden Karau <holden@us.ibm.com>
## What changes were proposed in this pull request?

Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark.

## How was this patch tested?

Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).

Author: Jakub Nowacki <j.s.nowacki@gmail.com>

Closes apache#19310 from jsnowacki/master.

(cherry picked from commit c11f24a)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
…rupts struct and array data

`OffHeapColumnVector.reserveInternal()` will only copy already inserted values during reallocation if `data != null`. In vectors containing arrays or structs this is incorrect, since there field `data` is not used at all. We need to check `nulls` instead.

Adds new tests to `ColumnVectorSuite` that reproduce the errors.

Author: Ala Luszczak <ala@databricks.com>

Closes apache#19323 from ala/port-vector-realloc.
…s and timestamps in partition column

## What changes were proposed in this pull request?

This PR backports apache@04975a6 into branch-2.2.

## How was this patch tested?

Unit tests in `ParquetPartitionDiscoverySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#19333 from HyukjinKwon/SPARK-22109-backport-2.2.
## What changes were proposed in this pull request?

Updated docs so that a line of python in the quick start guide executes. Closes apache#19283

## How was this patch tested?

Existing tests.

Author: John O'Leary <jgoleary@gmail.com>

Closes apache#19326 from jgoleary/issues/22107.

(cherry picked from commit 20adf9a)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
## What changes were proposed in this pull request?

MemoryStore.evictBlocksToFreeSpace acquires write locks for all the
blocks it intends to evict up front.  If there is a failure to evict
blocks (eg., some failure dropping a block to disk), then we have to
release the lock.  Otherwise the lock is never released and an executor
trying to get the lock will wait forever.

## How was this patch tested?

Added unit test.

Author: Imran Rashid <irashid@cloudera.com>

Closes apache#19311 from squito/SPARK-22083.

(cherry picked from commit 2c5b9b1)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
… warehouse directory

## What changes were proposed in this pull request?
During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory.

## How was this patch tested?
Ran full suite of tests locally, verified that they pass.

Author: Greg Owen <greg@databricks.com>

Closes apache#19341 from GregOwen/SPARK-22120.

(cherry picked from commit ce20478)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
… Cartesian products

Back port apache#19362 to branch-2.2

## What changes were proposed in this pull request?

When inferring constraints from children, Join's condition can be simplified as None.
For example,
```
val testRelation = LocalRelation('a.int)
val x = testRelation.as("x")
val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
x.join.where($"x.a" === $"y.a")
```
The plan will become
```
Join Inner
:- LocalRelation <empty>, [a#23]
+- LocalRelation <empty>, [a#224]
```
And the Cartesian products check will throw exception for above plan.

Propagate empty relation before checking Cartesian products, and the issue is resolved.

## How was this patch tested?

Unit test

Author: Wang Gengliang <ltnwgl@gmail.com>

Closes apache#19366 from gengliangwang/branch-2.2.
@markhamstra
Copy link
Owner

Wrong branch. Redo against csd-2.2 instead of master.

@ianlcsd ianlcsd closed this Sep 29, 2017
markhamstra pushed a commit that referenced this pull request Nov 7, 2017
* Use tar and gzip to archive shipped jars.

* Address comments

* Move files to resolve merge
markhamstra pushed a commit that referenced this pull request Mar 30, 2018
## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and `json_tuple` ([Fix #1](apache@c8803c0),
 [Fix #2](apache@86174ea)), but they weren't comprehensive it seems. I wanted to extend those fixes to all the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#20302 from brkyvz/json-invfix.

(cherry picked from commit e01919e)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
markhamstra pushed a commit that referenced this pull request May 14, 2018
## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and `json_tuple` ([Fix #1](apache@c8803c0),
 [Fix #2](apache@86174ea)), but they weren't comprehensive it seems. I wanted to extend those fixes to all the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#20302 from brkyvz/json-invfix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet