Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update shortest path test to make it fail on spark 1.5 and 1.6 #23

Merged
merged 2 commits into from
Feb 24, 2016

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Feb 24, 2016

This could be a bug in our integral ID mapping.

I can reproduce the bug in Spark 1.6:

val df = sqlContext.range(10).select(col("id"), monotonicallyIncreasingId().as("long_id”))
df.show()

+---+-----------+
| id|    long_id|
+---+-----------+
|  0|          0|
|  1|          1|
|  2| 8589934592|
|  3| 8589934593|
|  4| 8589934594|
|  5|17179869184|
|  6|17179869185|
|  7|25769803776|
|  8|25769803777|
|  9|25769803778|
+---+-----------+

df.filter(col("id") === 3).show()

+---+----------+
| id|   long_id|
+---+----------+
|  3|8589934592|
+---+----------+

If seems that filter is pushed down below monotonicallyIncreadingId, which is wrong.

== Physical Plan ==
Project [id#837L,monotonicallyincreasingid() AS long_id#838L]
+- Filter (id#837L = 3)
   +- Scan ExistingRDD[id#837L]

@rxin

@liancheng
Copy link

I'm looking into this one.

@liancheng
Copy link

Please note that due to implementation strategy of MonotonicallyIncreadingID, DataFrame partition number is crucial for reproducing this bug. We should filter out an element that belongs to any partition other than the 1st one, and this element must not be the 1st element in that partition.

For example, on my 8-core laptop, Spark default parallelism is 8, the following code reproduces this issue:

import org.apache.spark.sql.functions._

val df = sqlContext
  .range(16) // 8 partitions, 2 elements per partition
  .select(
    col("id"),
    monotonicallyIncreasingId().as("long_id")
  )

df.show()
df.filter(col("id") === 3) // 2nd element in the 2nd partition
  .show()

@mengxr
Copy link
Contributor Author

mengxr commented Feb 24, 2016

@liancheng Could you create a Spark JIRA and post the link here? We will implement a workaround in graphframes while waiting for the official fix. Thanks!

@mengxr mengxr changed the title update the test to make it fail on spark 1.5 and 1.6 update shortest path test to make it fail on spark 1.5 and 1.6 Feb 24, 2016
mengxr added a commit that referenced this pull request Feb 24, 2016
update shortest path test to make it fail on spark 1.5 and 1.6
@mengxr mengxr merged commit 9cf9741 into graphframes:master Feb 24, 2016
asfgit pushed a commit to apache/spark that referenced this pull request Feb 25, 2016
…ministic field(s)

## What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

## How was this patch tested?

A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

Author: Cheng Lian <lian@databricks.com>

Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.
asfgit pushed a commit to apache/spark that referenced this pull request Feb 25, 2016
…ministic field(s)

## What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

## How was this patch tested?

A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

Author: Cheng Lian <lian@databricks.com>

Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.

(cherry picked from commit 3fa6491)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit to apache/spark that referenced this pull request Feb 25, 2016
…ministic field(s)

## What changes were proposed in this pull request?

Predicates shouldn't be pushed through project with nondeterministic field(s).

See graphframes/graphframes#23 and SPARK-13473 for more details.

This PR targets master, branch-1.6, and branch-1.5.

## How was this patch tested?

A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.

Author: Cheng Lian <lian@databricks.com>

Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.

(cherry picked from commit 3fa6491)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@liancheng
Copy link

@mengxr Sorry, missed your last comment. JIRA link: https://issues.apache.org/jira/browse/SPARK-13473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants