Fixes issue #161: Add PushDownFilters support #162

utnaf · 2020-08-10T08:00:12Z

Fixes #161

Added PushDownFilters support.

Is applied just with labels and relationship, it apply with relationship and the relationship.nodes.map options set to true.

When applied the filters are executed on the Cypher query, moving less data from Neo4j to Spark. When not applied the filters are executed by Spark.

It can be manually disabled setting pushdown.filters.enabled option to false.

…eo4j by a provided query

typo

jexp · 2020-08-10T13:22:06Z

doc/docs/modules/ROOT/pages/quickstart.adoc

 ==== Schema

-If APOC are installed, schema will be created with `apoc.meta.relTypeProperties`. Otherwise the first 10 (or any number specified by the `schema.flatten.limit` option) results will be flattened and the schema will be create from those properties.
+If APOC are available, the schema will be created with `apoc.meta.relTypeProperties`. Otherwise the first 10 (or any number specified by the `schema.flatten.limit` option) results will be flattened and the schema will be create from those properties.


One sentence per line please. Makes it easier for diffs and is the rule for Neo4j related docs (asciidoc)

jexp · 2020-08-10T13:25:52Z

src/main/scala/org/neo4j/spark/service/Neo4jQueryService.scala

+    if (filters.nonEmpty) {
+      val filtersMap: Map[PropertyContainer, Array[Filter]] = filters.map(filter => {
+        if (filter.isAttribute(Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS)) {
+          (sourceNode, filter)


perhaps nicer to work with a case class

jexp · 2020-08-10T13:26:25Z

src/main/scala/org/neo4j/spark/service/Neo4jQueryService.scala

+        }
+      }).groupBy[PropertyContainer](_._1).mapValues(_.map(_._2))
+
+      matchQuery.where(


hard to reason about what happens in this code block

jexp · 2020-08-10T13:27:14Z

doc/docs/modules/ROOT/pages/quickstart.adoc

@@ -195,6 +200,15 @@ Spark works with data in a tabular fixed schema. To accomplish this Neo4j Connec

 TK list of supported data types

+=== Consideration on the filters
+
+The Neo4j Spark Connector implements the SupportPushDownFilters interface, that allows you to push the Spark filters down to the Neo4j layer. In this way the data that Spark will receive will be already filtered by Neo4j.


should probably mention that all filters are ANDed together?

done at line 213

jexp · 2020-08-10T13:27:42Z

src/main/scala/org/neo4j/spark/service/Neo4jQueryService.scala

+
+  private def createNode(name: String, labels: Seq[String]) = {
+    val primaryLabel = labels.head
+    val otherLabels = labels.takeRight(labels.size - 1)


labels.tail()

src/main/scala/org/neo4j/spark/util/Neo4jImplicits.scala

jexp · 2020-08-10T13:31:50Z

src/main/scala/org/neo4j/spark/util/Neo4jUtil.scala

    .toMap
    .asJava

  def connectorVersion: String = properties.getOrDefault("version", "UNKNOWN").toString

+  def mapSparkFiltersToCypher(filter: Filter, container: org.neo4j.cypherdsl.core.PropertyContainer, attributeAlias: Option[String] = None): Condition = {
+    filter match {
+      case eqns: EqualNullSafe => container.property(attributeAlias.getOrElse(eqns.attribute)).isEqualTo(Cypher.literalOf(eqns.value))


todo parameters instead of literals, can be in a later PR

jexp · 2020-08-10T13:32:12Z

src/main/scala/org/neo4j/spark/util/Neo4jUtil.scala

    .toMap
    .asJava

  def connectorVersion: String = properties.getOrDefault("version", "UNKNOWN").toString

+  def mapSparkFiltersToCypher(filter: Filter, container: org.neo4j.cypherdsl.core.PropertyContainer, attributeAlias: Option[String] = None): Condition = {
+    filter match {
+      case eqns: EqualNullSafe => container.property(attributeAlias.getOrElse(eqns.attribute)).isEqualTo(Cypher.literalOf(eqns.value))


coalesce around the value?

jexp · 2020-08-10T13:34:15Z

src/test/scala/org/neo4j/spark/DataSourceReaderTSE.scala

+
+    val result = df.where("age > 20").collectAsList()
+
+    assertEquals(1, result.size())


these tests should probably check the actual entities are correct in the results, not just the count

jexp · 2020-08-10T13:36:01Z

src/test/scala/org/neo4j/spark/DataSourceReaderTSE.scala

+      .option("relationship.target.labels", "Person")
+      .load()
+
+    assertEquals(1, df.filter("`<source>`.`id` = '10' AND `<target>`.`id` = '1'").collectAsList().size())


df can do count without collectToList

jexp · 2020-08-10T13:37:00Z

src/test/scala/org/neo4j/spark/DataSourceReaderTSE.scala

+      .option("relationship.target.labels", "Person")
+      .load()
+
+    assertEquals(1, df.filter("`source.id` = 10 AND `target.id` = 1").collectAsList().size())


the syntax is different from before where it was <source>.<id> ?
why? we should allow only one (simpler one)

Same explanation as my comment in quickstart.adoc, is due to the presence of .option("relationship.nodes.map", "false")

They are different because the result format is different (map vs column). We can talk about anyway

jexp · 2020-08-10T13:37:27Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+
+If `relationship.node.map` is set to **true**
+
+* ``\`<source>`.\`[property]` `` for the source node map properties


If I got your question right: this is used to filter the source, rel, or target columns when are returned as a map (using the relationship.node.map == true). This is the syntax used to access properties inside said maps.

but why needs to be a different syntax there, can't we figure that out internally? and use the appropriate adjusted filter. Just trying to understand.

We have two ways to represent a relationship:

with relationship.node.map=false => we have the nodes and relationship flattened so every node has the namespace source/target and the relationship has rel. This is the most descriptive representation;

with relationship.node.map=true => we have the nodes represented as map, in order to compact the table representation.

jexp · 2020-08-10T13:39:06Z

src/test/scala/org/neo4j/spark/DataSourceReaderWithApocTSE.scala

+      (p2:Person {name: 'Jane Doe'})
+     """)
+
+    val result = df.select("name").where("NOT name = 'John Doe'").collectAsList()


is there also a not operator on the predicate API?

We added a test with the != operator

I meant on the spark filter API i.e. can I do .not().where() ?

No you can't

jexp · 2020-08-10T13:41:39Z

src/test/scala/org/neo4j/spark/util/Neo4jImplicitsTest.scala


+  @Test
+  def `should not return attribute if filter doesn't have it` {


not sure what this test means.

jexp · 2020-08-10T13:41:54Z

src/test/scala/org/neo4j/spark/util/Neo4jImplicitsTest.scala

@@ -43,5 +43,27 @@ class Neo4jImplicitsTest {
    assertEquals(value, actual)
  }

+  @Test


should the other implicits also be tested?

jexp · 2020-08-10T13:42:34Z

src/test/scala/org/neo4j/spark/Neo4jQueryServiceTest.scala

@@ -13,7 +14,7 @@ class Neo4jQueryServiceTest {
    options.put(QueryType.LABELS.toString.toLowerCase, "Person")
    val neo4jOptions: Neo4jOptions = new Neo4jOptions(options)

-    val query: String = new Neo4jQueryService(neo4jOptions, new Neo4jQueryReadStrategy()).createQuery()
+    val query: String = new Neo4jQueryService(neo4jOptions, new Neo4jQueryReadStrategy(Array[Filter]())).createQuery()


shouldn't it have a default value of an empty array instead?

jexp · 2020-08-10T13:44:00Z

src/test/scala/org/neo4j/spark/Neo4jQueryServiceTest.scala

+
+    val query: String = new Neo4jQueryService(neo4jOptions, new Neo4jQueryReadStrategy(filters)).createQuery()
+
+    assertEquals("MATCH (n:`Person`) WHERE n.name = 'John Doe' RETURN n", query)


please also test the generation of a few more queries

and/or/not combos (esp. with parentheses around OR groups)

source/target/rel filters both with map-mode and without

multi label?

label filters?

jexp

Looks great, really cool feature!!

Please have a look at my comments. And decide if you want to act on them and if so if in this PR or later.

jexp · 2020-08-10T13:45:39Z

Are there also other predicates possible and functions possible? like toLower or abs() or WHERE n:Label ?

moxious · 2020-08-10T14:44:27Z

doc/docs/modules/ROOT/pages/quickstart.adoc

@@ -177,6 +177,11 @@ every single node property as column prefixed by `source` or `target`
 |`sample`
 |No

+|`pushdown.filters.enabled`


just out of curiosity, why would I ever not want this enabled? What situation would mean I should disable it?

Maybe some users prefer to leave the filtering to Spark because, for instance, Spark would be faster in filtering some properties that might not have been indexed on Neo4j.

Also, if we have a bug on the filtering, this could be a nice patch 😅

utnaf · 2020-08-11T10:41:45Z

Are there also other predicates possible and functions possible? like toLower or abs() or WHERE n:Label ?

for functions: you should use Spark functions, since they're not converted to Filters and pushed down to Neo4j. At the moment any way to handle them.

for the labels: they are implicitly defined when we use the .option("labels", ":Person:Customer"), same thing applies to relationship

.option("relationship", "BOUGHT")
.option("relationship.source.labels", ":Person:Customer")
.option("relationship.target.labels", ":Product")

jexp · 2020-08-12T11:07:58Z

src/test/scala/org/neo4j/spark/Neo4jQueryServiceTest.scala

+
+    assertEquals("MATCH (source:`Person`) " +
+      "MATCH (target:`Person`) " +
+      "MATCH (source)-[rel:`KNOWS`]->(target) WHERE (source.name = coalesce('John Doe', '') OR target.name = coalesce('John Doe', '')) RETURN source, rel, target", query)


hmm not sure about the semantics of EqualTo in spark?
b.c. now we also get all the records returned that have name='' ? not sure if that's intended.

Actually I'm not sure we have to distinguish between null and not null equalto?

jexp

I don't think the coalesce approach works here as it would return wrong data.
Do you have a pointer to the "null" semantics for the spark pushdown filters?
Otherwise I would just use regular cypher expressions and ignore the null bit.

If you want to specifiy x IS NULL OR x = 'foo' I would generate exactly that but that will add a penalty to the cypher query performance.

Looks good otherwise.

conker84 · 2020-08-13T16:47:28Z

@jexp I update the PR as you requested please lemme know what are your thoughts about.

jexp · 2020-08-13T21:36:58Z

src/main/scala/org/neo4j/spark/util/Neo4jUtil.scala

+      case notNull: IsNotNull => container.property(attributeAlias.getOrElse(notNull.attribute)).isNotNull
+      case isNull: IsNull => container.property(attributeAlias.getOrElse(isNull.attribute)).isNull
+      case startWith: StringStartsWith => container.property(attributeAlias.getOrElse(startWith.attribute))
+        .startsWith(Functions.coalesce(Cypher.literalOf(startWith.value), Cypher.literalOf("")))


I thought you wanted to remove the coalesce?

There might be missing test that would have spotted them. :)

@conker84 ping

jexp · 2020-08-13T21:38:24Z

src/test/scala/org/neo4j/spark/Neo4jQueryServiceTest.scala

+    val neo4jOptions: Neo4jOptions = new Neo4jOptions(options)
+
+    val filters: Array[Filter] = Array[Filter](
+      EqualNullSafe("name", "John Doe"),


where are the semantics of EqualNullSafe defined?

jexp

See my comment

conker84 · 2020-08-14T14:41:15Z

@jexp PR updated

conker84 and others added 10 commits August 7, 2020 11:06

fixes neo4j-contrib#140: [ReadSupport] Add support for reading from N…

7f1a9eb

…eo4j by a provided query

add filter converter, implemented push down filters for label query

dfd92e3

[wip] add filters to relationships

93b1cc0

add push down filters

7164bae

added note to documentation

de20255

review suggestion implemented

653e0d0

add missing docs

a8162de

add filters disabled option

a4ea08e

add tests to Neo4jImplicits

ede9ef4

Update quickstart.adoc

06ac008

typo

jexp reviewed Aug 10, 2020

View reviewed changes

src/main/scala/org/neo4j/spark/util/Neo4jImplicits.scala Show resolved Hide resolved

jexp reviewed Aug 10, 2020

View reviewed changes

jexp requested changes Aug 10, 2020

View reviewed changes

fixed problem on or/and filters for relationships

dc922ce

moxious reviewed Aug 10, 2020

View reviewed changes

utnaf added 2 commits August 11, 2020 10:19

implemented review suggestions

2d33083

added additional tests cases

1163acc

utnaf added 3 commits August 11, 2020 15:04

added spatial and date filter tests

ba256fb

add implicits tests

74e9de9

more quickstart.adoc changes that got lost in VS code

2f9e88e

jexp reviewed Aug 12, 2020

View reviewed changes

jexp requested changes Aug 12, 2020

View reviewed changes

conker84 closed this Aug 13, 2020

conker84 reopened this Aug 13, 2020

removed coalesce and added is null for null equality

6edd545

jexp reviewed Aug 13, 2020

View reviewed changes

jexp requested changes Aug 13, 2020

View reviewed changes

added more tests and removed last coalesce

10b64ff

conker84 force-pushed the issue_161 branch from e2c9db4 to 10b64ff Compare August 14, 2020 13:57

jexp approved these changes Aug 18, 2020

View reviewed changes

jexp merged commit b78d22f into neo4j-contrib:4.0 Aug 18, 2020

moxious mentioned this pull request Sep 1, 2020

Add support for PushDownFilters #161

Closed


		val result = df.where("age > 20").collectAsList()

		assertEquals(1, result.size())


		If `relationship.node.map` is set to true

		* ``\`<source>`.\`[property]` `` for the source node map properties


		@Test
		def `should not return attribute if filter doesn't have it` {


		val query: String = new Neo4jQueryService(neo4jOptions, new Neo4jQueryReadStrategy(filters)).createQuery()

		assertEquals("MATCH (n:`Person`) WHERE n.name = 'John Doe' RETURN n", query)

Fixes issue #161: Add PushDownFilters support #162

Fixes issue #161: Add PushDownFilters support #162

Conversation

utnaf commented Aug 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp left a comment

Choose a reason for hiding this comment

jexp commented Aug 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utnaf commented Aug 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp left a comment

Choose a reason for hiding this comment

conker84 commented Aug 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp left a comment

Choose a reason for hiding this comment

conker84 commented Aug 14, 2020