fixes #163: Add support to partitioning #164

conker84 · 2020-08-13T16:01:22Z

fixes #163

We support partitions via skip/limit queries. In order to do so, we compute the total count, for nodes/relationships we leverage the Neo4j count store in order to retrieve the total count about the nodes/relationships we're trying pulling off, for the query we have two possible approaches:

Compute a count over the query that we're using
Compute a count over a second optimized query that leverages indexes, in this case, you can pass
it via the .option("query.count", "<your cypher query>") the query must always return only
one field named count which is the result of the count. ie.:

MATCH (p:Person)-[r:BOUGHT]->(pr:Product)
WHERE pr.name = 'An Awesome Product'
RETUNR count(p) AS count

moxious · 2020-08-13T16:09:48Z

doc/docs/modules/ROOT/pages/quickstart.adoc

@@ -177,6 +188,13 @@ every single node property as column prefixed by `source` or `target`
 |`sample`
 |No

+|`partitions`
+|This defines the parallelization level while pulling data from Neo4j.
+N.b. As More parallelization does not mean more performances so please tune wisely in according to


"As more parallelization does not mean better performance, choose this setting wisely"

Note the interaction with "count". I think what you mean is that if the count query returns 100 and partitions=5, you're going to be getting 5 partitions of 20 records, using skip/limit. You should just say that.

We should also mention that it only makes sense when you're pulling more than X (where X can be 1M or 10M)

moxious · 2020-08-13T16:10:44Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+```cypher
+MATCH (p:Person)-[r:BOUGHT]->(pr:Product)
+WHERE pr.name = 'An Awesome Product'
+RETUNR count(p) AS count


typo return

moxious · 2020-08-13T16:11:56Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+2. Relationship extraction
+3. Query extraction
+
+We adopt generally provide a general count on what you're trying to pull of and add build


provide an example here, this isn't very clear. For example, I'm loading just Person nodes. I set partitions=5. This turns into:

MATCH (p:Person) RETURN p LIMIT 20 SKIP 0 MATCH (p:Person) RETURN p LIMIT 20 SKIP 20 ...

In theory we could also leverage cursors:

given there is a unqiue constraint

we can return the data ordered (using index backed order by) with a limit

then we can take the last value returned and add an WHERE n.id > $cursor which then would only query from that value

pure skip limit queries need to re-execute and throw away, so they get slower over time

moxious · 2020-08-13T16:12:51Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+While for (1) and (2) we leverage the Neo4j count store in order to retrieve the total count
+about the nodes/relationships we're trying pulling off, for the (3) we have two possible approaches:
+
+* Compute a count over the query that we're using


How would this be done in the general query case? Or it can't be done, and using more than > 1 partition with a query where no count is specified will fail?

moxious · 2020-08-13T16:17:23Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+      }
+  }
+
+  def countForRelationship(): Long = try {


These count methods, please think about failure methods. In some of these APOC calls you are doing you are coupling down to the string format that APOC returns in a particular field. Remembering that APOC will get put into product at some point, and the string format of return columns isn't promised, this code has some potential to break at some point in the future due to a new published APOC, just as an example of what can go wrong.

Suppose the user specifies paralleism of 5, and everything goes wrong with the count feature, and there's no way to obtain a valid count. What happens then? My vote would be for a debug message, plus reverting to reading into one partition (with no need for counts)

there's the extra possibility that the user just regular screws up the count query. Say I specify the count query like this:

MATCH (p:Person)-[:BOUGHT]->(pr:Product) RETURN count(pr) - 10 as count

Doesn't really matter how or why, but the count is wrong.

In this case, you might try to fetch one extra "empty" partition. If the user says 5 partitions, try to fetch 6, where the 6th partition has one record (SKIP whatever LIMIT 1). If you get nothing, that's good. If you get an actual record here, this is a warning that the user has messed up the partition. Another warning is if any of your partitions end up empty, then chances are good the user messed up the count query.

just allow to specifiy a literal count value in the property

It's always good to overfetch for stuff like that in SDN we used to pull pages with pagesize+1 so we could say "there is a next page"

Just to be sure should we over-fetch always or only if the user provides the query count?

just allow to specifiy a literal count value in the property

the user can define a query like this RETURN 50 AS count is it fine? Or we allow to simply pass .option("query.count", "50")?

I added .option("query.count", "50") as valid option

moxious · 2020-08-13T16:22:09Z

src/test/scala/org/neo4j/spark/DataSourceReaderTSE.scala

+    assertEquals(5, partitionedDf.rdd.getNumPartitions)
+    assertEquals(150, partitionedDf.collect().map(_.getAs[Long]("person")).toSet.size)
+
+    val partitionedQueryCountDf = ss.read.format(classOf[DataSource].getName)


Add tests:

What happens when query.count is invalid, or fails to return a "count" field, or returns a "count" field that has nonsense?

We provided a check if was a READ_ONLY query, I added a check about the returned identifier, that must be only count.
What do you mean nonsense? About type? That will be a runtime error because in my knowledge

github seems trucated my comment 😥
... in my knowledge there is no way to get the return type from the explain

jexp · 2020-08-13T21:42:52Z

Actually the optimized version should be:

MATCH (pr:Product)
WHERE pr.name = 'An Awesome Product'
RETURN size(()-[:BOUGHT]->(pr)) AS count

which uses get-degree

jexp · 2020-08-13T21:43:20Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+|Query count, used only in combination with `query` option, it's a query that returns a `count`
+field like the following
+```
+MATCH (p:Person)-[r:BOUGHT]->(pr:Product)


see my optimized version in the other comment.

also typo in RETURN

Can I also opt out of the count query and just provide a literal value if I know how much data is there?
That would be really useful.

jexp · 2020-08-13T21:49:49Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+          """CALL apoc.meta.relTypeProperties({ includeRels: $rels }) YIELD sourceNodeLabels, targetNodeLabels,
+            | propertyName, propertyTypes
+            |WITH *
+            |WHERE sourceNodeLabels = $sourceLabels AND targetNodeLabels = $targetLabels


shouldn't we rather pass this to the proc? seems a big waste. If it's based on apoc.meta.* it should support include and exclude functionality.

Unfortunately at this moment, the procedure does not take those parameters as configuration params

oh ok, that's really unfortunate.

But we can change that.

jexp · 2020-08-13T21:51:56Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+          // TODO get back to Cypher DSL
+          options.nodeMetadata.labels
+            .map(label => {
+              val query = s"""MATCH (${Neo4jUtil.NODE_ALIAS}:$label)


you can do a union query that uses

MATCH (:$label) RETURN { type: '$label', count:count(*) } as counts UNION ALL ...

jexp · 2020-08-13T21:55:01Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+                                         |""".stripMargin).single().get("count"))
+            .map(count => if (count.isNull) 0L else count.asLong())
+            .min
+          Math.min(minFromSource, minFromTarget)


why min and not max?

because I use the most selective, if we have this use use-case:

val df = ss.read.format(classOf[DataSource].getName) .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl) .option("relationship", "BOUGHT") .option("relationship.nodes.map", "false") .option("relationship.source.labels", ":Person:RegisteredCustomer") .option("relationship.target.labels", ":Product") .load()

This will get three counts:

(:Person)-[:BOUGHT]->() => 1000 rows

(:RegisteredCustomer)-[:BOUGHT]->() => 250 rows

()-[:BOUGHT]->(:Product) => 1000

we use the most selective.

jexp · 2020-08-13T21:56:00Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+    val query = if (options.queryMetadata.queryCount.trim.nonEmpty) {
+      options.queryMetadata.queryCount
+    } else {
+      s"""CALL { ${options.query.value} }


clever use of subquery

conker84 · 2020-08-14T10:24:44Z

Actually the optimized version should be:
MATCH (pr:Product)
WHERE pr.name = 'An Awesome Product'
RETURN size(()-[:BOUGHT]->(pr)) AS count
which uses get-degree

I used this in order to create the queries:

https://neo4j.com/developer/kb/fast-counts-using-the-count-store/

conker84 · 2020-08-21T14:21:37Z

@jexp lemme know if there are no further comments

jexp · 2020-09-02T10:18:32Z

src/main/scala/org/neo4j/spark/reader/Neo4jInputPartitionReader.scala

-  private val query: String = new Neo4jQueryService(options, new Neo4jQueryReadStrategy(filters)).createQuery()
+  private val query: String = new Neo4jQueryService(options, new Neo4jQueryReadStrategy(filters))
+    .createQuery()
+    .concat(if (skip != -1 && limit != -1) s" SKIP $$skip LIMIT $$limit" else "")


it would be nicer if we had a proper method on createQuery i.e. pass it to the query builder

jexp · 2020-09-02T10:18:45Z

src/main/scala/org/neo4j/spark/reader/Neo4jInputPartitionReader.scala

  with InputPartitionReader[InternalRow]
  with Logging {

  private var result: Iterator[Record] = _
  private var session: Session = _
  private var transaction: Transaction = _
-  private val driverCache: DriverCache = new DriverCache(options.connection, jobId)
+  private val driverCache: DriverCache = new DriverCache(options.connection,


why do we need the partitionid in the driver cache?

Because otherwise in a multi-partition env the first one that closes the driver cache close it for everyone.

jexp · 2020-09-02T11:49:14Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+      queryReadStrategy.createStatementForRelationshipCount(options)
+    }
+    log.info(s"Executing the following counting query on Neo4j: $query")
+    session.run(query)


how do you map them back to the source node / rel-type? or don't you need that?

What do you mean?

jexp · 2020-09-02T11:49:37Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+
+  def countForRelationshipWithQuery(filters: Array[Filter]): Long = {
+    val query = if (filters.isEmpty) {
+      val sourceQueries = options.relationshipMetadata.source.labels


why can't we use apoc.meta.stats here?

we use it, this is the fallback in case it's not present, look here:
https://github.com/neo4j-contrib/neo4j-spark-connector/pull/164/files#diff-85bfb2242b03b769ed6b534ff8992f88R255

jexp · 2020-09-02T11:53:26Z

src/main/scala/org/neo4j/spark/service/SchemaService.scala

+    if (count <= 0) {
+      Seq.empty
+    } else {
+      val partitionSize = Math.ceil(count / options.partitions).toLong


isn't this a whole number division? so ceil has no effect

Great catch thank you!!!

jexp · 2020-09-02T11:55:03Z

src/main/scala/org/neo4j/spark/util/Neo4jUtil.scala

@@ -53,7 +55,8 @@ object Neo4jUtil {
        case _ => autoCloseable.close()
      }
    } catch {
-      case _ => throw new Exception("This exception should be logged") // @todo Log
+      case t: Throwable => if (logger != null) logger
+        .error(s"Cannot close ${autoCloseable.getClass.getSimpleName} because of the following exception:", t)


this should not be an error, at most a warning

jexp · 2020-09-02T11:59:24Z

src/main/scala/org/neo4j/spark/reader/Neo4jDataSourceReader.scala

+  override def readSchema(): StructType = callSchemaService { schemaService => schemaService
+    .struct() }
+
+  private def callSchemaService[T](lambda: SchemaService => T): T = {


supplier ? or factory? instaed of lambda?

I changed the name to function if you agree I would leave it with a lambda which keeps the code very simple

jexp · 2020-09-02T11:59:35Z

src/main/scala/org/neo4j/spark/reader/Neo4jDataSourceReader.scala

+      lambda(schemaService)
+    } catch {
+      case e: Throwable => {
+        hasError = true


It will logged by Spark, I can add it but it would be redundant

jexp · 2020-09-02T12:00:41Z

src/main/scala/org/neo4j/spark/reader/Neo4jDataSourceReader.scala

+    val neo4jPartitions = if (partitionSkipLimit.isEmpty) {
+      Seq(new Neo4jInputPartitionReader(neo4jOptions, filters, schema, jobId))
+    } else {
+      partitionSkipLimit.zipWithIndex


this code is not very readable, can we use a caseclass instead of the triple?

jexp · 2020-09-02T12:01:15Z

src/main/scala/org/neo4j/spark/reader/Neo4jDataSourceReader.scala

@@ -19,20 +21,43 @@ class Neo4jDataSourceReader(private val options: DataSourceOptions, private val
  private val neo4jOptions: Neo4jOptions = new Neo4jOptions(options.asMap())
    .validate(options => Validations.read(options, jobId))

-  override def readSchema(): StructType = {
-    val schemaService = new SchemaService(neo4jOptions, jobId)
+


this whole new section is quite hard to understand / reason about, can you do somethign with the naming of methods/params or add a few comments

jexp

Please have a look at my comments

conker84 · 2020-09-06T19:09:24Z

Thank @jexp for the feedback I'm merging this

conker84 requested review from jexp and moxious August 13, 2020 16:01

conker84 force-pushed the issue_163 branch from 986a907 to 5a39843 Compare August 13, 2020 16:01

moxious reviewed Aug 13, 2020

View reviewed changes

jexp reviewed Aug 13, 2020

View reviewed changes

conker84 force-pushed the issue_163 branch from 5a39843 to 69bec83 Compare August 18, 2020 08:09

fixes neo4j-contrib#163: Add support to partitioning

4bf2cfc

conker84 force-pushed the issue_163 branch from 69bec83 to 4bf2cfc Compare August 19, 2020 16:35

jexp reviewed Sep 2, 2020

View reviewed changes

jexp approved these changes Sep 2, 2020

View reviewed changes

jexp feedback 02/09

cd8014c

conker84 force-pushed the issue_163 branch from 686a6b7 to cd8014c Compare September 3, 2020 09:51

conker84 merged commit a2d1737 into neo4j-contrib:4.0 Sep 6, 2020

fixes #163: Add support to partitioning #164

fixes #163: Add support to partitioning #164

Conversation

conker84 commented Aug 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conker84 Aug 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp commented Aug 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conker84 commented Aug 14, 2020

conker84 commented Aug 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp left a comment

Choose a reason for hiding this comment

conker84 commented Sep 6, 2020

conker84 Aug 14, 2020 •

edited