fixes #145: [ReadSupport] Add support to read from relationship with provided type #157

conker84 · 2020-08-02T14:25:07Z

fixes #145

This PR introduces the support for querying relationships by providing simple path like (source)-[re]->(target) in the following way:

val df = spark.read.format("org.neo4j.spark.DataSource")
      .option("relationship", "BOUGHT")
      .option("relationship.source.labels", "Person")
      .option("relationship.target.labels", "Product")

df.show()

This will create a Cypher Query as it follows:

MATCH (source:Person)-[rel:BOUGHT]->(target:Product)
RETURN source, rel, target

moxious · 2020-08-03T11:31:27Z

Bug running this code:

%py

df = spark.read.format("org.neo4j.spark.DataSource") \
  .option("url", dbutils.widgets.get("url")) \
  .option("authentication.basic.username", dbutils.widgets.get("user")) \
  .option("authentication.basic.password", dbutils.widgets.get("password")) \
  .option("relationship", "NEIGHBOUR") \
  .option("relationship.source.labels", "Country") \
  .option("relationship.target.labels", "Country").load()

df.show()

Py4JJavaError: An error occurred while calling o564.load.
: scala.MatchError: N/A (of class java.lang.String)
	at org.neo4j.spark.service.SchemaService.liftedTree2$1(SchemaService.scala:135)
	at org.neo4j.spark.service.SchemaService.structForRelationship(SchemaService.scala:128)
	at org.neo4j.spark.service.SchemaService.struct(SchemaService.scala:159)
	at org.neo4j.spark.reader.Neo4jDataSourceReader.readSchema(Neo4jDataSourceReader.scala:18)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:175)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:290)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

moxious

Generally I was able to get it to work, but importantly the flatten map option didn't work for me.

Beware when the schema sampling happens. I was iterating and creating new properties on (previously) empty relationships, and re-running my query and it seemed like it wasn't picking up the relationship property change, so I'm not sure what's going on there.

Given this example:

create (a:Customer { name: "David" })-[:BOUGHT { qty: 11 }]->(p:Product { name: "Socks" })

It'd be nice if this could return a dataframe like:

rel.id
rel.type
rel.qty
source.id <----- MISSING - at present you only get props, not internal ID
source.labels <---- MISSING (remember source nodes may have multi-labels)
source.name
target.id <--- MISSING
target.labels <---- MISSING
target.name

Since we "namespaced" properties with "source" and "target" here, might want to do the same for rel, to reduce the chance that property names collide. For example, imagine a relationship with a property called source.name

moxious · 2020-08-03T11:40:16Z

doc/docs/modules/ROOT/pages/quickstart.adoc

@@ -149,6 +149,22 @@ in case you set this option will have the priority compared to the one defined i
 |``
 |Yes^*^

+|`relationship.node.map`


I like this setting, would suggest to flatten the DF by default if this is feasible, because it prevents the user from having to destructure it and it's more "relational friendly", making the map behavior something you can toggle instead.

I find the name a bit confusing. "Relationship node?" what's that. :)

Not totally clear on what's being said here. A relationship is a property map but so are the nodes. Is this setting flattening all of them, or just the relationship one? Maybe a more clear name would be properties.flatten=True

This option doesn't appear to work for me:

%py df = spark.read.format("org.neo4j.spark.DataSource") \ .option("url", dbutils.widgets.get("url")) \ .option("authentication.basic.username", dbutils.widgets.get("user")) \ .option("authentication.basic.password", dbutils.widgets.get("password")) \ .option("relationship", "ORG") \ .option("relationship.source.labels", "Event") \ .option("relationship.node.map", False) \ .option("relationship.target.labels", "Organization").load() df.printSchema()

yields

root |-- <id>: long (nullable = false) |-- <relationshipType>: string (nullable = false) |-- <source>: map (nullable = false) | |-- key: string | |-- value: string (valueContainsNull = true) |-- <target>: map (nullable = false) | |-- key: string | |-- value: string (valueContainsNull = true)

relationship.nodes.as.map could be a better naming?

When you use both the word relationship and node in the same setting, that's what's confusing me, because at once I can't tell if you're referring to one, the other, or both. I guess I would think that "relationship.nodes.as.map" would return node properties as a map. But then I'd be left wondering what happens to relationship properties. (But they follow the same setting!)

moxious · 2020-08-03T11:41:41Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+|List of target node Labels separated by `:`
+|``
+|Yes^*^
+
 |`schema.flatten.limit`
 |Number of records to be used to create the Schema (only if APOC are not installed)


I don't really understand what this means. Does it constraint the maximum number of columns that can come back on a flattened DF? If I set a limit and something has 40 properties, I guess I lose the ones over the limit but I can't tell which ones I lose, which is problematic.

Let me explain, in case APOC are not installed we use these two queries:

For nodes:

val query = s"""MATCH (${Neo4jUtil.NODE_ALIAS}:${labels.mkString(":")}) |RETURN ${Neo4jUtil.NODE_ALIAS} |ORDER BY rand() |LIMIT ${options.query.schemaFlattenLimit} |""".stripMargin

For relationships:

val query = s"""MATCH (${Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS}:${options.relationshipMetadata.source.labels.mkString(":")}) |MATCH (${Neo4jUtil.RELATIONSHIP_TARGET_ALIAS}:${options.relationshipMetadata.target.labels.mkString(":")}) |MATCH (${Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS})-[${Neo4jUtil.RELATIONSHIP_ALIAS}:${options.relationshipMetadata.relationshipType}]->(${Neo4jUtil.RELATIONSHIP_TARGET_ALIAS}) |RETURN ${Neo4jUtil.RELATIONSHIP_ALIAS} |ORDER BY rand() |LIMIT ${options.query.schemaFlattenLimit} |""".stripMargin

and we use the result of these in order to retrieve the schema, so as we cannot dump the entire node/relationship set we provide the schema.flatten.limit in order to sample a fixed number of results. So we're are not defining the number of columns but the number of Neo4j results to parse in order to compute the schema.

Consider naming this schema.sample and then including this comment. Literally you're defining a sample size, right?

moxious · 2020-08-03T11:42:27Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+
+* `<id>` the internal Neo4j id
+* `<relatioshipType>` the relationship type


typo "relationshipType"

moxious · 2020-08-03T11:46:00Z

doc/docs/modules/ROOT/pages/quickstart.adoc

+==== Schema
+
+If APOC are installed, schema will be created with `apoc.meta.relTypeProperties`. Otherwise the first 10 (or any number specified by the `schema.flatten.limit` option) results will be flattened and the schema will be create from those properties.


better to put this information with the option in the doc. I missed it twice when reading through trying to figure what the. option did.

conker84 · 2020-08-03T12:39:28Z

%py

df = spark.read.format("org.neo4j.spark.DataSource")
.option("url", dbutils.widgets.get("url"))
.option("authentication.basic.username", dbutils.widgets.get("user"))
.option("authentication.basic.password", dbutils.widgets.get("password"))
.option("relationship", "NEIGHBOUR")
.option("relationship.source.labels", "Country")
.option("relationship.target.labels", "Country").load()

df.show()

I was able to reproduce it thank you!
The bug was because the relationship has no properties I fixed it by managing the use-case

conker84 · 2020-08-03T12:43:18Z

Generally I was able to get it to work, but importantly the flatten map option didn't work for me.

Beware when the schema sampling happens. I was iterating and creating new properties on (previously) empty relationships, and re-running my query and it seemed like it wasn't picking up the relationship property change, so I'm not sure what's going on there.

Given this example:
create (a:Customer { name: "David" })-[:BOUGHT { qty: 11 }]->(p:Product { name: "Socks" })
It'd be nice if this could return a dataframe like:
rel.id
rel.type
rel.qty
source.id <----- MISSING - at present you only get props, not internal ID
source.labels <---- MISSING (remember source nodes may have multi-labels)
source.name
target.id <--- MISSING
target.labels <---- MISSING
target.name
Since we "namespaced" properties with "source" and "target" here, might want to do the same for rel, to reduce the chance that property names collide. For example, imagine a relationship with a property called source.name

It didn't work because there was a typo into the documentation, it's relationship.nodes.map but I wrote into the docs relationship.node.map :(

This is the related schema:

root
 |-- <id>: long (nullable = false)
 |-- <relationshipType>: string (nullable = false)
 |-- <sourceId>: long (nullable = false)
 |-- <sourceLabels>: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sourceId: string (nullable = true)
 |-- sourceWordCount: long (nullable = true)
 |-- sourceTone: double (nullable = true)
 |-- sourceSelfGroupReferenceDensity: double (nullable = true)
 |-- sourcePositiveScore: double (nullable = true)
 |-- sourcePolarity: double (nullable = true)
 |-- sourceNegativeScore: double (nullable = true)
 |-- sourceDateStamp: long (nullable = true)
 |-- sourceDate: string (nullable = true)
 |-- sourceActivityReferenceDensity: double (nullable = true)
 |-- <targetId>: long (nullable = false)
 |-- <targetLabels>: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- targetName: string (nullable = true)

I like you're idea about the naming space, let me add it

…with provided type

conker84 marked this pull request as ready for review August 3, 2020 10:44

conker84 force-pushed the issue_145 branch from 4476131 to 0e3dd09 Compare August 3, 2020 10:49

moxious reviewed Aug 3, 2020

View reviewed changes

conker84 force-pushed the issue_145 branch from 0e3dd09 to 98c03fe Compare August 3, 2020 14:10

fixes neo4j#145: [ReadSupport] Add support to read from relationship …

9e2f486

…with provided type

conker84 force-pushed the issue_145 branch from 98c03fe to 9e2f486 Compare August 3, 2020 14:52

conker84 merged commit e689a29 into neo4j:4.0 Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #145: [ReadSupport] Add support to read from relationship with provided type #157

fixes #145: [ReadSupport] Add support to read from relationship with provided type #157

conker84 commented Aug 2, 2020

moxious commented Aug 3, 2020

moxious left a comment

moxious Aug 3, 2020

conker84 Aug 3, 2020

moxious Aug 3, 2020

moxious Aug 3, 2020

conker84 Aug 3, 2020 •

edited

Loading

moxious Aug 3, 2020

moxious Aug 3, 2020

conker84 Aug 3, 2020

moxious Aug 3, 2020

conker84 commented Aug 3, 2020

conker84 commented Aug 3, 2020


		* `<id>` the internal Neo4j id
		* `<relatioshipType>` the relationship type

		==== Schema

		If APOC are installed, schema will be created with `apoc.meta.relTypeProperties`. Otherwise the first 10 (or any number specified by the `schema.flatten.limit` option) results will be flattened and the schema will be create from those properties.

fixes #145: [ReadSupport] Add support to read from relationship with provided type #157

fixes #145: [ReadSupport] Add support to read from relationship with provided type #157

Conversation

conker84 commented Aug 2, 2020

moxious commented Aug 3, 2020

moxious left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conker84 Aug 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conker84 commented Aug 3, 2020

conker84 commented Aug 3, 2020

conker84 Aug 3, 2020 •

edited

Loading