Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #145: [ReadSupport] Add support to read from relationship with provided type #157

Merged
merged 1 commit into from
Aug 4, 2020

Conversation

conker84
Copy link
Contributor

@conker84 conker84 commented Aug 2, 2020

fixes #145

This PR introduces the support for querying relationships by providing simple path like (source)-[re]->(target) in the following way:

val df = spark.read.format("org.neo4j.spark.DataSource")
      .option("relationship", "BOUGHT")
      .option("relationship.source.labels", "Person")
      .option("relationship.target.labels", "Product")

df.show()

This will create a Cypher Query as it follows:

MATCH (source:Person)-[rel:BOUGHT]->(target:Product)
RETURN source, rel, target

@conker84 conker84 marked this pull request as ready for review August 3, 2020 10:44
@moxious
Copy link
Contributor

moxious commented Aug 3, 2020

Bug running this code:

%py

df = spark.read.format("org.neo4j.spark.DataSource") \
  .option("url", dbutils.widgets.get("url")) \
  .option("authentication.basic.username", dbutils.widgets.get("user")) \
  .option("authentication.basic.password", dbutils.widgets.get("password")) \
  .option("relationship", "NEIGHBOUR") \
  .option("relationship.source.labels", "Country") \
  .option("relationship.target.labels", "Country").load()

df.show()
Py4JJavaError: An error occurred while calling o564.load.
: scala.MatchError: N/A (of class java.lang.String)
	at org.neo4j.spark.service.SchemaService.liftedTree2$1(SchemaService.scala:135)
	at org.neo4j.spark.service.SchemaService.structForRelationship(SchemaService.scala:128)
	at org.neo4j.spark.service.SchemaService.struct(SchemaService.scala:159)
	at org.neo4j.spark.reader.Neo4jDataSourceReader.readSchema(Neo4jDataSourceReader.scala:18)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:175)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:290)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

Copy link
Contributor

@moxious moxious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I was able to get it to work, but importantly the flatten map option didn't work for me.

Beware when the schema sampling happens. I was iterating and creating new properties on (previously) empty relationships, and re-running my query and it seemed like it wasn't picking up the relationship property change, so I'm not sure what's going on there.

Given this example:

create (a:Customer { name: "David" })-[:BOUGHT { qty: 11 }]->(p:Product { name: "Socks" })

It'd be nice if this could return a dataframe like:

rel.id
rel.type
rel.qty
source.id <----- MISSING - at present you only get props, not internal ID
source.labels <---- MISSING (remember source nodes may have multi-labels)
source.name
target.id <--- MISSING
target.labels <---- MISSING
target.name

Since we "namespaced" properties with "source" and "target" here, might want to do the same for rel, to reduce the chance that property names collide. For example, imagine a relationship with a property called source.name

@@ -149,6 +149,22 @@ in case you set this option will have the priority compared to the one defined i
|``
|Yes^*^

|`relationship.node.map`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this setting, would suggest to flatten the DF by default if this is feasible, because it prevents the user from having to destructure it and it's more "relational friendly", making the map behavior something you can toggle instead.

I find the name a bit confusing. "Relationship node?" what's that. :)

Not totally clear on what's being said here. A relationship is a property map but so are the nodes. Is this setting flattening all of them, or just the relationship one? Maybe a more clear name would be properties.flatten=True

This option doesn't appear to work for me:

%py

df = spark.read.format("org.neo4j.spark.DataSource") \
  .option("url", dbutils.widgets.get("url")) \
  .option("authentication.basic.username", dbutils.widgets.get("user")) \
  .option("authentication.basic.password", dbutils.widgets.get("password")) \
  .option("relationship", "ORG") \
  .option("relationship.source.labels", "Event") \
  .option("relationship.node.map", False) \
  .option("relationship.target.labels", "Organization").load()

df.printSchema()

yields

root
 |-- <id>: long (nullable = false)
 |-- <relationshipType>: string (nullable = false)
 |-- <source>: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- <target>: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relationship.nodes.as.map could be a better naming?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use both the word relationship and node in the same setting, that's what's confusing me, because at once I can't tell if you're referring to one, the other, or both. I guess I would think that "relationship.nodes.as.map" would return node properties as a map. But then I'd be left wondering what happens to relationship properties. (But they follow the same setting!)

|List of target node Labels separated by `:`
|``
|Yes^*^

|`schema.flatten.limit`
|Number of records to be used to create the Schema (only if APOC are not installed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this means. Does it constraint the maximum number of columns that can come back on a flattened DF? If I set a limit and something has 40 properties, I guess I lose the ones over the limit but I can't tell which ones I lose, which is problematic.

Copy link
Contributor Author

@conker84 conker84 Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain, in case APOC are not installed we use these two queries:

For nodes:

val query = s"""MATCH (${Neo4jUtil.NODE_ALIAS}:${labels.mkString(":")})
                |RETURN ${Neo4jUtil.NODE_ALIAS}
                |ORDER BY rand()
                |LIMIT ${options.query.schemaFlattenLimit}
                |""".stripMargin

For relationships:

val query = s"""MATCH (${Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS}:${options.relationshipMetadata.source.labels.mkString(":")})
                |MATCH (${Neo4jUtil.RELATIONSHIP_TARGET_ALIAS}:${options.relationshipMetadata.target.labels.mkString(":")})
                |MATCH (${Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS})-[${Neo4jUtil.RELATIONSHIP_ALIAS}:${options.relationshipMetadata.relationshipType}]->(${Neo4jUtil.RELATIONSHIP_TARGET_ALIAS})
                |RETURN ${Neo4jUtil.RELATIONSHIP_ALIAS}
                |ORDER BY rand()
                |LIMIT ${options.query.schemaFlattenLimit}
                |""".stripMargin

and we use the result of these in order to retrieve the schema, so as we cannot dump the entire node/relationship set we provide the schema.flatten.limit in order to sample a fixed number of results. So we're are not defining the number of columns but the number of Neo4j results to parse in order to compute the schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider naming this schema.sample and then including this comment. Literally you're defining a sample size, right?


* `<id>` the internal Neo4j id
* `<relatioshipType>` the relationship type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "relationshipType"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

==== Schema

If APOC are installed, schema will be created with `apoc.meta.relTypeProperties`. Otherwise the first 10 (or any number specified by the `schema.flatten.limit` option) results will be flattened and the schema will be create from those properties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to put this information with the option in the doc. I missed it twice when reading through trying to figure what the. option did.

@conker84
Copy link
Contributor Author

conker84 commented Aug 3, 2020

%py

df = spark.read.format("org.neo4j.spark.DataSource")
.option("url", dbutils.widgets.get("url"))
.option("authentication.basic.username", dbutils.widgets.get("user"))
.option("authentication.basic.password", dbutils.widgets.get("password"))
.option("relationship", "NEIGHBOUR")
.option("relationship.source.labels", "Country")
.option("relationship.target.labels", "Country").load()

df.show()

I was able to reproduce it thank you!
The bug was because the relationship has no properties I fixed it by managing the use-case

@conker84
Copy link
Contributor Author

conker84 commented Aug 3, 2020

Generally I was able to get it to work, but importantly the flatten map option didn't work for me.

Beware when the schema sampling happens. I was iterating and creating new properties on (previously) empty relationships, and re-running my query and it seemed like it wasn't picking up the relationship property change, so I'm not sure what's going on there.

Given this example:

create (a:Customer { name: "David" })-[:BOUGHT { qty: 11 }]->(p:Product { name: "Socks" })

It'd be nice if this could return a dataframe like:

rel.id
rel.type
rel.qty
source.id <----- MISSING - at present you only get props, not internal ID
source.labels <---- MISSING (remember source nodes may have multi-labels)
source.name
target.id <--- MISSING
target.labels <---- MISSING
target.name

Since we "namespaced" properties with "source" and "target" here, might want to do the same for rel, to reduce the chance that property names collide. For example, imagine a relationship with a property called source.name

It didn't work because there was a typo into the documentation, it's relationship.nodes.map but I wrote into the docs relationship.node.map :(

This is the related schema:

root
 |-- <id>: long (nullable = false)
 |-- <relationshipType>: string (nullable = false)
 |-- <sourceId>: long (nullable = false)
 |-- <sourceLabels>: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sourceId: string (nullable = true)
 |-- sourceWordCount: long (nullable = true)
 |-- sourceTone: double (nullable = true)
 |-- sourceSelfGroupReferenceDensity: double (nullable = true)
 |-- sourcePositiveScore: double (nullable = true)
 |-- sourcePolarity: double (nullable = true)
 |-- sourceNegativeScore: double (nullable = true)
 |-- sourceDateStamp: long (nullable = true)
 |-- sourceDate: string (nullable = true)
 |-- sourceActivityReferenceDensity: double (nullable = true)
 |-- <targetId>: long (nullable = false)
 |-- <targetLabels>: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- targetName: string (nullable = true)

I like you're idea about the naming space, let me add it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants