-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes #145: [ReadSupport] Add support to read from relationship with provided type #157
Conversation
Bug running this code:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I was able to get it to work, but importantly the flatten map option didn't work for me.
Beware when the schema sampling happens. I was iterating and creating new properties on (previously) empty relationships, and re-running my query and it seemed like it wasn't picking up the relationship property change, so I'm not sure what's going on there.
Given this example:
create (a:Customer { name: "David" })-[:BOUGHT { qty: 11 }]->(p:Product { name: "Socks" })
It'd be nice if this could return a dataframe like:
rel.id
rel.type
rel.qty
source.id <----- MISSING - at present you only get props, not internal ID
source.labels <---- MISSING (remember source nodes may have multi-labels)
source.name
target.id <--- MISSING
target.labels <---- MISSING
target.name
Since we "namespaced" properties with "source" and "target" here, might want to do the same for rel, to reduce the chance that property names collide. For example, imagine a relationship with a property called source.name
@@ -149,6 +149,22 @@ in case you set this option will have the priority compared to the one defined i | |||
|`` | |||
|Yes^*^ | |||
|
|||
|`relationship.node.map` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this setting, would suggest to flatten the DF by default if this is feasible, because it prevents the user from having to destructure it and it's more "relational friendly", making the map behavior something you can toggle instead.
I find the name a bit confusing. "Relationship node?" what's that. :)
Not totally clear on what's being said here. A relationship is a property map but so are the nodes. Is this setting flattening all of them, or just the relationship one? Maybe a more clear name would be properties.flatten=True
This option doesn't appear to work for me:
%py
df = spark.read.format("org.neo4j.spark.DataSource") \
.option("url", dbutils.widgets.get("url")) \
.option("authentication.basic.username", dbutils.widgets.get("user")) \
.option("authentication.basic.password", dbutils.widgets.get("password")) \
.option("relationship", "ORG") \
.option("relationship.source.labels", "Event") \
.option("relationship.node.map", False) \
.option("relationship.target.labels", "Organization").load()
df.printSchema()
yields
root
|-- <id>: long (nullable = false)
|-- <relationshipType>: string (nullable = false)
|-- <source>: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- <target>: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relationship.nodes.as.map
could be a better naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you use both the word relationship and node in the same setting, that's what's confusing me, because at once I can't tell if you're referring to one, the other, or both. I guess I would think that "relationship.nodes.as.map" would return node properties as a map. But then I'd be left wondering what happens to relationship properties. (But they follow the same setting!)
|List of target node Labels separated by `:` | ||
|`` | ||
|Yes^*^ | ||
|
||
|`schema.flatten.limit` | ||
|Number of records to be used to create the Schema (only if APOC are not installed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what this means. Does it constraint the maximum number of columns that can come back on a flattened DF? If I set a limit and something has 40 properties, I guess I lose the ones over the limit but I can't tell which ones I lose, which is problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain, in case APOC are not installed we use these two queries:
For nodes:
val query = s"""MATCH (${Neo4jUtil.NODE_ALIAS}:${labels.mkString(":")})
|RETURN ${Neo4jUtil.NODE_ALIAS}
|ORDER BY rand()
|LIMIT ${options.query.schemaFlattenLimit}
|""".stripMargin
For relationships:
val query = s"""MATCH (${Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS}:${options.relationshipMetadata.source.labels.mkString(":")})
|MATCH (${Neo4jUtil.RELATIONSHIP_TARGET_ALIAS}:${options.relationshipMetadata.target.labels.mkString(":")})
|MATCH (${Neo4jUtil.RELATIONSHIP_SOURCE_ALIAS})-[${Neo4jUtil.RELATIONSHIP_ALIAS}:${options.relationshipMetadata.relationshipType}]->(${Neo4jUtil.RELATIONSHIP_TARGET_ALIAS})
|RETURN ${Neo4jUtil.RELATIONSHIP_ALIAS}
|ORDER BY rand()
|LIMIT ${options.query.schemaFlattenLimit}
|""".stripMargin
and we use the result of these in order to retrieve the schema, so as we cannot dump the entire node/relationship set we provide the schema.flatten.limit
in order to sample a fixed number of results. So we're are not defining the number of columns but the number of Neo4j results to parse in order to compute the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider naming this schema.sample
and then including this comment. Literally you're defining a sample size, right?
|
||
* `<id>` the internal Neo4j id | ||
* `<relatioshipType>` the relationship type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "relationshipType"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
==== Schema | ||
|
||
If APOC are installed, schema will be created with `apoc.meta.relTypeProperties`. Otherwise the first 10 (or any number specified by the `schema.flatten.limit` option) results will be flattened and the schema will be create from those properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to put this information with the option in the doc. I missed it twice when reading through trying to figure what the. option did.
I was able to reproduce it thank you! |
It didn't work because there was a typo into the documentation, it's This is the related schema:
I like you're idea about the naming space, let me add it |
…with provided type
fixes #145
This PR introduces the support for querying relationships by providing simple path like
(source)-[re]->(target)
in the following way:This will create a Cypher Query as it follows: