Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neo4j export #91

Merged
merged 10 commits into from Oct 2, 2020
Merged

Neo4j export #91

merged 10 commits into from Oct 2, 2020

Conversation

darabos
Copy link
Contributor

@darabos darabos commented Sep 30, 2020

This is part 1: exporting attributes for existing nodes.

There's an option to set node.keys and let them build the query. But if I use that, the label is a must. If I write the same query manually, I can leave it off. (http://5.9.211.195:8000/neo4j-spark-docs/1.0.0/writing.html#bookmark-write-node)

Open tasks:

  • Make sure this works if the keys are not defined everywhere.
  • Attribute export for edges.
  • Edge export for existing nodes. (I don't think this is important.)
  • Export whole graph as new stuff.
  • Documentation.
  • Tests. (Maybe when the final Neo4j Spark Connector is released.)

Copy link
Contributor

@xandrew-lynx xandrew-lynx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

So if key is not actually a key, then what happens? We will set the parameter for all nodes that match on they "keys", right? I'm not necessarily saying this is bad, just making sure I understand what's going on.

implicit val sd = rc.sparkDomain
val df = inputs.t.df
val keyMatch = keys.map(k => s"`$k`: event.`$k`").mkString(", ")
val query = s"MERGE (n$labels {$keyMatch}) SET n += event"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is event a cypher reserved word here? It refers to the actual record streaming in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is event a cypher reserved word here? It refers to the actual record streaming in?

Yes, it's the actual record. The connector adds UNWIND $events AS event before the query we specify. I'm not sure what is reserved, but that's a part outside of our control at least.

Base automatically changed from darabos-neo4j to master October 1, 2020 07:59
@darabos
Copy link
Contributor Author

darabos commented Oct 1, 2020

So if key is not actually a key, then what happens? We will set the parameter for all nodes that match on they "keys", right? I'm not necessarily saying this is bad, just making sure I understand what's going on.

I think so. I'll be sure to note this in the docs. I think it's just how Neo4j updates work. (Much like SQL updates.)

@darabos
Copy link
Contributor Author

darabos commented Oct 1, 2020

So if key is not actually a key, then what happens? We will set the parameter for all nodes that match on they "keys", right? I'm not necessarily saying this is bad, just making sure I understand what's going on.

The Neo4j import creates an <id> field, which would make an obvious key. But since the ID is not a property, it does not work. I've checked how it can be made to work — it cannot. As https://stackoverflow.com/questions/27729489/neo4j-merge-statement-with-id and the linked answer you must have your hand-crafted keys in Neo4j.

@darabos
Copy link
Contributor Author

darabos commented Oct 1, 2020

Screenshot 2020-10-01 at 15 23 32

Isn't that a joy to see? 😄 I got this with gender and comment selected as node label and relationship type. If you omit one or both:

Screenshot 2020-10-01 at 15 21 40
Screenshot 2020-10-01 at 15 21 01

PTAL

""")
} else {
neo.send(vs, s"""
CALL apoc.create.node(split(event.`$nodeLabelsColumn`, ','), event) YIELD node
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://neo4j.com/labs/apoc/4.1/overview/apoc.create/apoc.create.node/.
This is the way to create a node with a dynamically picked label. Requires apoc to be installed.

} else {
neo.send(es, s"""
MATCH (src {`$VID`: event.`$SRCID`}), (dst {`$VID`: event.`$DSTID`})
CALL apoc.create.relationship(src, event.`$relationshipTypeColumn`, event, dst) YIELD rel
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 100 to 101
// Prefix the internal IDs with this operation's GUID so different exports don't collide.
val guid = F.lit(this.gUID.toString + "-")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is perfect! But what about adding GUID as vertex and edge attribute as well? That would allow users to clean up results of somehow screwed up exports with relative ease. Maybe we can add export date as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is perfect! But what about adding GUID as vertex and edge attribute as well? That would allow users to clean up results of somehow screwed up exports with relative ease. Maybe we can add export date as well?

Actually the timestamp is better than the GUID. Maybe you've recreated your LynxKite instance. Maybe two LynxKite instances use the same Neo4j instance. If everyone exports the example graph, you can end up with two exports with the same GUID quick enough.

I'll switch to just using the timestamp. Easier to interpret too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@xandrew-lynx xandrew-lynx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks a lot!

I only added some nits.

But I wouldn't delay the testing until after the release. If things go well, this will be a very high profile feature. I understand that this is substantial work because we somehow have to get a testing neo4j instance up, so this will probably remove our chances for a release today, but still. Or is this going to be simpler after the final connector is released somehow?

import Operation.Implicits._
params ++= List(
Param("labels", "Relationship labels", defaultValue = ""),
Choice("keys", "Attribute to use as key",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Choice("keys", "Attribute to use as key",
Choice("keys", "Attributes to use as key",

url: String, username: String, password: String, labels: String, keys: Seq[String],
version: Long, nodesOrRelationships: String)
extends SparkOperation[ExportAttributesToNeo4j.Input, ExportAttributesToNeo4j.Output] {
val neo = Neo4jConnection(url, username, password)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw this (and of course didn't realize I just reviewed the definition of this class :( ) I thought why would you put a connection kind of class up here, why not create the connection only when needed, in execute. But of course, there is no connection being made here, it just collects some metadata which will be used in a helper method of this object. So then what about naming it differently? E.g. Neo4JHelper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw this (and of course didn't realize I just reviewed the definition of this class :( ) I thought why would you put a connection kind of class up here, why not create the connection only when needed, in execute. But of course, there is no connection being made here, it just collects some metadata which will be used in a helper method of this object. So then what about naming it differently? E.g. Neo4JHelper?

I changed it to Neo4jConnectionParameters.

Comment on lines 100 to 101
// Prefix the internal IDs with this operation's GUID so different exports don't collide.
val guid = F.lit(this.gUID.toString + "-")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is perfect! But what about adding GUID as vertex and edge attribute as well? That would allow users to clean up results of somehow screwed up exports with relative ease. Maybe we can add export date as well?

Param("username", "Neo4j username", defaultValue = "neo4j"),
Param("password", "Neo4j password", defaultValue = "neo4j"),
NonNegInt("version", "Snapshot revision ID", default = 1),
Choice("node_labels", "Attribute with node labels",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that the value of this attribute can be a comma separated list of label values if user wants multiple labels. Would be more elegant to allow vectors here, but probably now is not the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that the value of this attribute can be a comma separated list of label values if user wants multiple labels. Would be more elegant to allow vectors here, but probably now is not the time.

I expect you usually just use one label. Described the commas in the help.

@xandrew-lynx
Copy link
Contributor

On the screenshot: indeed feels cool! :) And if you just export twice, you will simply see two copies, right?

@darabos
Copy link
Contributor Author

darabos commented Oct 2, 2020

But I wouldn't delay the testing until after the release. If things go well, this will be a very high profile feature. I understand that this is substantial work because we somehow have to get a testing neo4j instance up, so this will probably remove our chances for a release today, but still. Or is this going to be simpler after the final connector is released somehow?

There's a whole Neo4j test library which we were using for the old import. Instead of starting a full Neo4j instance I think it sets something up in-process. Now we can't use it because it's an old version. I'll look around to see if there is a replacement.

@xandrew-lynx
Copy link
Contributor

On the testing: yes, please look around. Of course, if the conclusion is that now it's hard/ugly but in the not-too-far future we a see good solution on the horizon then sure, we can wait of it.

Copy link
Contributor Author

@darabos darabos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! You can review the docs now and I'll try to do the tests.

Param("username", "Neo4j username", defaultValue = "neo4j"),
Param("password", "Neo4j password", defaultValue = "neo4j"),
NonNegInt("version", "Snapshot revision ID", default = 1),
Choice("node_labels", "Attribute with node labels",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that the value of this attribute can be a comma separated list of label values if user wants multiple labels. Would be more elegant to allow vectors here, but probably now is not the time.

I expect you usually just use one label. Described the commas in the help.

url: String, username: String, password: String, labels: String, keys: Seq[String],
version: Long, nodesOrRelationships: String)
extends SparkOperation[ExportAttributesToNeo4j.Input, ExportAttributesToNeo4j.Output] {
val neo = Neo4jConnection(url, username, password)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw this (and of course didn't realize I just reviewed the definition of this class :( ) I thought why would you put a connection kind of class up here, why not create the connection only when needed, in execute. But of course, there is no connection being made here, it just collects some metadata which will be used in a helper method of this object. So then what about naming it differently? E.g. Neo4JHelper?

I changed it to Neo4jConnectionParameters.

The relationships in Neo4j are identified by a key property (or properties).
You must have a corresponding edge attribute in LynxKite by the same name.
This will be used to find the right relationship to update in Neo4j.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explicitly spell out what happens if multiple edges match a given key value combination
a) in Neo4j
b) in LynxKite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explicitly spell out what happens if multiple edges match a given key value combination
a) in Neo4j
b) in LynxKite

Done. I've also added the full Cypher query to the docs. I imagine this may be of interest to advanced Neo4j users and it's not too scary.

[p-node_labels]#Node labels#::
A string vertex attribute that is a comma-separated list of labels to apply to the newly
created nodes. Optional. You must have https://neo4j.com/developer/neo4j-apoc/[Neo4j APOC]
installed on the Neo4j instance to use this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If omitted, the nodes will have no label, right? Is this acceptable in Neo4J then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If omitted, the nodes will have no label, right? Is this acceptable in Neo4J then?

Surprisingly, yes.

[p-relationship_type]#Attribute with relationship type#::
A string edge attribute that specifies the relationship type for each newly created relationship.
Optional. You must have https://neo4j.com/developer/neo4j-apoc/[Neo4j APOC]
installed on the Neo4j instance to use this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So edges can have only one relationship type, but vertices can have multiple labels? Sorry for bothering you with basic neo4j questions..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So edges can have only one relationship type, but vertices can have multiple labels? Sorry for bothering you with basic neo4j questions..

Yes. Edges must always have a type, and they can only have a single type. Nodes can have multiple labels or no labels. It's pretty clear that you didn't design this system. 😄

@darabos darabos changed the title [WIP] Neo4j export Neo4j export Oct 2, 2020
Copy link
Contributor

@xandrew-lynx xandrew-lynx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! These tests ended up being very pretty and readable!

@darabos
Copy link
Contributor Author

darabos commented Oct 2, 2020

Nice! These tests ended up being very pretty and readable!

They are actually surprisingly fast too. Around 10-30 seconds. 👍 https://www.testcontainers.org/

@darabos darabos merged commit c2d3b69 into master Oct 2, 2020
@darabos darabos deleted the darabos-neo4j-export branch October 2, 2020 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants