Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function required: StructuredNode.merge #438

Closed
john-theo opened this issue Apr 22, 2019 · 13 comments
Closed

New function required: StructuredNode.merge #438

john-theo opened this issue Apr 22, 2019 · 13 comments

Comments

@john-theo
Copy link

Awesome awesome work!

I want to do a = Article(title='some title').merge() like actual MERGE action in Cypher.
The lines below are from neomodel.core. The if statement on line 520 decide whether to update or create the node. THE PROBLEM IS: How should the node has the attribute of id when it was just created?
OR, is there already a way to do something like a = Article(title='some title').merge()? I went through the documentation and the codes, found get_or_create and create_or_update, but they are not what I want.

Many thanks!

neomodel/neomodel/core.py

Lines 512 to 534 in cca5de4

def save(self):
"""
Save the node to neo4j or raise an exception
:return: the node instance
"""
# create or update instance node
if hasattr(self, 'id'):
# update
params = self.deflate(self.__properties__, self)
query = "MATCH (n) WHERE id(n)={self} \n"
query += "\n".join(["SET n.{0} = {{{1}}}".format(key, key) + "\n"
for key in params.keys()])
for label in self.inherited_labels():
query += "SET n:`{0}`\n".format(label)
self.cypher(query, params)
elif hasattr(self, 'deleted') and self.deleted:
raise ValueError("{0}.save() attempted on deleted node".format(
self.__class__.__name__))
else: # create
self.id = self.create(self.__properties__)[0].id
return self

@aanastasiou
Copy link
Collaborator

Awesome awesome work!

@john-theo On behalf of the ~30 developers who have contributed to neomodel, I am just going to say "Thanks" :)

How should the node has the attribute of id when it was just created?

It depends on what are you trying to do but from what you describe, the best thing to do is to use get_or_create and create Article entities en masse.
The if statement you are pointing out is a design choice. If the node has an id attribute set, then it means that it is saved already (i.e. a Node exists in the database server). If you are trying to update the node then once you have a handle to it (e.g. my_node = Article.nodes.get(some_unique_attribute="some_value")), change the attribute values and do a my_node.save() again.

Hope this helps.

@john-theo
Copy link
Author

Sorry for not describing it clearly.

It depends on what are you trying to do but from what you describe

The MERGE function I mentioned can be found [here]. It allows to create or update(depends on unique indexes) a node within ONE cypher query.
I have written a few lines(it works nicely in my own environment) and made a pull request #440 (it is my first time to contribute to an open source project TAT, really hope it turns out well). Somehow the Travis build failed.
Will you be so kind as to maybe take a look at it? @aanastasiou

Many thanks in advance.

@aanastasiou
Copy link
Collaborator

@john-theo What is it that you are trying to achieve? Let's not worry about how it is going to be done at the moment.

@john-theo
Copy link
Author

john-theo commented Apr 23, 2019

Let's say we are going to add a bunch of nodes and relations to a graph (For example, multiple articles and authors), and authors who previously wrote other articles are already in the graph. Assuming the unique index of Author be full_name, in other words, I don't want two Author nodes with the same full_name in the graph, and the WROTE relation arrows pointing from any Author whose full_name equals "A Good Author" should be pointing from the same Author node.

Now, I have many new articles and authors to be inserted into the graph.

The plain neo4j Cypher query would be

CREATE CONSTRAINT ON (author:Author) ASSERT author.full_name IS UNIQUE

CREATE (b:Author {full_name: 'maybe a new author'})-[r:WROTE]->(a:Article {title: 'new article'})
RETURN a,r,b

The neomodel expression would be

class Author:
    full_name = StringProperty(unique_index=True, required=True)

article = Article(title="new article").save()
author = Author(full_name="maybe a new author").save()
author.wrote.connect(article)

And these expressions may trigger a ConstraintError if an Author whose full_name is "maybe a new author" is already in the database.

To deal with this situation, a better(-ish) way to carry out the node insertion would be:

The plain neo4j Cypher query

MERGE (b:Author {full_name: 'maybe a new author'})-[r:WROTE]->(a:Article {title: 'new article'})
RETURN a,r,b

I wanted to perform a similar action in neomodel, and the current save function, as I mentioned in the first place, decides whether to perform a CREATE Cypher query or a MERGE Cypher query by examining whether the node instance has an attribute of id(already saved in the graph).
But as I had just created the Author instance, it wouldn't have the id attribute(equals to None), so each time I do a save action to an Author instance, it will ALWAYS perform a CREATE Cypher query, and the situation I described above appears from time to time. And even the save action goes into the update branch, it still relies on the prerequisite of the id attribute.

Yes, I can first query whether the Author I just created has existed in the graph, and then decide what to do. But would it be nicer if we ALWAYS do a MERGE query on a newly created StructureNode object? So everybody won't be worrying about whether the node has already existed or not.

The function I contributed can be used like this:

author = Author(full_name="maybe a new author").merge()

In my own project, I bind the function with the core.StructureNode class, and it works.

Um, that's pretty much the whole story, and thanks for your time and patience.

@aanastasiou
Copy link
Collaborator

authors who previously wrote other articles are already in the graph

As someone who has created a platform to process scientific literature data at the scale of a few GB (with a single desktop) and above (with distributed queries over the enterprise version of Neo4J), all based on neomodel, I can tell you that:

  1. What you are trying to do can be achieved entirely via neomodel.
  2. The performance bottleneck you will meet is not in neomodel but in Neo4j. It is a "natural" bottleneck that is based on the use of indices. The only way to overcome that bottleneck is to do the de-duplication yourself, prior to uploading the data. And yes, that is as messy as it sounds.

Now, having said this:

...And these expressions may trigger a ConstraintError if an Author whose full_name is "maybe a new author" is already in the database.

You have identified this part (and the text leading up to it of course) correctly. Indeed, neomodel cannot establish a full path right now and the operations that lead to it have to be applied individually and yes, if done in this way, an exception might be triggered.

The way to do this (that respects indices) is get_or_create.

(In fact, when you do this, you might still come across another "bug" that is related to this pull request where you have to establish a nested arrangement of try/catch statements.)

A typical approach here would be to first create all the nodes and then all connections between them. For more information please see this issue here and the ensuing discussion.

...each time I do a save action to an Author instance, it will ALWAYS perform a CREATE Cypher query, and the situation I described above appears from time to time. And even the save action goes into the update branch, it still relies on the prerequisite of the id attribute.

This is done for a reason. Some StructuredNode functionality only makes sense if the node exists in the database. Otherwise, it would generate redundant queries or errors. In general, what neomodel strives for is referential integrity. And this extends to the client-side view of the database through the established models. So, creating a node and subsequently modifying it has to be done "in-step". neomodel does have some functionality for lazy operations that might speed up certain operations but even in that case, the optimisation is in the amount of data communicated between the server and the client.

Have a look at get_or_update(). It was put in place for this type of "batch" operations.

Hope this helps.

@john-theo
Copy link
Author

Now it's crystal clear! Thank you so much! A guide like this should be in the project document to help neo4j starters like me. I'm closing this issue and the related pull request. Thanks again!

@aanastasiou
Copy link
Collaborator

@john-theo No worries, feel free to reach out anytime.

@P1zz4br0etch3n
Copy link

I still don't understand why there is no merge operation on single nodes. I work on a project which modifies only a few nodes and relationships in our graph, based on incoming event messages, not knowing which properties are changed. So I have to get every single node, just to check if it's already there, overwrite every single property for it might have changed and then save it again or for the first time. To create an instance and just merge would be much easier for me.

@aanastasiou
Copy link
Collaborator

@P1zz4br0etch3n I am sorry, I did not spot this earlier.

I still don't understand why there is no merge operation on single nodes.

Because it is impossible to "merge" the id attribute.

Every node has an id attribute internally that is very important to neomodel because it determines the state of the node and how to handle certain operations. Very briefly: If the node object does not have an id yet, it should be created. If it does, then it should be updated. If the node does not have an id then it cannot be deleted (and vice versa).

Attempting to merge without having filled in this id leads to a new node (and possible subsequent index clashes depending on your setup) because this attribute would not have been matched.

This is why the node has to first be retrieved.

A possible solution might be to use recently added functionality described here. You can do a lazy .get(), update the properties you wish to update and then call .save().

Hope this helps (?)

@P1zz4br0etch3n
Copy link

Thanks for your reply.

I get your point, but what if you implement the merge() method just the same way like the create_or_update() method: matching by its required and/or unique properties?

@aanastasiou
Copy link
Collaborator

@P1zz4br0etch3n I understand. At the moment I cannot see how would we go around trying to merge a node and not having the id attribute that is also required (?). Maybe I am not getting something though (?).

@P1zz4br0etch3n
Copy link

Maybe it's getting clear if I explain a little bit how our model currently works. We have a property named "key" on every node type, which is kind of our primary key. We force this property to be unique, required and indexed by adding a "is key" constraint on it, like it is explained here. To create or update such a node we use a statement like MERGE (n:Type {key: "some key"}). So we don't see the internal id at any time and though never create a new node with the same key.

I didn't look deeper into the implementation of your create_or_update() function yet, but when I tested it I didn't need to put a id into my dictionary neither. I even could use it with just one single entry. I implemented a little function which takes a StructuredNode instance, gets all values out of it, puts them into a dict and calls the function passing that dict. Then I bound the function to the StructuredNode class, which worked because now self is passed as first argument. But my IDE obviously wasn't able to auto complete this "method" because it's only added at runtime. So the easiest way would be if you do the same or maybe you can even find a better solution.

@john-theo
Copy link
Author

As the case in April, I implemented my own merge function and it worked fine.

Now I'm working on another project using neomodel, and this time, I want to try bulk operations.
Here's some toy code:

from neomodel import *
from random import randint

config.DATABASE_URL = 'bolt://neo4j:mypassword@localhost:7687'
config.AUTO_INSTALL_LABELS = True

class Person(StructuredNode):
    name = StringProperty(unique_index=True)
    age = IntegerProperty(required=True)


john = Person(name='John', age=randint(5,80))
alice = Person(name='Alice', age=randint(30,40))

Person.create_or_update(*[
    x.__dict__ for x in [john, alice]
])

Person.create_or_update(*[
    x.__dict__ for x in [john, alice]
])

As documented here, it works fine.

But if I run the following code TWICE, I got an error.

from neomodel import *
from random import randint

config.DATABASE_URL = 'bolt://neo4j:mypassword@localhost:7687'
config.AUTO_INSTALL_LABELS = True

class Person(StructuredNode):
    name = StringProperty(unique_index=True)
    age = IntegerProperty(required=True)


john = Person(name='John', age=randint(5,80))
alice = Person(name='Alice', age=randint(30,40))

Person.create_or_update(*[
    x.__dict__ for x in [john, alice]
])

The error is as follows:

neomodel.exceptions.UniqueProperty: Node(3) already exists with label `Person` and property `name` = 'John'

I was intended to batch update Person Node information, is there a proper way to do this?
@aanastasiou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants