Feature Request - Config option to disable ID reuse #137

Closed
lassewesth opened this Issue Nov 12, 2012 · 26 comments

Projects

None yet
@lassewesth
Member

@dcolebatch: 'There is conflicting reports out there, some google group posts say you can disable vertex ID reuse[1], some say you can't[2].

We'd love to be able to disable ID reuse in embedded mode and think it could be documented in Chapter 21[3] if it is there already... ?

1: https://groups.google.com/forum/?fromgroups#!searchin/neo4j/node$20ID/neo4j/wn5sQA1BFf4/T-siwqtof5QJ
2: https://groups.google.com/forum/#!topic/neo4j/ilq7624PwaU
3: http://docs.neo4j.org/chunked/milestone/embedded-configuration.html
'

@lassewesth
Member

values:

@lassewesth
Member

@tomohisaota: I really want the no-reuse option too.

Here's one of the reasons.
Index bind to node id. So if I forget to delete index,
that index could point to newly created node with reused node id.

The worse thing is, we cannot delete index in cypher query as of ver 1.8M07
So I feel much much safer if neo4j do not reuse node index.

@nambrot
nambrot commented Nov 19, 2012

im a +1 on this

@Pictorious

+1 as well

@nigelsmall
Collaborator

IMHO this feels like a feature which shouldn't be implemented. Instead, we should surely be looking to ignore node (and relationship) IDs as much as possible so as not to tightly bind applications to database internals. The IDs are internal artefacts so control over their allocation and usage should be solely the responsibility of the database engine.

The ID has no real use outside of the database engine anyway, other than to unambiguously identify entities during that period of database uptime. I would not write an application which would rely on IDs to be carried forward across periods of uptime since if I needed to rebuild a new database copy following a failure, there would be no guarantee that the IDs would be the same.

@tomohisaota: I don't think your index scenario holds true. It isn't possible in Neo4j to delete a node which would leave hanging index entries since these are automatically cleared up following the node deletion (lazily I believe). To that end, an index entry could never end up "accidentally" pointing to a new node.

Nige

@nambrot
nambrot commented Nov 20, 2012

I think the reason this has support (myself included), because it is a very
familiar practice within traditional dbs. Really, all I want is to be able
to uniquely reference nodes. Could I set up my own ID generation and index
the nodes by it? Yeah sure. Do I want to? Probably not.

I unfortunately do not know enough about the internals of Neo4j, but given
that it might be a fairly common use case, IMHO its something worth
considering.

On Tue, Nov 20, 2012 at 1:03 PM, Nigel Small notifications@github.comwrote:

IMHO this feels like a feature which shouldn't be implemented. Instead, we
should surely be looking to ignore node (and relationship) IDs as much as
possible so as not to tightly bind applications to database internals. The
IDs are internal artefacts so control over their allocation and usage
should be solely the responsibility of the database engine.

The ID has no real use outside of the database engine anyway, other than
to unambiguously identify entities during that period of database uptime. I
would not write an application which would rely on IDs to be carried
forward across periods of uptime since if I needed to rebuild a new
database copy following a failure, there would be no guarantee that the IDs
would be the same.

@tomohisaota https://github.com/tomohisaota: I don't think your index
scenario holds true. It isn't possible in Neo4j to delete a node which
would leave hanging index entries since these are automatically cleared up
following the node deletion (lazily I believe). To that end, an index entry
could never end up "accidentally" pointing to a new node.

Nige


Reply to this email directly or view it on GitHubhttps://github.com/neo4j/neo4j/issues/137#issuecomment-10551635.

@nigelsmall
Collaborator

I agree that other databases provide auto_increment functionality or similar but the Neo4j ID mechanism is not the same and cannot be used as such.

It would probably be better to argue for a separate permanent unique ID generation mechanism to be written for Neo4j rather than a feature which would force the existing internal IDs into something which they weren't designed to be. Some people have used UUIDs for such uniqueness already and since most languages support this mechanism in some form or other, the overhead of doing this is minimal.

Perhaps the best way to think of these IDs is akin to memory addresses of variables: there's no guarantee they'll be the same next time you run your application.

Nige

@mypark
mypark commented Nov 20, 2012

agreed with nambrot. It's fairly common to want a durable unique id to use to reference a node, and while unique ids can be generated in the application logic outside of the db, it seems cleaner to have the db create it since we then wouldn't have to create another index for it. The benefit and convenience of using the db generated ids seems far more than the overhead of keeping deleted ids.

What are the cons of keeping these deleted ids? Is it just a matter of orphaned ids and disk space, or does it affect query performance as well?

Regardless, I think there are a lot of production apps that don't ever delete records (just setting a deleted flag property instead), especially social apps I've worked on, so this might not be an issue for a large class of applications.

@tomohisaota

I don't think index get delete when node get deleted. (except auto-index)
And I actually had the problem that my index pointed to incorrect node.
#It could have been my software bug, though:-)

Anyway, I ended up with using unique id generator similar to twitter snowflake.

@akollegger
Member

@nigelsmall is exactly right about this. The proper solution is to lobby for unique ID generation support.

The node and relationship IDs should be thought about as if they were object references in your application: perfectly valid while your program is running, but you can't rely on an object's ID to be there after a restart. Neo4j exposes the node and relationship IDs for convenience only.

@mypark
mypark commented Nov 20, 2012

So what exactly are the valid scenarios in which the node ids should be used, what is the rule for how long-lived the node Ids should be? Do the ids get reclaimed only on DB restart? If the IDs are volatile, should they even be used at all?

Even if I'm assuming that ids are valid only for a short session, there is a chance that the db can get restarted during that session if I'm using a hosted db service, and that I would be updating incorrect nodes.

@nambrot
nambrot commented Nov 20, 2012

At the end of the day, people asking for ID's probably don't care how they
get generated or what the deeper semantics are. At least thats me. Exposing
IDs just begs to be used for uniquely referencing the nodes and I think
that's the underlying problem people have.

On Tue, Nov 20, 2012 at 3:59 PM, Andreas Kollegger <notifications@github.com

wrote:

@nigelsmall https://github.com/nigelsmall is exactly right about this.
The proper solution is to lobby for unique ID generation support.

The node and relationship IDs should be thought about as if they were
object references in your application: perfectly valid while your program
is running, but you can't rely on an object's ID to be there after a
restart. Neo4j exposes the node and relationship IDs for convenience only.


Reply to this email directly or view it on GitHubhttps://github.com/neo4j/neo4j/issues/137#issuecomment-10557631.

@akollegger
Member

I've been careless when demonstrating Neo4j, using node ids in queries because there is less typing involved. In a single-user scenario I'm not worried about the data changing. Really, I should always use an index look-up. I would only use node ids directly within a transaction running an embedded Neo4j. Remotely, it seems risky.

@nambrot That's a very good observation. By exposing the IDs, Neo4j creates the wrong expectations. That's something we'll try to correct by adding better facilities to the model.

@nambrot
nambrot commented Nov 20, 2012

The question now becomes whether it is already too late to change those
expectations. As you said, it seems very natural to do, hence I'd like to
ask how costly it would be to adjust to that expectation?

On Tue, Nov 20, 2012 at 6:45 PM, Andreas Kollegger <notifications@github.com

wrote:

I've been careless when demonstrating Neo4j, using node ids in queries
because there is less typing involved. In a single-user scenario I'm not
worried about the data changing. Really, I should always use an index
look-up. I would only use node ids directly within a transaction running an
embedded Neo4j. Remotely, it seems risky.

@nambrot https://github.com/nambrot That's a very good observation. By
exposing the IDs, Neo4j creates the wrong expectations. That's something
we'll try to correct by adding better facilities to the model.


Reply to this email directly or view it on GitHubhttps://github.com/neo4j/neo4j/issues/137#issuecomment-10564868.

@akollegger
Member

Using the internal database IDs is a mistake that should be corrected by offering the correct solution to the need. While the misunderstanding is understandable, the leak should be fixed, not made wider.

@pangloss

I've given this a little thought from the perspective of a consumer of the Neo4j API and have an approach to suggest.

I think we should split the problem: the internal identifier (II) and the external one (ID). Currently they are complected in Neo4j but should be relatively straightforward to separate. I suggest that you keep your II code but make it private or protected under a different method name, then repurpose the getID method to return an ID that has the semantics that people are demanding. Current IIs could be copied to be the new IDs for existing elements, while new elements could have their ID be generated sequentially (my preference) or via some other algorithm.

@peterneubauer

Duplicate of #1

@jakewins
jakewins commented Mar 4, 2013

Closing this as duplicate

@jakewins jakewins closed this Mar 4, 2013
@aseemk
aseemk commented Mar 16, 2013

@jakewins / @peterneubauer, this isn't a duplicate of #1 at all... #1 is saying "the indexes being updated lazily when nodes get deleted can lead to corruption", and this issue is saying "please just add a config to not reuse node IDs".

I've been pretty seriously +1 in this issue's camp for a while. Here were my thoughts to the mailing list:

https://groups.google.com/d/topic/neo4j/H0k17PkMWfg/discussion

To summarize, here's why we, too, use Neo4j IDs externally:

  • We want short IDs, because we use them in user-facing URLs, and Neo4j's IDs achieve this naturally since they're auto-increment. This also means UUIDs are out.
  • We don't want to have to save/remember every deleted key. Again, auto-incremented IDs achieve this naturally.
  • We need the performance of lookup-by-ID to be as fast as possible, and avoiding an index lookup helps a lot.

I've discussed this with @akollegger and others a bit. I understand the desire to expose a different semantic ID, but at the end of the day, power users will want to get access to the performance of file-offset lookups.

So please consider this a +1 from me too. =)

@aseemk
aseemk commented Mar 16, 2013

@mypark:

Regardless, I think there are a lot of production apps that don't ever delete records (just setting a deleted flag property instead), especially social apps I've worked on, so this might not be an issue for a large class of applications.

This is the workaround we were going to take too, but we've realized it's definitely not the same.

Cypher queries will now be vulnerable to acting on "deleted" nodes, where they wouldn't be if those nodes were actually deleted. You'd have to add a safeguard into every Cypher query of yours today plus all future ones.

I was accepting of the lack of this feature before, when I thought the workaround was easy, but now I really am feeling pretty bummed about it.

I haven't really heard a single good reason for not exposing this config other than "the IDs will change if you re-import all your data into a new graph". This is just not an issue in practice for real-world production systems IMHO. (Though please correct me if you've experienced otherwise.) We would only ever reset to a previous backup; we'd never manually re-import data.

I totally get why this shouldn't be the default config. Or why you might still discourage folks from using the Neo4j ID externally. But please, there are so many reasons to use the ID if you know what you're doing and are willing to accept the growing-disk-space trade-off. A config to disable reuse of it would be soooo helpful. Please consider!

@pietermartin

I certainly would also really like this feature. For now I am having to generate my own ids and index every node. This has a performance impact on insertions and probably on lookups also. Also, I have to worry about how well Lucene will scale with billions of nodes.

The solution of soft deletes does not appeal to me. We used to do it on a hibernate system, (using a hibernate filter for the entity manager). It added a level of noise into the system and yet another thing to always worry about. Every time a business user doubted some report we always ended up worrying if the report creator did not perhaps forget about the deleted rows.

Anyhow +1 for neo4j managed ids.

@mypark
mypark commented May 14, 2013

I haven't seen any activity on this in a while so wanted to check in and ask: is there anything in neo4j 2.0 that might be addressing this? Running into this issue again on another project and am loathe to use something like a guid that we have to index.

@akollegger
Member

Neo4j 2.0 will support uniqueness constraints for a property, which will be indexed. We're considering approaches for generating those unique values.

@rorymadden rorymadden referenced this issue in rorymadden/neoprene Jul 30, 2013
Open

Use a separate id from the neo4j node id #9

@fixundfertig123

Hey. @akollegger Is there an progress concerning the feature to generate those unique values Neo4j-internally? This feature looks to me like it is requested by various people (see http://stackoverflow.com/search?q=%5Bneo4j%5D+unique+id). Thanks for your work!

@triomen
triomen commented Aug 10, 2015

Hi,

@akollegger , does it mean that there will be a new way to generate UID or that we will be able to rely on node ID in future Neo4j version ?

Thanks for your work !

@ADTC
ADTC commented May 22, 2016

If you didn't want people to misuse database internals, you shouldn't have exposed them in the first place. I like the idea of making these internal IDs private and hidden, and instead repurposing the getID to get an actual externally usable unchanging id (stored as a property), which is also auto-incremented for every new node with the matching criteria (or for simplicity, just incremented database-wide).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment