Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable behaviour when indexable node string property near its maximum key length #12076

Closed
aanastasiou opened this issue Nov 12, 2018 · 11 comments
Labels

Comments

@aanastasiou
Copy link

@aanastasiou aanastasiou commented Nov 12, 2018

The problem

  • Neo4J version: 3.4.5 (community)
  • Operating System: Ubuntu 16.04
  • API/Driver: Python driver 1.6.2

When a node string property participates in an index. Neo4J imposes a (reasonable) limit on its byte length. However, sometimes, even if one trims the string length to that limit, the transaction does go ahead, no problem is reported but the database is left at an unusable state from which it has to be recovered by a restart.

The specific question I have about this issue is about a clarification of the 4095 bytes limit because if I trim at 4095 characters, the query does go ahead gracefully but still leaves the database at an unusable state.

Is there anything else that might be added to this key internally that has to be taken into account?

Currently, I am trimming at 4000 bytes which is even lower than keyValueSizeCap (4047) but there are still test cases which fail :(

I do not want to empirically lower this limit until I get no errors in my test cases, but rather I would like to understand how to determine the limit at which to trim an indexable property so that I do not have any errors. This will also help a lot with an effective solution for the validation of neomodel's StringProperty.

Reproducing the problem

A much more detailed presentation of the problem with test cases of predictable and unpredictable behaviour is available in this repository.

Expected behaviour

If the node property does not participate in an index, Neo4J has no problem in storing attribute values that may even be as big as 8kB.

Rather than throwing an exception which halts the whole process, would it be possible to consider specifying that in the case of indices, only the first ~4kB of the string's byte stream will participate in the key?

This means that if two strings happen to have the first N bytes identical, they would be considered as being the same even if their "ending" is different. (e.g. "Hello Neo4J" and "Hello" with key_size=5). This could be specified in the documentation around the node properties.

In this way, a property can still store long(ish) strings if required.

Note: The data I am dealing with were not expected to have such long values and in samples of hundreds of thousands of data items, just 2 happen to have such long values. We are talking about very rare occurrences but still possible. This indexing requirement means that I have to find an alternative way of both indexing the attribute and retaining its complete value which I will have to process at some point. I do not mind at all if duplicates are flagged just on the first ~4kB of the string because in my use case, if the first 4kB are the same, it is very likely that this is indeed the exact same value. But I cannot just store the trimmed value, I have to store both the trimmed (for the purposes of the indexing) and the non-trimmed (for the purposes of analysis) values. This is the motivation behind this request, rather than being able to store arbitrarily long pieces of text on Neo4J (which would not be good practice anyway).

Actual behaviour

The server is left in an unusable state which necessitates a restart (or sometimes, even a complete wipeout of the database directory).

@fickludd

This comment has been minimized.

Copy link
Contributor

@fickludd fickludd commented Nov 13, 2018

Thank you for reporting this issue. The server should of course never be left unusable, so this seems like a serious issue. Could you paste an example of an offending string which you think should be indexable, but isn't?

According to the documentation, the size limit is 4039 bytes btw, where did you find the keyValueSizeCap = 4047?

@aanastasiou

This comment has been minimized.

Copy link
Author

@aanastasiou aanastasiou commented Nov 13, 2018

Hi @fickludd, I am working on a copy of my main.py that uses the specific test cases that lead to the discovery of this issue and will get back to you on that one.

I am reporting here the keyValueSizeCap as it appears in the exception. I have posted a complete exception here. I could also attach an extract of logs/debug.log if that would help? I can see that the server is endlessly trying to "sample" the particular index that is causing the problem in the background.

2018-11-12 11:42:45.875+0000 WARN [o.n.k.i.a.i.s.OnlineIndexSamplingJob] Not finished: Sampling index Index( UNIQUE, :SomeEntity(payload) ) in 0 ms
...
[This goes on for ~100 lines and ~17min of real time. It's from the second time I managed to replicate this behaviour our of the main project code]
...
2018-11-12 11:59:45.877+0000 WARN [o.n.k.i.a.i.s.OnlineIndexSamplingJob] Not finished: Sampling index Index( UNIQUE, :SomeEntity(payload) ) in 0 ms
2018-11-12 11:59:55.435+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Shutdown started

All the best

@aanastasiou

This comment has been minimized.

Copy link
Author

@aanastasiou aanastasiou commented Nov 13, 2018

@fickludd , I have included a test case with specific inputs in this file. And also, here is the same test case but with the trimming at the 4000 bytes limit which however still produces the same behaviour :/ (Please note, this limit is less than both limits we have been discussing here).

Just three notes about this:

  1. The scenario that is described in that test case is slightly different but it demonstrates a case where the database's data directory has to be erased before restarting the server and resuming normal behaviour. (By the way, the main*.py files appear to be largely similar but the commentary is different depending on the behaviour of each test case)

  2. I think that we are getting somewhere with that keyValueSizeCap setting. It might be that internally, the limit is 4039 but the logic allows queries to go through even if they are slightly longer. This would explain why the error only appears when the string is near the limit but not exceeding it by far (as is the case with the longer inputs).

  3. This however would still mean that the behaviour of the server is to reject long strings altogether if they participate in an index. I was just wondering if it would be possible to produce a WARNing about this, use the first ~4kB for indexing but still store the whole string (even if it happens to be >>4kB). What would be the sentiment around this? :)

All the best

@fickludd

This comment has been minimized.

Copy link
Contributor

@fickludd fickludd commented Nov 13, 2018

Thank you for the additional info!

It turns out that we think this might be the same issue as was fixed by f4bbad0, which was merged into 3.4.7. Could you try upgrading to see if that solves the problem?

@burqen

This comment has been minimized.

Copy link
Contributor

@burqen burqen commented Nov 13, 2018

Hi @aanastasiou
I'll pick up this issue from @fickludd from here :)

In addition to try an upgrade to 3.4.7 (or later), can you also verify that the byte length is from a string encoded using UTF8?

@aanastasiou

This comment has been minimized.

Copy link
Author

@aanastasiou aanastasiou commented Nov 13, 2018

Hi @burqen & @fickludd , thanks for your help, yes, everything encoded in utf-8 and upgrading to 3.4.9 seems to have fixed the behaviour I am reporting.

Any thoughts on point #3 from above? I.e. a way by which the index can use the first N bytes for indexing but still store the whole entry?

All the best

@burqen

This comment has been minimized.

Copy link
Contributor

@burqen burqen commented Nov 13, 2018

I'm glad it worked 👍

Regarding #3, I think it's a reasonable feature proposition. I'll add it to our backlog :)

@aanastasiou

This comment has been minimized.

Copy link
Author

@aanastasiou aanastasiou commented Nov 13, 2018

Thank you very much. (didn't escape that hash symbol there :S )

In the meantime, is there a way to retrieve the key size programmatically? If I was to implement a trimming of the data entry for example, I could hard-wire a 4039, or I could obtain it from some setting (?). Is this likely to change in the immediate future?

@burqen

This comment has been minimized.

Copy link
Contributor

@burqen burqen commented Nov 13, 2018

Oh, oops 😅

Unfortunately there is no way to retrieve key size programmatically right now. For now you would need to hard-wire 4039 (as long as you are using index provider lucene-native-2.0). For older index providers this limit is around 32k bytes. Documented here: https://neo4j.com/docs/operations-manual/3.4/performance/index-configuration/
The only way to influence index key size limit is to change which index provider is used to back index. See docs link 👍

We are currently working on fixing this 4k limit. It is not releasable for quite a while though and it will not be back ported to older releases. So to answer your question: No, it will not change in the immediate future but some time after that.

@aanastasiou

This comment has been minimized.

Copy link
Author

@aanastasiou aanastasiou commented Nov 13, 2018

This is very useful, thank you very much.

All the best

@burqen

This comment has been minimized.

Copy link
Contributor

@burqen burqen commented Nov 13, 2018

All the best to you to :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.