Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.6] Allow datatype Long when indexing #5911

Merged
merged 5 commits into from
Feb 26, 2024

Conversation

BartChris
Copy link
Collaborator

@BartChris BartChris commented Feb 5, 2024

Fixes #5906

I would have liked to add a test for that but it seems like we are not testing the creation of the metadata fields in the index at the moment (only general process fields:https://github.com/kitodo/kitodo-production/blob/master/Kitodo-DataManagement/src/test/java/org/kitodo/data/elasticsearch/index/type/ProcessTypeTest.java) , so we would first need a way to add a general test here. There are also plans to simplify the metadata structure which would also provide the opportunity to add tests here (#4266)

Edit: I added a test in the ProcessService integration test.

@BartChris BartChris marked this pull request as ready for review February 9, 2024 11:23
Copy link
Member

@solth solth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BartChris thanks for this fix. Code looks good but I encountered a strange behaviour when testing it.

When manually entering one of the values you mentioned in the linked issue (9910211111112122) into a field in the metadata editor and saving the process, it works and the index contains the correct value.

Entering the other value you from the issue (991022551489706476), though, or the one from your test (999999999999999991), these are not saved correctly to the index. The first is changed from
991022551489706476 ➡️ 991022551489706500
and the second from
999999999999999991 ➡️ 1000000000000000000
in the index.

It's especially strange since all these numbers are still way below the limit of the Java long type (9223372036854775807, according to https://www.w3schools.com/java/java_data_types.asp) and because your test does indeed seem to work. When I try it in the running system, though, after saving the values are correct in the metadata files, but not in the index.

Maybe the ES long type does not cover the exact same domain as the Java type? Or it somehow depends on the minor ES version? (I am using 7.10 on my development system, but the elasticsearch.version in the pom.xml is 7.17)

Have you encountered anything similar when developing your fix?

@BartChris
Copy link
Collaborator Author

BartChris commented Feb 13, 2024

@solth Strange, i will check. Would it be an option to cast all Longs to a String before indexing? I am wondering if there is a scenario where we actually need the number value in the index.

What might also play a role is the fact that the field type of this field is dynamically determined the first time Elasticsearch encounters it. The following documents would probably use the same inferred index mapping. We should therefor only use datatypes like Long in the index when we really need them.

Edit: I can confirm your findings and it seems like this happens for numbers with more than 16 characters, but i have no idea why you can still find search for some of those instances.
It seems like for process properties everything just gets indexed as a string. I would recoomend the same for the metadata at least for Long (Maybe we want to support publication date queries (before, after) in the future where number support is useful) if there are no arguments against that. To support even bigger numbers we can use BigInteger here as well.

 if (value instanceof String || value instanceof Integer) {
                json.put(prepareKey(key), value);
            }  else if (value instanceof Long || value instanceof BigInteger){
                json.put(prepareKey(key), value.toString());
            } else if (value instanceof JSONObject) {

@BartChris BartChris force-pushed the fix_index_long_numbers branch 2 times, most recently from b3b6a5b to 46f45ab Compare February 14, 2024 13:57
@BartChris
Copy link
Collaborator Author

BartChris commented Feb 14, 2024

@solth I implemented exactly that: Long and BigInteger numbers are converted to a String.

While doing that i also encountered that Kitodo breaks when searching for very long numbers using the search field:

grafik

grafik

This happens because Elasticsearch tries to cast the search value for searching in the ID field. ("Value is out of range for a long'). I fixed two places in the code to prevent that.

@solth
Copy link
Member

solth commented Feb 21, 2024

@BartChris thanks a lot for looking into this and for the fix. I agree with you that casting numerical metadata values to String is a good solution, since no numerical operations are done on the values in the index anyway (e.g. queries that retrieve documents from ElasticSearch with numerical metadata values in a given range, for example).

Would you mind opening a corresponding pull request with the same fix against the master branch?

@BartChris
Copy link
Collaborator Author

BartChris commented Feb 21, 2024

@solth I targeted master as well and changed the test to adapt to the changes in the master branch: #5936

@solth solth added this to the Kitodo.Production 3.6.2 milestone Feb 26, 2024
@solth solth merged commit 04f4c9a into kitodo:3.6.x Feb 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants