-
-
Notifications
You must be signed in to change notification settings - Fork 246
HSEARCH-2254 Support (single-valued) sorts on fields within nested fields #2068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HSEARCH-2254 Support (single-valued) sorts on fields within nested fields #2068
Conversation
|
Still in preview: I handled only Field sort. I would prefer have a review on what I've done so far, before continue. Since the issue is not trivial at all. To be honest, it was very hard for me Looking at Elasticsearch's code it seems that they load the nested document using This issue is more difficult than the one on projection. Here we cannot simply merge all contributions. We need to switch document id ( parent to nested ) for each nested field ( so the switching and the convertion map are scoped on the single field ), paying attention do not alter what we already do on flattened or root object fields: see |
fd80300 to
0759330
Compare
yrodiere
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that must have required some deep thinking. The solution looks very good overall, so thanks for all this nice work!
As you suspected we will need several improvements, mainly to improve performance. I added a few comments below on areas that can be improved.
There is the matter of rebasing on my "aggregations" branch, however. There will be a few conflicts due to the refactorings I had to do to introduce aggregations. I see you split your commits very finely, which is a very good idea and should facilitate rebasing.
My suggestion would be to proceed like this:
- Squash the commits related to the Lucene implementation of sorts, or at least squash the commits that are erased by a later commit. I think there are several of these, and it would be wasting your time to rebase them.
- Rebase on my branch
- And only then, try to address my comments.
If necessary, I can take care of step 2; it will probably be easier for me since you haven't reviewed my PR yet.
...te/search/backend/elasticsearch/search/sort/impl/AbstractElasticsearchSearchSortBuilder.java
Outdated
Show resolved
Hide resolved
...te/search/backend/elasticsearch/search/sort/impl/AbstractElasticsearchSearchSortBuilder.java
Outdated
Show resolved
Hide resolved
...end/lucene/src/main/java/org/hibernate/search/backend/lucene/work/impl/LuceneSearchWork.java
Outdated
Show resolved
Hide resolved
| handleRescoring( indexSearcher, luceneQuery ); | ||
| } | ||
| if ( nestedPathsInSort != null && !nestedPathsInSort.isEmpty() ) { | ||
| extractTopDocsUsingTheirNested( indexSearcher, luceneQuery, offset, limit ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably move this above requireFieldDocRescoring, otherwise scores might be wrong in some cases.
Also, I think extractTopDocs doesn't need to be called when you call this method? Something like this would work:
if ( nestedPathsInSort != null && !nestedPathsInSort.isEmpty() ) {
extractTopDocsUsingTheirNested( indexSearcher, luceneQuery, offset, limit );
}
else {
extractTopDocs( offset, limit );
}
if ( requireFieldDocRescoring ) {
handleRescoring( indexSearcher, luceneQuery );
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved extractTopDocsUsingTheirNested above.
But without other changes we need to call extractTopDocs before we call extractTopDocsUsingTheirNested, since we need the ScoreDoc[] scoreDocs to fetch the nestedDocumentMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... now that I think about it, I think there's a flaw in this approach.
What about the following scenario:
- The query matches documents A, B, C
- The limit is 2
- The first search returns top documents [
A, B] - The second search, taking into account nested documents, would have returned
[C, B](Aexcluded because it's third), but sinceCwas not in the top docs of the first search, it ends up returning[B, A].
I suspect this is a very real possibility...
The only option to avoid this problem would be to revert the logic:
- Currently you perform the
search()call once, and then if you discover that there are nested sorts, you try to "fix" the first results by running a secondsearch(). - The alternative would be to detect before any
search()that there are nested sorts. In that case you would run a preliminarysearch()to collect nested documents (and only nested documents) without a limit. Then you would run the actualsearch()with a limit.
I'm aware this would be require an awful lot of memory for big indexes, but at least it would work (?). We can create a ticket to try and optimize this, maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I've had a look at the Elasticsearch code, and it seems they retrieve nested documents on the fly when it comes to sorts. See my other comment. I think if you manage to implement a solution that does the same, the flaw I just mentioned will just disappear. Could you have a look, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
...c/main/java/org/hibernate/search/backend/lucene/search/extraction/impl/LuceneCollectors.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/hibernate/search/integrationtest/backend/tck/search/sort/SearchSortIT.java
Show resolved
Hide resolved
...ain/java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneTextFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
...ain/java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneTextFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
...va/org/hibernate/search/backend/lucene/types/sort/nested/impl/LuceneNestedDocumentsSort.java
Outdated
Show resolved
Hide resolved
.../org/hibernate/search/backend/lucene/types/sort/nested/impl/LuceneNestedFieldComparator.java
Outdated
Show resolved
Hide resolved
0759330 to
6bfc242
Compare
|
Thanks @yrodiere. Rebased the changes. |
08e1a03 to
2e68834
Compare
|
Thanks @yrodiere. I think I've address all the comments except the one about to optimize nested queries when we have both nested sorts and nested projections. I'm not sure we can optimize something, since projections and sorts can work on different nested document paths. Let's talk on Zulip. |
|
This pull request introduces 1 alert when merging 2e68834 into 1e3ae6a - view on LGTM.com new alerts:
|
2e68834 to
f7bf7e8
Compare
.../java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneNumericFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
...va/org/hibernate/search/backend/lucene/types/sort/nested/impl/LuceneNestedDocumentsSort.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, most of my concerns have been addressed.
I answered your comments concerning the remaining problems. Sure, let's talk on Zulip.
f7bf7e8 to
f7388b8
Compare
|
@yrodiere thanks! |
f7388b8 to
d6353a7
Compare
yrodiere
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some suggestions.
WARNING: I force-pushed to your branch to change a test! See my comments below, but don't forget to pull before you start working again.
...ate/search/backend/lucene/types/sort/nested/onthefly/doubleval/impl/NumericDoubleValues.java
Show resolved
Hide resolved
.../java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneNumericFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
...ate/search/backend/lucene/types/sort/nested/onthefly/doubleval/impl/NumericDoubleValues.java
Show resolved
Hide resolved
...earch/backend/lucene/types/sort/nested/onthefly/impl/NestedNumericFieldComparatorSource.java
Outdated
Show resolved
Hide resolved
...earch/backend/lucene/types/sort/nested/onthefly/impl/NestedNumericFieldComparatorSource.java
Outdated
Show resolved
Hide resolved
...java/org/hibernate/search/integrationtest/backend/tck/search/sort/CompositeSearchSortIT.java
Show resolved
Hide resolved
...arch/backend/lucene/types/sort/nested/onthefly/doubleval/impl/SortedNumericDoubleValues.java
Show resolved
Hide resolved
22d540f to
d33a0fc
Compare
|
@yrodiere thanks. |
yrodiere
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few (final?) comments. I think we can merge this PR once they are resolved; I created HSEARCH-3694 to address distance sorts later (I'm sure it will be just as complex...).
...bernate/search/backend/lucene/types/sort/nested/impl/NestedNumericFieldComparatorSource.java
Outdated
Show resolved
Hide resolved
.../java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneNumericFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
.../java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneNumericFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
.../java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneNumericFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
...ain/java/org/hibernate/search/backend/lucene/types/sort/impl/LuceneTextFieldSortBuilder.java
Outdated
Show resolved
Hide resolved
d33a0fc to
9d0b7b8
Compare
|
Addressed all comments, with the exception of this one: #2068 (comment). We could solve it, but we definitely need the final lucene query at |
9d0b7b8 to
9ced168
Compare
6dc64da to
5048278
Compare
|
Running a full build: job/PR-2068/19 |
|
I need to add a commit to support the old nested sorting syntax of Elasticsearch 5 . |
|
Here we have a new build for all supported ES dialects: job/PR-2068/22 |
|
I added a test on a deep > 1 nested field. Full build: job/PR-2068/24 |
Giving a purpose to the already present `flattenedObject` mapped object
Note that the hierarchy is expressed in the same order ES expects: from root to leafs.
Since, given a single nested document path, we can have at most one nested document for each root document.
Restoring then_flattened_nested_limit2 test method
Always here means: not only to sort nested document fields
d19b8af to
9fb5e09
Compare
|
Merged! Thanks for all the work... and for putting up with me ;) |
https://hibernate.atlassian.net/browse/HSEARCH-2254