Skip to content

Conversation

@fax4ever
Copy link
Contributor

@fax4ever
Copy link
Contributor Author

Still in preview: I handled only Field sort.
I think I have to handle at least the Distance one too.

I would prefer have a review on what I've done so far, before continue. Since the issue is not trivial at all.

To be honest, it was very hard for me
and I'm pretty sure we could improve several choices I made here.
For instance we will avoid to run a top doc collector again just to load the nested contributions.

Looking at Elasticsearch's code it seems that they load the nested document using
not ToParentBlockJoinQuery nor ToChildBlockJoinQuery but ParentChildrenBlockJoinQuery.
With the latter we need to pass the parent id. So we would have needed to run one for each parent document.
For that reason here I opted for the ToChildBlockJoinQuery`, that's same as we used for supporting projection on nested fields.

This issue is more difficult than the one on projection. Here we cannot simply merge all contributions. We need to switch document id ( parent to nested ) for each nested field ( so the switching and the convertion map are scoped on the single field ), paying attention do not alter what we already do on flattened or root object fields: see CompositeSortIT.

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from fd80300 to 0759330 Compare August 30, 2019 05:03
@yrodiere yrodiere self-assigned this Aug 30, 2019
Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that must have required some deep thinking. The solution looks very good overall, so thanks for all this nice work!

As you suspected we will need several improvements, mainly to improve performance. I added a few comments below on areas that can be improved.

There is the matter of rebasing on my "aggregations" branch, however. There will be a few conflicts due to the refactorings I had to do to introduce aggregations. I see you split your commits very finely, which is a very good idea and should facilitate rebasing.

My suggestion would be to proceed like this:

  1. Squash the commits related to the Lucene implementation of sorts, or at least squash the commits that are erased by a later commit. I think there are several of these, and it would be wasting your time to rebase them.
  2. Rebase on my branch
  3. And only then, try to address my comments.

If necessary, I can take care of step 2; it will probably be easier for me since you haven't reviewed my PR yet.

handleRescoring( indexSearcher, luceneQuery );
}
if ( nestedPathsInSort != null && !nestedPathsInSort.isEmpty() ) {
extractTopDocsUsingTheirNested( indexSearcher, luceneQuery, offset, limit );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably move this above requireFieldDocRescoring, otherwise scores might be wrong in some cases.

Also, I think extractTopDocs doesn't need to be called when you call this method? Something like this would work:

if ( nestedPathsInSort != null && !nestedPathsInSort.isEmpty() ) {
	extractTopDocsUsingTheirNested( indexSearcher, luceneQuery, offset, limit ); 
}
else {
	extractTopDocs( offset, limit );
}

if ( requireFieldDocRescoring ) {
	handleRescoring( indexSearcher, luceneQuery );
}

Copy link
Contributor Author

@fax4ever fax4ever Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved extractTopDocsUsingTheirNested above.
But without other changes we need to call extractTopDocs before we call extractTopDocsUsingTheirNested, since we need the ScoreDoc[] scoreDocs to fetch the nestedDocumentMap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... now that I think about it, I think there's a flaw in this approach.
What about the following scenario:

  • The query matches documents A, B, C
  • The limit is 2
  • The first search returns top documents [A, B]
  • The second search, taking into account nested documents, would have returned [C, B] (A excluded because it's third), but since C was not in the top docs of the first search, it ends up returning [B, A].

I suspect this is a very real possibility...

The only option to avoid this problem would be to revert the logic:

  • Currently you perform the search() call once, and then if you discover that there are nested sorts, you try to "fix" the first results by running a second search().
  • The alternative would be to detect before any search() that there are nested sorts. In that case you would run a preliminary search() to collect nested documents (and only nested documents) without a limit. Then you would run the actual search() with a limit.

I'm aware this would be require an awful lot of memory for big indexes, but at least it would work (?). We can create a ticket to try and optimize this, maybe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I've had a look at the Elasticsearch code, and it seems they retrieve nested documents on the fly when it comes to sorts. See my other comment. I think if you manage to implement a solution that does the same, the flaw I just mentioned will just disappear. Could you have a look, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from 0759330 to 6bfc242 Compare September 4, 2019 07:22
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 4, 2019

Thanks @yrodiere. Rebased the changes.
Now I can address your comments.

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch 3 times, most recently from 08e1a03 to 2e68834 Compare September 4, 2019 09:52
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 4, 2019

Thanks @yrodiere. I think I've address all the comments except the one about to optimize nested queries when we have both nested sorts and nested projections.

I'm not sure we can optimize something, since projections and sorts can work on different nested document paths. Let's talk on Zulip.

@lgtm-com
Copy link

lgtm-com bot commented Sep 4, 2019

This pull request introduces 1 alert when merging 2e68834 into 1e3ae6a - view on LGTM.com

new alerts:

  • 1 for Useless null check

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from 2e68834 to f7bf7e8 Compare September 4, 2019 10:33
Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, most of my concerns have been addressed.

I answered your comments concerning the remaining problems. Sure, let's talk on Zulip.

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from f7bf7e8 to f7388b8 Compare September 5, 2019 20:14
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 5, 2019

@yrodiere thanks!
I think that now we have an on-the-fly nested field sorting.
Have I missed some comments?

@yrodiere yrodiere force-pushed the 2254-sort-on-nested-fields branch from f7388b8 to d6353a7 Compare September 6, 2019 08:14
Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some suggestions.

WARNING: I force-pushed to your branch to change a test! See my comments below, but don't forget to pull before you start working again.

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch 2 times, most recently from 22d540f to d33a0fc Compare September 9, 2019 12:06
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 9, 2019

@yrodiere thanks.
I think I fulfilled the comments here.
If there were other changes to do, I would happy to handle them.

Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few (final?) comments. I think we can merge this PR once they are resolved; I created HSEARCH-3694 to address distance sorts later (I'm sure it will be just as complex...).

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from d33a0fc to 9d0b7b8 Compare September 9, 2019 18:04
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 9, 2019

Addressed all comments, with the exception of this one: #2068 (comment).

We could solve it, but we definitely need the final lucene query at LuceneCollectors build time (not just at invocation time). Do you think that is possible?
Or should we create TopFieldCollector at LuceneCollectors invocation time?
The latter sounds a bit strange...
Thanks!

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from 9d0b7b8 to 9ced168 Compare September 10, 2019 06:28
@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch 2 times, most recently from 6dc64da to 5048278 Compare September 11, 2019 07:17
@fax4ever
Copy link
Contributor Author

Running a full build: job/PR-2068/19

@fax4ever
Copy link
Contributor Author

I need to add a commit to support the old nested sorting syntax of Elasticsearch 5 .

@fax4ever
Copy link
Contributor Author

Here we have a new build for all supported ES dialects: job/PR-2068/22

@fax4ever
Copy link
Contributor Author

I added a test on a deep > 1 nested field. Full build: job/PR-2068/24

@fax4ever fax4ever force-pushed the 2254-sort-on-nested-fields branch from d19b8af to 9fb5e09 Compare September 11, 2019 13:17
@fax4ever
Copy link
Contributor Author

fax4ever commented Sep 11, 2019

@yrodiere as usual you were right. Taking the flag from the ElasticsearchJsonSyntaxHelper is definitely much easier! You can find the yet another full build here.

@yrodiere yrodiere merged commit 437d579 into hibernate:master Sep 11, 2019
@yrodiere
Copy link
Member

Merged! Thanks for all the work... and for putting up with me ;)

@fax4ever fax4ever deleted the 2254-sort-on-nested-fields branch September 11, 2019 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants