Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parsing logic of the Query Builder (second try) #1824

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented Jul 13, 2024

Description

A continuation of #1793, but decided to just focus on parsing. KNNQueryBuilder parsing has gotten pretty complex and this changes it to use ObjectParser in order to simplify it.

I moved XContent related parsing and streaming into its own class (KNNQueryParser).

In addition, I added a JMH test to test the parsing. I ran locally on my mac, and got the following results. From the results it seems there is no statistically significant difference:

## Before
Benchmark                            (dimension)  (type)   Mode  Cnt       Score       Error  Units
QueryParsingBenchmarks.fromXContent          128   basic  thrpt    9  112177.006 ± 11012.866  ops/s
QueryParsingBenchmarks.fromXContent          128  filter  thrpt    9  100488.178 ± 24078.112  ops/s
QueryParsingBenchmarks.fromXContent         1024   basic  thrpt    9   16597.576 ±   887.830  ops/s
QueryParsingBenchmarks.fromXContent         1024  filter  thrpt    9   15989.327 ±   826.738  ops/s

## After
Benchmark                            (dimension)  (type)   Mode  Cnt       Score      Error  Units
QueryParsingBenchmarks.fromXContent          128   basic  thrpt    9  111009.726 ± 7980.943  ops/s
QueryParsingBenchmarks.fromXContent          128  filter  thrpt    9  108396.222 ± 5215.041  ops/s
QueryParsingBenchmarks.fromXContent         1024   basic  thrpt    9   15589.731 ±  731.359  ops/s
QueryParsingBenchmarks.fromXContent         1024  filter  thrpt    9   15042.023 ± 1330.928  ops/s

Issues Resolved

[List any issues this PR will resolve]

Check List

  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality labels Jul 13, 2024
Copy link
Contributor

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am content with how this refactoring is going. A few small comments

For method parameter parser, I ended up moving stream in and stream out logic in the parser as well since it was in a way parsing the bytes stream. Wanted to know your thoughts on moving the logic here

* Helper class responsible for converting XContent to/from KNNQueryBuilders
*/
@Log4j2
public final class KNNQueryXContentHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

I tend to avoid generic suffixes like helper it opens up the possibility for adding things other than what the class is meant for. Consider using KNNQueryParser, this will limit the responsibility of the class to parsing the query and not leak XContent which is implementation detail related to paring in the name of the class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm Im actually okay limiting to XContent only because the toXContent method and fromXContent explicitly require XContent related functions. I thought about KNNQueryParser, but then I thought it wouldnt make sense to include the toXContent code in here, because its not actually parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A parser can have a reverse parsing capability. Its generally an inverse function. something like a serialize and deserialize. Not too hung up on this, will leave it up to you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I renamed to Parser and updated.

} else {
throwParsingExceptionOnMultipleFields(parser.getTokenLocation(), fieldName, parser.currentName());
fieldName = parser.currentName();
vector = parser.list();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to check if the fieldName.equals(VECTOR_FIELD.getName())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that makes sense to have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, nvm. We do not do this now. I dont want to change behavior in this PR. See https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java#L409

import static org.opensearch.knn.index.KNNClusterTestUtils.mockClusterService;
import static org.opensearch.knn.index.query.KNNQueryXContentHelper.EF_SEARCH_FIELD;

public class KNNQueryXContentHelperTests extends KNNTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Lets make sure we have the below case. This should hit the final else part in the while loop of fromXcontent
{
   knn : {
     myvector : {
        vector : [1, 2, 3]
     }
  }
}
  1. Do we need tests for toXContent() ?
  2. There is a place in the code where it throws for multiple fields, is that case covered in any of the tests?

nit: Consider using a parameterized test, makes it easy to figure out what tests are covered and not covered IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill take a look. It was mostly just moving the tests from KNNQueryBuilder parsing to here. And Im not sure why there were no tests added for toXContent()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (1) Interestingly, the above fails because: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java#L222. This is actually a bug with our defaults (see https://opensearch.org/docs/latest/query-dsl/specialized/neural/). @junqiu-lei Can you take a look around this? Im going to hold off adding the test for now until that is settled on.

Also, I believe the above is the last line in the while loop:

{
   knn : {
     myvector : [1, 2, 3]
  }
}

This also fails with

[knn] requires exactly one of k, distance or score to be set

Copy link
Member

@junqiu-lei junqiu-lei Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just offline discussed with @jmazanec15, the default k only support in neural search. In k-NN plugin, k was one required field before radial search introduced and there wasn't default k value. So the existing validation logic here is correct.

Ref to 2.12 KNNQueryBuilder code before radial search introduced: https://github.com/opensearch-project/k-NN/blob/2.12.0.0/src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java

@@ -0,0 +1,114 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using this copyright

/*
 * Copyright OpenSearch Contributors
 * SPDX-License-Identifier: Apache-2.0
 */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Measurement(iterations = 3, time = 10)
@Fork(3)
@State(Scope.Benchmark)
public class QueryParsingBenchmarks {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to have this file in knn repo?
To make this code useful, I think we should force it to run for every change in query parser and fail if there is a degradation. Otherwise, I don't think it is worth to add it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be kept. I think its fine to have this for testing in the future if we want to make some other change around this functionality.

I think we should force it to run for every change in query parser and fail if there is a degradation.

Sure, this is something we could consider in future, but out of scope for this PR. Want to try to get this change in without having to change multiple other things.

);
}

protected static NamedXContentRegistry xContentRegistry() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected static NamedXContentRegistry xContentRegistry() {
private NamedXContentRegistry xContentRegistry() {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this makes sense. We need it for initializing NAMED_X_CONTENT_REGISTRY at the top

@@ -138,7 +122,7 @@ public static class Builder {
private String queryName;
private float boost = DEFAULT_BOOST;

private Builder() {}
Builder() {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my preference but I think using either private/public make it more readable unless there is specific necessity to make it package access level.

Suggested change
Builder() {}
public Builder() {}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep for now because I dont want to expose publically, but need it in the other class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually updated to public

* Helper class responsible for converting XContent to/from KNNQueryBuilders
*/
@Log4j2
public final class KNNQueryXContentHelper {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about the pattern where KNNQueryBuilder use KNNQueryXContentHelper and KNNQueryXContentHelper use KNNQueryBuilder which could make cyclic dependency by mistake.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not worried about cyclic dependency because helper contains static stateless dependencies. Can you give an example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like KNNBuilder calls one of the method in Helper and helper create a new KNNBuilder which call the method in Helper again. It might not happen but this cyclic dependency always leave a room for such case. It is not comfortable to work on such code base as you have to check that the cyclic calls is not happening when you are modifying the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I dont think this will happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It might not happen but it does look like an anti pattern. Isn't it possible to break it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KNNQueryBuilder will need to depend on KNNQueryBuilderParser just because it needs to delegate the doXContent method. KNNQueryBuilderParser will need KNNQueryBuilder in other methods in order to convert from KNNQueryBuilder to some stream. Given it is stateless static class, Im not sure its going to break.

public static final ParseField EF_SEARCH_FIELD = new ParseField(METHOD_PARAMETER_EF_SEARCH);
public static final ParseField METHOD_PARAMS_FIELD = new ParseField(METHOD_PARAMETER);

private static final ObjectParser<KNNQueryBuilder.Builder, Void> INTERNAL_PARSER = createInternalObjectParser();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this static pattern? I feel this pattern is from legacy OpenSearch code where we don't have dependency injection concept and want to have singleton instance. However, this makes writing a unit test hard and not a best practice in OOP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a little difficult to avoid static pattern in this case. Changing this would then lead to a much larger number of touched files and potential blast radius. I think its something that can be taken in the future.

@jmazanec15
Copy link
Member Author

Waiting till after 2.16 to finish this one up because there are a lot of PRs going on.

@jmazanec15
Copy link
Member Author

For method parameter parser, I ended up moving stream in and stream out logic in the parser as well since it was in a way parsing the bytes stream. Wanted to know your thoughts on moving the logic here

@shatejas Im not 100% sure on this. I leaned toward leaving it in the class just because its a constructor and a method. But I think its something we could move out. However, Ill leave it for another PR just to keep this scope limited.

@@ -1,6 +1,6 @@
<component name="CopyrightManager">
<copyright>
<option name="notice" value="SPDX-License-Identifier: Apache-2.0&#10;&#10;The OpenSearch Contributors require contributions made to&#10;this file be licensed under the Apache-2.0 license or a&#10;compatible open source license.&#10;&#10;Modifications Copyright OpenSearch Contributors. See&#10;GitHub history for details." />
<option name="notice" value="Copyright OpenSearch Contributors&#10;SPDX-License-Identifier: Apache-2.0&#10;" />
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15
Copy link
Member Author

@heemin32 @shatejas I rebased and addressed some of the feedback. I think a lot of the points are good - but I do want to keep scope limited just so that this PR doesnt die on the vine.

Also, wanted to get opinion on one thing. From #1860 I see that this change will break neural because of the introduction of the Helper class. That being said, I think I should move the logic back into the KNNQueryBuilder class. I was luke-warm on moving out in first place and figure that many plugins out there may be using this as an extension. WDYT?

@shatejas
Copy link
Contributor

For method parameter parser, I ended up moving stream in and stream out logic in the parser as well since it was in a way parsing the bytes stream. Wanted to know your thoughts on moving the logic here

@shatejas Im not 100% sure on this. I leaned toward leaving it in the class just because its a constructor and a method. But I think its something we could move out. However, Ill leave it for another PR just to keep this scope limited.

Would like to know regarding the doubts. Having stream in and stream out will help reusing it in neural search and decoupling the changes required on the plugin related to knn.

With respect to scope, we are trying to address a tech debt here, would encourage to not push it to a later especially as it doesn't seem to be a heavy lift. Let me know what you think

@jmazanec15
Copy link
Member Author

Would like to know regarding the doubts. Having stream in and stream out will help reusing it in neural search and decoupling the changes required on the plugin related to knn.

I see. So in neural, we cannot directly re-use KNNQueryBuilder IO because there are different required parameters. But, I will follow convention of MethodParameterParser for KNNQueryBuilder and move streaming to the class. Also, why did you define NPROBE in KNNQueryBuilder then? https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java#L78-L79. I think Ill do the same not to break neural (https://github.com/search?q=repo%3Aopensearch-project%2Fneural-search%20KNNQueryBuilder&type=code)

With respect to scope, we are trying to address a tech debt here, would encourage to not push it to a later especially as it doesn't seem to be a heavy lift. Let me know what you think

Want to just be clear - I want to get this PR merged, so I am focusing on limited scope, incremental change. The problem is that as scope gets bigger and PR gets bigger, it ends up becoming more difficult to keep up with and eventually gets abandoned due to other priorities. That being said, Im trying to be a bit defensive in adding on refactors to make sure that that doesnt happen to this PR. For example, not going to change to parametrized tests in this PR.

@jmazanec15 jmazanec15 force-pushed the query-builder-ref branch 2 times, most recently from 0db4ae5 to 6ba54e9 Compare July 29, 2024 19:34
Refactors parsing of KNNQueryBuilder. First, it moves parsing logic to a
separate class. Next, it uses ObjectParser instead of parsing manually
by hand. To test, it contains a simple jmh benchmark for testing.

Signed-off-by: John Mazanec <jmazane@amazon.com>
@heemin32
Copy link
Collaborator

Also, wanted to get opinion on one thing. From #1860 I see that this change will break neural because of the introduction of the Helper class. That being said, I think I should move the logic back into the KNNQueryBuilder class. I was luke-warm on moving out in first place and figure that many plugins out there may be using this as an extension. WDYT?

Not sure for this. KNN is not built to be used by other plugins right? We don't know which class/method is being used by third party and which is not. Shouldn't we ask neural to move again from KNN dependency? Or do we have a set of classes that need to be backward compatible?

@jmazanec15
Copy link
Member Author

Not sure for this. KNN is not built to be used by other plugins right? We don't know which class/method is being used by third party and which is not. Shouldn't we ask neural to move again from KNN dependency? Or do we have a set of classes that need to be backward compatible?

In general, I think keeping KNNQueryBuilder bwc is probably a good thing to do - just because its directly required for neural. Hence, I moved the FIELD declarations back into that class as @shatejas had done for MethodParameters

Also, I just updated to handle streams from the KNNQueryBuilderParser class as well to be consistent with MethodParametersParser.

@shatejas
Copy link
Contributor

So in neural, we cannot directly re-use KNNQueryBuilder IO because there are different required parameters

I see, might be possible if we club all knn related fields together, but again we will have to check if that can be done in backward compatible way

Also, why did you define NPROBE in KNNQueryBuilder then?

All fields were defined there, didn't think much back then

@jmazanec15 jmazanec15 merged commit 52636c4 into opensearch-project:main Jul 29, 2024
52 of 53 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1824-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 52636c47c03ca16da3b2c46860d782b46f192118
# Push it to GitHub
git push --set-upstream origin backport/backport-1824-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1824-to-2.x.

jmazanec15 added a commit to jmazanec15/k-NN-1 that referenced this pull request Jul 29, 2024
Refactors parsing of KNNQueryBuilder. First, it moves parsing logic to a
separate class. Next, it uses ObjectParser instead of parsing manually
by hand. Lastly, it also moves out the streaming. To test, it contains a
simple jmh benchmark for testing as well as new unit tests.

Signed-off-by: John Mazanec <jmazane@amazon.com>
(cherry picked from commit 52636c4)
jmazanec15 added a commit to jmazanec15/k-NN-1 that referenced this pull request Jul 29, 2024
Refactors parsing of KNNQueryBuilder. First, it moves parsing logic to a
separate class. Next, it uses ObjectParser instead of parsing manually
by hand. Lastly, it also moves out the streaming. To test, it contains a
simple jmh benchmark for testing as well as new unit tests.

Signed-off-by: John Mazanec <jmazane@amazon.com>
(cherry picked from commit 52636c4)
jmazanec15 added a commit to jmazanec15/k-NN-1 that referenced this pull request Jul 29, 2024
Refactors parsing of KNNQueryBuilder. First, it moves parsing logic to a
separate class. Next, it uses ObjectParser instead of parsing manually
by hand. Lastly, it also moves out the streaming. To test, it contains a
simple jmh benchmark for testing as well as new unit tests.

Signed-off-by: John Mazanec <jmazane@amazon.com>
(cherry picked from commit 52636c4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants