This repository was archived by the owner on Aug 16, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 101
KNN custom scoring #329
Merged
Merged
KNN custom scoring #329
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,10 +13,10 @@ Short for its associated *k-nearest neighbors* algorithm, the KNN plugin lets yo | |
|
|
||
| ## Get started | ||
|
|
||
| To use the KNN plugin, you must create an index with the `index.knn` setting and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` with `l2` or `cosinesimil`, respectively, to use either Euclidean distance or cosine similarity for calculations. By default, `index.knn.space_type` is set to `l2`. Here is an example that creates an index with two knn_vector fields and uses cosine similarity: | ||
| To use the KNN query type, you must create an index with `index.knn: true` and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` parameter with `l2` to use Euclidean distance or `cosinesimil` to use cosine similarity for calculations. By default, `index.knn.space_type` is `l2`. Here is an example that creates an index with two `knn_vector` fields and uses cosine similarity: | ||
|
|
||
| ```json | ||
| PUT my-index | ||
| PUT my-knn-index-1 | ||
| { | ||
| "settings": { | ||
| "index": { | ||
|
|
@@ -48,31 +48,31 @@ After you create the index, add some data to it: | |
|
|
||
| ```json | ||
| POST _bulk | ||
| { "index": { "_index": "my-index", "_id": "1" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "1" } } | ||
| { "my_vector1": [1.5, 2.5], "price": 12.2 } | ||
| { "index": { "_index": "my-index", "_id": "2" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "2" } } | ||
| { "my_vector1": [2.5, 3.5], "price": 7.1 } | ||
| { "index": { "_index": "my-index", "_id": "3" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "3" } } | ||
| { "my_vector1": [3.5, 4.5], "price": 12.9 } | ||
| { "index": { "_index": "my-index", "_id": "4" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "4" } } | ||
| { "my_vector1": [5.5, 6.5], "price": 1.2 } | ||
| { "index": { "_index": "my-index", "_id": "5" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "5" } } | ||
| { "my_vector1": [4.5, 5.5], "price": 3.7 } | ||
| { "index": { "_index": "my-index", "_id": "6" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "6" } } | ||
| { "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 } | ||
| { "index": { "_index": "my-index", "_id": "7" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "7" } } | ||
| { "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 } | ||
| { "index": { "_index": "my-index", "_id": "8" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "8" } } | ||
| { "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 } | ||
| { "index": { "_index": "my-index", "_id": "9" } } | ||
| { "index": { "_index": "my-knn-index-1", "_id": "9" } } | ||
| { "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 } | ||
|
|
||
| ``` | ||
|
|
||
| Then you can search the data using the `knn` query type: | ||
|
|
||
| ```json | ||
| GET my-index/_search | ||
| GET my-knn-index-1/_search | ||
| { | ||
| "size": 2, | ||
| "query": { | ||
|
|
@@ -88,10 +88,13 @@ GET my-index/_search | |
|
|
||
| In this case, `k` is the number of neighbors you want the query to return, but you must also include the `size` option. Otherwise, you get `k` results for each shard (and each segment) rather than `k` results for the entire query. The plugin supports a maximum `k` value of 10,000. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that |
||
|
|
||
| If you mix the `knn` query with other clauses, you might receive fewer than `k` results. In this example, the `post_filter` clause reduces the number of results from 2 to 1: | ||
|
|
||
| ## Compound queries with KNN | ||
|
|
||
| If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1: | ||
|
|
||
| ```json | ||
| GET my-index/_search | ||
| GET my-knn-index-1/_search | ||
| { | ||
| "size": 2, | ||
| "query": { | ||
|
|
@@ -112,3 +115,98 @@ GET my-index/_search | |
| } | ||
| } | ||
| ``` | ||
|
|
||
|
|
||
| ## Custom scoring | ||
|
|
||
| The [previous example](#mixing-queries) shows a search that returns fewer than `k` results. If you want to avoid this situation, KNN's custom scoring option lets you essentially invert the order of events. | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this imply that if the filter option reduces the number of results below the value of |
||
| First, add another index: | ||
|
|
||
| ```json | ||
| PUT my-knn-index-2 | ||
| { | ||
| "settings": { | ||
| "index.knn": true | ||
| }, | ||
| "mappings": { | ||
| "properties": { | ||
| "my_vector": { | ||
| "type": "knn_vector", | ||
| "dimension": 2 | ||
| }, | ||
| "color": { | ||
| "type": "keyword" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| If you *only* want to use KNN's custom scoring, you can omit `"index.knn": true`. The benefit of this approach is faster indexing speed and lower memory usage, but you lose the ability to perform standard KNN queries on the index. | ||
| {: .tip} | ||
|
|
||
| Then add some documents: | ||
|
|
||
| ```json | ||
| POST _bulk | ||
| { "index": { "_index": "my-knn-index-2", "_id": "1" } } | ||
| { "my_vector": [1, 1], "color" : "RED" } | ||
| { "index": { "_index": "my-knn-index-2", "_id": "2" } } | ||
| { "my_vector": [2, 2], "color" : "RED" } | ||
| { "index": { "_index": "my-knn-index-2", "_id": "3" } } | ||
| { "my_vector": [3, 3], "color" : "RED" } | ||
| { "index": { "_index": "my-knn-index-2", "_id": "4" } } | ||
| { "my_vector": [10, 10], "color" : "BLUE" } | ||
| { "index": { "_index": "my-knn-index-2", "_id": "5" } } | ||
| { "my_vector": [20, 20], "color" : "BLUE" } | ||
| { "index": { "_index": "my-knn-index-2", "_id": "6" } } | ||
| { "my_vector": [30, 30], "color" : "BLUE" } | ||
|
|
||
| ``` | ||
|
|
||
| Finally, use the `script_store` query to pre-filter your documents before identifying nearest neighbors: | ||
|
|
||
| ```json | ||
| GET my-knn-index-2/_search | ||
| { | ||
| "size": 2, | ||
| "query": { | ||
| "script_score": { | ||
| "query": { | ||
| "bool": { | ||
| "filter": { | ||
| "term": { | ||
| "color": "BLUE" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "script": { | ||
| "lang": "knn", | ||
| "source": "knn_score", | ||
| "params": { | ||
| "field": "my_vector", | ||
| "vector": [9.9, 9.9], | ||
| "space_type": "l2" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| All parameters are required. | ||
|
|
||
| - `lang` is the script type. This value is usually `painless`, but here you must specify `knn`. | ||
| - `source` is the name of the stored script, `knn_store`. | ||
| - `field` is the field that contains your vector data. | ||
| - `vector` is the point you want to find the nearest neighbors for. | ||
| - `space_type` is either `l2` or `cosinesimil`. | ||
|
|
||
|
|
||
| ## Performance considerations | ||
|
|
||
| The standard KNN query and custom scoring option perform differently. Test using a representative set of documents to see if the search results and latencies match your expectations. | ||
|
|
||
| Custom scoring works best if the initial filter reduces the number of documents to no more than 20,000. Increasing shard count can improve latencies, but be sure to keep shard size within [the recommended guidelines](../elasticsearch/#primary-and-replica-shards). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.