Skip to content
This repository was archived by the owner on Aug 16, 2022. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 112 additions & 14 deletions docs/knn/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ Short for its associated *k-nearest neighbors* algorithm, the KNN plugin lets yo

## Get started

To use the KNN plugin, you must create an index with the `index.knn` setting and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` with `l2` or `cosinesimil`, respectively, to use either Euclidean distance or cosine similarity for calculations. By default, `index.knn.space_type` is set to `l2`. Here is an example that creates an index with two knn_vector fields and uses cosine similarity:
To use the KNN query type, you must create an index with `index.knn: true` and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` parameter with `l2` to use Euclidean distance or `cosinesimil` to use cosine similarity for calculations. By default, `index.knn.space_type` is `l2`. Here is an example that creates an index with two `knn_vector` fields and uses cosine similarity:

```json
PUT my-index
PUT my-knn-index-1
{
"settings": {
"index": {
Expand Down Expand Up @@ -48,31 +48,31 @@ After you create the index, add some data to it:

```json
POST _bulk
{ "index": { "_index": "my-index", "_id": "1" } }
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{ "my_vector1": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-index", "_id": "2" } }
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{ "my_vector1": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-index", "_id": "3" } }
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{ "my_vector1": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-index", "_id": "4" } }
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
{ "my_vector1": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-index", "_id": "5" } }
{ "index": { "_index": "my-knn-index-1", "_id": "5" } }
{ "my_vector1": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-index", "_id": "6" } }
{ "index": { "_index": "my-knn-index-1", "_id": "6" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-index", "_id": "7" } }
{ "index": { "_index": "my-knn-index-1", "_id": "7" } }
{ "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-index", "_id": "8" } }
{ "index": { "_index": "my-knn-index-1", "_id": "8" } }
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-index", "_id": "9" } }
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

```

Then you can search the data using the `knn` query type:

```json
GET my-index/_search
GET my-knn-index-1/_search
{
"size": 2,
"query": {
Expand All @@ -88,10 +88,13 @@ GET my-index/_search

In this case, `k` is the number of neighbors you want the query to return, but you must also include the `size` option. Otherwise, you get `k` results for each shard (and each segment) rather than `k` results for the entire query. The plugin supports a maximum `k` value of 10,000.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that size and k should always have the same value?


If you mix the `knn` query with other clauses, you might receive fewer than `k` results. In this example, the `post_filter` clause reduces the number of results from 2 to 1:

## Compound queries with KNN

If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:

```json
GET my-index/_search
GET my-knn-index-1/_search
{
"size": 2,
"query": {
Expand All @@ -112,3 +115,98 @@ GET my-index/_search
}
}
```


## Custom scoring

The [previous example](#mixing-queries) shows a search that returns fewer than `k` results. If you want to avoid this situation, KNN's custom scoring option lets you essentially invert the order of events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that if the filter option reduces the number of results below the value of k, then the filter doesn't take effect?

First, add another index:

```json
PUT my-knn-index-2
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 2
},
"color": {
"type": "keyword"
}
}
}
}
```

If you *only* want to use KNN's custom scoring, you can omit `"index.knn": true`. The benefit of this approach is faster indexing speed and lower memory usage, but you lose the ability to perform standard KNN queries on the index.
{: .tip}

Then add some documents:

```json
POST _bulk
{ "index": { "_index": "my-knn-index-2", "_id": "1" } }
{ "my_vector": [1, 1], "color" : "RED" }
{ "index": { "_index": "my-knn-index-2", "_id": "2" } }
{ "my_vector": [2, 2], "color" : "RED" }
{ "index": { "_index": "my-knn-index-2", "_id": "3" } }
{ "my_vector": [3, 3], "color" : "RED" }
{ "index": { "_index": "my-knn-index-2", "_id": "4" } }
{ "my_vector": [10, 10], "color" : "BLUE" }
{ "index": { "_index": "my-knn-index-2", "_id": "5" } }
{ "my_vector": [20, 20], "color" : "BLUE" }
{ "index": { "_index": "my-knn-index-2", "_id": "6" } }
{ "my_vector": [30, 30], "color" : "BLUE" }

```

Finally, use the `script_store` query to pre-filter your documents before identifying nearest neighbors:

```json
GET my-knn-index-2/_search
{
"size": 2,
"query": {
"script_score": {
"query": {
"bool": {
"filter": {
"term": {
"color": "BLUE"
}
}
}
},
"script": {
"lang": "knn",
"source": "knn_score",
"params": {
"field": "my_vector",
"vector": [9.9, 9.9],
"space_type": "l2"
}
}
}
}
}
```

All parameters are required.

- `lang` is the script type. This value is usually `painless`, but here you must specify `knn`.
- `source` is the name of the stored script, `knn_store`.
- `field` is the field that contains your vector data.
- `vector` is the point you want to find the nearest neighbors for.
- `space_type` is either `l2` or `cosinesimil`.


## Performance considerations

The standard KNN query and custom scoring option perform differently. Test using a representative set of documents to see if the search results and latencies match your expectations.

Custom scoring works best if the initial filter reduces the number of documents to no more than 20,000. Increasing shard count can improve latencies, but be sure to keep shard size within [the recommended guidelines](../elasticsearch/#primary-and-replica-shards).
6 changes: 5 additions & 1 deletion docs/knn/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ parent: KNN
nav_order: 10
---

# KNN Settings and Statistics
# KNN Settings and statistics

The KNN plugin adds several new index settings, cluster settings, and statistics.

Expand Down Expand Up @@ -60,3 +60,7 @@ Statistic | Description
`graphMemoryUsage` | Current cache size (total size of all graphs in memory) in kilobytes.
`missCount` | The number of cache misses. A cache miss occurs when a user queries a graph and it has not yet been loaded into memory.
`loadExceptionCount` | The number of times an exception occurred when trying to load a graph into the cache.
`script_compilations` | The number of times the KNN script has been compiled. This value should usually be 1 or 0, but if the cache containing the compiled scripts is filled, the KNN script might be recompiled.
`script_compilation_errors` | The number of errors during script compilation.
`script_query_requests` | The number of query requests that use [the KNN script](../#custom-scoring).
`script_query_errors` | The number of errors during script queries.