Skip to content

Commit

Permalink
HSEARCH-4948 Document knn search predicate
Browse files Browse the repository at this point in the history
  • Loading branch information
marko-bekhta authored and yrodiere committed Nov 29, 2023
1 parent 9f50fba commit e919631
Show file tree
Hide file tree
Showing 5 changed files with 209 additions and 1 deletion.
Expand Up @@ -55,7 +55,7 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException

@Override
public int getMaxDimensions(String fieldName) {
// TODO : vector : we can make this configurable, apparently there are models that produce larger vectors than this default allows.
// TODO: HSEARCH-5020: we can make this configurable, apparently there are models that produce larger vectors than this default allows.
return DEFAULT_MAX_DIMENSIONS;
}

Expand Down
Expand Up @@ -132,6 +132,25 @@ Fields mapped using this annotation have very limited configuration options from
but the value binder will be able to pick a non-standard field type,
which generally gives much more flexibility.

[[mapping-directfieldmapping-annotations-vectorfield]] `@VectorField`::
+
include::../components/_incubating-warning.adoc[]
+
WARNING: Vector fields are only supported by the <<backend-lucene, Lucene backend>> for now.
+
Specific field type for vector fields to be used in a <<search-dsl-predicate-knn,vector search>>.
+
Vector fields accept values of type `float[]` or `byte[]` and *require* that
the <<mapping-directfieldmapping-dimension,dimension>> of stored vectors is specified upfront and that the indexed vectors
size match this dimension.
+
Besides that, vector fields allow optionally configuring
the <<mapping-directfieldmapping-vectorSimilarity,similarity function>> used during search,
the <<mapping-directfieldmapping-beamWidth,beam width>>
and the <<mapping-directfieldmapping-maxConnections,maximum connections>> used during indexing.
+
WARNING: It is not allowed to index multiple vectors within the same field, i.e. vector fields cannot be <<binding-index-field-dsl-multi-valued-fields,multivalued>>.

[[mapping-directfieldmapping-annotation-attributes]]
== [[mapper-orm-directfieldmapping-annotation-attributes]] Field annotation attributes

Expand Down Expand Up @@ -336,6 +355,71 @@ or `WITH_POSITIONS_OFFSETS_PAYLOADS`, both backends would support the `FAST_VECT
highlighter, if they already support the other two (`[PLAIN, UNIFIED]`).
|===============

[[mapping-directfieldmapping-dimension]] `dimension`::
+
include::../components/_incubating-warning.adoc[]
+
The size of the stored vectors. This is a required field. This size should match the vector size of the vectors produced by
the model used to convert the data into vector representation. It is expected to be a positive integer value in range `[1,1024]`.
+
Only available on `@VectorField`.
// TODO: HSEARCH-5020: to update the section once we make this configurable

[[mapping-directfieldmapping-vectorSimilarity]] `vectorSimilarity`::
+
include::../components/_incubating-warning.adoc[]
+
Defines how vector similarity is calculated during a <<search-dsl-predicate-knn,vector search>>.
+
Only available on `@VectorField`.
+
[cols=",a",options="header"]
|===============
|Value|Definition
|`VectorSimilarity.L2`|An L2 (Euclidean) norm, that is a sensible default for most scenarios.
Distance between vectors `x` and `y` is calculated as latexmath:[d(x,y) = \sum_{i=1}^{n} (x_i - y_i)^2 ]
and the score function is latexmath:[s = \frac{1}{1+d}]
|`VectorSimilarity.INNER_PRODUCT`|Inner product (dot product in particular).
Distance between vectors `x` and `y` is calculated as latexmath:[d(x,y) = \sum_{i=1}^{n} x_i \cdot y_i ]
and the score function is latexmath:[s = \frac{1}{1+d}]

[WARNING]
====
To use this similarity efficiently, both index and search vectors *must* be normalized;
otherwise search may produce poor results.
Floating point vectors must be https://en.wikipedia.org/wiki/Unit_vector[normalized to be of unit length],
while byte vectors should simply all have the same norm.
====

|`VectorSimilarity.COSINE`|Cosine similarity.
Distance between vectors `x` and `y` is calculated as latexmath:[d(x,y) = \frac{1 - \sum_{i=1} ^{n} x_i \cdot y_i }{ \sqrt{ \sum_{i=1} ^{n} x_i^2 } \sqrt{ \sum_{i=1} ^{n} y_i^2 }} ]
and the score function is latexmath:[s = \frac{1}{1+d}]
|`VectorSimilarity.DEFAULT`|Use the backend-specific default. For the <<backend-lucene, Lucene backend>> an `L2` similarity is used.
|===============

[[mapping-directfieldmapping-beamWidth]] `beamWidth`::
+
include::../components/_incubating-warning.adoc[]
+
Beam width is the size of the dynamic list used during k-NN graph creation. It affects how vectors are stored.
Higher values lead to a more accurate graph but slower indexing speed.
+
Default value is backend-specific.
+
Only available on `@VectorField`.

[[mapping-directfieldmapping-maxConnections]] `maxConnections`::
+
include::../components/_incubating-warning.adoc[]
+
The number of neighbors each node will be connected to in the https://en.wikipedia.org/wiki/Nearest_neighbor_search#cite_note-:0-10[HNSW (Hierarchical Navigable Small World graphs) graph].
Modifying this value will have an impact on memory consumption.
It is recommended to keep this value between 2 and 100.
+
Default value is backend-specific.
+
Only available on `@VectorField`.

[[mapping-directfieldmapping-supported-types]]
== [[mapper-orm-directfieldmapping-supported-types]] [[section-built-in-bridges]] Supported property types

Expand Down
Expand Up @@ -1464,6 +1464,53 @@ either on a per-field basis with a call to `.boost(...)` just after `.field(...)
or for the whole predicate with a call to `.boost(...)`
after `.circle(...)`/`.boundingBox(...)`/`.polygon(...)`.

[[search-dsl-predicate-knn]]
== `knn`: K-Nearest Neighbors a.k.a. vector search

include::../components/_incubating-warning.adoc[]

The `knn` predicate, with `k` being a positive integer, matches the `k` documents for which a given vector field's value is "nearest" to a given vector.

Distance is measured based on the vector similarity configured for the given <<mapping-directfieldmapping-annotations-vectorfield,vector field>>.

.Simple K-Nearest Neighbors search
====
[source, JAVA, indent=0, subs="+callouts"]
----
include::{sourcedir}/org/hibernate/search/documentation/search/predicate/PredicateDslIT.java[tags=knn]
----
====

[[search-dsl-predicate-knn-argument-type]]
=== Expected type of arguments

The `knn` predicate expects arguments to the `matching(...)` method
to have the same type as the index type of a target field.

For example, if an entity property is mapped in the index to a byte array type (`byte[]`) ,
`.matching(...)` will expect its argument to be a byte array (`byte[]`) only.

[[search-dsl-predicate-knn-filter]]
=== Filtering the neighbors

Optionally, the predicate can filter out some of the neighbors using the `.filter(..)` clause of the predicate.
`.filter(...)` expects a predicate to be passed to it.

.K-Nearest Neighbors search with a filter
====
[source, JAVA, indent=0, subs="+callouts"]
----
include::{sourcedir}/org/hibernate/search/documentation/search/predicate/PredicateDslIT.java[tags=knn-filter]
----
====

[[search-dsl-predicate-knn-other]]
=== Other options

* The score of a `knn` predicate is variable by default (higher for "nearer" documents),
but can be <<search-dsl-predicate-common-constantScore,made constant with `.constantScore()`>>.
* The score of a `knn` predicate can be <<search-dsl-predicate-common-boost,boosted>>
for the whole predicate with a call to `.boost(...)` after `.matching(...)`.

[[search-dsl-predicate-named]]
== [[query-filter-fulltext]] `named`: call a predicate defined in the mapping
Expand Down
Expand Up @@ -43,6 +43,8 @@ public class Book {
@FullTextField(analyzer = "english")
private String comment;

private float[] coverImageEmbeddings;

@ManyToMany
@IndexedEmbedded(structure = ObjectStructure.NESTED)
private List<Author> authors = new ArrayList<>();
Expand Down Expand Up @@ -98,6 +100,14 @@ public void setComment(String comment) {
this.comment = comment;
}

public float[] getCoverImageEmbeddings() {
return coverImageEmbeddings;
}

public void setCoverImageEmbeddings(float[] coverImageEmbeddings) {
this.coverImageEmbeddings = coverImageEmbeddings;
}

public List<Author> getAuthors() {
return authors;
}
Expand Down
Expand Up @@ -8,8 +8,10 @@

import static org.assertj.core.api.Assertions.assertThat;
import static org.hibernate.search.util.impl.integrationtest.mapper.orm.OrmUtils.with;
import static org.junit.jupiter.api.Assumptions.assumeTrue;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.function.Consumer;
Expand All @@ -26,9 +28,12 @@
import org.hibernate.search.engine.spatial.GeoPoint;
import org.hibernate.search.engine.spatial.GeoPolygon;
import org.hibernate.search.mapper.orm.Search;
import org.hibernate.search.mapper.orm.cfg.HibernateOrmMapperSettings;
import org.hibernate.search.mapper.orm.mapping.HibernateOrmSearchMappingConfigurer;
import org.hibernate.search.mapper.orm.scope.SearchScope;
import org.hibernate.search.mapper.orm.session.SearchSession;
import org.hibernate.search.util.common.data.RangeBoundInclusion;
import org.hibernate.search.util.impl.integrationtest.common.extension.BackendConfiguration;

import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
Expand All @@ -52,6 +57,19 @@ class PredicateDslIT {
@BeforeEach
void setup() {
entityManagerFactory = setupHelper.start().setup( Book.class, Author.class, EmbeddableGeoPoint.class );

DocumentationSetupHelper.SetupContext setupContext = setupHelper.start();
// TODO HSEARCH-4950 Remove this if and add @VectorField to the Book entity once we support KNN search using Elasticsearch/OpenSearch
if ( BackendConfiguration.isLucene() ) {
setupContext.withProperty(
HibernateOrmMapperSettings.MAPPING_CONFIGURER,
(HibernateOrmSearchMappingConfigurer) context -> context.programmaticMapping()
.type( Book.class )
.property( "coverImageEmbeddings" )
.vectorField( 128 )
);
}
entityManagerFactory = setupContext.setup( Book.class, Author.class, EmbeddableGeoPoint.class );
initData();
}

Expand Down Expand Up @@ -1086,6 +1104,45 @@ void not() {
} );
}

@Test
void knn() {
// TODO HSEARCH-4950 Remove this assumption when we support KNN search using Elasticsearch/OpenSearch
assumeTrue(
BackendConfiguration.isLucene(),
"This test only makes sense if the backend supports vectors"
);
withinSearchSession( searchSession -> {
// tag::knn[]
float[] coverImageEmbeddingsVector = /*...*/
// end::knn[]
new float[128];
// tag::knn[]
List<Book> hits = searchSession.search( Book.class )
.where( f -> f.knn( 5 ).field( "coverImageEmbeddings" ).matching( coverImageEmbeddingsVector ) )
.fetchHits( 20 );
// end::knn[]
assertThat( hits )
.extracting( Book::getId )
.containsExactlyInAnyOrder( BOOK1_ID, BOOK2_ID, BOOK3_ID, BOOK4_ID );
} );

withinSearchSession( searchSession -> {
// tag::knn-filter[]
float[] coverImageEmbeddingsVector = /*...*/
// end::knn-filter[]
new float[128];
// tag::knn-filter[]
List<Book> hits = searchSession.search( Book.class )
.where( f -> f.knn( 5 ).field( "coverImageEmbeddings" ).matching( coverImageEmbeddingsVector )
.filter( f.match().field( "authors.firstName" ).matching( "isaac" ) ) )
.fetchHits( 20 );
// end::knn-filter[]
assertThat( hits )
.extracting( Book::getId )
.containsExactlyInAnyOrder( BOOK1_ID, BOOK2_ID, BOOK3_ID );
} );
}

private MySearchParameters getSearchParameters() {
return new MySearchParameters() {
@Override
Expand Down Expand Up @@ -1148,6 +1205,7 @@ private void initData() {
book1.setPageCount( 250 );
book1.setGenre( Genre.SCIENCE_FICTION );
book1.getAuthors().add( isaacAsimov );
book1.setCoverImageEmbeddings( floats( 128, 1.0f ) );
isaacAsimov.getBooks().add( book1 );

Book book2 = new Book();
Expand All @@ -1158,6 +1216,7 @@ private void initData() {
book2.setGenre( Genre.SCIENCE_FICTION );
book2.setComment( "Really liked this one!" );
book2.getAuthors().add( isaacAsimov );
book2.setCoverImageEmbeddings( floats( 128, 2.0f ) );
isaacAsimov.getBooks().add( book2 );

Book book3 = new Book();
Expand All @@ -1167,6 +1226,7 @@ private void initData() {
book3.setPageCount( 435 );
book3.setGenre( Genre.SCIENCE_FICTION );
book3.getAuthors().add( isaacAsimov );
book3.setCoverImageEmbeddings( floats( 128, 3.0f ) );
isaacAsimov.getBooks().add( book3 );

Book book4 = new Book();
Expand All @@ -1176,6 +1236,7 @@ private void initData() {
book4.setPageCount( 222 );
book4.setGenre( Genre.CRIME_FICTION );
book4.getAuthors().add( aLeeMartinez );
book4.setCoverImageEmbeddings( floats( 128, 4.0f ) );
aLeeMartinez.getBooks().add( book3 );

entityManager.persist( isaacAsimov );
Expand All @@ -1197,4 +1258,10 @@ private interface MySearchParameters {

List<String> getAuthorFilters();
}

private static float[] floats(int dimension, float value) {
float[] bytes = new float[dimension];
Arrays.fill( bytes, value );
return bytes;
}
}

0 comments on commit e919631

Please sign in to comment.