Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

having multiple vectors per document which can be searched in the same knn operation #221

Closed
pommedeterresautee opened this issue Sep 10, 2020 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@pommedeterresautee
Copy link

Transformer models are limited to 512 tokens but may provide high quality embeddings for semantic search compared to classical word embeddings.
For long documents (over 512 tokens), it's usual to split them in blocks < 512 tokens and work at the level of a single block.

My use case is to perform a semantic search across those long documents and find the most semantically related one.

In the current implementation of KNN in open distro, we can provide several vectors per document but :

  • the number of vectors per document is limited by the mapping,
  • we can't perform a search across all vector fields in a single operation.

I have thought to 2 workarounds:

  • split documents and index each blocks and its vector as a document: it makes every thing much more complex to maintain (as it force us to maintain a second index with the original full document), and simplicity was the very reason to try open distro vs building an nmslib/FAISS index outside of elasticsearch.
  • declare 10 or more KNN vector fields per document in the mapping, populate only the required fields (most of the time, most of the fields will be kept empty), and during the search, launch 10 searches, 1 per field, retrieve top k docs per search, concatenate results, sort per cosine, keep top k. Here, the main issue is performance, it may be quite slow. Again, in this case, it seems better to manage a vector index outside of elasticsearch.

Is there another way to manage long documents ?

@vamshin vamshin self-assigned this Sep 10, 2020
@vamshin vamshin added the question Further information is requested label Sep 10, 2020
@vamshin
Copy link
Member

vamshin commented Sep 16, 2020

Hi @pommedeterresautee,

>>> 1. the number of vectors per document is limited by the mapping,

This can be achieved using dynamic templates . You could dynamically define fields of type knn_vector.

Example:-
To declare all the fields that begin with name vsearch to be of type knn_vector with 2 dimensions, you could create index using dynamic templates this way

curl -X PUT "localhost:9200/myindex" -H 'Content-Type: application/json' -d'
{
  "settings" : {
    "number_of_shards" :   1,
    "number_of_replicas" : 0,
    "index": {
        "knn": true
    }
  },
  "mappings": {
    "dynamic_templates": [
        {
          "test_template" : {
            "path_match" : "vsearch*",
            "mapping" : {
              "dimension" : 2,
              "type" : "knn_vector"
            }
          }
        }
    ]
}
}
'

>>> 2. we can't perform a search across all vector fields in a single operation.

Work around to do search across multiple knn fields and combine the results

Assuming fields my_dense_vector1 with 2 dimensions, my_dense_vector2 with 3 dimensions. You could define weightage for the scores for each of the fields.


curl -X POST "localhost:9200/my_dense_index/_search" -H 'Content-Type: application/json' -d'
 {
   "query": {
     "bool": {
       "should": [
         {
           "function_score": {
             "query": {
               "knn": {
                   "my_dense_vector1": {
                   "vector": [0, 0],
                   "k": 1
                   }        
               }
           },
             "weight": 1
           }
         },
         {
           "function_score": {
             "query": {
               "knn": {
                   "my_dense_vector2": {
                   "vector": [0, 0, 0],
                   "k": 1
                   }        
               }
           },
             "weight": 1
           }
         }
       ]
     }
   }
 }
 '

@vamshin vamshin closed this as completed Jan 5, 2021
@ezorita
Copy link

ezorita commented Feb 4, 2021

@pommedeterresautee I find myself in the exact same situation. Which approach did you take finally? Have you had the chance to evaluate the performance loss of searching over multiple vector fields? Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants