[RFC] Optimized Disk-Based Vector Search #1779

jmazanec15 · 2024-06-28T22:33:58Z

Introduction

This document outlines a high level proposal for providing efficient, yet easy to use k-NN in OpenSearch in low-memory environments. Many more details to come in individual component RFCs.

Problem Statement

Vector search has become very popular in recent years due to recent advancements in LLMs. Typically, embedding models like Amazon Titan or Cohere or OpenAI will output fp32 embeddings in high dimensional space (>= 768). These embeddings are then ingested into vector search systems to provide semantic search. While semantic search can provide major benefits in search quality, it has historically come at a much higher cost than full text search. This is mainly due to the size of embeddings produced along with the high in-memory requirements of traditional Approximate Nearest Neighbor algorithms, such as HNSW.

With that, a major user ask is cost reduction. Many new solutions/algorithms are being released to reduce cost. General strategies include leveraging the disk to extend memory and/or using quantization techniques to reduce overall memory footprint per vector. Currently, OpenSearch offers the following quantization techniques for the 2x, 4x, 8x+ compression levels (assuming input vectors are fp32 d-dimensional vectors, 2x would compress vectors into 16*d bits, etc):

Compression Rate	Options
2x	1. [faiss] 16-bit Scalar Quantization
4x	1. [faiss] byte/8-bit Scalar Quantization (coming soon) 2. [lucene] byte/8-bit Scalar Quantization (coming soon)
8x	1. [faiss] product quantization (m=dim/2)
16x	1. [faiss] product quantization (m=dim/4)
32x	1. [faiss] product quantization (m=dim/8)

Many users want 8x+ compression, meaning that the current scalar quantization techniques are not viable options for them. Product quantization can be employed to meet this requirement. However, the current product quantization implementation has a few drawbacks:

Product Quantization requires a separate training stage, which raises the bar for entry to get a working solution. Users need to maintain a separate workflow for creating the model in the cluster. Additionally, it is difficult to detect when the model no longer fits the distribution of the data and needs to be re-trained.
Product Quantization recall degrades as compression rate increases. It can be increased in OpenSearch with full-precision re-scoring via script queries, but this interface is very verbose

Embedding models have also recognized the issue around memory and are starting to generate embeddings that use byte-per-dimension and/or bit-per-dimensional vectors, while maintaining competitive search quality. See cohere blog for byte/bit embeddings.

In OpenSearch, we currently support byte vectors through the Lucene engine. We are planning on supporting byte vectors in 2.16 in faiss (see #1659). binary vector support has not yet been added. This has also been a highly sought after feature for some time, with use cases outside of semantic search as well (see #81).

That being said, in order to continue to provide optimal vector search support, OpenSearch should:

Support binary vectors being produced by embedding models
Provide users with a simple experience for running k-NN search in low memory environments

Requirements

Functional

Provide binary ANN vector support as data type users can use to perform k-NN search on the vectors they may bring from popular embedding models.
Provide solutions in the 8x-32x compression level that do not require separate training step that produce recall >= 0.9 at acceptable latencies and scalable throughput for popular embedding models

Non-functional

The latency is a factor of many attributes including dimension of vectors, compression technique and number of vectors. We are targeting to keep the latency below ~200ms at p90 for billions(s) scale use-case with a recall of ~0.9 and above. More details will be added once we have done more benchmarks.

Requirement #1 is covered in #1767. We will focus on Requirement #2 for remainder of doc.

Proposed Solution

From a high level, we propose a flexible, easy to use, two-phased search to give performant k-NN search in low-memory environments. The two-phase search works by executing an initial k-NN search over an in-memory quantized index, with # results > k, and then re-scoring the results by re-computing the distances with full-precision vectors. In the second phase, the full precision vectors are lazily loaded from disk. In short, this approach offers users the ability to reduce their memory footprint and control the tradeoff between recall and search latency.

Ingestion

Two-phase Search

In fact, it is possible to do a two-phased workflow like this in OpenSearch now. See Appendix A: Executing Two-Phased Search with Current Functionality. To evaluate feasibility of this approach, we ran several PoC experiments using product quantization and re-scoring. See Appendix B: Baseline Re-score Experiments for details. From the experiments, we found that this approach can substantially increase the recall of the k-NN search on quantized vectors, while still providing low-latency vector search in low memory environments. This implementation does have a couple drawbacks though:

Training for product quantization is cumbersome. Users have to ingest training data set and run a separate training phase in order to use product quantization. The product quantizer is an additional complexity that needs to be managed
Configuring the product quantizer requires an understanding about its parameters in order to configure the tradeoff
The existing re-scoring interface requires duplicate specification of vectors and is very verbose

We want to provide users with an improved experience that does not require a separate training stage and has a simple, intuitive interface. To do this, we are going to provide quantization during ingestion, improved full-precision re-scoring and a refreshed interface.

Proposed Interface

Disclaimer: The interface is subject to change — feedback is encouraged/very valuable. We want to provide an intuitive user experience that provides a simple out of the box experience, with the flexibility for users to fine tune to optimize for their performance needs. This is the general vision. More details to come in future RFCs.

Out of the box, we propose a simple interface that allows users to specify their workload intent, which we will then take and optimize. The end to end workflow will look something like this:

Index Creation

PUT my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "mode": "on-disk"
      }
    }
  }
}

Parameters:

dimension — dimension of vectors
mode — new parameter that indicates what optimization strategy the index will take. Functionally, this parameter will choose a set of default parameters to optimize for certain workload requirements and/or constraints. For example, on-disk may mean that the user wants to limit the amount of memory consumption, so we will select a certain compression level and rescore factor by default. Details will be fleshed out in later RFCs.

Ingestion (no change)

PUT _bulk
{ "index": { "_index": "my-vector-index" } }
{ "my_vector_field": [1.5, 5.5, ...]}

Search (no change)

GET my-vector-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5],
        "k": 10
      }
    }
  }
}

The long-term vision will then be to consistently improve the out of box experience for a given mode by making more intelligent configuration decisions.

Fine tuning
As a tenet, we will never introduce functionality for improved tuning via mode that cannot be overridden by a parameter — thus, we will not step on any power users’ toes. As a consequence of this, we will need to expose the set of defaults/constraints for each mode in documentation so that users who need to fine-tune know where to start from.

Index Creation
To support intuitive quantization, we will add a new mapping parameter called compression_level.

PUT my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "mode": "disk-based",
        "compression_level": "16x",
        "method": {
        ...
        }
      }
    }
  }
}

Parameters:

compression_level — This parameter will indicate to users what ratio their input vectors will be compressed down into. Based on this value, the plugin will decide what quantization technique to apply. We will also provide an expert interface that allows users to specify the details of any quantizer, but this is omitted from doc for brevity.
method — method is an existing parameter that tells users how to configure their ANN index. Users can fine tune their index by overriding values in this body

Ingestion (no change)

PUT _bulk
{ "index": { "_index": "my-vector-index" } }
{ "my_vector_field": [1.5, 5.5, ...]}

Search
To support two-phased search, we will add a new parameter called rescore_factor.

GET my-vector-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5],
        "k": 10,
        "rescore_factor": 1.2
      }
    }
  }
}

Parameters:

rescore_factor — This parameter will indicate that rescore_factor*k results should be received from the initial ANN search and their distances should be recomputed based on the full precision vectors recommend fron secondary storage.

Quantization Framework

One of the issues with our implementation of PQ is that it requires a training stage and model management. This is cumbersome to users. So, to provide an improved experience, we need to give users the ability to achieve 8x+ compression without requiring a training stage.

To accomplish this, we are going to build a quantization framework that will configure the quantizer during ingestion. It will be responsible for sampling incoming data and performing analysis to determine how to best quantize the data.

Initially, we will onboard binary quantization into this framework. Binary quantization has shown tremendous success (see blogs in Appendix C: References) for achieving high compression rate while sacrificing minimal search qualit. So, we think it is a good place to start. For this, we will leverage the work done for binary vector support (#1767). We will continue to add more quantizers as we go.

Re-scoring

With the quantization framework, we will still store the full-precision vectors in a flat, file structure. With this, we will implement the functionality to get the results from the search over the quantized index and re-score them with higher precision from a secondary data store, like disk.

Alternatives Considered

Supporting new algorithm, such as DiskANN or SPANN

We could integrate another library’s implementation of an algorithm like SPANN or DiskANN directly into the plugin. However, the addition of another library would come with a high cost of maintenance as well as a high cost for integration in a way that would take complete advantage of features such as remote storage.

Right now, there is not enough evidence yet to prove that benefits from such algorithms/implementation would outperform an optimized two-phased search over HNSW or IVF structure with quantization (see PoC done from Lucene community: apache/lucene#12615).

While we are not adding support now for new algos/libraries, this feature does not close the door on it. This is a route we may take in the future as well. With this in mind, we will be building the components in a generic matter so that they can be re-used.

Enhance Product Quantization to avoid training step

Product quantization is a powerful technique for reducing memory consumption. However, it has an upfront requirement for training that leads to a more complex user experience that we want to avoid. While it may be possible to abstract this step into the ingestion workflow, it would result in sub-optimal ingestion performance because the quantizer could not be shared across shards and would potentially need to be recreated per segment.

Product quantization should have better memory vs. recall tradeoffs when compared to binary quantization. However, due to the efficiency for ingestion of binary quantization and encouraging empirical results, we are focusing for now on binary quantization.

In the future, we could onboard product quantization into the online training framework so it can be used as well. In addition, re-scoring components will be available for existing product quantization indices.

Appendix A: Executing Two-Phased Search with Current Functionality #

It is possible to run a two-phased k-NN search in OpenSearch with the script scoring functionality (ref). All you have to do is build an index with a particular quantizer and then add a k-NN script score query as a part of the re-score block, which will re-score the top window_size results after search over the shard finishes with full precision vectors. For example:

POST my-knn-index-1/_search
{
    "size": k,
    "query": {
        "knn": {
            field_name: {
                "vector": [0.12321, 123212342],
                "k": k
            }
        }
    },
    "rescore": {
        "window_size": k,
        "query": {
            "rescore_query": {
                "script_score": {
                    "query": {
                        "match_all": {}
                    },
                    "script": {
                        "lang": "knn",
                        "source": "knn_score",
                        "params": {
                            "field": "test-field",
                            "query_value": [0.12321, 123212342], // this needs to be copied
                            "space_type": "l2"
                        }
                    }
                }
            }
        }
    }
}

This is what was done to run the experiments for Appendix B: Baseline Re-score Experiments.

Appendix B: Baseline Re-score Experiments #

In order to evaluate feasibility of the disk-based two-phase search approach, we executed a series of tests with the OSB cohere-10M (768-d IP) vector data set. To test the system in a memory constrained environment, we used a single OpenSearch node docker configuration where the container was resource restricted (see here for details). We ran the experiments on r6gd.12xlarge (SSD/instance store) AWS instances. To focus on off-heap memory consumption, we fixed the heap at 32 GB.

We defined a couple different memory environments. The baseline was determined via the memory recommended for full precision HNSW graphs:

estimated_consumption: 1.1*(4*M + 8*d) * num_vectors
memory_allotment = 2 * estimate_consumption

mem-1x	mem-8x	mem-12x	mem-16x
offheap memory - cohere10M (GB)	65.57	8.20	5.46
total memory - cohere10M (GB)	97.57	40.20	37.46

We ran the search workload with differing number of candidates and re-scoring windows with the memory constraints. This led to these results for IVFPQ:

Metric/Config	ivfpq-8x-0r	ivfpq-8x-1r	ivfpq-8x-2r	ivfpq-12x-0r	ivfpq-12x-1r	ivfpq-12x-2r	ivfpq-16x-0r	ivfpq-16x-1r	ivfpq-16x-2r
Search 1 Client
Recall	0.691	0.907	0.92	0.691	0.906	0.918	0.512	0.798	0.871
p50 (ms)	30	42	52	30	41	54	13	24	50
p90 (ms)	39	54	71	39	52	74	16	33	79
p99 (ms)	46	64	83	45	60	85	18	37	90
mean t/p	28.52	20.57	17.17	28.4	20.99	16.49	58.55	33.24	17.68
Search 2 Client
Recall	0.691	0.907	0.92	0.691	0.906	0.918	0.512	0.798	0.871
p50 (ms)	30	43	54	30	42	57	13	26	58
p90 (ms)	39	56	75	38	54	79	16	36	93
p99 (ms)	46	66	88	45	63	92	19	44	106
mean t/p	54.33	39.06	31.92	54.46	39.78	30.53	107.26	60.31	29.88
Search 4 Client
Recall	0.691	0.907	0.92	0.691	0.906	0.918	0.512	0.798	0.871
p50 (ms)	30	46	87	30	51	107	13	55	113
p90 (ms)	39	63	146	39	81	172	16	88	184
p99 (ms)	46	84	182	45	105	208	19	105	213
mean t/p	100.15	70.18	39.85	97.81	63.43	33.88	191.45	62.24	32.6

Appendix C: References #

Several blogs around success of binary quantization

The text was updated successfully, but these errors were encountered:

luyuncheng · 2024-07-04T06:44:18Z

Quantization ANN in memory, and Full-Precision in Disk, this optimize looks like DiskANN. and it would make memory reduce as many as possible. and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk.

jmazanec15 · 2024-07-05T15:41:26Z

@luyuncheng I think disk ann may be worth exploring in future. However, I am a bit skeptical on effectiveness of the approach in higher dimensions, common for many existing embedding models. For instance, for embeddings with 1024 dimensions, the full precision vectors would consume 4 KB of space on disk. With this size, nothing else would be able to fit on the block that is read from disk - thus, the locality in reads offered via disk ann would not be helpful. So, the number of IOPs would not decrease unless the number of candidates being re-scored was significantly less than those being fetched during the graph traversal of diskann.

and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk.

What did you mean by this?

luyuncheng · 2024-07-08T02:51:36Z

and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk.

What did you mean by this?

i misunderstanding for the RFC. i was supposed that we can design a Disk-Based Storage in Faiss and CHANGED Index* storage which default is IndexFlat loaded all Vector into memory TO load from disk . and using quantized index in HNSW graph distance

struct IndexHNSW : Index {
    // the link strcuture
    HNSW hnsw;

    // the sequential storage
    bool own_fields = false;
    Index* storage = nullptr;
};

but i realized we just use quantized index and rescore in topK result for full-precision

navneet1v · 2024-07-11T17:04:11Z

i was supposed that we can design a Disk-Based Storage in Faiss and CHANGED Index* storage which default is IndexFlat loaded all Vector into memory TO load from disk .

@luyuncheng
I think we touched upon this idea in this gh issue if you remember. That is also worth exploring and may benefit users for whom quantization doesn't work.

jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Jun 28, 2024

github-actions bot added the untriaged label Jun 28, 2024

jmazanec15 removed the untriaged label Jun 28, 2024

navneet1v assigned jmazanec15 Jun 28, 2024

jmazanec15 changed the title ~~[RFC] Optimized Vector Search in Low-Memory Environments~~ [RFC] Optimized Disk-Based Vector Search Jun 29, 2024

jmazanec15 mentioned this issue Jul 3, 2024

Refactor parsing logic of the Query Builder #1793

Closed

1 task

heemin32 mentioned this issue Jul 5, 2024

[FEATURE] Inbuilt Byte quantizer to convert float 32 bits to 8 bit #1277

Open

navneet1v mentioned this issue Jul 11, 2024

Add binary format support with HNSW method in Faiss Engine #1781

Merged

5 tasks

jmazanec15 mentioned this issue Jul 11, 2024

Add refinement of quantized vector scores with fp distance calculations apache/lucene#13564

Open

vamshin added the v2.17.0 label Jul 15, 2024

vamshin added this to 2.17 (First RC 09/03, Release 09/17) in OpenSearch Project Roadmap Jul 15, 2024

vamshin added the k-NN label Jul 15, 2024

navneet1v added storage-improvements memory-reduction labels Jul 16, 2024

jmazanec15 mentioned this issue Jul 17, 2024

Investigate migrating from native ANN search at weight.scorer create level to query.rewrite leve #1845

Open

heemin32 mentioned this issue Jul 18, 2024

[FEATURE] Hamming distance / binary vector support #81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Optimized Disk-Based Vector Search #1779

[RFC] Optimized Disk-Based Vector Search #1779

jmazanec15 commented Jun 28, 2024 •

edited

Loading

luyuncheng commented Jul 4, 2024 •

edited

Loading

jmazanec15 commented Jul 5, 2024

luyuncheng commented Jul 8, 2024 •

edited

Loading

navneet1v commented Jul 11, 2024

[RFC] Optimized Disk-Based Vector Search #1779

[RFC] Optimized Disk-Based Vector Search #1779

Comments

jmazanec15 commented Jun 28, 2024 • edited Loading

Introduction

Problem Statement

Requirements

Functional

Non-functional

Proposed Solution

Proposed Interface

Quantization Framework

Re-scoring

Alternatives Considered

Appendix A: Executing Two-Phased Search with Current Functionality #

Appendix B: Baseline Re-score Experiments #

Appendix C: References #

luyuncheng commented Jul 4, 2024 • edited Loading

jmazanec15 commented Jul 5, 2024

luyuncheng commented Jul 8, 2024 • edited Loading

navneet1v commented Jul 11, 2024

jmazanec15 commented Jun 28, 2024 •

edited

Loading

luyuncheng commented Jul 4, 2024 •

edited

Loading

luyuncheng commented Jul 8, 2024 •

edited

Loading