Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Optimized Disk-Based Vector Search #1779

Open
jmazanec15 opened this issue Jun 28, 2024 · 4 comments
Open

[RFC] Optimized Disk-Based Vector Search #1779

jmazanec15 opened this issue Jun 28, 2024 · 4 comments
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement k-NN memory-reduction storage-improvements v2.17.0

Comments

@jmazanec15
Copy link
Member

jmazanec15 commented Jun 28, 2024

Introduction

This document outlines a high level proposal for providing efficient, yet easy to use k-NN in OpenSearch in low-memory environments. Many more details to come in individual component RFCs.

Problem Statement

Vector search has become very popular in recent years due to recent advancements in LLMs. Typically, embedding models like Amazon Titan or Cohere or OpenAI will output fp32 embeddings in high dimensional space (>= 768). These embeddings are then ingested into vector search systems to provide semantic search. While semantic search can provide major benefits in search quality, it has historically come at a much higher cost than full text search. This is mainly due to the size of embeddings produced along with the high in-memory requirements of traditional Approximate Nearest Neighbor algorithms, such as HNSW.

With that, a major user ask is cost reduction. Many new solutions/algorithms are being released to reduce cost. General strategies include leveraging the disk to extend memory and/or using quantization techniques to reduce overall memory footprint per vector. Currently, OpenSearch offers the following quantization techniques for the 2x, 4x, 8x+ compression levels (assuming input vectors are fp32 d-dimensional vectors, 2x would compress vectors into 16*d bits, etc):

Compression Rate Options
2x 1. [faiss] 16-bit Scalar Quantization
4x 1. [faiss] byte/8-bit Scalar Quantization (coming soon)
2. [lucene] byte/8-bit Scalar Quantization (coming soon)
8x 1. [faiss] product quantization (m=dim/2)
16x 1. [faiss] product quantization (m=dim/4)
32x 1. [faiss] product quantization (m=dim/8)

Many users want 8x+ compression, meaning that the current scalar quantization techniques are not viable options for them. Product quantization can be employed to meet this requirement. However, the current product quantization implementation has a few drawbacks:

  1. Product Quantization requires a separate training stage, which raises the bar for entry to get a working solution. Users need to maintain a separate workflow for creating the model in the cluster. Additionally, it is difficult to detect when the model no longer fits the distribution of the data and needs to be re-trained.
  2. Product Quantization recall degrades as compression rate increases. It can be increased in OpenSearch with full-precision re-scoring via script queries, but this interface is very verbose

Embedding models have also recognized the issue around memory and are starting to generate embeddings that use byte-per-dimension and/or bit-per-dimensional vectors, while maintaining competitive search quality. See cohere blog for byte/bit embeddings.

In OpenSearch, we currently support byte vectors through the Lucene engine. We are planning on supporting byte vectors in 2.16 in faiss (see #1659). binary vector support has not yet been added. This has also been a highly sought after feature for some time, with use cases outside of semantic search as well (see #81).

That being said, in order to continue to provide optimal vector search support, OpenSearch should:

  1. Support binary vectors being produced by embedding models
  2. Provide users with a simple experience for running k-NN search in low memory environments

Requirements

Functional

  1. Provide binary ANN vector support as data type users can use to perform k-NN search on the vectors they may bring from popular embedding models.
  2. Provide solutions in the 8x-32x compression level that do not require separate training step that produce recall >= 0.9 at acceptable latencies and scalable throughput for popular embedding models

Non-functional

  1. The latency is a factor of many attributes including dimension of vectors, compression technique and number of vectors. We are targeting to keep the latency below ~200ms at p90 for billions(s) scale use-case with a recall of ~0.9 and above. More details will be added once we have done more benchmarks.

Requirement #1 is covered in #1767. We will focus on Requirement #2 for remainder of doc.

Proposed Solution

From a high level, we propose a flexible, easy to use, two-phased search to give performant k-NN search in low-memory environments. The two-phase search works by executing an initial k-NN search over an in-memory quantized index, with # results > k, and then re-scoring the results by re-computing the distances with full-precision vectors. In the second phase, the full precision vectors are lazily loaded from disk. In short, this approach offers users the ability to reduce their memory footprint and control the tradeoff between recall and search latency.

Ingestion

Two-phase Search

In fact, it is possible to do a two-phased workflow like this in OpenSearch now. See Appendix A: Executing Two-Phased Search with Current Functionality. To evaluate feasibility of this approach, we ran several PoC experiments using product quantization and re-scoring. See Appendix B: Baseline Re-score Experiments for details. From the experiments, we found that this approach can substantially increase the recall of the k-NN search on quantized vectors, while still providing low-latency vector search in low memory environments. This implementation does have a couple drawbacks though:

  1. Training for product quantization is cumbersome. Users have to ingest training data set and run a separate training phase in order to use product quantization. The product quantizer is an additional complexity that needs to be managed
  2. Configuring the product quantizer requires an understanding about its parameters in order to configure the tradeoff
  3. The existing re-scoring interface requires duplicate specification of vectors and is very verbose

We want to provide users with an improved experience that does not require a separate training stage and has a simple, intuitive interface. To do this, we are going to provide quantization during ingestion, improved full-precision re-scoring and a refreshed interface.

Proposed Interface

Disclaimer: The interface is subject to change — feedback is encouraged/very valuable. We want to provide an intuitive user experience that provides a simple out of the box experience, with the flexibility for users to fine tune to optimize for their performance needs. This is the general vision. More details to come in future RFCs.

Out of the box, we propose a simple interface that allows users to specify their workload intent, which we will then take and optimize. The end to end workflow will look something like this:

Index Creation

PUT my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "mode": "on-disk"
      }
    }
  }
}

Parameters:

  • dimension — dimension of vectors
  • mode — new parameter that indicates what optimization strategy the index will take. Functionally, this parameter will choose a set of default parameters to optimize for certain workload requirements and/or constraints. For example, on-disk may mean that the user wants to limit the amount of memory consumption, so we will select a certain compression level and rescore factor by default. Details will be fleshed out in later RFCs.

Ingestion (no change)

PUT _bulk
{ "index": { "_index": "my-vector-index" } }
{ "my_vector_field": [1.5, 5.5, ...]}

Search (no change)

GET my-vector-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5],
        "k": 10
      }
    }
  }
}

The long-term vision will then be to consistently improve the out of box experience for a given mode by making more intelligent configuration decisions.

Fine tuning
As a tenet, we will never introduce functionality for improved tuning via mode that cannot be overridden by a parameter — thus, we will not step on any power users’ toes. As a consequence of this, we will need to expose the set of defaults/constraints for each mode in documentation so that users who need to fine-tune know where to start from.

Index Creation
To support intuitive quantization, we will add a new mapping parameter called compression_level.

PUT my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "mode": "disk-based",
        "compression_level": "16x",
        "method": {
        ...
        }
      }
    }
  }
}

Parameters:

  • compression_level — This parameter will indicate to users what ratio their input vectors will be compressed down into. Based on this value, the plugin will decide what quantization technique to apply. We will also provide an expert interface that allows users to specify the details of any quantizer, but this is omitted from doc for brevity.
  • method — method is an existing parameter that tells users how to configure their ANN index. Users can fine tune their index by overriding values in this body

Ingestion (no change)

PUT _bulk
{ "index": { "_index": "my-vector-index" } }
{ "my_vector_field": [1.5, 5.5, ...]}

Search
To support two-phased search, we will add a new parameter called rescore_factor.

GET my-vector-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5],
        "k": 10,
        "rescore_factor": 1.2
      }
    }
  }
}

Parameters:

  • rescore_factor — This parameter will indicate that rescore_factor*k results should be received from the initial ANN search and their distances should be recomputed based on the full precision vectors recommend fron secondary storage.

Quantization Framework

One of the issues with our implementation of PQ is that it requires a training stage and model management. This is cumbersome to users. So, to provide an improved experience, we need to give users the ability to achieve 8x+ compression without requiring a training stage.

To accomplish this, we are going to build a quantization framework that will configure the quantizer during ingestion. It will be responsible for sampling incoming data and performing analysis to determine how to best quantize the data.

Initially, we will onboard binary quantization into this framework. Binary quantization has shown tremendous success (see blogs in Appendix C: References) for achieving high compression rate while sacrificing minimal search qualit. So, we think it is a good place to start. For this, we will leverage the work done for binary vector support (#1767). We will continue to add more quantizers as we go.

Re-scoring

With the quantization framework, we will still store the full-precision vectors in a flat, file structure. With this, we will implement the functionality to get the results from the search over the quantized index and re-score them with higher precision from a secondary data store, like disk.

Alternatives Considered

Supporting new algorithm, such as DiskANN or SPANN

We could integrate another library’s implementation of an algorithm like SPANN or DiskANN directly into the plugin. However, the addition of another library would come with a high cost of maintenance as well as a high cost for integration in a way that would take complete advantage of features such as remote storage.

Right now, there is not enough evidence yet to prove that benefits from such algorithms/implementation would outperform an optimized two-phased search over HNSW or IVF structure with quantization (see PoC done from Lucene community: apache/lucene#12615).

While we are not adding support now for new algos/libraries, this feature does not close the door on it. This is a route we may take in the future as well. With this in mind, we will be building the components in a generic matter so that they can be re-used.

Enhance Product Quantization to avoid training step

Product quantization is a powerful technique for reducing memory consumption. However, it has an upfront requirement for training that leads to a more complex user experience that we want to avoid. While it may be possible to abstract this step into the ingestion workflow, it would result in sub-optimal ingestion performance because the quantizer could not be shared across shards and would potentially need to be recreated per segment.

Product quantization should have better memory vs. recall tradeoffs when compared to binary quantization. However, due to the efficiency for ingestion of binary quantization and encouraging empirical results, we are focusing for now on binary quantization.

In the future, we could onboard product quantization into the online training framework so it can be used as well. In addition, re-scoring components will be available for existing product quantization indices.

Appendix A: Executing Two-Phased Search with Current Functionality #

It is possible to run a two-phased k-NN search in OpenSearch with the script scoring functionality (ref). All you have to do is build an index with a particular quantizer and then add a k-NN script score query as a part of the re-score block, which will re-score the top window_size results after search over the shard finishes with full precision vectors. For example:

POST my-knn-index-1/_search
{
    "size": k,
    "query": {
        "knn": {
            field_name: {
                "vector": [0.12321, 123212342],
                "k": k
            }
        }
    },
    "rescore": {
        "window_size": k,
        "query": {
            "rescore_query": {
                "script_score": {
                    "query": {
                        "match_all": {}
                    },
                    "script": {
                        "lang": "knn",
                        "source": "knn_score",
                        "params": {
                            "field": "test-field",
                            "query_value": [0.12321, 123212342], // this needs to be copied
                            "space_type": "l2"
                        }
                    }
                }
            }
        }
    }
}

This is what was done to run the experiments for Appendix B: Baseline Re-score Experiments.

Appendix B: Baseline Re-score Experiments #

In order to evaluate feasibility of the disk-based two-phase search approach, we executed a series of tests with the OSB cohere-10M (768-d IP) vector data set. To test the system in a memory constrained environment, we used a single OpenSearch node docker configuration where the container was resource restricted (see here for details). We ran the experiments on r6gd.12xlarge (SSD/instance store) AWS instances. To focus on off-heap memory consumption, we fixed the heap at 32 GB.

We defined a couple different memory environments. The baseline was determined via the memory recommended for full precision HNSW graphs:

  • estimated_consumption: 1.1*(4*M + 8*d) * num_vectors
  • memory_allotment = 2 * estimate_consumption
mem-1x mem-8x mem-12x mem-16x
offheap memory - cohere10M (GB) 65.57 8.20 5.46
total memory - cohere10M (GB) 97.57 40.20 37.46

We ran the search workload with differing number of candidates and re-scoring windows with the memory constraints. This led to these results for IVFPQ:

Metric/Config ivfpq-8x-0r ivfpq-8x-1r ivfpq-8x-2r ivfpq-12x-0r ivfpq-12x-1r ivfpq-12x-2r ivfpq-16x-0r ivfpq-16x-1r ivfpq-16x-2r
Search 1 Client                
Recall 0.691 0.907 0.92 0.691 0.906 0.918 0.512 0.798 0.871
p50 (ms) 30 42 52 30 41 54 13 24 50
p90 (ms) 39 54 71 39 52 74 16 33 79
p99 (ms) 46 64 83 45 60 85 18 37 90
mean t/p 28.52 20.57 17.17 28.4 20.99 16.49 58.55 33.24 17.68
Search 2 Client                
Recall 0.691 0.907 0.92 0.691 0.906 0.918 0.512 0.798 0.871
p50 (ms) 30 43 54 30 42 57 13 26 58
p90 (ms) 39 56 75 38 54 79 16 36 93
p99 (ms) 46 66 88 45 63 92 19 44 106
mean t/p 54.33 39.06 31.92 54.46 39.78 30.53 107.26 60.31 29.88
Search 4 Client                
Recall 0.691 0.907 0.92 0.691 0.906 0.918 0.512 0.798 0.871
p50 (ms) 30 46 87 30 51 107 13 55 113
p90 (ms) 39 63 146 39 81 172 16 88 184
p99 (ms) 46 84 182 45 105 208 19 105 213
mean t/p 100.15 70.18 39.85 97.81 63.43 33.88 191.45 62.24 32.6

Appendix C: References #

Several blogs around success of binary quantization

  1. https://medium.com/qdrant/hyperfast-vector-search-with-binary-quantization-865052c5c259
  2. https://huggingface.co/blog/embedding-quantization#binary-rescoring
  3. https://cohere.com/blog/int8-binary-embeddings
@jmazanec15 jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Jun 28, 2024
@jmazanec15 jmazanec15 changed the title [RFC] Optimized Vector Search in Low-Memory Environments [RFC] Optimized Disk-Based Vector Search Jun 29, 2024
@luyuncheng
Copy link
Collaborator

luyuncheng commented Jul 4, 2024

Quantization ANN in memory, and Full-Precision in Disk, this optimize looks like DiskANN. and it would make memory reduce as many as possible. and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk.

@jmazanec15
Copy link
Member Author

@luyuncheng I think disk ann may be worth exploring in future. However, I am a bit skeptical on effectiveness of the approach in higher dimensions, common for many existing embedding models. For instance, for embeddings with 1024 dimensions, the full precision vectors would consume 4 KB of space on disk. With this size, nothing else would be able to fit on the block that is read from disk - thus, the locality in reads offered via disk ann would not be helpful. So, the number of IOPs would not decrease unless the number of candidates being re-scored was significantly less than those being fetched during the graph traversal of diskann.

and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk.

What did you mean by this?

@luyuncheng
Copy link
Collaborator

luyuncheng commented Jul 8, 2024

and if we want to make this into graph algorithm, i think we would changed the IndexFlat storage into disk based. ALSO it would take many IOPS(thousands of iops one query) , we can reconstruct the storage structure in disk.

What did you mean by this?

i misunderstanding for the RFC. i was supposed that we can design a Disk-Based Storage in Faiss and CHANGED Index* storage which default is IndexFlat loaded all Vector into memory TO load from disk . and using quantized index in HNSW graph distance

struct IndexHNSW : Index {
    // the link strcuture
    HNSW hnsw;

    // the sequential storage
    bool own_fields = false;
    Index* storage = nullptr;
};

but i realized we just use quantized index and rescore in topK result for full-precision

@navneet1v
Copy link
Collaborator

i was supposed that we can design a Disk-Based Storage in Faiss and CHANGED Index* storage which default is IndexFlat loaded all Vector into memory TO load from disk .

@luyuncheng
I think we touched upon this idea in this gh issue if you remember. That is also worth exploring and may benefit users for whom quantization doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement k-NN memory-reduction storage-improvements v2.17.0
Projects
OpenSearch Project Roadmap
2.17 (First RC 09/03, Release 09/17)
Status: 2.17.0
Development

No branches or pull requests

4 participants