Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600

Open
navneet1v opened this issue Apr 9, 2024 · 24 comments
Assignees
Labels
enhancement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. memory-reduction v2.17.0

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Apr 9, 2024

Description

While running some experiments I can am able to see that we need more memory than final graph which gets created in a segment.

Based on some research what I can see for Faiss

  1. We give faiss, all the vectors and ids for which graph needs to be created. here: https://github.com/navneet1v/k-NN/blob/badbb1d3438a37536ce6de29c4085cded0ef2723/jni/src/faiss_wrapper.cpp#L159 .

  2. After this faiss, internally calls add on the index function. here: https://github.com/facebookresearch/faiss/blob/8898eabe9f16e7bbebdc19667c993f7dc55a6a0c/faiss/IndexIDMap.cpp#L80

  3. This will call internally add function of IndexHNSW, which will first store the vector in a flat index(which is nothing but a binary representation of vectors) and then start creating the HNSW graph.

So based on the above flow I can see that vectors which are stored in native memory in step 1 and passed to Faiss are stored first in FlatIndex of Faiss doubling the native memory.

For Nmslib:

  1. We first take the vectors and copy vectors into this new object of nmslib.
  2. After that, we create the index for nmslib.
  3. After the index is created we save the index to a file. This index writing makes a very large allocation before it writes the index on the file here: https://github.com/nmslib/nmslib/blob/a2d6624e1315402662025debfdd614b505d9c3ef/similarity_search/src/method/hnsw.cc#L417-L428

Issue

Due to this, we need more memory to create a graph than the output Faiss/Nmslib index at a segment level. This really throw off the memory estimations while force merging the segments.

Resolution

With Streaming Vectors from Java to JNI Layer, we are already sending a limited amount of vectors from Java to JNI layer, we can use the same approach to build the graph incrementally by not storing the vectors in a memory location but directly calling faiss add_with_ids api. This will ensure that vectors which are getting streamed from Java to JNI layer we are creating faiss index for them, so no extra memory(apart from streamed vectors) will be required. The api is available for hnsw, IVF, IVFPQ and HNSWPQ.

For Nmslib, we can start creating the object which is required to create the index of nmslib. This should at-least avoid increase memory foot print. @jmazanec15 added the resolution here: #1600 (comment)

### Limitation
K-NN also provides nmslib as KNNEngine, but based on my deep-dive I didn't find similar feature or Nmslib. But overall I still believe we should implement this feature for Faiss Engine.

@navneet1v navneet1v added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. and removed untriaged labels Apr 9, 2024
@vamshin
Copy link
Member

vamshin commented Apr 9, 2024

Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine?

@navneet1v
Copy link
Collaborator Author

Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine?

Its little different as per my understanding. In lucene we create graphs incrementally during indexing(when docs are still in lucene buffer) but above incremental creation of graph is during segment creation(merges and refresh)

@navneet1v navneet1v changed the title [Enahncement] [Build-Time] Reduce Memory Footprint during Faiss Index Creation [Enahncement] [Build-Time] Reduce Memory Footprint during Native Index Creation Apr 9, 2024
@navneet1v
Copy link
Collaborator Author

@vamshin I have updated the proposal for fixing both nmslib and faiss memory footprint for indexing.

@jmazanec15 can you add the solution for fixing the footprint during saveIndex call of nmslib.

@jmazanec15
Copy link
Member

@navneet1v sure

Right now, nmslib duplicates the memory during indexing to build the optimized index: https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L470. Because we do not search in memory the index that is created at index time (i.e. we create the index, write it to disk, load it from disk, and then search it), we can avoid this duplication by stopping index creation before we make the additional allocations (https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L351) and then create a new function that will write the non-optimized index to disk as optimized. I played around with this some time ago - will share code

@navneet1v navneet1v changed the title [Enahncement] [Build-Time] Reduce Memory Footprint during Native Index Creation [Enhancement] [Build-Time] Reduce Memory Footprint during Native Index Creation Apr 9, 2024
@navneet1v
Copy link
Collaborator Author

@vamshin whats your thought on the overall solution here?

@vamshin
Copy link
Member

vamshin commented Apr 10, 2024

LGTM! Thanks for the deep dive.
Looking forward to benchmarks to quantify the savings!

@navneet1v
Copy link
Collaborator Author

navneet1v commented Apr 10, 2024

In my understanding we should be able to reduce the total memory footprint by half. So currently if graph takes 1GB then during indexing we need atleast 2GB of off heap space. With the proposed change we should be pretty close to 1GB.

I did a small experiment with the current status, and I can see that we need double the memory and what should be the ideal memory usage.

MemoryUsageGraphs

With the above proposal we should be close to the Ideal line I have added in this graph. I added Jemalloc in this experiment.

Jemalloc was there but it was not getting actually used. I will add new experiment results with jemalloc. Thanks to @jmazanec15 for helping in setting up jemalloc correctly.

OverallMemoryFootPrint.csv

@navneet1v
Copy link
Collaborator Author

navneet1v commented Apr 12, 2024

Completed the POC code for Faiss library. Ref: navneet1v@7165079

Here is the general idea:

  1. I split the createIndex function of JNIService in 2 smaller function: createIndexIteratively, writeIndex. Reason being createIndex function of JNIService was doing 2 things, it was building the index and then it was writing to a file.
  2. The new function createIndexIteratively takes indexAddress as a parameter and see if the indexAddress is not 0, then cast the address of right faiss index and then add the vectors.
  3. The number of vectors which are send for creating the index are based on the design of [FEATURE] [Build-TimeReduction V1] Merge Optimization: Streaming vectors from Java Layer to JNI layer to avoid OOM/Circuit Breaker in native engines #1506

Next steps:
I will test with faiss and see the memory footprint. Once it is good, I will try to port the changes for nmslib too.

@navneet1v
Copy link
Collaborator Author

navneet1v commented Apr 14, 2024

So I did some testing and I can see that with the POC code(navneet1v@7165079). Here are the observations:

  1. I am able to remove the extra memory(vectors memory) during force merge as compared to current 2.x version and 2.13 versions.
  2. The memory usage doesn't jump that quickly but slowly progress, which is happening because we are creating the graphs in a progressive fashion and never sending more than 1% of heap size memory vectors to JNI layer.
  3. If you kill the cluster and start back again to do the search only, then memory usage for all 3 versions are exactly the same[Not Plotted in Graph] which is (Heap + graph usage).

Experiment Configuration

Key Value
Dataset sift-128
# vectors 1M
Heap 1GB
VCPU 4
Tool OSB
Engine Faiss
RAM(2.13, 2.14) 3G
RAM(Mem fix version) 2.5G
Ideal Memory usage < 2G ~ 1.9G

Ideal Memory usage includes Heap + Graph + some extra usage of memory that can come from other processes.

Memory Usage for Faiss for different experiments after memory fix

Raw data:
updated_memory_footprint.xlsx

This provides a strong evidence that we should build the graphs in progressive fashion.

Further Items to be looked at:

  1. Once the indexing/force merge is completed we can still see more memory usage as compared to what we started with(for all the versions). My initial thoughts on that is it is the doc values files which are mapped. I am seeing a 200MB increase in memory which correlates to 2 ~100MB doc values files coming from segment. More deep-dive needs to be done.
  2. When force merge is completed for 2.14 and 2.13, we see a sudden dip in the memory which is exactly same as Memory fix Version(expected for all of them), but then graphs are loaded back in memory for search 2.13 and 2.14 goes back to same level of memory usage(during force merge) as compared to Memory fix Version which is baffling me more. The increase for Memory Fix Version is consistent with the graph file estimate but not adding up for 2.13 and 2.14. More deep-dive is required here. Updated the explanation below.

Index Directory

total 1.5G
-rw-r--r-- 1 ci-runner ci-runner  453 Apr 13 23:36 _e.fdm
-rw-r--r-- 1 ci-runner ci-runner 335M Apr 13 23:36 _e.fdt
-rw-r--r-- 1 ci-runner ci-runner  28K Apr 13 23:36 _e.fdx
-rw-r--r-- 1 ci-runner ci-runner  813 Apr 13 23:43 _e.fnm
-rw-r--r-- 1 ci-runner ci-runner 1.1M Apr 13 23:43 _e.kdd
-rw-r--r-- 1 ci-runner ci-runner 9.1K Apr 13 23:43 _e.kdi
-rw-r--r-- 1 ci-runner ci-runner  149 Apr 13 23:43 _e.kdm
-rw-r--r-- 1 ci-runner ci-runner  581 Apr 13 23:43 _e.si
-rw-r--r-- 1 ci-runner ci-runner 634M Apr 13 23:43 _e_165_target_field.faiss
-rw-r--r-- 1 ci-runner ci-runner 491M Apr 13 23:43 _e_Lucene90_0.dvd
-rw-r--r-- 1 ci-runner ci-runner  364 Apr 13 23:43 _e_Lucene90_0.dvm
-rw-r--r-- 1 ci-runner ci-runner   77 Apr 13 23:36 _e_Lucene99_0.doc
-rw-r--r-- 1 ci-runner ci-runner 3.0M Apr 13 23:36 _e_Lucene99_0.tim
-rw-r--r-- 1 ci-runner ci-runner 137K Apr 13 23:36 _e_Lucene99_0.tip
-rw-r--r-- 1 ci-runner ci-runner  200 Apr 13 23:36 _e_Lucene99_0.tmd
-rw-r--r-- 1 ci-runner ci-runner  372 Apr 13 23:43 segments_5
-rw-r--r-- 1 ci-runner ci-runner    0 Apr 13 23:31 write.lock

For setting up the env code can found here: https://github.com/navneet1v/opensearch-vectorsearch-sample/tree/main/constraint-env-benchmarks

cc: @vamshin , @jmazanec15

@navneet1v
Copy link
Collaborator Author

2. When force merge is completed for 2.14 and 2.13, we see a sudden dip in the memory which is exactly same as Memory fix Version(expected for all of them), but then graphs are loaded back in memory for search 2.13 and 2.14 goes back to same level of memory usage(during force merge) as compared to Memory fix Version which is baffling me more. The increase for Memory Fix Version is consistent with the graph file estimate but not adding up for 2.13 and 2.14. More deep-dive is required here.

On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it.

I also confirmed by killing the containers and running them back and just did the search. The memory was < 2GB which is correct expectation.

@jmazanec15
Copy link
Member

On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it.

Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system.

Also, did these set of experiments use jemalloc? I dont see it in https://github.com/navneet1v/opensearch-vectorsearch-sample.

@navneet1v
Copy link
Collaborator Author

@jmazanec15 it uses jemalloc: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/Dockerfile_Memory_Fix#L3 . I think I have configured it correctly.

Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system.

Yeah I want to do that. Let me talk to you on this.

@navneet1v
Copy link
Collaborator Author

So I have updated the code to capture RSS metrics too here: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/capture_memory_metrics.sh#L42-L43 I will run experiments to give more granular details on RSS.

@navneet1v
Copy link
Collaborator Author

@vamshin , @jmazanec15

With the help of @jmazanec15 I am able to setup the jemalloc in the experiments. I am seeing some positive results after that with Memory Fix version. Will add more details once I complete the experiments.

@navneet1v
Copy link
Collaborator Author

navneet1v commented Apr 16, 2024

I completed the experiments and here are the results. Jemalloc helped to remove a lot of fragmentation which was happening earlier.

Experiment Configuration

Key Value
Dataset sift-128
# vectors 1M
Heap 1GB
VCPU 4
Tool OSB
Engine Faiss
RAM(2.14) 2.6G
RAM(Mem fix version) 2.1G
Ideal Memory usage < 2G ~ 1.9G

Memory Fix Version

  1. The total Anon RSS always remains below the theoretical max. (Heap + Graph usage)
  2. File cache come into play but it is dynamic and tries to take up space based on what is available.
  3. The spike in File RSS is because of reading of binary doc values at the start of the force merge.
  4. The AnonRSS usage has a saw tooth curve, which is expected because we are creating the index iteratively and sending only a limited amount of vectors at at time. But AnonRSS never cross the theoretical max which is exactly what the change was intended to do.
  5. We can smooth out the curve for memory usage by ensure more optimization in the POC code, which can be part of the final launch.

Memory Fix  Version Memory Footprint

2.14 Version

  1. The memory footprint is higher than theoretical max, which is expected because when we are creating the Faiss index, we have both vectors and graph at the same time in the memory.
  2. The spike in File RSS is because of reading of binary doc values at the start of the force merge.

2 14 Memory Foot Print

Raw values:
updated_memory_footprint.xlsx

cc: @jmazanec15 , @vamshin

@navneet1v navneet1v self-assigned this Apr 16, 2024
@vamshin
Copy link
Member

vamshin commented Apr 16, 2024

@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix?

@navneet1v
Copy link
Collaborator Author

@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix?

Yes the recall were same for the both the experiments.

@navneet1v
Copy link
Collaborator Author

@vamshin can we plan this for a release?

@luyuncheng
Copy link
Collaborator

luyuncheng commented May 14, 2024

i want to talk about the memory reduce scenarios.

if we just write the vector into file like indexFlat format in java layer. and we do not transfer the memory to jni layer. let faiss read the memory from file as needed. it can reduce the memory even less than 1GB in write and merge scenarios. like #1693 we talked

@navneet1v
Copy link
Collaborator Author

navneet1v commented May 14, 2024

@luyuncheng yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors.

And as we were discussing on the #1693 reading from file adds a lot of iops. in case of indexing this can really slow down the whole process.

@luyuncheng
Copy link
Collaborator

yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors.

sure, we need take the iops into consideration.

BTW, i really like this: navneet1v@7165079.

@navneet1v navneet1v changed the title [Enhancement] [Build-Time] Reduce Memory Footprint during Native Index Creation [Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation May 29, 2024
@navneet1v navneet1v removed the v2.15.0 label May 29, 2024
@jmazanec15
Copy link
Member

jmazanec15 commented Jun 24, 2024

@navneet1v One idea I came across: I am wondering if we can avoid the page cache via passing O_DIRECT for reading/writing the file. O_DIRECT will hint to OS to avoid using page cache if possible.

On read, it will prevent an additional copy. So, instead of disk copyto pagecache copyto anon memory, it will be disk copyto anon memory. Thus, it may speed up loads as well as well as reduce memory usage/page cache thrashing.

On write, I think its a little trickier with synchronization, but it would avoid copying to page cache before writing to disk.

Downsides would be:

  1. Read after writes would be slower if pages were present in cache. i.e. we write the graph, graph gets cached in page cache, we then load it
  2. Sync on write might be trickier/performance problems

@fddattal and I had a discussion some time ago about this I believe. Do you have any thoughts on this?

@fddattal
Copy link

@jmazanec15 Thanks for looping me into the discussion!

I recall when we discussed with Kiran that one of the issues with O_DIRECT is its behavior is actually not standardized.
For example, there are some circumstances that it would still internally buffer data.

Here's the best article I have found on explaining that: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics

As I am aware Scylladb folks use O_DIRECT. They have a pretty good engineering blog which may be of help to us
https://www.scylladb.com/2017/10/05/io-access-methods-scylla/

At the time we considered implementing it, we didn't due to time constraints. In theory, it can outperform typical read/writes but it is a bit of a can of worms as you will need to direct the OS more regarding how to behave.

@jmazanec15
Copy link
Member

Thanks @fddattal. I was thinking that we already cache in-memory graphs. So, the fact that there is an intermediate stage where we are reading the file into the page cache and then copying it seems inefficient - if we just remove it, we wouldnt have to do anything else to get benefit.

one of the issues with O_DIRECT is its behavior is actually not standardized
Thats good to know. I thought it might have been in POSIX standard. Thatll make things tricky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. memory-reduction v2.17.0
Projects
OpenSearch Project Roadmap
2.17 (First RC 09/03, Release 09/17)
Status: 2.17.0
Development

No branches or pull requests

5 participants