[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600

navneet1v · 2024-04-09T06:33:25Z

Description

While running some experiments I can am able to see that we need more memory than final graph which gets created in a segment.

Based on some research what I can see for Faiss

We give faiss, all the vectors and ids for which graph needs to be created. here: https://github.com/navneet1v/k-NN/blob/badbb1d3438a37536ce6de29c4085cded0ef2723/jni/src/faiss_wrapper.cpp#L159 .
After this faiss, internally calls add on the index function. here: https://github.com/facebookresearch/faiss/blob/8898eabe9f16e7bbebdc19667c993f7dc55a6a0c/faiss/IndexIDMap.cpp#L80
This will call internally add function of IndexHNSW, which will first store the vector in a flat index(which is nothing but a binary representation of vectors) and then start creating the HNSW graph.

So based on the above flow I can see that vectors which are stored in native memory in step 1 and passed to Faiss are stored first in FlatIndex of Faiss doubling the native memory.

For Nmslib:

We first take the vectors and copy vectors into this new object of nmslib.
After that, we create the index for nmslib.
After the index is created we save the index to a file. This index writing makes a very large allocation before it writes the index on the file here: https://github.com/nmslib/nmslib/blob/a2d6624e1315402662025debfdd614b505d9c3ef/similarity_search/src/method/hnsw.cc#L417-L428

Issue

Due to this, we need more memory to create a graph than the output Faiss/Nmslib index at a segment level. This really throw off the memory estimations while force merging the segments.

Resolution

With Streaming Vectors from Java to JNI Layer, we are already sending a limited amount of vectors from Java to JNI layer, we can use the same approach to build the graph incrementally by not storing the vectors in a memory location but directly calling faiss add_with_ids api. This will ensure that vectors which are getting streamed from Java to JNI layer we are creating faiss index for them, so no extra memory(apart from streamed vectors) will be required. The api is available for hnsw, IVF, IVFPQ and HNSWPQ.

For Nmslib, we can start creating the object which is required to create the index of nmslib. This should at-least avoid increase memory foot print. @jmazanec15 added the resolution here: #1600 (comment)

### Limitation
K-NN also provides nmslib as KNNEngine, but based on my deep-dive I didn't find similar feature or Nmslib. But overall I still believe we should implement this feature for Faiss Engine.

vamshin · 2024-04-09T08:06:48Z

Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine?

navneet1v · 2024-04-09T17:06:18Z

Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine?

Its little different as per my understanding. In lucene we create graphs incrementally during indexing(when docs are still in lucene buffer) but above incremental creation of graph is during segment creation(merges and refresh)

navneet1v · 2024-04-09T19:18:59Z

@vamshin I have updated the proposal for fixing both nmslib and faiss memory footprint for indexing.

@jmazanec15 can you add the solution for fixing the footprint during saveIndex call of nmslib.

jmazanec15 · 2024-04-09T20:11:45Z

@navneet1v sure

Right now, nmslib duplicates the memory during indexing to build the optimized index: https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L470. Because we do not search in memory the index that is created at index time (i.e. we create the index, write it to disk, load it from disk, and then search it), we can avoid this duplication by stopping index creation before we make the additional allocations (https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L351) and then create a new function that will write the non-optimized index to disk as optimized. I played around with this some time ago - will share code

navneet1v · 2024-04-09T20:24:16Z

@vamshin whats your thought on the overall solution here?

vamshin · 2024-04-10T09:31:46Z

LGTM! Thanks for the deep dive.
Looking forward to benchmarks to quantify the savings!

navneet1v · 2024-04-10T18:50:09Z

In my understanding we should be able to reduce the total memory footprint by half. So currently if graph takes 1GB then during indexing we need atleast 2GB of off heap space. With the proposed change we should be pretty close to 1GB.

I did a small experiment with the current status, and I can see that we need double the memory and what should be the ideal memory usage.

With the above proposal we should be close to the Ideal line I have added in this graph. ~~I added Jemalloc in this experiment.~~

Jemalloc was there but it was not getting actually used. I will add new experiment results with jemalloc. Thanks to @jmazanec15 for helping in setting up jemalloc correctly.

OverallMemoryFootPrint.csv

navneet1v · 2024-04-12T21:01:52Z

Completed the POC code for Faiss library. Ref: navneet1v@7165079

Here is the general idea:

I split the createIndex function of JNIService in 2 smaller function: createIndexIteratively, writeIndex. Reason being createIndex function of JNIService was doing 2 things, it was building the index and then it was writing to a file.
The new function createIndexIteratively takes indexAddress as a parameter and see if the indexAddress is not 0, then cast the address of right faiss index and then add the vectors.
The number of vectors which are send for creating the index are based on the design of [FEATURE] [Build-TimeReduction V1] Merge Optimization: Streaming vectors from Java Layer to JNI layer to avoid OOM/Circuit Breaker in native engines #1506

Next steps:
I will test with faiss and see the memory footprint. Once it is good, I will try to port the changes for nmslib too.

navneet1v · 2024-04-14T00:51:10Z

So I did some testing and I can see that with the POC code(navneet1v@7165079). Here are the observations:

I am able to remove the extra memory(vectors memory) during force merge as compared to current 2.x version and 2.13 versions.
The memory usage doesn't jump that quickly but slowly progress, which is happening because we are creating the graphs in a progressive fashion and never sending more than 1% of heap size memory vectors to JNI layer.
If you kill the cluster and start back again to do the search only, then memory usage for all 3 versions are exactly the same[Not Plotted in Graph] which is (Heap + graph usage).

Experiment Configuration

Key	Value
Dataset	sift-128
# vectors	1M
Heap	1GB
VCPU	4
Tool	OSB
Engine	Faiss
RAM(2.13, 2.14)	3G
RAM(Mem fix version)	2.5G
Ideal Memory usage	< 2G ~ 1.9G

Ideal Memory usage includes Heap + Graph + some extra usage of memory that can come from other processes.

Raw data:
updated_memory_footprint.xlsx

This provides a strong evidence that we should build the graphs in progressive fashion.

Further Items to be looked at:

Once the indexing/force merge is completed we can still see more memory usage as compared to what we started with(for all the versions). My initial thoughts on that is it is the doc values files which are mapped. I am seeing a 200MB increase in memory which correlates to 2 ~100MB doc values files coming from segment. More deep-dive needs to be done.
When force merge is completed for 2.14 and 2.13, we see a sudden dip in the memory which is exactly same as Memory fix Version(expected for all of them), but then graphs are loaded back in memory for search 2.13 and 2.14 goes back to same level of memory usage(during force merge) as compared to Memory fix Version which is baffling me more. The increase for Memory Fix Version is consistent with the graph file estimate but not adding up for 2.13 and 2.14. More deep-dive is required here. Updated the explanation below.

Index Directory

total 1.5G
-rw-r--r-- 1 ci-runner ci-runner  453 Apr 13 23:36 _e.fdm
-rw-r--r-- 1 ci-runner ci-runner 335M Apr 13 23:36 _e.fdt
-rw-r--r-- 1 ci-runner ci-runner  28K Apr 13 23:36 _e.fdx
-rw-r--r-- 1 ci-runner ci-runner  813 Apr 13 23:43 _e.fnm
-rw-r--r-- 1 ci-runner ci-runner 1.1M Apr 13 23:43 _e.kdd
-rw-r--r-- 1 ci-runner ci-runner 9.1K Apr 13 23:43 _e.kdi
-rw-r--r-- 1 ci-runner ci-runner  149 Apr 13 23:43 _e.kdm
-rw-r--r-- 1 ci-runner ci-runner  581 Apr 13 23:43 _e.si
-rw-r--r-- 1 ci-runner ci-runner 634M Apr 13 23:43 _e_165_target_field.faiss
-rw-r--r-- 1 ci-runner ci-runner 491M Apr 13 23:43 _e_Lucene90_0.dvd
-rw-r--r-- 1 ci-runner ci-runner  364 Apr 13 23:43 _e_Lucene90_0.dvm
-rw-r--r-- 1 ci-runner ci-runner   77 Apr 13 23:36 _e_Lucene99_0.doc
-rw-r--r-- 1 ci-runner ci-runner 3.0M Apr 13 23:36 _e_Lucene99_0.tim
-rw-r--r-- 1 ci-runner ci-runner 137K Apr 13 23:36 _e_Lucene99_0.tip
-rw-r--r-- 1 ci-runner ci-runner  200 Apr 13 23:36 _e_Lucene99_0.tmd
-rw-r--r-- 1 ci-runner ci-runner  372 Apr 13 23:43 segments_5
-rw-r--r-- 1 ci-runner ci-runner    0 Apr 13 23:31 write.lock

For setting up the env code can found here: https://github.com/navneet1v/opensearch-vectorsearch-sample/tree/main/constraint-env-benchmarks

cc: @vamshin , @jmazanec15

navneet1v · 2024-04-14T02:29:27Z

2. When force merge is completed for 2.14 and 2.13, we see a sudden dip in the memory which is exactly same as Memory fix Version(expected for all of them), but then graphs are loaded back in memory for search 2.13 and 2.14 goes back to same level of memory usage(during force merge) as compared to Memory fix Version which is baffling me more. The increase for Memory Fix Version is consistent with the graph file estimate but not adding up for 2.13 and 2.14. More deep-dive is required here.

On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it.

I also confirmed by killing the containers and running them back and just did the search. The memory was < 2GB which is correct expectation.

jmazanec15 · 2024-04-15T16:17:53Z

On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it.

Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system.

Also, did these set of experiments use jemalloc? I dont see it in https://github.com/navneet1v/opensearch-vectorsearch-sample.

navneet1v · 2024-04-15T17:37:04Z

@jmazanec15 it uses jemalloc: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/Dockerfile_Memory_Fix#L3 . I think I have configured it correctly.

Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system.

Yeah I want to do that. Let me talk to you on this.

navneet1v · 2024-04-15T22:03:34Z

So I have updated the code to capture RSS metrics too here: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/capture_memory_metrics.sh#L42-L43 I will run experiments to give more granular details on RSS.

navneet1v · 2024-04-16T00:44:16Z

@vamshin , @jmazanec15

With the help of @jmazanec15 I am able to setup the jemalloc in the experiments. I am seeing some positive results after that with Memory Fix version. Will add more details once I complete the experiments.

navneet1v · 2024-04-16T06:54:01Z

I completed the experiments and here are the results. Jemalloc helped to remove a lot of fragmentation which was happening earlier.

Experiment Configuration

Key	Value
Dataset	sift-128
# vectors	1M
Heap	1GB
VCPU	4
Tool	OSB
Engine	Faiss
RAM(2.14)	2.6G
RAM(Mem fix version)	2.1G
Ideal Memory usage	< 2G ~ 1.9G

Memory Fix Version

The total Anon RSS always remains below the theoretical max. (Heap + Graph usage)
File cache come into play but it is dynamic and tries to take up space based on what is available.
The spike in File RSS is because of reading of binary doc values at the start of the force merge.
The AnonRSS usage has a saw tooth curve, which is expected because we are creating the index iteratively and sending only a limited amount of vectors at at time. But AnonRSS never cross the theoretical max which is exactly what the change was intended to do.
We can smooth out the curve for memory usage by ensure more optimization in the POC code, which can be part of the final launch.

2.14 Version

The memory footprint is higher than theoretical max, which is expected because when we are creating the Faiss index, we have both vectors and graph at the same time in the memory.
The spike in File RSS is because of reading of binary doc values at the start of the force merge.

Raw values:
updated_memory_footprint.xlsx

cc: @jmazanec15 , @vamshin

vamshin · 2024-04-16T07:49:00Z

@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix?

navneet1v · 2024-04-16T07:56:01Z

@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix?

Yes the recall were same for the both the experiments.

navneet1v · 2024-04-17T17:13:45Z

@vamshin can we plan this for a release?

luyuncheng · 2024-05-14T06:08:26Z

i want to talk about the memory reduce scenarios.

if we just write the vector into file like indexFlat format in java layer. and we do not transfer the memory to jni layer. let faiss read the memory from file as needed. it can reduce the memory even less than 1GB in write and merge scenarios. like #1693 we talked

navneet1v · 2024-05-14T06:44:54Z

@luyuncheng yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors.

And as we were discussing on the #1693 reading from file adds a lot of iops. in case of indexing this can really slow down the whole process.

luyuncheng · 2024-05-14T06:48:29Z

yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors.

sure, we need take the iops into consideration.

BTW, i really like this: navneet1v@7165079.

jmazanec15 · 2024-06-24T18:56:44Z

@navneet1v One idea I came across: I am wondering if we can avoid the page cache via passing O_DIRECT for reading/writing the file. O_DIRECT will hint to OS to avoid using page cache if possible.

On read, it will prevent an additional copy. So, instead of disk copyto pagecache copyto anon memory, it will be disk copyto anon memory. Thus, it may speed up loads as well as well as reduce memory usage/page cache thrashing.

On write, I think its a little trickier with synchronization, but it would avoid copying to page cache before writing to disk.

Downsides would be:

Read after writes would be slower if pages were present in cache. i.e. we write the graph, graph gets cached in page cache, we then load it
Sync on write might be trickier/performance problems

@fddattal and I had a discussion some time ago about this I believe. Do you have any thoughts on this?

fddattal · 2024-06-27T17:09:43Z

@jmazanec15 Thanks for looping me into the discussion!

I recall when we discussed with Kiran that one of the issues with O_DIRECT is its behavior is actually not standardized.
For example, there are some circumstances that it would still internally buffer data.

Here's the best article I have found on explaining that: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics

As I am aware Scylladb folks use O_DIRECT. They have a pretty good engineering blog which may be of help to us
https://www.scylladb.com/2017/10/05/io-access-methods-scylla/

At the time we considered implementing it, we didn't due to time constraints. In theory, it can outperform typical read/writes but it is a bit of a can of worms as you will need to direct the OS more regarding how to behave.

jmazanec15 · 2024-06-28T20:22:45Z

Thanks @fddattal. I was thinking that we already cache in-memory graphs. So, the fact that there is an intermediate stage where we are reading the file into the page cache and then copying it seems inefficient - if we just remove it, we wouldnt have to do anything else to get benefit.

one of the issues with O_DIRECT is its behavior is actually not standardized
Thats good to know. I thought it might have been in POSIX standard. Thatll make things tricky.

navneet1v added untriaged enhancement and removed untriaged labels Apr 9, 2024

github-actions bot added the untriaged label Apr 9, 2024

navneet1v added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. and removed untriaged labels Apr 9, 2024

navneet1v mentioned this issue Apr 9, 2024

[META] [Build-Time] Improving Build time for Vector Indices #1599

Open

7 tasks

navneet1v changed the title ~~[Enahncement] [Build-Time] Reduce Memory Footprint during Faiss Index Creation~~ [Enahncement] [Build-Time] Reduce Memory Footprint during Native Index Creation Apr 9, 2024

navneet1v changed the title ~~[Enahncement] [Build-Time] Reduce Memory Footprint during Native Index Creation~~ [Enhancement] [Build-Time] Reduce Memory Footprint during Native Index Creation Apr 9, 2024

navneet1v mentioned this issue Apr 9, 2024

Move free vectorAddress from Java to JNI layer to reduce the memory footprint for Nmslib. #1602

Merged

5 tasks

navneet1v added the memory-reduction label Apr 11, 2024

navneet1v self-assigned this Apr 16, 2024

vamshin added the v2.15.0 label Apr 17, 2024

navneet1v changed the title ~~[Enhancement] [Build-Time] Reduce Memory Footprint during Native Index Creation~~ [Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation May 29, 2024

navneet1v removed the v2.15.0 label May 29, 2024

navneet1v added the v2.17.0 label Jul 2, 2024

navneet1v mentioned this issue Jul 11, 2024

Add binary format support with HNSW method in Faiss Engine #1781

Merged

5 tasks

vamshin added this to 2.17 (First RC 09/03, Release 09/17) in OpenSearch Project Roadmap Jul 15, 2024

MrFlap mentioned this issue Jul 16, 2024

Iterative Graph Construction #1840

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600

[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600

navneet1v commented Apr 9, 2024 •

edited

Loading

vamshin commented Apr 9, 2024 •

edited

Loading

navneet1v commented Apr 9, 2024

navneet1v commented Apr 9, 2024

jmazanec15 commented Apr 9, 2024

navneet1v commented Apr 9, 2024

vamshin commented Apr 10, 2024 •

edited

Loading

navneet1v commented Apr 10, 2024 •

edited

Loading

navneet1v commented Apr 12, 2024 •

edited

Loading

navneet1v commented Apr 14, 2024 •

edited

Loading

navneet1v commented Apr 14, 2024

jmazanec15 commented Apr 15, 2024

navneet1v commented Apr 15, 2024

navneet1v commented Apr 15, 2024

navneet1v commented Apr 16, 2024

navneet1v commented Apr 16, 2024 •

edited

Loading

vamshin commented Apr 16, 2024

navneet1v commented Apr 16, 2024

navneet1v commented Apr 17, 2024

luyuncheng commented May 14, 2024 •

edited

Loading

navneet1v commented May 14, 2024 •

edited

Loading

luyuncheng commented May 14, 2024

jmazanec15 commented Jun 24, 2024 •

edited

Loading

fddattal commented Jun 27, 2024

jmazanec15 commented Jun 28, 2024

[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600

[Enhancement] [Build-Time V1] Reduce Memory Footprint during Native Index Creation #1600

Comments

navneet1v commented Apr 9, 2024 • edited Loading

Description

Issue

Resolution

vamshin commented Apr 9, 2024 • edited Loading

navneet1v commented Apr 9, 2024

navneet1v commented Apr 9, 2024

jmazanec15 commented Apr 9, 2024

navneet1v commented Apr 9, 2024

vamshin commented Apr 10, 2024 • edited Loading

navneet1v commented Apr 10, 2024 • edited Loading

navneet1v commented Apr 12, 2024 • edited Loading

navneet1v commented Apr 14, 2024 • edited Loading

navneet1v commented Apr 14, 2024

jmazanec15 commented Apr 15, 2024

navneet1v commented Apr 15, 2024

navneet1v commented Apr 15, 2024

navneet1v commented Apr 16, 2024

navneet1v commented Apr 16, 2024 • edited Loading

Memory Fix Version

2.14 Version

vamshin commented Apr 16, 2024

navneet1v commented Apr 16, 2024

navneet1v commented Apr 17, 2024

luyuncheng commented May 14, 2024 • edited Loading

navneet1v commented May 14, 2024 • edited Loading

luyuncheng commented May 14, 2024

jmazanec15 commented Jun 24, 2024 • edited Loading

fddattal commented Jun 27, 2024

jmazanec15 commented Jun 28, 2024

navneet1v commented Apr 9, 2024 •

edited

Loading

vamshin commented Apr 9, 2024 •

edited

Loading

vamshin commented Apr 10, 2024 •

edited

Loading

navneet1v commented Apr 10, 2024 •

edited

Loading

navneet1v commented Apr 12, 2024 •

edited

Loading

navneet1v commented Apr 14, 2024 •

edited

Loading

navneet1v commented Apr 16, 2024 •

edited

Loading

luyuncheng commented May 14, 2024 •

edited

Loading

navneet1v commented May 14, 2024 •

edited

Loading

jmazanec15 commented Jun 24, 2024 •

edited

Loading