Skip to content

Conversation

@rchitale7
Copy link
Member

@rchitale7 rchitale7 commented Oct 15, 2025

Description

This change introduces the option of serializing the faiss graph in memory, instead of writing to disk. For background on why this can be helpful, see the issue description here: #97. For the reasoning behind the solution I took, please see the solution here: #97 (comment). TLDR: I did not end up using faiss.serialize_index, due its inefficient memory usage. Instead, I used faiss.PyCallbackIOWriter to stream directly to a bytes buffer.

Note that writing the graph to disk is still the default option. The consumer of this library needs to specify the index_storage_mode via IndexBuildParameters as IndexStorageMode.MEMORY to serialize in memory.

Testing

I tested this change manually and verified it works. Will need to do some more benchmarking as a follow up.

Issues Resolved

Resolves #97.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rchitale7 rchitale7 force-pushed the serialization_changes branch 2 times, most recently from a4a060e to 72035bd Compare October 15, 2025 17:23
@rchitale7 rchitale7 marked this pull request as ready for review October 15, 2025 17:39
@navneet1v
Copy link
Collaborator

@rchitale7 please add a benchmarking results for the change

ensuring strict parameter validation.
"""

index_storage_mode: IndexStorageMode = IndexStorageMode.DISK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are init the mode here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the default mode as 'DISK', to ensure backwards compatibility. It also follows the pattern of initializing the other fields, such as repository_type, data_type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can potentially pass it through the environment variable of the API docker image instead, will explore that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the code to use the environment variable to initialize this parameter

@rchitale7 rchitale7 force-pushed the serialization_changes branch 2 times, most recently from 079fd54 to 5e4de8a Compare October 24, 2025 07:41
@rchitale7
Copy link
Member Author

rchitale7 commented Oct 24, 2025

Serializing the faiss index in memory instead of writing to disk impacts the performance in two places, in the overall remote build flow:

  1. The faiss write index step (since we are now serializing the index in memory instead of disk)
  2. Uploading the faiss index to s3 (since we are now uploading to s3 from a memory buffer instead of from disk)

All other components of the remote build flow should remain the same. Serializing also impacts CPU memory usage; we must now maintain an additional copy of the serialized graph in memory, instead of keeping it on disk. GPU memory usage should also be unaffected.

To benchmark this serialization change, I used the ms-marco-384, cohere-10M-768-IP, and open-ai-1536 datasets that can be found in benchmarks.yml. I compared the performance of serializing the index in memory v.s. disk for the 'write index' and 'upload to s3' steps. I also compared the peak CPU memory usage for memory v.s. disk. Finally, I validated that serializing in memory does not impact the recall. I also validated the performance for all other aspects of the remote index build flow remained the same.

I've added the benchmarking scripts I used to this PR, and this README explains how I (and anyone else) use these scripts.

Here is a table comparing the performance and memory impacts for serializing in memory v.s. disk:

Dataset Serialization Method Dataset Size (MB) Write Index Time (s) S3 Upload Time (s) Peak CPU Memory (MB)
ms-marco-384 Memory 1430.51 0.88 6.95 5084.59
ms-marco-384 Disk 1430.51 0.82 6.02 4087.42
cohere-10M-768-IP Memory 28610.32 16.58 132.67 91575.0
cohere-10M-768-IP Disk 28610.32 83.90 109.14 64057.26
open-ai-1536 Memory 5722.05 3.29 25.78 18328. 68
open-ai-1536 Disk 5722.05 3.06 21.59 12878.24

As expected, the peak CPU memory is roughly [dataset size] MB larger when the index is serialized in memory v.s. stored on disk. This is because the additional copy of the faiss graph contains all of the vectors.

There does not appear to be a performance benefit for serializing in memory for the marco and open ai datasets. However, for the much larger cohere dataset, we get a small boost when serializing in memory. The write index step is 57 seconds faster, but uploading to s3 is 23 seconds slower. This results in about a ~30 second bump in performance.

More investigation needs to be done to improve the s3 upload times for serializing in memory. One reason may be that we need to tune the multipart upload settings for boto3.upload_fileobj (which is used for uploading from memory -> s3) differently than boto3.upload_file (which is used for uploading from disk > s3).

@rchitale7 rchitale7 changed the title Introduce index storage mode option to support serializing graph in memory Introduce index serialization mode option to support serializing graph in memory Oct 27, 2025
jed326
jed326 previously approved these changes Oct 27, 2025
…h in memory

Signed-off-by: Rohan Chitale <rchital@amazon.com>
@rchitale7
Copy link
Member Author

I realized we can further optimize the memory usage, by freeing the vectors from memory after GPU->CPU conversion. Here is the updated memory graph:

betterchatgptgraph

I've updated the code to free the vectors after conversion, but before serialization. This gives us a similar memory impact to writing to disk. Here is the updated table comparing performance and memory impacts for serializing in memory v.s. disk:

Dataset Serialization Method Dataset Size (MB) Write Index Time (s) S3 Upload Time (s) Peak CPU Memory (MB)
ms-marco-384 Memory 1430.51 0.88 6.95 4017.32
ms-marco-384 Disk 1430.51 0.82 6.02 4087.42
cohere-10M-768-IP Memory 28610.32 16.58 132.67 64769.87
cohere-10M-768-IP Disk 28610.32 83.90 109.14 64057.26
open-ai-1536 Memory 5722.05 3.29 25.78 12833.53
open-ai-1536 Disk 5722.05 3.06 21.59 12878.24

I've also updated the comment on the GH issue with this new graph: #97 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Serialize Faiss Index in memory instead of writing to disk

3 participants