- 
                Notifications
    
You must be signed in to change notification settings  - Fork 11
 
Introduce index serialization mode option to support serializing graph in memory #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a4a060e    to
    72035bd      
    Compare
  
    | 
           @rchitale7 please add a benchmarking results for the change  | 
    
        
          
                remote_vector_index_builder/core/common/models/index_build_parameters.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | ensuring strict parameter validation. | ||
| """ | ||
| 
               | 
          ||
| index_storage_mode: IndexStorageMode = IndexStorageMode.DISK | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are init the mode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the default mode as 'DISK', to ensure backwards compatibility. It also follows the pattern of initializing the other fields, such as repository_type, data_type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can potentially pass it through the environment variable of the API docker image instead, will explore that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the code to use the environment variable to initialize this parameter
        
          
                ...ector_index_builder/core/common/models/index_builder/faiss/faiss_index_hnsw_cagra_builder.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      079fd54    to
    5e4de8a      
    Compare
  
    | 
           Serializing the faiss index in memory instead of writing to disk impacts the performance in two places, in the overall remote build flow: 
 All other components of the remote build flow should remain the same. Serializing also impacts CPU memory usage; we must now maintain an additional copy of the serialized graph in memory, instead of keeping it on disk. GPU memory usage should also be unaffected. To benchmark this serialization change, I used the ms-marco-384, cohere-10M-768-IP, and open-ai-1536 datasets that can be found in benchmarks.yml. I compared the performance of serializing the index in memory v.s. disk for the 'write index' and 'upload to s3' steps. I also compared the peak CPU memory usage for memory v.s. disk. Finally, I validated that serializing in memory does not impact the recall. I also validated the performance for all other aspects of the remote index build flow remained the same. I've added the benchmarking scripts I used to this PR, and this README explains how I (and anyone else) use these scripts. Here is a table comparing the performance and memory impacts for serializing in memory v.s. disk: 
 As expected, the peak CPU memory is roughly [dataset size] MB larger when the index is serialized in memory v.s. stored on disk. This is because the additional copy of the faiss graph contains all of the vectors. There does not appear to be a performance benefit for serializing in memory for the marco and open ai datasets. However, for the much larger cohere dataset, we get a small boost when serializing in memory. The write index step is 57 seconds faster, but uploading to s3 is 23 seconds slower. This results in about a ~30 second bump in performance. More investigation needs to be done to improve the s3 upload times for serializing in memory. One reason may be that we need to tune the multipart upload settings for   | 
    
…h in memory Signed-off-by: Rohan Chitale <rchital@amazon.com>
5e4de8a    to
    6c33cf4      
    Compare
  
    | 
           I realized we can further optimize the memory usage, by freeing the vectors from memory after GPU->CPU conversion. Here is the updated memory graph:  
I've updated the code to free the vectors after conversion, but before serialization. This gives us a similar memory impact to writing to disk. Here is the updated table comparing performance and memory impacts for serializing in memory v.s. disk: 
 I've also updated the comment on the GH issue with this new graph: #97 (comment)  | 
    

Description
This change introduces the option of serializing the
faissgraph in memory, instead of writing to disk. For background on why this can be helpful, see the issue description here: #97. For the reasoning behind the solution I took, please see the solution here: #97 (comment). TLDR: I did not end up usingfaiss.serialize_index, due its inefficient memory usage. Instead, I usedfaiss.PyCallbackIOWriterto stream directly to a bytes buffer.Note that writing the graph to disk is still the default option. The consumer of this library needs to specify the
index_storage_modeviaIndexBuildParametersasIndexStorageMode.MEMORYto serialize in memory.Testing
I tested this change manually and verified it works. Will need to do some more benchmarking as a follow up.
Issues Resolved
Resolves #97.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.