-
Notifications
You must be signed in to change notification settings - Fork 754
Add pickle support to python bindings Index class
#251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pickle support to python bindings Index class
#251
Conversation
…alization logic of HierarchicalNSW class via std::i/ostream.
…Index objects; add new properties to Index class: space_name, dim, max_elements, element_count, ef_construction, M, num_threads, ef. Properties num_threads and ef are read-write-able, other parameters are read-only.
…jects; use brute-force knn search to verify knn_query gives recall of (almost) 100%
use 200 test items
|
@dbespalov |
|
@yurymalkov actually, could you hold off with the review? I think the current implementation is pretty inefficient. I'm going to try to take another stab at serialization logic using pybind11/eigen , so it does not create unnecessary copies of index during pickling I will post an update here with my progress in the next couple of days. |
|
@dbespalov Sure. There is no rush. |
…de-/serialization logic of HierarchicalNSW class via std::i/ostream." This reverts commit cb7b398.
…e of Index object) when calling Python <-> C++
|
@yurymalkov I updated the PR and it should now be ready for the review! Copies of I also did a quick test to benchmark Index serialization runtime and memory usage against a file-based pickling logic. The following snippet creates 25 copies of Index object with 100k points in 512 dim. import hnswlib
data = np.float32(np.random.random((100_000, 512))) # create 100k points with dim=512
ann = hnswlib.Index(space='cosine', dim=512) # create ann index with cosine distance
ann.init_index(max_elements=100_000, ef_construction=10, M=8)
t0 = time.time();
ann.add_items(data); # add points to index
t1=time.time()
print(f"Building index took {(t1-t0)}s")
### create 25 copies of ann object using four strategies: file, memory, param, clone
for ii in range(25):
if sys.argv[-1] == 'file': ### use file-based pickle to create copy ann
copyreg.pickle(hnswlib.Index, pickle_ann) ### use temp files for ann serialization
bb=pickle.dumps(ann)
ann=pickle.loads(bb)
if sys.argv[-1] == 'memory': ### copy ann with round-trip pickle
bb=pickle.dumps(ann)
ann=pickle.loads(bb)
if sys.argv[-1] == 'param': ### create new index using ann's state tuple
ann=hnswlib.Index(ann.__getstate__())
if sys.argv[-1] == 'clone': ### use copy constructor to copy ann object
ann= hnswlib.Index(ann)
t2=time.time()
print(f"Making 25 copies of index took {(t2-t1)}s")I got the following timing results when running the snippet locally: > python3 tests/test_ann.py file
Building index took 1.471487045288086s
pickling ann using temp file /var/folders/_4/l7lvz56s609867vwt5r5bkj98kcygg/T/tmpw3ydj4uu.ann
unpickling ann using temp file /var/folders/_4/l7lvz56s609867vwt5r5bkj98kcygg/T/tmpwu8xnto7.ann
...
Making 25 copies of index took 55.927014112472534s
> python3 tests/test_ann.py memory
Building index took 1.5638556480407715s
Making 25 copies of index took 20.929177284240723s
> python3 tests/test_ann.py param
Building index took 1.4085211753845215s
Making 25 copies of index took 8.599105834960938s
> python3 tests/test_ann.py clone
Building index took 1.4074420928955078s
Making 25 copies of index took 8.865073919296265sThe in-memory pickling (implemented in this PR) is more than 2x faster and the memory usage was comparable to the file-based method (also no leaks). |
…python types by value everywhere
|
Thank you! I'll review it in the few following days. |
yurymalkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank so much for the PR! This is a very helpful feature!
I think it is ok to merge, but few things can be improved.
|
|
||
| void setAnnData(const py::tuple t) { | ||
|
|
||
| std::unique_lock <std::mutex> templock(appr_alg->global); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note here is that the global lock does not actually give guaranties about safe-thread operations. It do prevent from adding new elements, but might fail if the element insertion started before the serialization.
I think it is ok, but probably should be mentioned in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good callout, thank you! I added warning comments re: serialization thread-safety with addItems method
README.md
Outdated
| # Query dataset, k - number of closest elements (returns 2 numpy arrays) | ||
| labels, distances = p.knn_query(data, k = 1) | ||
|
|
||
| # Index objects support pickling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add note here that it is not fully thread safe. Requires the insertions to be finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added warning re: serialization is not thread safe with add_items.
…lock for serialization of random generators
python_bindings/bindings.cpp
Outdated
|
|
||
| py::tuple getIndexParams() const { | ||
| /* TODO: serialize state of random generators appr_alg->level_generator_ and appr_alg->update_probability_generator_ */ | ||
| /* for full reproducibility / to avoid re-initializing generators inside Index<float>::createFromParams */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TODO block to serialize state of random generators. Currently, the random generators are re-initialized on index deserialization. This could make index operations not (fully) reproducible.
|
@yurymalkov thank you for your review! I think all your comments have been addressed. I clicked "resolved conversation" for the items that have been fully addressed and no longer need attention. The only item remaining is synchronization of serialization with element insertion. I've added comments to warn about thread-safety of these two operation. If you think there is a better/easier way to synchronize serialization, so that it is thread-safe with element insertion, I can take a stab at that. Frankly, I did not look into synchronization logic for too long -- I simply repurposed some code from One more item that is worth pointing out: I added TODO block in TLDR: Please let me know if there is a simple fix to properly synchronize serialization and element insertion. Otherwise, I think the PR looks good to be merged. |
|
@dbespalov Thank you so much for the update!! One other thing I forgot to comment is maybe replacing a tuple in https://github.com/nmslib/hnswlib/pull/251/files#diff-e7a40eaa3578ae348570a4917aaf8b203f34246d035b611aa0ed07252bb24ba4R345 with a dict (that would make adding/deleting fields easier) and adding a version number for rising exceptions in case of data pickled by a newer/older version of the pickling format (e.g. what happens if we add serialized seed?). I think changing tuple to dict can be done in the future (it is anyway likely to break the backward compatibility), but adding a version number from the start seems to be a good move. I definitely regret I didn't add it to the save format. |
|
@yurymalkov I will update the PR to add serialization version to the parameter tuple. Very good call on this one! Thank you!! Please see my other comments below
TIL: read-write lock is not part of C++11! I did not know that. Thank you!! Agreed, that it is out of scope for this PR.
The parameter tuple should definitely be a python dict! Unfortunately, I did not get to it for this PR.
All good! Thank you for catching this!! |
…tion version; serialization version is returned as the first element of the parameter tuple; serialization version must match when instantiating Index object from parameter tuple
|
@yurymalkov I added |
|
Great! Thank you so much for the contribution! |
|
@yurymalkov Thank you so much for your help with improving this PR!! Do you know what the timeframe would be for the change to be merged to master and it ends up in pypi? |
|
Hi @dbespalov. Thanks again for the PR! |
|
@yurymalkov Sure. Let me see what I can do. |
|
@yurymalkov makes sense. In a retrospect, I should've included this change in my PR. I can probably find some time later this week to make a quick PR for this. |
|
@dbespalov Cool! Thank you! |
|
@dbespalov Looks like with the latest changes the first python example doesn't work in the readme. There is an issue in this line of code The correct property of Index class is |
|
@dyashuni oops, I missed this. I meant to rename space_name to space on the python side (to be consistent with constructor's kwargs). It should be a quick fix: Line 658 that now reads This way, |
|
@dbespalov Thank you, I have created a pull request with the suggested quick fix. |

Changes
Index class
createFromIndexandcreateFromParamsfor Index construction from another Index object or Index parameter tuple, respectivelygetIndexParamsserializes Index object. Returns a tuple with Index parameters.getAnnDatareturns a tuple with hnsw-specific parameters. Copy ofappr_alg->data_level0_memory_andappr_alg->linkLists_are returned aspy::array_t<char>objects to avoid additional copying from python sidePython bindings for Index class
py::pickledefinition from pybind11Index.__init__to static factory methodscreateFromIndexandcreateFromParamsspace_name, dim, max_elements, element_count, ef_construction, M, num_threads, efef, num_threadsUpdated API documentation and the first python example in README.md
New test in python_bindings/tests/bindings_test_pickle.py.
Indexobjects are copied using round-trip pickling.k=25using three spaces. Brute-force search is used to return ground-truth labels for the randomly generated items.