Add pickle support to python bindings `Index` class #251

dbespalov · 2020-10-12T18:09:31Z

Changes

Index class
- Static factory methods createFromIndex and createFromParams for Index construction from another Index object or Index parameter tuple, respectively
- Method getIndexParams serializes Index object. Returns a tuple with Index parameters.
- Method getAnnData returns a tuple with hnsw-specific parameters. Copy of appr_alg->data_level0_memory_ and appr_alg->linkLists_ are returned as py::array_t<char> objects to avoid additional copying from python side
Python bindings for Index class
- Index serialization is implemented with py::pickle definition from pybind11
- Bind Index.__init__ to static factory methods createFromIndex and createFromParams
- Expose parameters of the hnsw index as read-only properties in python:
  - space_name, dim, max_elements, element_count, ef_construction, M, num_threads, ef
- And, two properties that support read and write:
  - ef, num_threads
Updated API documentation and the first python example in README.md
New test in python_bindings/tests/bindings_test_pickle.py.
- Verifies that results of knn_query match results from copies of the same index. Index objects are copied using round-trip pickling.
- Verify that knn_query gives recall of (almost) 100% for k=25 using three spaces. Brute-force search is used to return ground-truth labels for the randomly generated items.
- Create separate test for each space
- Sample test output:

> python3 -m unittest  tests/bindings_test_pickle.py -k Inner
Running pickle tests for <hnswlib.Index(space='ip', dim=48)>
Warning: 1 labels are missing from ann results (k=25, err_thresh=5)
Warning: 1 labels are missing from ann results (k=25, err_thresh=5)
 ... 
Warning: 16 ann distance values are different from brute-force values (total # of values=5000, dists_thresh=50)

.
----------------------------------------------------------------------
Ran 1 test in 19.057s

OK

…alization logic of HierarchicalNSW class via std::i/ostream.

…Index objects; add new properties to Index class: space_name, dim, max_elements, element_count, ef_construction, M, num_threads, ef. Properties num_threads and ef are read-write-able, other parameters are read-only.

…jects; use brute-force knn search to verify knn_query gives recall of (almost) 100%

use 200 test items

yurymalkov · 2020-10-13T22:13:46Z

@dbespalov
Great! Thanks! I'll check it this week.

dbespalov · 2020-10-19T19:46:48Z

@yurymalkov actually, could you hold off with the review?

I think the current implementation is pretty inefficient. __getstate__ makes two copies during serialization (with a call to .str(), and when casting with py::bytes). And __setstate__ makes at least one copy (when initing content of istream)

I'm going to try to take another stab at serialization logic using pybind11/eigen , so it does not create unnecessary copies of index during pickling

I will post an update here with my progress in the next couple of days.

yurymalkov · 2020-10-19T20:35:07Z

@dbespalov Sure. There is no rush.
Thank you so much for the work!

…de-/serialization logic of HierarchicalNSW class via std::i/ostream." This reverts commit cb7b398.

…e of Index object) when calling Python <-> C++

dbespalov · 2020-10-26T03:05:22Z

@yurymalkov I updated the PR and it should now be ready for the review!

Copies of appr_alg->linkLists_ and appr_alg->data_level0_memory_ are serialized as array_t<char> objects, that look like numpy arrays to python (and avoids additional copying).

I also did a quick test to benchmark Index serialization runtime and memory usage against a file-based pickling logic. The following snippet creates 25 copies of Index object with 100k points in 512 dim.

import hnswlib

data = np.float32(np.random.random((100_000, 512))) # create 100k points with dim=512

ann = hnswlib.Index(space='cosine', dim=512) # create ann index with cosine distance
ann.init_index(max_elements=100_000, ef_construction=10, M=8) 

t0 = time.time();  
ann.add_items(data); # add points to index
t1=time.time() 

print(f"Building index took {(t1-t0)}s")

### create 25 copies of ann object using four strategies: file, memory, param, clone
for ii in range(25):
    if sys.argv[-1] == 'file':   ### use file-based pickle to create copy ann 
        copyreg.pickle(hnswlib.Index, pickle_ann) ### use temp files for ann serialization
        bb=pickle.dumps(ann)
        ann=pickle.loads(bb)
    if sys.argv[-1] == 'memory': ### copy ann with round-trip pickle
        bb=pickle.dumps(ann)
        ann=pickle.loads(bb)
   if sys.argv[-1] == 'param':  ### create new index using ann's state tuple
        ann=hnswlib.Index(ann.__getstate__())
   if sys.argv[-1] == 'clone':  ### use copy constructor to copy ann object
        ann= hnswlib.Index(ann)
        
t2=time.time()
print(f"Making 25 copies of index took {(t2-t1)}s")

I got the following timing results when running the snippet locally:

> python3 tests/test_ann.py file
Building index took 1.471487045288086s
pickling ann using temp file /var/folders/_4/l7lvz56s609867vwt5r5bkj98kcygg/T/tmpw3ydj4uu.ann
unpickling ann using temp file /var/folders/_4/l7lvz56s609867vwt5r5bkj98kcygg/T/tmpwu8xnto7.ann
 ...
Making 25 copies of index took 55.927014112472534s

> python3 tests/test_ann.py memory
Building index took 1.5638556480407715s
Making 25 copies of index took 20.929177284240723s

> python3 tests/test_ann.py param
Building index took 1.4085211753845215s
Making 25 copies of index took 8.599105834960938s

> python3 tests/test_ann.py clone
Building index took 1.4074420928955078s
Making 25 copies of index took 8.865073919296265s

The in-memory pickling (implemented in this PR) is more than 2x faster and the memory usage was comparable to the file-based method (also no leaks).

…python types by value everywhere

yurymalkov · 2020-10-27T05:34:53Z

Thank you! I'll review it in the few following days.

yurymalkov

Thank so much for the PR! This is a very helpful feature!
I think it is ok to merge, but few things can be improved.

python_bindings/bindings.cpp

python_bindings/tests/bindings_test_pickle.py

yurymalkov · 2020-11-01T05:07:27Z

python_bindings/bindings.cpp

+
+    void setAnnData(const py::tuple t) {
+
+      std::unique_lock <std::mutex> templock(appr_alg->global);


One thing to note here is that the global lock does not actually give guaranties about safe-thread operations. It do prevent from adding new elements, but might fail if the element insertion started before the serialization.
I think it is ok, but probably should be mentioned in the documentation.

good callout, thank you! I added warning comments re: serialization thread-safety with addItems method

python_bindings/tests/bindings_test_pickle.py

yurymalkov · 2020-11-01T21:29:39Z

README.md

 # Query dataset, k - number of closest elements (returns 2 numpy arrays)
 labels, distances = p.knn_query(data, k = 1)
+
+# Index objects support pickling:


Maybe add note here that it is not fully thread safe. Requires the insertions to be finished.

I added warning re: serialization is not thread safe with add_items.

python_bindings/bindings.cpp

README.md

…lock for serialization of random generators

dbespalov · 2020-11-03T06:25:10Z

python_bindings/bindings.cpp

+
+    py::tuple getIndexParams() const {
+        /*  TODO: serialize state of random generators appr_alg->level_generator_ and appr_alg->update_probability_generator_  */
+        /*        for full reproducibility / to avoid re-initializing generators inside Index<float>::createFromParams         */


Added TODO block to serialize state of random generators. Currently, the random generators are re-initialized on index deserialization. This could make index operations not (fully) reproducible.

dbespalov · 2020-11-03T18:22:42Z

@yurymalkov thank you for your review! I think all your comments have been addressed. I clicked "resolved conversation" for the items that have been fully addressed and no longer need attention.

The only item remaining is synchronization of serialization with element insertion. I've added comments to warn about thread-safety of these two operation. If you think there is a better/easier way to synchronize serialization, so that it is thread-safe with element insertion, I can take a stab at that. Frankly, I did not look into synchronization logic for too long -- I simply repurposed some code from hnswalg.h that was using global lock.

One more item that is worth pointing out: I added TODO block in python_bindings/bindings.cpp about serialization of states for the two random generators used in hnswalg.h. Current implementation re-initializes random generators using random_seed value that is passed to init_index. I dont think this affects operations on the ann index functionally, but it can potentially cause issues when reproducibility of the ann data-structure is required.

TLDR: Please let me know if there is a simple fix to properly synchronize serialization and element insertion. Otherwise, I think the PR looks good to be merged.

yurymalkov · 2020-11-05T07:18:53Z

@dbespalov Thank you so much for the update!!
I think a good way to synchronize updates/saves/pickles is by emulating a light read-write lock in C++ code (I think read-write lock is not a part of C++11, but not sure if C++11 has advantages over C++14 nowadays) - e.g. add a flag to lock new insertions and add a counter of the number of running insertions. Though probably there are other solutions.
Anyway I think this is totally out of scope of this PR (though is important).

One other thing I forgot to comment is maybe replacing a tuple in https://github.com/nmslib/hnswlib/pull/251/files#diff-e7a40eaa3578ae348570a4917aaf8b203f34246d035b611aa0ed07252bb24ba4R345 with a dict (that would make adding/deleting fields easier) and adding a version number for rising exceptions in case of data pickled by a newer/older version of the pickling format (e.g. what happens if we add serialized seed?).

I think changing tuple to dict can be done in the future (it is anyway likely to break the backward compatibility), but adding a version number from the start seems to be a good move. I definitely regret I didn't add it to the save format.
I can do it myself though.

dbespalov · 2020-11-06T02:22:33Z

@yurymalkov I will update the PR to add serialization version to the parameter tuple. Very good call on this one! Thank you!!

Please see my other comments below

@dbespalov Thank you so much for the update!!
I think a good way to synchronize updates/saves/pickles is by emulating a light read-write lock in C++ code (I think read-write lock is not a part of C++11, but not sure if C++11 has advantages over C++14 nowadays) - e.g. add a flag to lock new insertions and add a counter of the number of running insertions. Though probably there are other solutions.
Anyway I think this is totally out of scope of this PR (though is important).

TIL: read-write lock is not part of C++11! I did not know that. Thank you!!

Agreed, that it is out of scope for this PR.

One other thing I forgot to comment is maybe replacing a tuple in https://github.com/nmslib/hnswlib/pull/251/files#diff-e7a40eaa3578ae348570a4917aaf8b203f34246d035b611aa0ed07252bb24ba4R345 with a dict (that would make adding/deleting fields easier) and adding a version number for rising exceptions in case of data pickled by a newer/older version of the pickling format (e.g. what happens if we add serialized seed?).

The parameter tuple should definitely be a python dict! Unfortunately, I did not get to it for this PR.
Serialization version must be included with the parameter tuple. I will update the PR.

I think changing tuple to dict can be done in the future (it is anyway likely to break the backward compatibility), but adding a version number from the start seems to be a good move. I definitely regret I didn't add it to the save format.
I can do it myself though.

All good! Thank you for catching this!!

…tion version; serialization version is returned as the first element of the parameter tuple; serialization version must match when instantiating Index object from parameter tuple

dbespalov · 2020-11-06T04:22:51Z

@yurymalkov I added Index::ser_version, a static const data member for the serialization version. ser_version is the first element of parameter tuple, which has three elements in total now. The tests are passing as well.

Let me know what you think!

yurymalkov · 2020-11-07T04:33:12Z

Great! Thank you so much for the contribution!
I'll merge it when we fix CI.

dbespalov · 2020-11-13T21:58:50Z

@yurymalkov Thank you so much for your help with improving this PR!!

Do you know what the timeframe would be for the change to be merged to master and it ends up in pypi?

yurymalkov · 2020-11-15T03:00:26Z

Hi @dbespalov. Thanks again for the PR!
I am going to consider switch from tuples to dict before merging to master to avoid future compatibility issues. Maybe @dyashuni can help?

dyashuni · 2020-11-15T18:16:36Z

@yurymalkov Sure. Let me see what I can do.

dbespalov · 2020-11-16T21:01:48Z

@yurymalkov makes sense. In a retrospect, I should've included this change in my PR. I can probably find some time later this week to make a quick PR for this.

yurymalkov · 2020-11-17T05:20:42Z

@dbespalov Cool! Thank you!

dyashuni · 2020-11-17T08:34:54Z

@dbespalov Looks like with the latest changes the first python example doesn't work in the readme. There is an issue in this line of code

    print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}")
AttributeError: 'hnswlib.Index' object has no attribute 'space'

The correct property of Index class is space_name. However space_name is the same as the constructor parameter space but called differently. I think we should align names of variables. Do you have any ideas how can we fix the issue? I can help with fixing.

dbespalov · 2020-11-17T19:12:15Z

@dyashuni oops, I missed this. I meant to rename space_name to space on the python side (to be consistent with constructor's kwargs). It should be a quick fix:

Line 658 that now reads
.def_readonly("space_name", &Index<float>::space_name)
should be
.def_readonly("space", &Index<float>::space_name)

This way, space_name will be completely hidden in python. I only updated readme, but forgot to make the change in the bindings code.

dyashuni · 2020-11-18T11:35:59Z

@dbespalov Thank you, I have created a pull request with the suggested quick fix.

dbespalov added 9 commits October 12, 2020 01:27

New methods loadIndexFromStream and saveIndexToStream expose de-/seri…

cb7b398

…alization logic of HierarchicalNSW class via std::i/ostream.

Verify knn_query results match before/after pickling hnswlib.Index ob…

e0eacad

…jects; use brute-force knn search to verify knn_query gives recall of (almost) 100%

add documeentation

ec4f4b1

clean-up readme

a3646cc

clean-up readme

a1ba4e5

clean-up readme

cf3846c

clean-up readme

27471cd

Update bindings_test_pickle.py

4220956

use 200 test items

Bespalov, Dmitriy (CORP) and others added 7 commits October 23, 2020 02:31

Revert "New methods loadIndexFromStream and saveIndexToStream expose …

72b6501

…de-/serialization logic of HierarchicalNSW class via std::i/ostream." This reverts commit cb7b398.

use python's buffer protocol to avoid making copies of ann data (stat…

3a62b41

…e of Index object) when calling Python <-> C++

replace tab characters with spaces

fe6d2fa

test each space (ip/cosine/l2) as a separate unittest

c9fb60d

return array_t pointers

3c4510d

expose static method of Index class as copy constructor in python

64c5154

do not waste space when returning serialized appr_alg->linkLists_

7b445c8

serialize element_lookup_ and element_level_ as array_t arrays; pass …

c02f1dc

…python types by value everywhere

yurymalkov approved these changes Nov 1, 2020

View reviewed changes

dbespalov added 3 commits November 3, 2020 05:47

warn that serialization is not thread safe with add_items

1f25102

warn that serialization is not thread safe with add_items; add todo b…

1165370

…lock for serialization of random generators

remove camel casing

2c040e6

dbespalov commented Nov 3, 2020

View reviewed changes

add static const int data member to class Index that stores serializa…

6298996

…tion version; serialization version is returned as the first element of the parameter tuple; serialization version must match when instantiating Index object from parameter tuple

dbespalov added 2 commits November 6, 2020 04:32

add todo block to convert parameter tuple to dicts

c8276d8

add todo block to convert parameter tuple to dicts

345f71d

yurymalkov merged commit cee0e99 into nmslib:develop Nov 9, 2020


		void setAnnData(const py::tuple t) {

		std::unique_lock <std::mutex> templock(appr_alg->global);

Add pickle support to python bindings Index class #251

Add pickle support to python bindings Index class #251

Uh oh!

Conversation

dbespalov commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

yurymalkov commented Oct 13, 2020

Uh oh!

dbespalov commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurymalkov commented Oct 19, 2020

Uh oh!

dbespalov commented Oct 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurymalkov commented Oct 27, 2020

Uh oh!

yurymalkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yurymalkov Nov 1, 2020

Choose a reason for hiding this comment

Uh oh!

dbespalov Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yurymalkov Nov 1, 2020

Choose a reason for hiding this comment

Uh oh!

dbespalov Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dbespalov Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

dbespalov commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurymalkov commented Nov 5, 2020

Uh oh!

dbespalov commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbespalov commented Nov 6, 2020

Uh oh!

yurymalkov commented Nov 7, 2020

Uh oh!

dbespalov commented Nov 13, 2020

Uh oh!

yurymalkov commented Nov 15, 2020

Uh oh!

dyashuni commented Nov 15, 2020

Uh oh!

dbespalov commented Nov 16, 2020

Uh oh!

yurymalkov commented Nov 17, 2020

Uh oh!

dyashuni commented Nov 17, 2020

Uh oh!

dbespalov commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dyashuni commented Nov 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Add pickle support to python bindings `Index` class #251

Add pickle support to python bindings `Index` class #251

dbespalov commented Oct 12, 2020 •

edited

Loading

dbespalov commented Oct 19, 2020 •

edited

Loading

dbespalov commented Oct 26, 2020 •

edited

Loading

dbespalov commented Nov 3, 2020 •

edited

Loading

dbespalov commented Nov 6, 2020 •

edited

Loading

dbespalov commented Nov 17, 2020 •

edited

Loading