Skip to content

Conversation

@dbespalov
Copy link
Contributor

@dbespalov dbespalov commented Oct 12, 2020

Changes

  • Index class

    • Static factory methods createFromIndex and createFromParams for Index construction from another Index object or Index parameter tuple, respectively
    • Method getIndexParams serializes Index object. Returns a tuple with Index parameters.
    • Method getAnnData returns a tuple with hnsw-specific parameters. Copy of appr_alg->data_level0_memory_ and appr_alg->linkLists_ are returned as py::array_t<char> objects to avoid additional copying from python side
  • Python bindings for Index class

    • Index serialization is implemented with py::pickle definition from pybind11
    • Bind Index.__init__ to static factory methods createFromIndex and createFromParams
    • Expose parameters of the hnsw index as read-only properties in python:
      • space_name, dim, max_elements, element_count, ef_construction, M, num_threads, ef
    • And, two properties that support read and write:
      • ef, num_threads
  • Updated API documentation and the first python example in README.md

  • New test in python_bindings/tests/bindings_test_pickle.py.

    • Verifies that results of knn_query match results from copies of the same index. Index objects are copied using round-trip pickling.
    • Verify that knn_query gives recall of (almost) 100% for k=25 using three spaces. Brute-force search is used to return ground-truth labels for the randomly generated items.
    • Create separate test for each space
    • Sample test output:
> python3 -m unittest  tests/bindings_test_pickle.py -k Inner
Running pickle tests for <hnswlib.Index(space='ip', dim=48)>
Warning: 1 labels are missing from ann results (k=25, err_thresh=5)
Warning: 1 labels are missing from ann results (k=25, err_thresh=5)
 ... 
Warning: 16 ann distance values are different from brute-force values (total # of values=5000, dists_thresh=50)

.
----------------------------------------------------------------------
Ran 1 test in 19.057s

OK

…alization logic of HierarchicalNSW class via std::i/ostream.
…Index objects; add new properties to Index class: space_name, dim, max_elements, element_count, ef_construction, M, num_threads, ef. Properties num_threads and ef are read-write-able, other parameters are read-only.
…jects; use brute-force knn search to verify knn_query gives recall of (almost) 100%
@yurymalkov
Copy link
Member

@dbespalov
Great! Thanks! I'll check it this week.

@dbespalov
Copy link
Contributor Author

dbespalov commented Oct 19, 2020

@yurymalkov actually, could you hold off with the review?

I think the current implementation is pretty inefficient. __getstate__ makes two copies during serialization (with a call to .str(), and when casting with py::bytes). And __setstate__ makes at least one copy (when initing content of istream)

I'm going to try to take another stab at serialization logic using pybind11/eigen , so it does not create unnecessary copies of index during pickling

I will post an update here with my progress in the next couple of days.

@yurymalkov
Copy link
Member

@dbespalov Sure. There is no rush.
Thank you so much for the work!

@dbespalov
Copy link
Contributor Author

dbespalov commented Oct 26, 2020

@yurymalkov I updated the PR and it should now be ready for the review!

Copies of appr_alg->linkLists_ and appr_alg->data_level0_memory_ are serialized as array_t<char> objects, that look like numpy arrays to python (and avoids additional copying).

I also did a quick test to benchmark Index serialization runtime and memory usage against a file-based pickling logic. The following snippet creates 25 copies of Index object with 100k points in 512 dim.

import hnswlib

data = np.float32(np.random.random((100_000, 512))) # create 100k points with dim=512

ann = hnswlib.Index(space='cosine', dim=512) # create ann index with cosine distance
ann.init_index(max_elements=100_000, ef_construction=10, M=8) 

t0 = time.time();  
ann.add_items(data); # add points to index
t1=time.time() 

print(f"Building index took {(t1-t0)}s")

### create 25 copies of ann object using four strategies: file, memory, param, clone
for ii in range(25):
    if sys.argv[-1] == 'file':   ### use file-based pickle to create copy ann 
        copyreg.pickle(hnswlib.Index, pickle_ann) ### use temp files for ann serialization
        bb=pickle.dumps(ann)
        ann=pickle.loads(bb)
    if sys.argv[-1] == 'memory': ### copy ann with round-trip pickle
        bb=pickle.dumps(ann)
        ann=pickle.loads(bb)
   if sys.argv[-1] == 'param':  ### create new index using ann's state tuple
        ann=hnswlib.Index(ann.__getstate__())
   if sys.argv[-1] == 'clone':  ### use copy constructor to copy ann object
        ann= hnswlib.Index(ann)
        
t2=time.time()
print(f"Making 25 copies of index took {(t2-t1)}s")

I got the following timing results when running the snippet locally:

> python3 tests/test_ann.py file
Building index took 1.471487045288086s
pickling ann using temp file /var/folders/_4/l7lvz56s609867vwt5r5bkj98kcygg/T/tmpw3ydj4uu.ann
unpickling ann using temp file /var/folders/_4/l7lvz56s609867vwt5r5bkj98kcygg/T/tmpwu8xnto7.ann
 ...
Making 25 copies of index took 55.927014112472534s

> python3 tests/test_ann.py memory
Building index took 1.5638556480407715s
Making 25 copies of index took 20.929177284240723s

> python3 tests/test_ann.py param
Building index took 1.4085211753845215s
Making 25 copies of index took 8.599105834960938s

> python3 tests/test_ann.py clone
Building index took 1.4074420928955078s
Making 25 copies of index took 8.865073919296265s

The in-memory pickling (implemented in this PR) is more than 2x faster and the memory usage was comparable to the file-based method (also no leaks).

@yurymalkov
Copy link
Member

Thank you! I'll review it in the few following days.

Copy link
Member

@yurymalkov yurymalkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank so much for the PR! This is a very helpful feature!
I think it is ok to merge, but few things can be improved.


void setAnnData(const py::tuple t) {

std::unique_lock <std::mutex> templock(appr_alg->global);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note here is that the global lock does not actually give guaranties about safe-thread operations. It do prevent from adding new elements, but might fail if the element insertion started before the serialization.
I think it is ok, but probably should be mentioned in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good callout, thank you! I added warning comments re: serialization thread-safety with addItems method

README.md Outdated
# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

# Index objects support pickling:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add note here that it is not fully thread safe. Requires the insertions to be finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added warning re: serialization is not thread safe with add_items.


py::tuple getIndexParams() const {
/* TODO: serialize state of random generators appr_alg->level_generator_ and appr_alg->update_probability_generator_ */
/* for full reproducibility / to avoid re-initializing generators inside Index<float>::createFromParams */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TODO block to serialize state of random generators. Currently, the random generators are re-initialized on index deserialization. This could make index operations not (fully) reproducible.

@dbespalov
Copy link
Contributor Author

dbespalov commented Nov 3, 2020

@yurymalkov thank you for your review! I think all your comments have been addressed. I clicked "resolved conversation" for the items that have been fully addressed and no longer need attention.

The only item remaining is synchronization of serialization with element insertion. I've added comments to warn about thread-safety of these two operation. If you think there is a better/easier way to synchronize serialization, so that it is thread-safe with element insertion, I can take a stab at that. Frankly, I did not look into synchronization logic for too long -- I simply repurposed some code from hnswalg.h that was using global lock.

One more item that is worth pointing out: I added TODO block in python_bindings/bindings.cpp about serialization of states for the two random generators used in hnswalg.h. Current implementation re-initializes random generators using random_seed value that is passed to init_index. I dont think this affects operations on the ann index functionally, but it can potentially cause issues when reproducibility of the ann data-structure is required.

TLDR: Please let me know if there is a simple fix to properly synchronize serialization and element insertion. Otherwise, I think the PR looks good to be merged.

@yurymalkov
Copy link
Member

@dbespalov Thank you so much for the update!!
I think a good way to synchronize updates/saves/pickles is by emulating a light read-write lock in C++ code (I think read-write lock is not a part of C++11, but not sure if C++11 has advantages over C++14 nowadays) - e.g. add a flag to lock new insertions and add a counter of the number of running insertions. Though probably there are other solutions.
Anyway I think this is totally out of scope of this PR (though is important).

One other thing I forgot to comment is maybe replacing a tuple in https://github.com/nmslib/hnswlib/pull/251/files#diff-e7a40eaa3578ae348570a4917aaf8b203f34246d035b611aa0ed07252bb24ba4R345 with a dict (that would make adding/deleting fields easier) and adding a version number for rising exceptions in case of data pickled by a newer/older version of the pickling format (e.g. what happens if we add serialized seed?).

I think changing tuple to dict can be done in the future (it is anyway likely to break the backward compatibility), but adding a version number from the start seems to be a good move. I definitely regret I didn't add it to the save format.
I can do it myself though.

@dbespalov
Copy link
Contributor Author

dbespalov commented Nov 6, 2020

@yurymalkov I will update the PR to add serialization version to the parameter tuple. Very good call on this one! Thank you!!

Please see my other comments below

@dbespalov Thank you so much for the update!!
I think a good way to synchronize updates/saves/pickles is by emulating a light read-write lock in C++ code (I think read-write lock is not a part of C++11, but not sure if C++11 has advantages over C++14 nowadays) - e.g. add a flag to lock new insertions and add a counter of the number of running insertions. Though probably there are other solutions.
Anyway I think this is totally out of scope of this PR (though is important).

TIL: read-write lock is not part of C++11! I did not know that. Thank you!!

Agreed, that it is out of scope for this PR.

One other thing I forgot to comment is maybe replacing a tuple in https://github.com/nmslib/hnswlib/pull/251/files#diff-e7a40eaa3578ae348570a4917aaf8b203f34246d035b611aa0ed07252bb24ba4R345 with a dict (that would make adding/deleting fields easier) and adding a version number for rising exceptions in case of data pickled by a newer/older version of the pickling format (e.g. what happens if we add serialized seed?).

The parameter tuple should definitely be a python dict! Unfortunately, I did not get to it for this PR.
Serialization version must be included with the parameter tuple. I will update the PR.

I think changing tuple to dict can be done in the future (it is anyway likely to break the backward compatibility), but adding a version number from the start seems to be a good move. I definitely regret I didn't add it to the save format.
I can do it myself though.

All good! Thank you for catching this!!

…tion version; serialization version is returned as the first element of the parameter tuple; serialization version must match when instantiating Index object from parameter tuple
@dbespalov
Copy link
Contributor Author

@yurymalkov I added Index::ser_version, a static const data member for the serialization version. ser_version is the first element of parameter tuple, which has three elements in total now. The tests are passing as well.

Let me know what you think!
image

@yurymalkov
Copy link
Member

Great! Thank you so much for the contribution!
I'll merge it when we fix CI.

@yurymalkov yurymalkov merged commit cee0e99 into nmslib:develop Nov 9, 2020
@dbespalov
Copy link
Contributor Author

@yurymalkov Thank you so much for your help with improving this PR!!

Do you know what the timeframe would be for the change to be merged to master and it ends up in pypi?

@yurymalkov
Copy link
Member

Hi @dbespalov. Thanks again for the PR!
I am going to consider switch from tuples to dict before merging to master to avoid future compatibility issues. Maybe @dyashuni can help?

@dyashuni
Copy link
Contributor

@yurymalkov Sure. Let me see what I can do.

@dbespalov
Copy link
Contributor Author

@yurymalkov makes sense. In a retrospect, I should've included this change in my PR. I can probably find some time later this week to make a quick PR for this.

@yurymalkov
Copy link
Member

@dbespalov Cool! Thank you!

@dyashuni
Copy link
Contributor

@dbespalov Looks like with the latest changes the first python example doesn't work in the readme. There is an issue in this line of code

    print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}")
AttributeError: 'hnswlib.Index' object has no attribute 'space'

The correct property of Index class is space_name. However space_name is the same as the constructor parameter space but called differently. I think we should align names of variables. Do you have any ideas how can we fix the issue? I can help with fixing.

@dbespalov
Copy link
Contributor Author

dbespalov commented Nov 17, 2020

@dyashuni oops, I missed this. I meant to rename space_name to space on the python side (to be consistent with constructor's kwargs). It should be a quick fix:

Line 658 that now reads
.def_readonly("space_name", &Index<float>::space_name)
should be
.def_readonly("space", &Index<float>::space_name)

This way, space_name will be completely hidden in python. I only updated readme, but forgot to make the change in the bindings code.

@dyashuni
Copy link
Contributor

@dbespalov Thank you, I have created a pull request with the suggested quick fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants