Threaded add_items issue #28

sumsuddin · 2018-06-12T08:26:25Z

I was looking into the python example. In my experience, the threaded add_items gives me different result & accuracy every time I run the script.

I think using multiple threads while adding the items is wrong here.

p.set_num_threads(4) # by default using all available cores

Moreover when I used cosine spcae the accuracy was around 50% in some index generation.
Index(space='cosine', dim=dim)

When I used single thread the results were consistant all the time.
p.set_num_threads(1)

Can someone clarify the issue?

The text was updated successfully, but these errors were encountered:

yurymalkov · 2018-06-12T12:22:19Z

Hi @sumsuddin,
Can you please provide a demo script to understand what is going on?

sumsuddin · 2018-06-19T05:15:35Z

I can't share the private data that I was working on. But here is a randomly generated numpy array that I saved in a file. I attached the saved file here so that you can investigate.

# Generating sample data
#data = np.float32(np.random.random((num_elements, dim)))
#np.savetxt('data.txt', data)
data = np.loadtxt('data.txt')

For this specific random number combination (attached file) I get following two different recall accuracy randomly in different run.

Recall for two batches: 0.99990000000000001 (this happens rarely)
Recall for two batches: 1.0 (I mostly get this one)

Increasing the item size makes the issue more obvious in my experiments.
num_elements = 100000

I guess you can find easier ways to regenerate the issue.
Thanks for your time.

Python version : Python 2.7.6
OS: Ubuntu 14.04.5 LTS

data.txt

yurymalkov · 2018-06-19T16:27:50Z

I see. Thanks!
It seems there are only two options to solve this:

use single-threaded construction.
setting high ef/efConstruction values, so the search will be almost exact.

There is a potential fix that can stabilize the randomness to some extent - setting the element levels before the actual insertion (it would require updating bindings), but it will not solve the problem completely.
I think that hnsw in faiss (e.g. https://github.com/facebookresearch/faiss/blob/master/benchs/bench_hnsw.py) works that way. You can try it (although it is generally slower than hnswlib at fixed accuracy).

…level, this should make the index a lot more stable see : nmslib/hnswlib#28

sumsuddin changed the title ~~Threaded add_item issue~~ Threaded add_items issue Jun 12, 2018

jelmerk pushed a commit to jelmerk/hnswlib that referenced this issue May 21, 2019

use a murmur hash of the hash code of the id to come up with the max …

9d5740d

…level, this should make the index a lot more stable see : nmslib/hnswlib#28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded add_items issue #28

Threaded add_items issue #28

sumsuddin commented Jun 12, 2018 •

edited

Loading

yurymalkov commented Jun 12, 2018

sumsuddin commented Jun 19, 2018 •

edited

Loading

yurymalkov commented Jun 19, 2018

Threaded add_items issue #28

Threaded add_items issue #28

Comments

sumsuddin commented Jun 12, 2018 • edited Loading

yurymalkov commented Jun 12, 2018

sumsuddin commented Jun 19, 2018 • edited Loading

yurymalkov commented Jun 19, 2018

sumsuddin commented Jun 12, 2018 •

edited

Loading

sumsuddin commented Jun 19, 2018 •

edited

Loading