Support deleting elements? #4

ThomasDelteil · 2018-04-04T04:50:33Z

Thanks for the great implementation, the performance is impressive. I was wondering what it would take to add the ability to delete an element from the graph without degrading too much the performance? Having a special flag on the label, or re-assigning the lost edges?

searchivarius · 2018-04-04T05:38:38Z

This works mostly fine for SW-graph:
https://github.com/nmslib/nmslib/blob/master/test_batch_app/test_batch_mod.cc

Doing something similar for HNSW is a bit more work, but not too much. My understanding is that HNSW will actually be much more robust to deletion. One disadvantage is that you can't use an optimized static index any more (but that's the cost of all dynamic data structures): At least, something

yurymalkov · 2018-04-04T20:47:29Z

Hi @ThomasDelteil,
the optimal solution strongly depends on the scenario and the amount of performance you are willing to lose).

Deletion by flagging the elements is easy to implement and is OK if the number of deletions is small and/or they are random. However, just using flagging will not stop the degradation of performance.
Imagine you have a 1B dataset consisting of two well-isolated clusters (i.e. two very different classes). You mark every element from the first cluster as deleted and then query the nearest neighbor of an element from the "deleted" cluster.
The search will turn to (almost) exhaustive as you would have to go through all of 0.5B the deleted elements to find just a single nearest neighbor from the second (non-deleted) cluster (as all elements from the deleted cluster are closer to the query than any second-cluster element).
As you have mentioned, you can do full reassignments of the edges. This allows keeping the index fresh, no matter how many deletions there were. However, reassignments lead to extra work during the deletion and are somewhat trickier to implement compared to sw-graph (although I see no problem with the optimized hnswlib index).

Agree with @searchivarius. HNSW should be much more robust, as it does not depend on the data order. SW-graph (NSW) can be easily crippled on low-dim data if the first inserted elements are removed.

phdowling · 2018-05-23T12:45:34Z

Any chance this will be added, either here or in nmslib, some time soon?

searchivarius · 2018-05-23T14:36:36Z

@phdowling sometime, it will. SW-graph in NMSLIB has an experimental deletion feature, but it's not exposed to Python.

yurymalkov · 2018-05-23T15:01:08Z

@phdowling I do not think I will have an opportunity to add deletions to hnswlib for several following months.
But if someone willing to contribute I can help.

linzuk · 2018-10-19T07:00:23Z

Does any hero implement deletion feature. We are looking forward to it.

popupz · 2019-05-11T21:19:36Z

@yurymalkov could you describe what the process of reassignments of the edges ?

would a naive approach of simply removing the node from the connected nodes on all layers work. those should be easy to find since the links are bi-directional

edit : on second thought links are not bi-directional because only the strongest links on each level are preserved. So would the only option then be to make a pass through all nodes in the graph and rewrite those nodes that have connections to the item being deleted

gasabr · 2019-08-08T14:17:30Z

Have not seen all the code yet, but is there a workaround for solving this issue, like saving and loading index or recreating is the only way to do "hard delete"?

etiennedi · 2020-08-21T14:19:34Z

@popupz

edit : on second thought links are not bi-directional because only the strongest links on each level are preserved. So would the only option then be to make a pass through all nodes in the graph and rewrite those nodes that have connections to the item being deleted

I ran into a similar issue building the deletion feature for the HNSW implementation in Weaviate (which is written in Golang). Weaviate is a general-purpose database/search engine, so we can't predict in what order or frequency users will be deleting items, so the "flagging-only" approach also isn't feasible for us, for the reasons @yurymalkov outlined very well above. I'd like to share my findings and also present an idea of I'm planning to tackle this issue. Feedback much appreciated.

Naive approach

I first tried following all edges of the to-be-deleted items and simply reassigning those, i.e. what you described as the "naive approach". However, even at a very small dataset (1000 nodes, maxNeighbors=30, efConstruction=128) I noticed that less than half of the edges pointing to the deleted item would be removed. I.e. the other half have become uni-directional edges. As you outlined, this is to be expected as only the strongest links are preserved. However, I wouldn't have expected that the effect is this noticeable on such a small HNSW graph already. Either way, this clearly shows, that the naive approach isn't very helpful.

Brute force approach

So next I tried the brute-force approach of:

Iterate over every node in the graph
If level 0 of this node's outgoing edges does not include the to-be-deleted item, skip node
On the remaining nodes, remove all edges and re-assign them using the same procedure on insert (making sure we exclude the to-be-deleted item)

This works well, all the edges are cleaned up and the quality of the index isn't harmed. However, this is terribly slow. Even on my tiny graph deleting all the items took about 10 times as long as inserting them. I don't see this feasible on a graph with a few million or billion nodes.

Idea: Tombstones with periodic clean up

My idea to solve this issue is to combine the "flagging approach" with a periodical reassignment of edges. It would look roughly like this:

On Delete:

Attach a tombstone to the to-be-deleted item
Append node id to a list of nodes with tombstones (basically a temporary index to quickly find all deleted nodes)

On Search:

Skip each element that has a tombstone attached

Periodically (with configurable frequency):

If list of tombstones is empty, don't do anything, else:
Iterate over ever node in the graph
If node has no outgoing edges to a tombstoned node skip, else:
Delete all edges and reassign
Once the reassignment of a node is done, remove the node's id from the list of tombstoned nodes

Possible Benefits

There is immediate feedback to a user sending a delete request, as adding a tombstone is very cheap
The deleted item will immediately be excluded from future searches
The search performances degrades slightly (as outlined by @yurymalkov) with each additional tombstone, however the periodic clean up will make sure the amount of tombstones stays low
The clean up procedure is very robust: If the cleanup fails after cleaning up only 1/3 of the graph (e.g. through a random server crash), those nodes will simply be skipped in the next clean up cycle, and clean up will continue on the next node that requires clean up.
Regardless of how many deletes, at the very most we'll reassign every node once per clean up cycle. If we were to do re-assigns on every delete there is no upper limit to how many reassigns are required. In other words: A node that has outgoing edges to n tombstoned nodes per cycle will be reassigned once instead of n times.
During a clean up, the search quality should not change: Since a tombstoned node is skipped in a search anyway, the result quality will not change regardless of whether a clean up job is currently running or not.

Any thoughts on my outlined approach? Is there any way we can avoid a brute-force cleanup which needs to read every single node in the graph?

yurymalkov · 2020-08-24T04:02:29Z

Hi @etiennedi for a detail analysis. I agree that might be a very good solution!
One thing is that it needs global operation once in a while, but this should not be an actual problem.

Another option is when a new element is being added to pick one of the deleted (tombstoned) elements and update it to the position of the new element. The downside is that it cannot shrink the index at any time (e.g. it makes sense to remove unused element during index saving).
Note that tombstoning the elements and element updates is implemented in hnswlib.

alonre24 · 2021-05-25T10:09:29Z

Hi @etiennedi, your suggestion sounds interesting! Did you finally implemented this solution? I would appropriate if you share your experience.

etiennedi · 2021-05-25T10:27:16Z

Hi @alonre24,

yes the approach outline above is what we implemented for async cleanup of deleted objects in Weaviate. You can find the implementation here. At first, we encountered a couple of new bugs during concurrent writes and deletions which could lead to issues down the line - especially around handling entrypoints. For example, in Weaviate we support kNN-style classifications where each classification is an update (the object is updated with the label), leading to a ton of concurrent writes/deletes. This "helped" us highlight any issues rather quickly.

We have thus added a lot of integration tests around the edge cases around deletes which can be found here.

Since then I haven't been aware of any other bugs related to this, so I'm pretty confident that it works reliably now - so I can definitely recommend this approach.

alonre24 · 2021-06-02T09:03:46Z

Thanks @etiennedi for your detailed answer!
There is still one thing that bothers me regarding the "sweep" phase: when "reassigning" edges for nodes that has outgoing edges to "tombstones" (in some layer), how can we be sure that the graph stays connected? I imagine a situation where we delete a node that connects between two isolated clusters that the only connection between them is that node. In that case, none of the nodes that were pointing to the deleted node would choose a node from the other cluster to be their neighbors, and the graph is left not connected as a result.
Perhaps I'm missing something, I would be happy to understand how this approach overcome this situations.
Thanks :)

kishorenc · 2022-05-06T14:03:17Z

@yurymalkov I see that hnswlib supports mark_deleted -- is there an API for performing the periodic sweep / gc to remove these dead nodes?

yurymalkov · 2022-05-08T02:05:04Z

Hi @kishorenc it is not implemented. An easy solution is to just rebuild index from scratch (which can be done in c++). That has ~ O(1) complexity per deletion (if triggered at a fraction of the dataset deleted).
There should be more effective solutions (in terms of constant), I guess we can do that, especially is there are volunteers to help.

This PR adds a batch persist functionality via the persistDirty() nethid to the graph which only persists dirtied elements in the graph. We store data in four files, a header, the data_level_0, the length, and the link lists. The latter three files map to the in memory representation. Data is never read from disk except on load, serving as a write-through cache. Callers are expected to periodically call persistDirty() in a thread-compatible way. This storage scheme is extremely naive, and is only meant as an improvement to serializing the whole index. We can make many improvements in terms of disk access, layout, caching, and durability.

yurymalkov added help wanted enhancement labels Apr 4, 2018

yurymalkov mentioned this issue Jun 6, 2018

How to change a data point after building index? #22

Closed

yurymalkov mentioned this issue Sep 18, 2018

Is there any way to remove items #59

Closed

SeanU mentioned this issue Sep 27, 2018

Duplicate labels permitted? #62

Closed

harishankar-gopalan mentioned this issue Oct 18, 2022

LUCENE-10040: Handle deletions in nearest vector search apache/lucene#239

Merged

jeffchuber mentioned this issue Oct 28, 2022

Updating the index chroma-core/chroma#36

Closed

ashvardanian mentioned this issue May 26, 2023

Feature: Removing vectors unum-cloud/usearch#62

Closed

3 tasks

jmazanec15 mentioned this issue Sep 1, 2023

Init HNSW merge with graph containing deleted documents apache/lucene#12533

Open

ekzhu mentioned this issue Sep 8, 2023

HNSW remove() point in-place. ekzhu/datasketch#225

Merged

duylncanawan mentioned this issue Nov 25, 2023

segmentation fault (core dumped) #525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support deleting elements? #4

Support deleting elements? #4

ThomasDelteil commented Apr 4, 2018

searchivarius commented Apr 4, 2018 •

edited

Loading

yurymalkov commented Apr 4, 2018

phdowling commented May 23, 2018

searchivarius commented May 23, 2018

yurymalkov commented May 23, 2018

linzuk commented Oct 19, 2018

popupz commented May 11, 2019 •

edited

Loading

gasabr commented Aug 8, 2019

etiennedi commented Aug 21, 2020 •

edited

Loading

yurymalkov commented Aug 24, 2020

alonre24 commented May 25, 2021

etiennedi commented May 25, 2021

alonre24 commented Jun 2, 2021

kishorenc commented May 6, 2022

yurymalkov commented May 8, 2022

Support deleting elements? #4

Support deleting elements? #4

Comments

ThomasDelteil commented Apr 4, 2018

searchivarius commented Apr 4, 2018 • edited Loading

yurymalkov commented Apr 4, 2018

phdowling commented May 23, 2018

searchivarius commented May 23, 2018

yurymalkov commented May 23, 2018

linzuk commented Oct 19, 2018

popupz commented May 11, 2019 • edited Loading

gasabr commented Aug 8, 2019

etiennedi commented Aug 21, 2020 • edited Loading

Naive approach

Brute force approach

Idea: Tombstones with periodic clean up

Possible Benefits

yurymalkov commented Aug 24, 2020

alonre24 commented May 25, 2021

etiennedi commented May 25, 2021

alonre24 commented Jun 2, 2021

kishorenc commented May 6, 2022

yurymalkov commented May 8, 2022

searchivarius commented Apr 4, 2018 •

edited

Loading

popupz commented May 11, 2019 •

edited

Loading

etiennedi commented Aug 21, 2020 •

edited

Loading