Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deleting elements? #4

Open
ThomasDelteil opened this issue Apr 4, 2018 · 15 comments
Open

Support deleting elements? #4

ThomasDelteil opened this issue Apr 4, 2018 · 15 comments

Comments

@ThomasDelteil
Copy link
Contributor

Thanks for the great implementation, the performance is impressive. I was wondering what it would take to add the ability to delete an element from the graph without degrading too much the performance? Having a special flag on the label, or re-assigning the lost edges?

@searchivarius
Copy link
Member

searchivarius commented Apr 4, 2018

This works mostly fine for SW-graph:
https://github.com/nmslib/nmslib/blob/master/test_batch_app/test_batch_mod.cc

Doing something similar for HNSW is a bit more work, but not too much. My understanding is that HNSW will actually be much more robust to deletion. One disadvantage is that you can't use an optimized static index any more (but that's the cost of all dynamic data structures): At least, something

@yurymalkov
Copy link
Member

Hi @ThomasDelteil,
the optimal solution strongly depends on the scenario and the amount of performance you are willing to lose).

  1. Deletion by flagging the elements is easy to implement and is OK if the number of deletions is small and/or they are random. However, just using flagging will not stop the degradation of performance.
    Imagine you have a 1B dataset consisting of two well-isolated clusters (i.e. two very different classes). You mark every element from the first cluster as deleted and then query the nearest neighbor of an element from the "deleted" cluster.
    The search will turn to (almost) exhaustive as you would have to go through all of 0.5B the deleted elements to find just a single nearest neighbor from the second (non-deleted) cluster (as all elements from the deleted cluster are closer to the query than any second-cluster element).
  2. As you have mentioned, you can do full reassignments of the edges. This allows keeping the index fresh, no matter how many deletions there were. However, reassignments lead to extra work during the deletion and are somewhat trickier to implement compared to sw-graph (although I see no problem with the optimized hnswlib index).

Agree with @searchivarius. HNSW should be much more robust, as it does not depend on the data order. SW-graph (NSW) can be easily crippled on low-dim data if the first inserted elements are removed.

@phdowling
Copy link

Any chance this will be added, either here or in nmslib, some time soon?

@searchivarius
Copy link
Member

@phdowling sometime, it will. SW-graph in NMSLIB has an experimental deletion feature, but it's not exposed to Python.

@yurymalkov
Copy link
Member

@phdowling I do not think I will have an opportunity to add deletions to hnswlib for several following months.
But if someone willing to contribute I can help.

@linzuk
Copy link

linzuk commented Oct 19, 2018

Does any hero implement deletion feature. We are looking forward to it.

@popupz
Copy link

popupz commented May 11, 2019

@yurymalkov could you describe what the process of reassignments of the edges ?

would a naive approach of simply removing the node from the connected nodes on all layers work. those should be easy to find since the links are bi-directional

edit : on second thought links are not bi-directional because only the strongest links on each level are preserved. So would the only option then be to make a pass through all nodes in the graph and rewrite those nodes that have connections to the item being deleted

@gasabr
Copy link

gasabr commented Aug 8, 2019

Have not seen all the code yet, but is there a workaround for solving this issue, like saving and loading index or recreating is the only way to do "hard delete"?

@etiennedi
Copy link

etiennedi commented Aug 21, 2020

@popupz

edit : on second thought links are not bi-directional because only the strongest links on each level are preserved. So would the only option then be to make a pass through all nodes in the graph and rewrite those nodes that have connections to the item being deleted

I ran into a similar issue building the deletion feature for the HNSW implementation in Weaviate (which is written in Golang). Weaviate is a general-purpose database/search engine, so we can't predict in what order or frequency users will be deleting items, so the "flagging-only" approach also isn't feasible for us, for the reasons @yurymalkov outlined very well above. I'd like to share my findings and also present an idea of I'm planning to tackle this issue. Feedback much appreciated.

Naive approach

I first tried following all edges of the to-be-deleted items and simply reassigning those, i.e. what you described as the "naive approach". However, even at a very small dataset (1000 nodes, maxNeighbors=30, efConstruction=128) I noticed that less than half of the edges pointing to the deleted item would be removed. I.e. the other half have become uni-directional edges. As you outlined, this is to be expected as only the strongest links are preserved. However, I wouldn't have expected that the effect is this noticeable on such a small HNSW graph already. Either way, this clearly shows, that the naive approach isn't very helpful.

Brute force approach

So next I tried the brute-force approach of:

  1. Iterate over every node in the graph
  2. If level 0 of this node's outgoing edges does not include the to-be-deleted item, skip node
  3. On the remaining nodes, remove all edges and re-assign them using the same procedure on insert (making sure we exclude the to-be-deleted item)

This works well, all the edges are cleaned up and the quality of the index isn't harmed. However, this is terribly slow. Even on my tiny graph deleting all the items took about 10 times as long as inserting them. I don't see this feasible on a graph with a few million or billion nodes.

Idea: Tombstones with periodic clean up

My idea to solve this issue is to combine the "flagging approach" with a periodical reassignment of edges. It would look roughly like this:

On Delete:

  • Attach a tombstone to the to-be-deleted item
  • Append node id to a list of nodes with tombstones (basically a temporary index to quickly find all deleted nodes)

On Search:

  • Skip each element that has a tombstone attached

Periodically (with configurable frequency):

  • If list of tombstones is empty, don't do anything, else:
  • Iterate over ever node in the graph
  • If node has no outgoing edges to a tombstoned node skip, else:
  • Delete all edges and reassign
  • Once the reassignment of a node is done, remove the node's id from the list of tombstoned nodes

Possible Benefits

  • There is immediate feedback to a user sending a delete request, as adding a tombstone is very cheap
  • The deleted item will immediately be excluded from future searches
  • The search performances degrades slightly (as outlined by @yurymalkov) with each additional tombstone, however the periodic clean up will make sure the amount of tombstones stays low
  • The clean up procedure is very robust: If the cleanup fails after cleaning up only 1/3 of the graph (e.g. through a random server crash), those nodes will simply be skipped in the next clean up cycle, and clean up will continue on the next node that requires clean up.
  • Regardless of how many deletes, at the very most we'll reassign every node once per clean up cycle. If we were to do re-assigns on every delete there is no upper limit to how many reassigns are required. In other words: A node that has outgoing edges to n tombstoned nodes per cycle will be reassigned once instead of n times.
  • During a clean up, the search quality should not change: Since a tombstoned node is skipped in a search anyway, the result quality will not change regardless of whether a clean up job is currently running or not.

Any thoughts on my outlined approach? Is there any way we can avoid a brute-force cleanup which needs to read every single node in the graph?

@yurymalkov
Copy link
Member

Hi @etiennedi for a detail analysis. I agree that might be a very good solution!
One thing is that it needs global operation once in a while, but this should not be an actual problem.

Another option is when a new element is being added to pick one of the deleted (tombstoned) elements and update it to the position of the new element. The downside is that it cannot shrink the index at any time (e.g. it makes sense to remove unused element during index saving).
Note that tombstoning the elements and element updates is implemented in hnswlib.

@alonre24
Copy link
Contributor

Hi @etiennedi, your suggestion sounds interesting! Did you finally implemented this solution? I would appropriate if you share your experience.

@etiennedi
Copy link

Hi @alonre24,

yes the approach outline above is what we implemented for async cleanup of deleted objects in Weaviate. You can find the implementation here. At first, we encountered a couple of new bugs during concurrent writes and deletions which could lead to issues down the line - especially around handling entrypoints. For example, in Weaviate we support kNN-style classifications where each classification is an update (the object is updated with the label), leading to a ton of concurrent writes/deletes. This "helped" us highlight any issues rather quickly.

We have thus added a lot of integration tests around the edge cases around deletes which can be found here.

Since then I haven't been aware of any other bugs related to this, so I'm pretty confident that it works reliably now - so I can definitely recommend this approach.

@alonre24
Copy link
Contributor

alonre24 commented Jun 2, 2021

Thanks @etiennedi for your detailed answer!
There is still one thing that bothers me regarding the "sweep" phase: when "reassigning" edges for nodes that has outgoing edges to "tombstones" (in some layer), how can we be sure that the graph stays connected? I imagine a situation where we delete a node that connects between two isolated clusters that the only connection between them is that node. In that case, none of the nodes that were pointing to the deleted node would choose a node from the other cluster to be their neighbors, and the graph is left not connected as a result.
Perhaps I'm missing something, I would be happy to understand how this approach overcome this situations.
Thanks :)

@kishorenc
Copy link
Contributor

@yurymalkov I see that hnswlib supports mark_deleted -- is there an API for performing the periodic sweep / gc to remove these dead nodes?

@yurymalkov
Copy link
Member

Hi @kishorenc it is not implemented. An easy solution is to just rebuild index from scratch (which can be done in c++). That has ~ O(1) complexity per deletion (if triggered at a fraction of the dataset deleted).
There should be more effective solutions (in terms of constant), I guess we can do that, especially is there are volunteers to help.

GerHobbelt pushed a commit to GerHobbelt/hnswlib that referenced this issue May 10, 2024
This PR adds a batch persist functionality via the persistDirty() nethid to the graph which only persists dirtied elements in the graph. We store data in four files, a header, the data_level_0, the length, and the link lists. The latter three files map to the in memory representation. Data is never read from disk except on load, serving as a write-through cache. Callers are expected to periodically call persistDirty() in a thread-compatible way.

This storage scheme is extremely naive, and is only meant as an improvement to serializing the whole index. We can make many improvements in terms of disk access, layout, caching, and durability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants