Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpANN algorithm to support language-angostic implementations (Re: Issue #20) #35

Merged
merged 17 commits into from
Oct 4, 2021

Conversation

alexklibisz
Copy link
Contributor

@alexklibisz alexklibisz commented Sep 26, 2021

Thanks @maumueller, @gosha1128 and others for the fruitful discussion over in #20. I think I've arrived at an implementation that could fit the purpose of language-agnostic (big) ANN.

The HttpANN algorithm is designed to make HTTP calls to a server. The server executes all indexing and querying, thus enabling language-agnostic ANN implementations with minimal overhead. The only requirements for the server are:

  1. It should implement the JSON-over-HTTP API documented below (copied from httpann.py). Note that this is a 1:1 copy of the BaseANN Python Class API.
  2. It should be able to read the vector dataset in the standard binary format used by this competition.

It could in theory even run remotely, although the intended use-case is that the server runs in the same container.

The overhead for data transfer and serialization is minimal. The server only needs to parse the 10k JSON-encoded query vectors and encode the resulting 10k lists of neighbors.

I also included an example implementation which uses scikit-learn. It's too slow for the large datasets, but it works on the smaller random-xs and random-range-xs. So it should be good enough to demonstrate that this algorithm works.


Here is the API that a server must implement:

Method Route Request Body Expected Status Response Body
POST /init dictionary of constructor arguments, e.g., {“metric”: “euclidean”, “dimension”: 99 } 200 { }
POST /load_index { "dataset": <dataset name, e.g. "bigann-10m"> } 200 { "load_index": }
POST /set_query_arguments dictionary of query arguments 200 { }
POST /query { “X”: , “k”: } 200 { }
POST /range_query { “X”: , “radius”: } 200 { }
POST /get_results { } 200 { “get_results”: }
POST /get_additional { } 200 { “get_additional”: }
POST /get_range_results { } 200 { “get_range_results”: <list of three 1-dimensional lists (lims, I, D)> }

@@ -90,6 +110,17 @@ deep-10M:
"nprobe=2,quantizer_efSearch=8",
"nprobe=4,quantizer_efSearch=4",
"nprobe=2,quantizer_efSearch=16"]
diskann-t2:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two sections called "deep-10M" so I just moved the "diskann-t2" section up here to deduplicate the sections.

@alexklibisz
Copy link
Contributor Author

alexklibisz commented Sep 26, 2021

It looks like CI is failing because the fit method is never called. I think I can use the --rebuild flag to force this, but how is it working for the other algos without this flag? Nevermind, it looks like I need to just return false from load_index.

@alexklibisz alexklibisz changed the title HTTPAnn algorithm to support language-angostic implementations (Re: Issue #20) HttpANN algorithm to support language-angostic implementations (Re: Issue #20) Sep 27, 2021
@maumueller maumueller self-requested a review September 27, 2021 07:02
@maumueller maumueller self-assigned this Sep 27, 2021
Copy link
Collaborator

@maumueller maumueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice, thanks @alexklibisz! Have you been able to measure the overhead vs a naive implementation?

My only suggestion here is to split up httpann.py into base-http.py (or whatever you deem to fit) and http-example-sklearn.py to show the interaction between the base http wrapper and the actual implementation that you suggest others to use.

Comment on lines +64 to +70
def query(self, X, k):
body = dict(X=[arr.tolist() for arr in X], k=k)
self.post("query", body, 200)

def range_query(self, X, radius):
body = dict(X=[arr.tolist() for arr in X], radius=radius)
self.post("range_query", body, 200)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what the performance penalty of this will be. I would be happy to

  • change the arguments such that the query vector file is exposed
  • or add some kind of prepare for providing the query vectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some local testing, and I think it will be fine since the HTTP overhead and JSON serialization is only incurred one time. It was a bigger deal with ann-benchmarks because that framework required a request and serialization for every query.

@alexklibisz
Copy link
Contributor Author

My only suggestion here is to split up httpann.py into base-http.py (or whatever you deem to fit) and http-example-sklearn.py to show the interaction between the base http wrapper and the actual implementation that you suggest others to use.

Sounds good. I'll break it into two files.

@maumueller
Copy link
Collaborator

Great, thanks. Is this ready to be merged?

@alexklibisz
Copy link
Contributor Author

I think so. Please squash if you can, as there are several intermediate/incomplete commits in there.

@alexklibisz
Copy link
Contributor Author

I'll resolve the conflicts and also need to add one bit of documentation.. one moment..

@maumueller
Copy link
Collaborator

Sorry for not getting back to you early. Shall I squash and merge with main?

@alexklibisz
Copy link
Contributor Author

No problem. Yes, good to go. Thanks!

@maumueller maumueller merged commit 455aadc into harsha-simhadri:main Oct 4, 2021
@maumueller
Copy link
Collaborator

Looking forward to how this is going to be used. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants