Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify and simplify UI? #6

Open
vspinu opened this issue Jul 22, 2021 · 5 comments
Open

Unify and simplify UI? #6

vspinu opened this issue Jul 22, 2021 · 5 comments

Comments

@vspinu
Copy link

vspinu commented Jul 22, 2021

Currently every rnn method has its own function and there is a distinction between query and build. This results in quite a lot of copy paste code lying around.

What would you think about the following unification in the increased order of "density"?

  1. Unify documentation pages (all builders in one document, all queries in another document)
  2. Unify query and build impelmentation at C++ level for the code in src/. For example in rnn_bruteforce there is no really a need for 5 entry points, but rather one function which would internally branch on a NULL query argument.
  3. At R level unify all query and all knn functions into two functions knn_query and knn with method = argument which can be one of random, bruteforce or nnd.
  4. Merge knn_query and knn into one function with two arguments, query and reference = NULL. If reference is NULL, it is assumed to be query.

To illustrate bullet 2, the entire rnn_bruteforce.cpp can be probably boiled down to something like this:

// [[Rcpp::export]]
List rnn_brute_force_query(NumericMatrix reference, SEXP query,
                           uint32_t k, const std::string &metric = "euclidean",
                           std::size_t n_threads = 0, bool verbose = false) {
  DISPATCH_ON_DISTANCES();

  auto ref_vec = r_to_dist_vect<Distance>(reference);
  tdoann::NNGraph<Distance::Output> nn_graph;

  if (query == R_NilValue) {
    nn_graph = tdoann::brute_force_build<Distance, RPProgress, RParallel>(
      data_vec, data.ncol(), k, n_threads, verbose);
  } else {
    NumericMatrix query(query);
    auto query_vec = r_to_dist_vect<Distance>(query);
    nn_graph = tdoann::brute_force_query<Distance, RPProgress, RParallel>(
      ref_vec, reference.ncol(), query_vec, k, n_threads, verbose);
  }
  return (graph_to_r(nn_graph));
}
@jlmelville
Copy link
Owner

I'm not against this in principle, but it seems premature to do a lot of this at this point in the life cycle of the package; it's nowhere near close to finished. One thing I really want to do with the package to bring it closer to pynndescent in usefulness is to add a new method: a forest of random projection trees. Eventually, this will be the default initialization method for nearest neighbor descent, but can be used standalone. This is going to complicate the interface to any single-function API substantially.

My other big concern with this is that it reduces copy-and-paste at the cost of increasing abstraction. You can take this for what it's worth (probably not much): in my career I have often come to bitterly regret too much abstraction (almost always introduced by me), but copy-and-paste has only ever caused me mild annoyance. It's possible to have too much of either and I understand if the current state of the package goes too far in one direction for your taste.

Here are some more specific thoughts on these proposals.

  1. I could possibly be persuaded about. There's enough happening in rnndescent.R that splitting it up might be helpful. What problem is this solving from your perspective?
  2. There's probably a lot of clean up that could be done to the wrapper code in src. The brute force part of the code is the simplest, so it's the most amenable to this sort of thing. But it's also not the focus of the package (that's the nearest neighbor descent routines). If your main interest in the package is the brute force search then I can see how this would be a help to you.
  3. I don't see much value to unifying all the functions:
    • They have similar interfaces, but nnd_knn adds a large number of other parameters that the other methods don't use.
    • One way to deal with this: keep the parameters but then you have to add ignored unless method = "nnd" to all the documentation where this matters. When I get round to implementing the tree method there will be another round of additions. At some point having different functions for different methods is just easier to understand.
    • Another way to deal with this: have them as varargs (or an extra list)? But now you can't use tab-complete to remind yourself about them and you have to go to the help page to find out about them. I use tab-complete to remind myself of the argument names all the time and find it very frustrating when that doesn't work.
    • In practice, I doubt anyone is going to want to use the methods interchangeably. Random neighbor selection is just there for initialization purposes, and brute force is unusable for large datasets, in which case only nearest neighbor descent is feasible. So in practice, people will either use brute force search for some datasets and nearest neighbor descent for the rest. Pretending that the methods are interchangeable when they have such wildly different performance characteristics would be doing users of the package a disservice.
    • If there is a strong need for this, I wouldn't be against adding this as an additional option, but in that case, if none of the other functions are going away, would it justify the maintenance and testing burden?
    • As an aside I wouldn't want to call the function knn and knn_query: knn_query is probably ok, but with short function names, I am always a bit annoyed when I require a package and find that some functions have now been shadowed by imported functions. Is this irrational of me? Probably (obviously I can just not require the package) but I still like to avoid names that might clash with other functions in the user's environment.
  4. This has all the problems of 3, but with the additional problem that you now have to deal with arguments that only make sense when the method = "nnd" and there is a non-NULL query.

Sorry to be negative. I appreciate the thought and attention you have given this and I hope my response isn't too demotivating. I think it's unlikely I would want to merge any PRs about 3) or 4) at the moment but for 2) your proposed cleanup for brute force knn is a lot simpler. If you can get a PR working, I would certainly take a look at it favorably. The random neighbors interface might also be fixable. I think merging the nearest neighbor descent routines would be a lot more challenging, so they might not be a good idea in one PR. For 1) if I can understand the problem it's solving that is also something I would be interested in learning more about.

@vspinu
Copy link
Author

vspinu commented Jul 23, 2021

Ha ha. Interesting. I didn't expect such a strong opinion on 3 :) To me 3 feels very natural. One commonly structures UI by purpose and returned output. If two functions have same input, similar parameter set and same output, and differ just in an algorithm, it should be one function IMO. I think it's a very common practice in R world.

There's enough happening in rnndescent.R that splitting it up might be helpful. What problem is this solving from your perspective?

In fact that's my main motivation for proposing all 4 of them. When I come to a package I usually look at the index to have a feeling what is there. And with rnndescend I was quite confused (and still am, to be honest). There seem to be a lot of functionality with a lot of documentation, but most of it is redundant. With multiple documents with the roughly same doc, I find myself doing a lot of skimming as the docs generally look familiar, but there might be an important difference which is easy to miss.

If there would be only one or two entry points in the entire package, the user remembers those very easily and always comes to the same doc, again and again, thus fully understanding the system from one place.

but nnd_knn adds a large number of other parameters that the other methods don't use.

The main disadvantage is tab completion I agree. Extra parameter in the doc is not an issue in my experience. I was personally never bothered by it. But I was never bothered by extra unused arguments in the tab completion either :)

In practice, I doubt anyone is going to want to use the methods interchangeably.

True, but it's still easier for them to remember one entry point and come back to it whatever they want. It's easier to remember that there is a "method" parameter with whatever possible N values than dealing with N differently named functions.

As an aside I wouldn't want to call the function knn and knn_query: knn_query is probably ok, but with short function names, I am always a bit annoyed when I require a package and find that some functions have now been shadowed by imported functions. Is this irrational of me?

I agree that it's annoing but it's so common nowadays that I always put initialization in a suppressPackageMessages. In my opinion it's not ok to crapify function names just because there are potential conflicts. In this concrete case there is no R core packages conflict and the users are unlikely to use two knn packages anyhow, so I think this particular concern is greatly exaggerated.

By far more annoying is the inconsistency in naming. I really had a lot of trouble with remembering nnd_ prefix for that one function. Both when typing the function name or when searching for the docs. I think, a cryptic prefix like that should be either used for all functions in a package or just drop it.

but with the additional problem that you now have to deal with arguments that only make sense when the method = "nnd" and there is a non-NULL query.

Hmm, my feeling is that such distinctions are artificial. I don't see why querying from self, should be different functionally from querying from a reference table. For instance, don't reference_graph and epsilon make sense for nnd_knn? as well?

Sorry to be negative.

Arh, no problem. It's your package and you should do whatever you see fit. I just recognize that this package has a great potential. Clearly a high quality stuff going on in here. So I just wanted to share my experience as a starter which wasn't quite smooth, but still great!

The random neighbors interface might also be fixable. I think merging the nearest neighbor descent routines would be a lot more challenging, so they might not be a good idea in one PR.

Ok. Let me give it a try and be back with you.

@vspinu
Copy link
Author

vspinu commented Jul 23, 2021

In practice, I doubt anyone is going to want to use the methods interchangeably.

I actually just had a counter use case for this. I am developing a system which currently relies on a brute force on a dev set. It might still stay brute force in the production eventually, but I don't know as yet. I want to be able to quickly switch between the two algorithms etc.

More generally, if a dependent package or a production system wants to keep the flexibility of the algorithm, they have to implement an ugly if else loop which would dispatch on a different function name. Instead, if a parameter is used in nnd functions, then the same parameter can be kept in dependent functions or as a global configuration option.

@jlmelville
Copy link
Owner

I think I understand a bit better now and I think we probably just have some different expectations around what the package does (at the moment).

Goal 1: a package to present functions for doing (approximate) nearest neighbor search by a specific method. Currently, that's pure nearest-neighbor descent. The brute force method is there to help with benchmarking and comparisons. The task I want to solve with this package is "I would like to generate some nearest neighbors using NND", not "I want to accurately generate nearest neighbors". Therefore it makes sense to me offer specific functions for each task. This is the current aim of the package.

Goal 2: a package to provide a user-friendly interface to generate nearest neighbors, where the implementation method is a simple detail. This is not currently a goal of the package. This might seem user-hostile but I just haven't given it any though (also my personal long-term reason to build this package is for use by another method, which would hide all these unpleasant details).

Basically, if you are using this package it's because you are interested in nearest neighbor descent, not generating nearest neighbors per se. I understand that the audience for such a package is basically me. That's ok for now: that's why the version number isn't even 0.1 yet and I haven't submitted to CRAN.

Goal 2 is worthy. And in that case having a single, simply-named function would be very useful. But I think if that gets added now it will break a lot and be more trouble to maintain than it's worth.

When I come to a package I usually look at the index to have a feeling what is there. And with rnndescend I was quite confused (and still am, to be honest). There seem to be a lot of functionality with a lot of documentation, but most of it is redundant.

Could this be solved by a package documentation or vignettes?

users are unlikely to use two knn packages anyhow

I do not at all share your confidence about that!

By far more annoying is the inconsistency in naming. I really had a lot of trouble with remembering nnd_ prefix for that one function.

It doesn't seem that cryptic to me, it stands for "nearest neighbor descent", which is specifically what that function does.

Hmm, my feeling is that such distinctions are artificial. I don't see why querying from self, should be different functionally from querying from a reference table. For instance, don't reference_graph and epsilon make sense for nnd_knn? as well?

No. You can't query with nearest neighbor descent, the graph_knn_query function uses an entirely different code path, although they are related quite closely conceptually and the distinction was lost on me until embarrassingly recently. But the actual implementation turns out to be significantly different hence their different names.

I am quite sure that the distinctions between building and querying here are not artificial. It is tempting to assume that the concepts behind the brute force code can be extrapolated to other methods, and I have made that mistake myself several times while developing the package. To understand why it gets more complex you need to know a bit of technical detail about how nearest neighbor descent works. That underlines my point about not attempting to hide these differences behind a single knn function (at least not yet).

This discussion does raise the issue that it might make sense to change graph_knn_query to be able to build a graph as well as query but that wouldn't affect the nearest neighbor descent routine (also it might not be practically very helpful).

@vspinu
Copy link
Author

vspinu commented Jul 23, 2021

Could this be solved by a package documentation or vignettes?

I don't think so. Vignette would be useful to get more into technical details of the algo (which I admit I haven't done so far). The part which is trouble is the naming conventions and that logically related functionality is in different functions and documentation pages, but that's all and neither is a big deal.

That underlines my point about not attempting to hide these differences behind a single knn function (at least not yet).

I see. Thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants