Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add weights arguments to differentiate dimensions. #3

Open
vspinu opened this issue Jul 21, 2021 · 6 comments
Open

feature: Add weights arguments to differentiate dimensions. #3

vspinu opened this issue Jul 21, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@vspinu
Copy link

vspinu commented Jul 21, 2021

Would it be very difficult to add weight argument to the available metrics? My use case involves a well defined externally provided importance weights for the matrix' dimensions. My primary interest is in hamming metric, but I think it does make sense throughout.

@vspinu
Copy link
Author

vspinu commented Jul 21, 2021

If weight can be a square (corss-dimension) matrix then one could get Mahalanobis and all the similarity measures for categorical data from this paper as a special case with the overlap measure.

@jlmelville jlmelville added the enhancement New feature or request label Jul 21, 2021
@jlmelville
Copy link
Owner

I haven't really thought this through and I just using this space to muse over implementation ideas, but the simplest approach would be to duplicate all the distances with Weighted versions which take a std::vector<float> of the weights and use that if a newly-added weights parameter was non-NULL (or add in in a new metric_params list parameter that allows for arbitrary new parameters to be associated with the chosen metric).

That doesn't seem hard, but it does add a burden in terms of testing, documentation, maintainability, checking error paths and so on. It might also increase compilation times (which wouldn't affect end users) and make the final compiled result larger which causes a NOTE to appear for some architectures when doing a CRAN check. I don't know why anyone particularly cares about this but I guess the CRAN maintainers do?

I am less certain about passing in a matrix, as this won't be feasible except for small-dimension datasets, unless the weights are sparse, but sparse input isn't something I have implemented yet.

I am happy to look at a PR: but it's probably a lot of work: just doing it for Hamming might reduce the work a little bit (at least with testing). In terms of me getting round to implementing this, this has to be traded off against my available time and priorities, so this will be low on any personal roadmap.

@vspinu vspinu mentioned this issue Jul 22, 2021
@vspinu
Copy link
Author

vspinu commented Jul 22, 2021

simplest approach would be to duplicate all the distances with Weighted versions which take a std::vector of the weights and use that if a newly-added weights parameter was non-NULL

Alternatively one can just add a weight to existing distances. The price to pay is an extra if statement which, I think, is negligible from the performance perspective.

(or add in in a new metric_params list parameter that allows for arbitrary new parameters to be associated with the chosen metric).

This is future proof. Maybe go with this then?

It might also increase compilation times (which wouldn't affect end users) and make the final compiled result larger which causes a NOTE to appear for some architectures when doing a CRAN check. I don't know why anyone particularly cares about this but I guess the CRAN maintainers do?

What about switching to cpp11. My experience with it was very positive, much simpler, faster compilation and much smaller package size.

I am less certain about passing in a matrix, as this won't be feasible except for small-dimension datasets, unless the weights are sparse, but sparse input isn't something I have implemented yet.

The waits are for dimensions (columns) not for observations. Those are usually small. It's the user's part to bear the consequences of using a matrix. In terms of the UI, both should be accepted. the weight argument should be either a vector matching the input number of columns C, or a matrix with dimension CxC.

I am happy to look at a PR.

Ok. I will try to have a look at it.

@jlmelville
Copy link
Owner

Alternatively one can just add a weight to existing distances. The price to pay is an extra if statement which, I think, is negligible from the performance perspective.

One if per distance calculation, unless I am misunderstanding something? If so, I wouldn't assume that any change is negligible without measuring the effect.

What about switching to cpp11. My experience with it was very positive, much simpler, faster compilation and much smaller package size.

My assumption is that the compiled sizes are due to the proliferation of concrete classes (one per distance metric), so it's possible it's just a C++ thing rather than an Rcpp thing. But thank you for pointing out cpp11, it seems like something to take a look at.

The waits are for dimensions (columns) not for observations. Those are usually small.

Out of interest, what discipline is the weighted Hamming is usually used in and how large the dimensionality of the dataset is? Column dimensions between 100 and 10,000 are not uncommon in my experience so I am skeptical that a matrix would be a good idea there. I would just be careful about assuming that the number of columns are usually small when designing and documenting this feature. This might not be the case when considering all the domains where people want to find approximate nearest neighbors.

It's the user's part to bear the consequences of using a matrix.

And the maintainers (although I mean just me in this case :D). If most users of this package want to find approximate nearest neighbors in fairly high dimensional datasets, then adding a feature that most people can't use has to be weighed (no pun intended) against its value. It's my experience that the time spent adding a feature is 10% doing the actual implementation and then 90% safe-guarding against accidental misuse by adding documentation and error handling and tests and so on. There's also the added complexity for the user trying to understand how to use the package where they have the extra overhead of being asked to ignore optional parameters (in this case, admittedly less of a big deal).

Do you plan to add this option to all the exported functions that carry out distance calculations? Or do you have a more specific use case?

I don't intend for any of this to put you off contributing a PR, just want to set some expectations about what it will take to get this fully integrated. As this isn't on CRAN (and won't be for a long time if ever), nor have I made swift progress in this package's development, just adding what you want in a fork might be an quicker way to get what you want without any particular downside.

@vspinu
Copy link
Author

vspinu commented Jul 23, 2021

Out of interest, what discipline is the weighted Hamming is usually used in and how large the dimensionality of the dataset is?

I don't know to be frank. Your concern is valid for sure. My use case involved up to 1k features and it does not involve a matrix weight as yet but it does require a weight. I just propose a matrix because it captures a huge number of hamming style metrics for categorical data and also mahalanobis. Also, you want to go the pnndiscent way eventually, which supports a lot of distances including ahalanobis, matrix weight is a must.

In terms of the implementations, a vector weight or a matrix weight, doesn't make much difference.

t's my experience that the time spent adding a feature is 10% doing the actual implementation and then 90% safe-guarding against accidental misuse by adding documentation and error handling and tests and so on

Agreed, but I also want to point out that the current set of tests is quite difficult to maintain because they test the entire knn trip. I would rather propose testing individual distances separately which is rather trivial. A few toy examples per metric would be enough and a weight parameter doesn't seem like something very difficult to document or maintain.

Similarly for the documentation, I feel that it could be streamlined quite a bit, but that's the topic of #6 I guess.

As this isn't on CRAN (and won't be for a long time if ever), nor have I made swift progress in this package's development, just adding what you want in a fork might be an quicker way to get what you want without any particular downside.

I do feel your concerns. And even though the package is solid already, and IMO it would be a pity if it's not popularized and pushed to CRAN etc, I sympathize with the maintenance costs. I myself have a lot of half-backed packages lying around which I never had time to finish or promote (including one for parallel distance computation)

In any case, I will still go ahead and contribute the weight. For me knn without a weight is a no go. I will have to either implement it from scratch or add a patch here and I would rather do it here for the greater good.

@jlmelville
Copy link
Owner

Agreed, but I also want to point out that the current set of tests is quite difficult to maintain because they test the entire knn trip.

In practice they aren't that difficult to maintain because they are either designed to either simply characterize the output (i.e. the actual number isn't that important: if the number changes I know I have done something very bad to the internals) or to reproduce results from existing nearest neighbor libraries.

Also, testing the entire knn trip for a new feature is important to me because that's how the user will interact with it.

Similarly for the documentation, I feel that it could be streamlined quite a bit, but that's the topic of #6 I guess.

I am happy to give it a try, but in previous packages I have either reverted previous efforts along these lines or wished I hadn't done it. I see the appeal for very large codebases, but the advantage of not streamlining the documentation is that it's very obvious where the documentation for a function is: it's where the function is. I see documentation and tests as different beasts to code itself and the payoff of introducing abstraction to remove duplication vs simplicity isn't the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants