-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Add weights arguments to differentiate dimensions. #3
Comments
If weight can be a square (corss-dimension) matrix then one could get Mahalanobis and all the similarity measures for categorical data from this paper as a special case with the overlap measure. |
I haven't really thought this through and I just using this space to muse over implementation ideas, but the simplest approach would be to duplicate all the distances with That doesn't seem hard, but it does add a burden in terms of testing, documentation, maintainability, checking error paths and so on. It might also increase compilation times (which wouldn't affect end users) and make the final compiled result larger which causes a I am less certain about passing in a matrix, as this won't be feasible except for small-dimension datasets, unless the weights are sparse, but sparse input isn't something I have implemented yet. I am happy to look at a PR: but it's probably a lot of work: just doing it for Hamming might reduce the work a little bit (at least with testing). In terms of me getting round to implementing this, this has to be traded off against my available time and priorities, so this will be low on any personal roadmap. |
Alternatively one can just add a weight to existing distances. The price to pay is an extra
This is future proof. Maybe go with this then?
What about switching to cpp11. My experience with it was very positive, much simpler, faster compilation and much smaller package size.
The waits are for dimensions (columns) not for observations. Those are usually small. It's the user's part to bear the consequences of using a matrix. In terms of the UI, both should be accepted. the weight argument should be either a vector matching the input number of columns C, or a matrix with dimension CxC.
Ok. I will try to have a look at it. |
One
My assumption is that the compiled sizes are due to the proliferation of concrete classes (one per distance metric), so it's possible it's just a C++ thing rather than an Rcpp thing. But thank you for pointing out cpp11, it seems like something to take a look at.
Out of interest, what discipline is the weighted Hamming is usually used in and how large the dimensionality of the dataset is? Column dimensions between 100 and 10,000 are not uncommon in my experience so I am skeptical that a matrix would be a good idea there. I would just be careful about assuming that the number of columns are usually small when designing and documenting this feature. This might not be the case when considering all the domains where people want to find approximate nearest neighbors.
And the maintainers (although I mean just me in this case :D). If most users of this package want to find approximate nearest neighbors in fairly high dimensional datasets, then adding a feature that most people can't use has to be weighed (no pun intended) against its value. It's my experience that the time spent adding a feature is 10% doing the actual implementation and then 90% safe-guarding against accidental misuse by adding documentation and error handling and tests and so on. There's also the added complexity for the user trying to understand how to use the package where they have the extra overhead of being asked to ignore optional parameters (in this case, admittedly less of a big deal). Do you plan to add this option to all the exported functions that carry out distance calculations? Or do you have a more specific use case? I don't intend for any of this to put you off contributing a PR, just want to set some expectations about what it will take to get this fully integrated. As this isn't on CRAN (and won't be for a long time if ever), nor have I made swift progress in this package's development, just adding what you want in a fork might be an quicker way to get what you want without any particular downside. |
I don't know to be frank. Your concern is valid for sure. My use case involved up to 1k features and it does not involve a matrix weight as yet but it does require a weight. I just propose a matrix because it captures a huge number of hamming style metrics for categorical data and also mahalanobis. Also, you want to go the pnndiscent way eventually, which supports a lot of distances including ahalanobis, matrix weight is a must. In terms of the implementations, a vector weight or a matrix weight, doesn't make much difference.
Agreed, but I also want to point out that the current set of tests is quite difficult to maintain because they test the entire knn trip. I would rather propose testing individual distances separately which is rather trivial. A few toy examples per metric would be enough and a weight parameter doesn't seem like something very difficult to document or maintain. Similarly for the documentation, I feel that it could be streamlined quite a bit, but that's the topic of #6 I guess.
I do feel your concerns. And even though the package is solid already, and IMO it would be a pity if it's not popularized and pushed to CRAN etc, I sympathize with the maintenance costs. I myself have a lot of half-backed packages lying around which I never had time to finish or promote (including one for parallel distance computation) In any case, I will still go ahead and contribute the weight. For me knn without a weight is a no go. I will have to either implement it from scratch or add a patch here and I would rather do it here for the greater good. |
In practice they aren't that difficult to maintain because they are either designed to either simply characterize the output (i.e. the actual number isn't that important: if the number changes I know I have done something very bad to the internals) or to reproduce results from existing nearest neighbor libraries. Also, testing the entire knn trip for a new feature is important to me because that's how the user will interact with it.
I am happy to give it a try, but in previous packages I have either reverted previous efforts along these lines or wished I hadn't done it. I see the appeal for very large codebases, but the advantage of not streamlining the documentation is that it's very obvious where the documentation for a function is: it's where the function is. I see documentation and tests as different beasts to code itself and the payoff of introducing abstraction to remove duplication vs simplicity isn't the same. |
Would it be very difficult to add weight argument to the available metrics? My use case involves a well defined externally provided importance weights for the matrix' dimensions. My primary interest is in
hamming
metric, but I think it does make sense throughout.The text was updated successfully, but these errors were encountered: