New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CF cannot properly handle 0s in the input matrix #379
Comments
A rudimentary implementation of a masked array is nothing more than a sparse matrix + a bitmask. The only operations we need are to be able to lookup if an i,j entry is present and iteration over the present values. we don't, for instance, need to multiply a masked array with a vector. I feel like your proposed additions to spmat complicate its implementation for very little additional gain; we seem to be the only use case. All the sparse matrix multiplication / factorization algorithms don't apply in the non-zero case, so you'll have to write two implementations of everything (the fast common case one + the slow non-zero one). If, on the other hand, you decide to not implement the majority of the algorithms for the non-zero case, then it's basically no different from a masked array with a default value. Honestly, I don't see what the big problem is with (uvec, vec) pair to specify matrix entries from an API perspective. Maybe all we need is a wrapper class
Of course, maybe there are some performance issues you are concerned with, which I have not thought about. Could you maybe elaborate more on why you think its important for the API to change in this regard? |
After further thinking, I agree with your comments about the modified uvec,vec pair isn't necessarily a problem, and I don't see any serious performance issues by using that. Sparse matrices are a compelling abstraction to use for collaborative filtering where one is missing entries, but it's not necessarily the best solution. Ideally I'd like a unified API so we can present the user documentation that says "Okay, you're doing collaborative filtering and using the CF class. Pass in your data like this." instead of "Well, depending on which factorizer you use, your data may come in differently...". Right now we have these factorizers:
It would be possible but difficult to refactor the AMF rules to use NMF is a good case of where the In my mind what's necessary here is some kind of API consistency, which is really what I'm striving for in the end. (Otherwise, as more factorizers and CF code gets added, it gets even uglier...) I'd be happy with switching all factorizers to Okay, I hope this long rambling essay makes some amount of sense. Good API design is hard because there are so many things to balance, but it's very helpful to have someone to bounce ideas off of in the process of convergence. And if it doesn't make any sense, don't be afraid to say so. :) |
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
(For background, see #376.) In collaborative filtering or other matrix completion tasks, the user may have an incomplete rating matrix which may contain 0s. Currently, mlpack won't handle this properly; nor will the the
arma::sp_mat
sparse matrix class. Although sparse matrices can be easily made to use 0s, the Armadillo documentation seems to discourage this (the Armadillo documentation and sparse matrix code, since I wrote a large part of it, is mutable).@stephentu suggested masked arrays in numpy: http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
but personally I think any structure like this will incur significant extra overhead in both runtime and memory (if it turns out I'm wrong, then, I'm happy to use something like masked arrays).
In addition, a bunch of the existing factorizers that CF can use have checks for zero values. Those checks should probably go away regardless of the decision here, especially when
MatType::row_col_iterator
is used, since that will iterate only over nonzero values whenMatType = sp_mat
and all values whenMatType = mat
.I'm leaning towards a first step in this direction being a significant modification of the
arma::SpMat<eT>
class to take some extra parameter specifying what the value of a "missing value" is. For a defaultsp_mat
it should be 0; maybe for the CF applications it should be something likeNaN
or something else. The question in my mind right now is how to provide that support cleanly in such a way that it'll be accepted upstream in Armadillo...The text was updated successfully, but these errors were encountered: