Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CF cannot properly handle 0s in the input matrix #379

Closed
rcurtin opened this issue Jan 12, 2015 · 3 comments
Closed

CF cannot properly handle 0s in the input matrix #379

rcurtin opened this issue Jan 12, 2015 · 3 comments

Comments

@rcurtin
Copy link
Member

rcurtin commented Jan 12, 2015

(For background, see #376.) In collaborative filtering or other matrix completion tasks, the user may have an incomplete rating matrix which may contain 0s. Currently, mlpack won't handle this properly; nor will the the arma::sp_mat sparse matrix class. Although sparse matrices can be easily made to use 0s, the Armadillo documentation seems to discourage this (the Armadillo documentation and sparse matrix code, since I wrote a large part of it, is mutable).

@stephentu suggested masked arrays in numpy: http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
but personally I think any structure like this will incur significant extra overhead in both runtime and memory (if it turns out I'm wrong, then, I'm happy to use something like masked arrays).

In addition, a bunch of the existing factorizers that CF can use have checks for zero values. Those checks should probably go away regardless of the decision here, especially when MatType::row_col_iterator is used, since that will iterate only over nonzero values when MatType = sp_mat and all values when MatType = mat.

I'm leaning towards a first step in this direction being a significant modification of the arma::SpMat<eT> class to take some extra parameter specifying what the value of a "missing value" is. For a default sp_mat it should be 0; maybe for the CF applications it should be something like NaN or something else. The question in my mind right now is how to provide that support cleanly in such a way that it'll be accepted upstream in Armadillo...

@stephentu
Copy link
Contributor

A rudimentary implementation of a masked array is nothing more than a sparse matrix + a bitmask. The only operations we need are to be able to lookup if an i,j entry is present and iteration over the present values. we don't, for instance, need to multiply a masked array with a vector.

I feel like your proposed additions to spmat complicate its implementation for very little additional gain; we seem to be the only use case. All the sparse matrix multiplication / factorization algorithms don't apply in the non-zero case, so you'll have to write two implementations of everything (the fast common case one + the slow non-zero one). If, on the other hand, you decide to not implement the majority of the algorithms for the non-zero case, then it's basically no different from a masked array with a default value.

Honestly, I don't see what the big problem is with (uvec, vec) pair to specify matrix entries from an API perspective. Maybe all we need is a wrapper class

class MatrixEntries {
  arma::uvec indices;
  arma::vec values;
};

Of course, maybe there are some performance issues you are concerned with, which I have not thought about. Could you maybe elaborate more on why you think its important for the API to change in this regard?

@rcurtin
Copy link
Member Author

rcurtin commented Jan 14, 2015

After further thinking, I agree with your comments about the modified sp_mat. I had a feeling it was a half-baked idea but I hadn't figured out how, yet.

uvec,vec pair isn't necessarily a problem, and I don't see any serious performance issues by using that. Sparse matrices are a compelling abstraction to use for collaborative filtering where one is missing entries, but it's not necessarily the best solution. Ideally I'd like a unified API so we can present the user documentation that says "Okay, you're doing collaborative filtering and using the CF class. Pass in your data like this." instead of "Well, depending on which factorizer you use, your data may come in differently...".

Right now we have these factorizers:

  • RegularizedSVD -- takes a mat coordinate list (should probably be split into uvec/vec)
  • AMF<> (this is a whole class of factorizers) -- may take sp_mat or mat depending; in either case, entries that are zero are assumed to be missing, except for the NMF update rules (I think)
  • QUIC_SVD -- takes mat, does not consider missing values
  • MatrixCompletion (needs some glue, but could work as one) -- takes a uvec/vec coordinate list

It would be possible but difficult to refactor the AMF rules to use uvec/vec (especially the NMF rules, which take advantage of linear algebra expressions). NMF does appear to be used for collaborative filtering without any modification, as suggested by this paper: http://arxiv.org/pdf/1205.3193.pdf , so I don't think we can say "okay, we'll just drop support for NMF as a factorizer for CF".

NMF is a good case of where the mat/sp_mat duality is really nice; see src/mlpack/methods/amf/update_rules/nmf_mult_dist.hpp and the fact that HUpdate() and WUpdate() work the same with mat or sp_mat. In some of the other AMF update rules, the row_col_iterator is used to give the same type of sparse/dense genericity, although those rules make the assumption that a zero value means a missing value.

In my mind what's necessary here is some kind of API consistency, which is really what I'm striving for in the end. (Otherwise, as more factorizers and CF code gets added, it gets even uglier...) I'd be happy with switching all factorizers to uvec/vec as input, but this doesn't make it clear what we should do with NMF, which has matrix expressions where missing values no longer make any sense. That solution also doesn't consider the case where things like AMF, QUIC_SVD, and RegularizedSVD would be used for tasks where there are no missing entries, either with mat or sp_mat as input.

Okay, I hope this long rambling essay makes some amount of sense. Good API design is hard because there are so many things to balance, but it's very helpful to have someone to bounce ideas off of in the process of convergence. And if it doesn't make any sense, don't be afraid to say so. :)

@mlpack-bot
Copy link

mlpack-bot bot commented Feb 18, 2019

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Feb 18, 2019
@mlpack-bot mlpack-bot bot closed this as completed Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants