New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CF Interpolation Weights #1410

Merged
merged 36 commits into from Jul 5, 2018

Conversation

Projects
None yet
4 participants
@Wenhao-H
Member

Wenhao-H commented May 25, 2018

To do:

  • Add AverageInterpolation, SimilarityInterpolation, RegressionInterpolation.
  • Add LMetricSearch, CosineSearch, PearsonSearch.
  • Modify methods in class CF to use interpolation policies.
  • Update tests.

@Wenhao-H Wenhao-H changed the title from CF Interpolation Weights to [WIP] CF Interpolation Weights May 25, 2018

Show outdated Hide outdated src/mlpack/methods/cf/interpolation_policies/average_interpolation.hpp
Show outdated Hide outdated src/mlpack/methods/cf/interpolation_policies/average_interpolation.hpp
Show outdated Hide outdated src/mlpack/methods/cf/interpolation_policies/similarity_interpolation.hpp
NeighborSearchType neighborSearch;
};
using EuclideanSearch = LMetricSearch<2>;

This comment has been minimized.

@zoq

zoq Jun 2, 2018

Member

Not sure what the idea behind this is, can you elaborate on this one?

@zoq

zoq Jun 2, 2018

Member

Not sure what the idea behind this is, can you elaborate on this one?

This comment has been minimized.

@Wenhao-H

Wenhao-H Jun 2, 2018

Member

@zoq The current CF performs neighbor search with Euclidean Distance. I am working to modify that to policy-based design and provide alternatives like EuclideanSearch, CosineSearch, PearsonSearch (maybe others as well). The Search() method in these classes return neighbors, and similarities which is used in calculating interpolation weights (i.e. GetWeights() method in interpolation classes).
As to LMetricSearch, I guess it is possible that a user may want to try other L_p distance besides L_2 (Euclidean Distance), though I haven't met such cases.

@Wenhao-H

Wenhao-H Jun 2, 2018

Member

@zoq The current CF performs neighbor search with Euclidean Distance. I am working to modify that to policy-based design and provide alternatives like EuclideanSearch, CosineSearch, PearsonSearch (maybe others as well). The Search() method in these classes return neighbors, and similarities which is used in calculating interpolation weights (i.e. GetWeights() method in interpolation classes).
As to LMetricSearch, I guess it is possible that a user may want to try other L_p distance besides L_2 (Euclidean Distance), though I haven't met such cases.

This comment has been minimized.

@zoq

zoq Jun 2, 2018

Member

Thanks for the clarification, that makes sense, I was just wondering if a user is going to use the alias and if it's a good idea to put it into the cf namespace.

@zoq

zoq Jun 2, 2018

Member

Thanks for the clarification, that makes sense, I was just wondering if a user is going to use the alias and if it's a good idea to put it into the cf namespace.

// Normalize all vectors in referenceSet.
// For each vector x, first subtract mean(x) from each element in x.
// Then normalize the vector to unit length.
arma::mat normalizedSet(arma::size(referenceSet));

This comment has been minimized.

@zoq

zoq Jun 2, 2018

Member

Can we use arma:: normalise here?

@zoq

zoq Jun 2, 2018

Member

Can we use arma:: normalise here?

This comment has been minimized.

@Wenhao-H

Wenhao-H Jun 2, 2018

Member

Yes we should use normalise() instead.

@Wenhao-H

Wenhao-H Jun 2, 2018

Member

Yes we should use normalise() instead.

RegressionInterpolation(const arma::sp_mat& cleanedData)
{
const size_t userNum = cleanedData.n_cols;
a.set_size(userNum, userNum);

This comment has been minimized.

@zoq

zoq Jun 9, 2018

Member

You can also write: a.set_size(arma::size(cleanedData)), just wanted to point that out.

@zoq

zoq Jun 9, 2018

Member

You can also write: a.set_size(arma::size(cleanedData)), just wanted to point that out.

@Wenhao-H

This comment has been minimized.

Show comment
Hide comment
@Wenhao-H

Wenhao-H Jun 10, 2018

Member

All neighbor search classes and weights interpolation classes are added. I might spend some time later to work on NNRegressionInterpolation which is an improvement of RegressionInterpolation.

I will templatize some functions in CFType to make use of neighbor search policies and interpolation policies, and add tests for these classes. But as I don't want to introduce conflicts against my another PR on CF data normalization (#1397), I plan to do that after the other PR is merged (and I guess that PR is close to being merged). At the same time I will try implementing NNRegressionInterpolation and run the CF algorithm with newly added data normalization, neighbor search policies and interpolation policies to check performance improvement.

Member

Wenhao-H commented Jun 10, 2018

All neighbor search classes and weights interpolation classes are added. I might spend some time later to work on NNRegressionInterpolation which is an improvement of RegressionInterpolation.

I will templatize some functions in CFType to make use of neighbor search policies and interpolation policies, and add tests for these classes. But as I don't want to introduce conflicts against my another PR on CF data normalization (#1397), I plan to do that after the other PR is merged (and I guess that PR is close to being merged). At the same time I will try implementing NNRegressionInterpolation and run the CF algorithm with newly added data normalization, neighbor search policies and interpolation policies to check performance improvement.

@Wenhao-H Wenhao-H changed the title from [WIP] CF Interpolation Weights to CF Interpolation Weights Jun 18, 2018

@Wenhao-H

This comment has been minimized.

Show comment
Hide comment
@Wenhao-H

Wenhao-H Jun 18, 2018

Member

Benchmark results on newly-added policies in CF (including data normalization, neighbor search policies and interpolation policies) can be viewed here.

Member

Wenhao-H commented Jun 18, 2018

Benchmark results on newly-added policies in CF (including data normalization, neighbor search policies and interpolation policies) can be viewed here.

@lozhnikov

This comment has been minimized.

Show comment
Hide comment
@lozhnikov

lozhnikov Jun 23, 2018

Contributor

Regarding your question at #mlpack:
I think you could refactor DecompositionPolicy instead. Actually, you needn't rewrite the whole CFType, I think it is enough to add a function to DecompositionPolicy that calculates any particular term of the weighted sum i.e. (w * h.col(neighborhood(j, i))` and perhaps a function that searches for similarities.

Contributor

lozhnikov commented Jun 23, 2018

Regarding your question at #mlpack:
I think you could refactor DecompositionPolicy instead. Actually, you needn't rewrite the whole CFType, I think it is enough to add a function to DecompositionPolicy that calculates any particular term of the weighted sum i.e. (w * h.col(neighborhood(j, i))` and perhaps a function that searches for similarities.

@Wenhao-H

This comment has been minimized.

Show comment
Hide comment
@Wenhao-H

Wenhao-H Jun 23, 2018

Member

@lozhnikov That's a good solution. Correct me if I'm misunderstand. Let's say we use DecompositionPolicy to decompose the rating matrix. To compute a rating we need to call the function decomposition.GetRating(user, item, ...). Therefore, in order to implement BiasSVD and SVD++, we only need to add BiasSVDPolicy and SVDPlusPlusPolicy as decomposition policies, and provide functions to train the model on given datasets, to compute rating for a particular user-item pair, etc.

I have one more question about this solution. To refactor DecompositionPolicy, should model parameters (W, H, ...) be saved as members in CFType<> as they are now, or be saved as members in DecompositionPolicy?

Member

Wenhao-H commented Jun 23, 2018

@lozhnikov That's a good solution. Correct me if I'm misunderstand. Let's say we use DecompositionPolicy to decompose the rating matrix. To compute a rating we need to call the function decomposition.GetRating(user, item, ...). Therefore, in order to implement BiasSVD and SVD++, we only need to add BiasSVDPolicy and SVDPlusPlusPolicy as decomposition policies, and provide functions to train the model on given datasets, to compute rating for a particular user-item pair, etc.

I have one more question about this solution. To refactor DecompositionPolicy, should model parameters (W, H, ...) be saved as members in CFType<> as they are now, or be saved as members in DecompositionPolicy?

@lozhnikov

This comment has been minimized.

Show comment
Hide comment
@lozhnikov

lozhnikov Jun 23, 2018

Contributor

@Wenhao-H Yeah, exactly. So, I think CFType won't be just a wrapper since it contains a lot common code for each decomposition policy. Yes, I guess we should move W and H to DecompositionPolicy.

Contributor

lozhnikov commented Jun 23, 2018

@Wenhao-H Yeah, exactly. So, I think CFType won't be just a wrapper since it contains a lot common code for each decomposition policy. Yes, I guess we should move W and H to DecompositionPolicy.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jun 23, 2018

Member

Agreed, let's move W and H into the DecompositionPolicy.

Member

zoq commented Jun 23, 2018

Agreed, let's move W and H into the DecompositionPolicy.

@Wenhao-H

This comment has been minimized.

Show comment
Hide comment
@Wenhao-H

Wenhao-H Jun 24, 2018

Member

Great! I will open another PR to refactor DecompositionPolicy and add BiasSVDPolicy, SVDPlusPlusPolicy.

Member

Wenhao-H commented Jun 24, 2018

Great! I will open another PR to refactor DecompositionPolicy and add BiasSVDPolicy, SVDPlusPlusPolicy.

Wenhao-H added some commits Jun 24, 2018

* @param neighbors Nearest neighbors.
* @param similarites Similarities between query point and its neighbors.
*/
void Search(const arma::mat& query, const size_t k,

This comment has been minimized.

@zoq

zoq Jun 25, 2018

Member

Picky style comment, can you align the parameter with the first one?

@zoq

zoq Jun 25, 2018

Member

Picky style comment, can you align the parameter with the first one?

Show outdated Hide outdated src/mlpack/methods/cf/neighbor_search_policies/pearson_search.hpp
Show outdated Hide outdated src/mlpack/methods/cf/neighbor_search_policies/pearson_search.hpp
Show outdated Hide outdated src/mlpack/methods/cf/neighbor_search_policies/cosine_search.hpp
@lozhnikov

Actually, looks good to me. I just pointed out some interesting points.

coeff(i, j) = std::numeric_limits<double>::min();
coeff(j, i) = coeff(i, j);
// Cache calcualted coefficient.
a(neighbors(i), neighbors(j)) = coeff(i, j);

This comment has been minimized.

@lozhnikov

lozhnikov Jul 2, 2018

Contributor

I am not sure that it is reasonable to cache the coefficients since armadillo stores sparse matrices in CSC format, and populating a sparse matrix via element access operators is inefficient. Moreover you don't take into account that the matrix is symmetric. I am not sure but I think std::map could handle that better. How do you think?

@lozhnikov

lozhnikov Jul 2, 2018

Contributor

I am not sure that it is reasonable to cache the coefficients since armadillo stores sparse matrices in CSC format, and populating a sparse matrix via element access operators is inefficient. Moreover you don't take into account that the matrix is symmetric. I am not sure but I think std::map could handle that better. How do you think?

This comment has been minimized.

@Wenhao-H

Wenhao-H Jul 2, 2018

Member

Ah I understand. Thanks for pointing this out. I didn't know how sparse matrix is implemented and I will take a look at CSC format later. I think std::unordered_map may be a better choice because it is generally faster than std::map.

@Wenhao-H

Wenhao-H Jul 2, 2018

Member

Ah I understand. Thanks for pointing this out. I didn't know how sparse matrix is implemented and I will take a look at CSC format later. I think std::unordered_map may be a better choice because it is generally faster than std::map.

This comment has been minimized.

@rcurtin

rcurtin Jul 2, 2018

Member

Actually, this has changed recently (for Armadillo 7+ I believe). We implemented a 'cache' for Armadillo so that setting sparse matrix elements by, e.g., x(a, b) = c is much faster. Basically in the backend this fills out a std::map<>, which is synced with the actual CSC representation when necessary. So elementwise insertion into sparse matrices is not a thing that is a problem anymore*. I don't have any particular opinions on how you do this here, but I thought I'd let you know since I saw the comment.

If you're interested in more details, here is a paper on how it works: http://www.ratml.org/pub/pdf/2018open.pdf

*It's possible that in some cases it may still be faster to fill a matrix with the batch constructors or by operating on the CSC representation directly, but the time difference between that and individual element insertion with the access operators should not be particularly significant in most common cases.

@rcurtin

rcurtin Jul 2, 2018

Member

Actually, this has changed recently (for Armadillo 7+ I believe). We implemented a 'cache' for Armadillo so that setting sparse matrix elements by, e.g., x(a, b) = c is much faster. Basically in the backend this fills out a std::map<>, which is synced with the actual CSC representation when necessary. So elementwise insertion into sparse matrices is not a thing that is a problem anymore*. I don't have any particular opinions on how you do this here, but I thought I'd let you know since I saw the comment.

If you're interested in more details, here is a paper on how it works: http://www.ratml.org/pub/pdf/2018open.pdf

*It's possible that in some cases it may still be faster to fill a matrix with the batch constructors or by operating on the CSC representation directly, but the time difference between that and individual element insertion with the access operators should not be particularly significant in most common cases.

This comment has been minimized.

@Wenhao-H

Wenhao-H Jul 2, 2018

Member

@rcurtin Thanks for the clear explanation:)
@lozhnikov In this case I think we can keep using sparse matrix to cache the values, right?

@Wenhao-H

Wenhao-H Jul 2, 2018

Member

@rcurtin Thanks for the clear explanation:)
@lozhnikov In this case I think we can keep using sparse matrix to cache the values, right?

This comment has been minimized.

@lozhnikov

lozhnikov Jul 2, 2018

Contributor

@rcurtin Thanks for the clarification, it's interesting.
Actually, looks like the correct link is http://www.ratml.org/pub/pdf/2018user.pdf :)
@Wenhao-H Yeah, sure.

@lozhnikov

lozhnikov Jul 2, 2018

Contributor

@rcurtin Thanks for the clarification, it's interesting.
Actually, looks like the correct link is http://www.ratml.org/pub/pdf/2018user.pdf :)
@Wenhao-H Yeah, sure.

This comment has been minimized.

@rcurtin

rcurtin Jul 2, 2018

Member

Oops, right, looked like I typed it wrong. Thanks for the fix :)

@rcurtin

rcurtin Jul 2, 2018

Member

Oops, right, looked like I typed it wrong. Thanks for the fix :)

Show outdated Hide outdated src/mlpack/tests/cf_test.cpp
@lozhnikov

No comments from my side. If the PR doesn't block anything let's wait a couple of days and then I'll merge the PR.

Show outdated Hide outdated src/mlpack/tests/cf_test.cpp
@rcurtin

rcurtin approved these changes Jul 3, 2018

Splitting the NeighborSearchPolicy and InterpolationPolicy away from the CF class itself is really nice, since it allows users of a single CF model to get predictions in numerous different ways without having to redo the decomposition.

I think in the long term it might be nice to make some of these different search options and interpolation options available in cf_main.cpp, but I don't see it as a huge priority to do that immediately.

Overall it looks great! I am excited to see it merged.

@zoq

zoq approved these changes Jul 4, 2018

No more comments from my side. Really nice contribution, looking forward to use some of the new features.

@lozhnikov lozhnikov merged commit 8cf6e61 into mlpack:master Jul 5, 2018

5 checks passed

Memory Checks
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment