Lsh table access #663

Merged
merged 12 commits into from Jun 2, 2016

Projects

None yet

2 participants

@mentekid
Contributor

I modified Train() to accept an extra argument, a std::vector of projection tables (arma::mat). If the size of the vector and the tables it contains is correct, Train retrains the algorithm to use the specified projection tables.

The user can also call setProjectionTables() which simply calls Train().

The user can also see the projection tables by calling getProjectionTables()

mentekid added some commits May 31, 2016
@mentekid mentekid Adds code that gives controllable access to LSH projection tables e94896d
@mentekid mentekid Adds code that gives controllable access to LSH projection tables f3c4939
@mentekid mentekid Adds code that gives controllable access to LSH projection tables
934fe08
@rcurtin rcurtin and 1 other commented on an outdated diff May 31, 2016
src/mlpack/methods/lsh/lsh_search.hpp
*/
void Train(const arma::mat& referenceSet,
const size_t numProj,
const size_t numTables,
const double hashWidth = 0.0,
const size_t secondHashSize = 99901,
- const size_t bucketSize = 500);
+ const size_t bucketSize = 500,
+ const std::vector<arma::mat> &projection
+ = std::vector<arma::mat>()
@rcurtin
rcurtin May 31, 2016 Member

Do you think we could use arma::cube here instead? We know how many projection tables we have and how many projections are in each table (and it's the same number of projections per table), so we don't need the extra overhead of the vector. I know that's not code you wrote, but I wanted to ask if you think we could make that change (I'll do it if you think it's reasonable, unless you beat me to it :)).

@mentekid
mentekid May 31, 2016 Contributor

I don't see any reason not to, since the number of tables doesn't change for the duration of the object.

I think I can do that tomorrow if you are busy, shouldn't be much trouble :)

@rcurtin rcurtin commented on an outdated diff May 31, 2016
src/mlpack/methods/lsh/lsh_search_impl.hpp
+ // Z ~ N(0, 1) is p-stable.
+ projMat.randn(referenceSet->n_rows, numProj);
+ }
+ else //user-specified projection tables
+ {
+ projMat = projection[i];
+
+ //make sure specified matrix is of correct size
+ if ( projMat.n_rows != referenceSet->n_rows )
+ throw std::invalid_argument(
+ "projection table dimensionality doesn't"
+ " equal dataset dimensionality" );
+ if ( projMat.n_cols != numProj )
+ throw std::invalid_argument(
+ "projection table doesn't have correct number of projections");
+ }
@rcurtin
rcurtin May 31, 2016 Member

We might be able to clean this up a bit: instead of calling projMat.randn(), if we use a cube instead, we can just call cube::randn(referenceSet->n_rows, numProj, numTables) once at the beginning of the function, and move this entire conditional there. Then projMat or projections[i] won't need to be touched at all here.

@rcurtin rcurtin commented on an outdated diff May 31, 2016
src/mlpack/methods/lsh/lsh_search.hpp
@@ -174,6 +178,24 @@ class LSHSearch
//! Get the second hash table.
const arma::Mat<size_t>& SecondHashTable() const { return secondHashTable; }
+ //! Get the projection tables.
+ const std::vector<arma::mat> getProjectionTables() { return projections; }
@rcurtin
rcurtin May 31, 2016 Member

We should call this Projections(), not getProjectionTables(), like the rest of the accessors in mlpack: https://github.com/mlpack/mlpack/wiki/DesignGuidelines#naming-conventions

@rcurtin rcurtin commented on the diff May 31, 2016
src/mlpack/methods/lsh/lsh_search.hpp
+ const std::vector<arma::mat> getProjectionTables() { return projections; }
+
+ //! Change the projection tables (Retrains object)
+ void setProjectionTables(std::vector<arma::mat> projTables)
+ {
+ // Simply call Train() with given projection tables
+ Train(
+ *referenceSet,
+ numProj,
+ numTables,
+ hashWidth,
+ secondHashSize,
+ bucketSize,
+ projTables
+ );
+ };
@rcurtin
rcurtin May 31, 2016 Member

Same here, we should call it Projections(vector<mat>&). (Don't forget to make it a reference, otherwise copying is going to happen. Also I think it should be const.)

@rcurtin rcurtin commented on an outdated diff May 31, 2016
src/mlpack/methods/lsh/lsh_search.hpp
@@ -188,7 +210,7 @@ class LSHSearch
* are private members of this class, initialized during the class
* initialization.
*/
- void BuildHash();
+ void BuildHash(const std::vector<arma::mat> &projection);
@rcurtin
rcurtin May 31, 2016 Member

If you set LSHSearch::projections in either the constructor or in Train() then there is no need to pass this parameter (or check its validity inside of BuildHash()). Actually to be honest I would be happy to move BuildHash() entirely into Train() since BuildHash() is only called once, but either way is fine.

@rcurtin
Member
rcurtin commented May 31, 2016

I know I made a lot of comments on a very simple change, hopefully I am not being too picky. :) I can help with the changes I proposed, just let me know what you'd like me to do.

@rcurtin rcurtin added a commit that referenced this pull request May 31, 2016
@rcurtin rcurtin I'm not sure what line width Pari used, but it wasn't 80 columns.
This will probably make the merge of #663 and other LSH improvements by Yannis
harder...
e6d2ca7
@mentekid
Contributor
mentekid commented May 31, 2016 edited

It's ok, these are pretty simple so I think I can do them quickly tomorrow.

arma::cube slices are arranged the same way std::vector elements are in memory right? So it won't hurt performance to group tables like that - it would only hurt if they were arranged in some weird way so elements of each matrix weren't concentrated, but that wouldn't make sense.

But armadillo documentation says

Cube data is stored as a set of slices (matrices) stored contiguously within memory. Within each slice, elements are stored with column-major ordering (ie. column by column)

so we're good

@rcurtin
Member
rcurtin commented May 31, 2016

Yep, slices are contiguous. If you volunteer to do them I will not do it then. :)

I don't expect to see a significant or even noticeable speed difference from this change, but it will simplify the code a bit and reduce memory usage at least some trivial amount.

@mentekid mentekid Fixes naming conventions of accessors
4773efb
@mentekid
Contributor
mentekid commented Jun 1, 2016

I am not sure how boost serialization works, is there going to be a problem for users trying to load their models if we change the representation from vector to cube?
Shoul we write something to organize the conversion?

@mentekid
Contributor
mentekid commented Jun 1, 2016 edited

Is this failure related to commit e6d2ca7 ?

I don't think I changed anything so as to conflict previous versions

@rcurtin
Member
rcurtin commented Jun 1, 2016

The boost serialization bit is fairly straightforward... we can use the version information with BOOST_CLASS_VERSION(). The implementation will probably look like this:

template<typename Archive>
void Serialize(Archive& ar, const unsigned int version)
{
  ...

  if (version == 0)
  {
    // In older versions, the projection tables were stored in a std::vector.
  std::vector<arma::mat> tmpProj;
  ar & data::CreateNVP(tmpProj, "projections");

  projections.set_size(tmpProj[0].n_rows, tmpProj[0].n_cols, tmpProj.size());
  for (size_t i = 0; i < tmpProj.size(); ++i)
    projections.slice(i) = tmpProj[i];
}
else
{
  ar & data::CreateNVP(projections, "projections");
}

...
}

And then you will have to set the class version number which you would normally do with BOOST_CLASS_VERSION(LSHSearch, 1); but this is a template class and the macro does not work with the template class so we have to use the expansion of the macro.

I wonder why Travis is not building this PR anymore? If it compiles and the LSH tests pass on your system I am fine with that, no need for Travis.

@rcurtin rcurtin commented on an outdated diff Jun 1, 2016
src/mlpack/methods/lsh/lsh_search.hpp
@@ -159,8 +163,6 @@ class LSHSearch
//! Get the number of projections.
size_t NumProjections() const { return projections.size(); }
@rcurtin
rcurtin Jun 1, 2016 Member

cube::size() actually returns n_elem, not n_slices, so we should change this to return projections.n_slices;.

@rcurtin rcurtin commented on an outdated diff Jun 1, 2016
src/mlpack/methods/lsh/lsh_search.hpp
@@ -159,8 +163,6 @@ class LSHSearch
//! Get the number of projections.
size_t NumProjections() const { return projections.size(); }
- //! Get the projection matrix of the given table.
- const arma::mat& Projection(const size_t i) const { return projections[i]; }
@rcurtin
rcurtin Jun 1, 2016 Member

We should mention the removal of this function in HISTORY.txt, and mention that users should now use Projections().slice(i) instead.

@rcurtin rcurtin commented on an outdated diff Jun 1, 2016
src/mlpack/methods/lsh/lsh_search.hpp
@@ -174,6 +176,24 @@ class LSHSearch
//! Get the second hash table.
const arma::Mat<size_t>& SecondHashTable() const { return secondHashTable; }
+ //! Get the projection tables.
+ const arma::cube Projections() { return projections; }
@rcurtin
rcurtin Jun 1, 2016 Member

Don't forget the & :)

@rcurtin rcurtin commented on an outdated diff Jun 1, 2016
src/mlpack/methods/lsh/lsh_search_impl.hpp
@@ -518,7 +542,7 @@ void LSHSearch<SortPolicy>::Serialize(Archive& ar,
// Delete existing projections, if necessary.
if (Archive::is_loading::value)
- projections.clear();
+ projections.zeros(0, 0, 0); // TODO: correct way to clear this?
@rcurtin
rcurtin Jun 1, 2016 Member

.reset() is probably the easiest way to clear the cube.

@rcurtin
Member
rcurtin commented Jun 1, 2016

The snippet you will need for versioning will be this...

namespace boost {
namespace serialization {

template<>
template<typename SortPolicy>
struct version<LSHSearch<SortPolicy>>
{
  typedef mpl::int_<1> type;
  typedef mpl::integral_c_tag tag;
  BOOST_STATIC_CONSTANT(int, value = version::type::value);
  BOOST_MPL_ASSERT((
    boost::mpl::less<
        boost::mpl::int_<N>, boost::mpl::int_<256>>
  ));
};

It might be worth making a nice macro for this situation, or some class or something that is easy to overload.

@rcurtin
Member
rcurtin commented Jun 1, 2016

Five patches for serialization:

http://www.ratml.org/misc/0001-Add-a-templated-version-for-BOOST_CLASS_VERSION.patch
http://www.ratml.org/misc/0002-We-actually-need-to-wrap-mlpack-data-SecondShim-obje.patch
http://www.ratml.org/misc/0003-Include-new-serialization-version-macro.patch
http://www.ratml.org/misc/0004-Use-n_slices-not-size-to-fix-correctness.patch
http://www.ratml.org/misc/0005-Refactor-Serialize-add-backwards-compatibility-and-u.patch

You can download those then use git am to apply to your repo, and this should update this PR. Hopefully this is not too tedious, I don't know if I can add individual commits to your PR in the interface we have here. I'll remove the patches from my server once you've applied them.

@rcurtin rcurtin merged commit 06fdfa8 into mlpack:master Jun 2, 2016

1 check failed

continuous-integration/appveyor/pr AppVeyor was unable to build non-mergeable pull request
Details
@rcurtin
Member
rcurtin commented Jun 2, 2016

A few minor changes I made afterwards:

e0b6ce7 b30e697 d3e3c54 8ed22ce e3a23c2

I forgot about the versioning, which means we can't remove Projection() until 2.1.0:
https://github.com/mlpack/mlpack/blob/master/UPDATING.txt

Let me know if I broke anything. :)

@mentekid
Contributor
mentekid commented Jun 2, 2016

I pulled the changes from upstream:master and merged with my MultiprobeLSH branch, but there seems to be a problem when compiling mlpack_lsh.

g++ says all of the LSHSearch class variables have "not been declared in this scope". Not sure what this means, did I do something wrong when recompiling or is it something in the master branch code?

Here's the g++ error messages:

[ 95%] Building CXX object src/mlpack/methods/lsh/CMakeFiles/mlpack_lsh.dir/lsh_main.cpp.o
In file included from /home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search.hpp:377:0,
                 from /home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_main.cpp:17:
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:819:16: error: ‘SortPolicy’ was not declared in this scope
 void LSHSearch<SortPolicy>::Serialize(Archive& ar,
                ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:819:26: error: template argument 1 is invalid
 void LSHSearch<SortPolicy>::Serialize(Archive& ar,
                          ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp: In function ‘void mlpack::neighbor::Serialize(Archive&, unsigned int)’:
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:827:9: error: ‘ownsSet’ was not declared in this scope
     if (ownsSet)
         ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:828:14: error: ‘referenceSet’ was not declared in this scope
       delete referenceSet;
              ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:829:5: error: ‘ownsSet’ was not declared in this scope
     ownsSet = true;
     ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:831:18: error: ‘referenceSet’ was not declared in this scope
   ar & CreateNVP(referenceSet, "referenceSet");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:833:18: error: ‘numProj’ was not declared in this scope
   ar & CreateNVP(numProj, "numProj");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:834:18: error: ‘numTables’ was not declared in this scope
   ar & CreateNVP(numTables, "numTables");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:838:5: error: ‘projections’ was not declared in this scope
     projections.reset();
     ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:847:5: error: ‘projections’ was not declared in this scope
     projections.set_size(tmpProj[0].n_rows, tmpProj[0].n_cols, tmpProj.size());
     ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:853:20: error: ‘projections’ was not declared in this scope
     ar & CreateNVP(projections, "projections");
                    ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:856:18: error: ‘offsets’ was not declared in this scope
   ar & CreateNVP(offsets, "offsets");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:857:18: error: ‘hashWidth’ was not declared in this scope
   ar & CreateNVP(hashWidth, "hashWidth");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:858:18: error: ‘secondHashSize’ was not declared in this scope
   ar & CreateNVP(secondHashSize, "secondHashSize");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:859:18: error: ‘secondHashWeights’ was not declared in this scope
   ar & CreateNVP(secondHashWeights, "secondHashWeights");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:860:18: error: ‘bucketSize’ was not declared in this scope
   ar & CreateNVP(bucketSize, "bucketSize");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:861:18: error: ‘secondHashTable’ was not declared in this scope
   ar & CreateNVP(secondHashTable, "secondHashTable");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:862:18: error: ‘bucketContentSize’ was not declared in this scope
   ar & CreateNVP(bucketContentSize, "bucketContentSize");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:863:18: error: ‘bucketRowInHashTable’ was not declared in this scope
   ar & CreateNVP(bucketRowInHashTable, "bucketRowInHashTable");
                  ^
/home/et3rn1ty/Projects/MLPack/mlpack/src/mlpack/methods/lsh/lsh_search_impl.hpp:864:18: error: ‘distanceEvaluations’ was not declared in this scope
   ar & CreateNVP(distanceEvaluations, "distanceEvaluations");

@rcurtin
Member
rcurtin commented Jun 2, 2016

Not sure what the issue here, is the signature

template<typename Archive>
template<typename SortPolicy>
void LSHSearch<SortPolicy>::Serialize(Archive& ar, const unsigned int version)

? If not, maybe that is the issue, maybe the merge went wrong or something.

@mentekid
Contributor
mentekid commented Jun 2, 2016

Yep, that was it, I only had template, I deleted the other one in the merge accidentaly.

By the way for some reason

template<typename SortPolicy>
template<typename Archive>

produced an error but

template<typename Archive>
template<typename SortPolicy>

didn't. The second version is how it is in the master branch.

Finally we can close this 😄

@rcurtin
Member
rcurtin commented Jun 2, 2016

Yes, the ordering makes a difference. You have to put the template declaration for the method before the template declaration for the class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment