Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PCA class: able to use different decomposition techniques (exact, randomized, QUIC SVD). #716

Closed
wants to merge 31 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
4727298
Fix explicitly specialized template issue.
zoq Jun 2, 2016
05b36fc
Merge remote-tracking branch 'upstream/master'
zoq Jun 5, 2016
00e867f
edge_boxes: feature extraction
nilayjain Jun 5, 2016
c8d5766
Properly resetting auxBound. Start using a Reset() method, to avoid f…
MarcosPividori Jun 3, 2016
5149efd
backported ind2sub and sub2ind
nilayjain Jun 6, 2016
61e63b9
backported ind2sub and sub2ind
nilayjain Jun 6, 2016
5f01b84
Revert "edge_boxes: feature extraction"
nilayjain Jun 6, 2016
8907d5a
backported sub2ind & ind2sub
nilayjain Jun 6, 2016
b8da5c6
fix doc tutorial
keon Jun 7, 2016
0d6d3af
Use appveyor cache (nuget and armadillo).
zoq Jun 7, 2016
45e8cd6
fix typo
keon Jun 7, 2016
01e699c
added test for ind2sub and sub2ind
nilayjain Jun 7, 2016
7e8abed
Minor style fixes for ind2sub() test.
rcurtin Jun 7, 2016
7bbd897
Add new contributors.
rcurtin Jun 7, 2016
29fcf0a
Try debugging symbols for AppVeyor build to see if it is faster.
rcurtin Jun 7, 2016
cbbd671
Merge remote-tracking branch 'upstream/master'
zoq Jun 14, 2016
ec0a6d5
Add QUIC-SVD singular values test.
zoq Jul 4, 2016
6aedf56
Remove implementation from header file to avoid duplicate symbol error.
zoq Jul 4, 2016
570a3d8
Add randomized SVD method.
zoq Jul 4, 2016
907461f
Add randomized SVd test suite.
zoq Jul 4, 2016
861fe30
Add exact, randomized and QUIC SVD decomposition policies; meant to b…
zoq Jul 4, 2016
9101a6f
Refactor PCA class; able to use different decomposition methods.
zoq Jul 4, 2016
ee7ff36
Merge remote-tracking branch 'upstream/master'
zoq Jul 4, 2016
10d435f
Merge with master.
zoq Jul 4, 2016
081428e
Fix merge conflict.
zoq Jul 4, 2016
43744c5
Remove unused header guard.
zoq Jul 4, 2016
260a48e
Update boost test header.
zoq Jul 4, 2016
1d675d4
Do not split if numColumns < 3.
zoq Jul 4, 2016
080d198
Use the correct svd method name and parameter name for the centered d…
zoq Jul 5, 2016
1c0192f
Minor style changes.
zoq Jul 5, 2016
1127e61
Introduce compatibility by changing PCA to PCAType.
zoq Jul 5, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -24,10 +24,10 @@ class ExactSVDPolicy
public:
/**
* Apply Principal Component Analysis to the provided data set using the
* randomized SVD.
* exact SVD method.
*
* @param data Data matrix.
* @param data Centered data matrix.
* @param centeredData Centered data matrix.
* @param transformedData Matrix to put results of PCA into.
* @param eigVal Vector to put eigenvalues into.
* @param eigvec Matrix to put eigenvectors (loadings) into.
Expand All @@ -37,7 +37,7 @@ class ExactSVDPolicy
const arma::mat& centeredData,
arma::mat& transformedData,
arma::vec& eigVal,
arma::mat& coeff,
arma::mat& eigvec,
const size_t /* rank */)
{
// This matrix will store the right singular values; we do not need them.
Expand All @@ -49,11 +49,11 @@ class ExactSVDPolicy
{
// Do economical singular value decomposition and compute only the left
// singular vectors.
arma::svd_econ(coeff, eigVal, v, centeredData, 'l');
arma::svd_econ(eigvec, eigVal, v, centeredData, 'l');
}
else
{
arma::svd(coeff, eigVal, v, centeredData);
arma::svd(eigvec, eigVal, v, centeredData);
}

// Now we must square the singular values to get the eigenvalues.
Expand All @@ -62,7 +62,7 @@ class ExactSVDPolicy
eigVal %= eigVal / (data.n_cols - 1);

// Project the samples to the principals.
transformedData = arma::trans(coeff) * centeredData;
transformedData = arma::trans(eigvec) * centeredData;
}
};

Expand Down
Expand Up @@ -40,7 +40,7 @@ class QUICSVDPolicy
* QUIC-SVD method.
*
* @param data Data matrix.
* @param data Centered data matrix.
* @param centeredData Centered data matrix.
* @param transformedData Matrix to put results of PCA into.
* @param eigVal Vector to put eigenvalues into.
* @param eigvec Matrix to put eigenvectors (loadings) into.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name are different, one called eigvec, and another called coeff.
Besides, The Apply function of PCA, the comments are written for the same functions and different with each other.

pca.hpp

 /**
   * Apply Principal Component Analysis to the provided data set.  It is safe to
   * pass the same matrix reference for both data and transformedData.
   *
   * @param data Data matrix.
   * @param transformedData Matrix to put results of PCA into.
   * @param eigval Vector to put eigenvalues into.
   * @param eigvec Matrix to put eigenvectors (loadings) into.
   */
  void Apply(const arma::mat& data,
             arma::mat& transformedData,
             arma::vec& eigval,
arma::mat& eigvec);

pca_impl.hpp

/**
 * Apply Principal Component Analysis to the provided data set.
 *
 * @param data - Data matrix
 * @param transformedData - Data with PCA applied
 * @param eigVal - contains eigen values in a column vector
 * @param coeff - PCA Loadings/Coeffs/EigenVectors
 */
template<typename DecompositionPolicy>
void PCA<DecompositionPolicy>::Apply(const arma::mat& data,
                arma::mat& transformedData,
                arma::vec& eigVal,
arma::mat& coeff)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks.

Expand All @@ -50,22 +50,22 @@ class QUICSVDPolicy
const arma::mat& centeredData,
arma::mat& transformedData,
arma::vec& eigVal,
arma::mat& coeff,
arma::mat& eigvec,
const size_t /* rank */)
{
// This matrix will store the right singular values; we do not need them.
arma::mat v, sigma;

// Do singular value decomposition using the QUIC-SVD algorithm.
svd::QUIC_SVD quicsvd(centeredData, coeff, v, sigma, epsilon, delta);
svd::QUIC_SVD quicsvd(centeredData, eigvec, v, sigma, epsilon, delta);

// Now we must square the singular values to get the eigenvalues.
// In addition we must divide by the number of points, because the
// covariance matrix is X * X' / (N - 1).
eigVal = arma::pow(arma::diagvec(sigma), 2) / (data.n_cols - 1);

// Project the samples to the principals.
transformedData = arma::trans(coeff) * centeredData;
transformedData = arma::trans(eigvec) * centeredData;
}

//! Get the error tolerance fraction for calculated subspace.
Expand Down
Expand Up @@ -44,7 +44,7 @@ class RandomizedSVDPolicy
* randomized SVD.
*
* @param data Data matrix.
* @param data Centered data matrix.
* @param centeredData Centered data matrix.
* @param transformedData Matrix to put results of PCA into.
* @param eigVal Vector to put eigenvalues into.
* @param eigvec Matrix to put eigenvectors (loadings) into.
Expand All @@ -54,23 +54,23 @@ class RandomizedSVDPolicy
const arma::mat& centeredData,
arma::mat& transformedData,
arma::vec& eigVal,
arma::mat& coeff,
arma::mat& eigvec,
const size_t rank)
{
// This matrix will store the right singular values; we do not need them.
arma::mat v;

// Do singular value decomposition using the randomized SVD algorithm.
svd::RandomizedSVD rsvd(iteratedPower, maxIterations);
rsvd.Apply(data, coeff, eigVal, v, rank);
rsvd.Apply(data, eigvec, eigVal, v, rank);

// Now we must square the singular values to get the eigenvalues.
// In addition we must divide by the number of points, because the
// covariance matrix is X * X' / (N - 1).
eigVal %= eigVal / (data.n_cols - 1);

// Project the samples to the principals.
transformedData = arma::trans(coeff) * centeredData;
transformedData = arma::trans(eigvec) * centeredData;
}

//! Get the size of the normalized power iterations.
Expand Down
54 changes: 30 additions & 24 deletions src/mlpack/methods/pca/pca.hpp
@@ -1,10 +1,12 @@
/**
* @file pca.hpp
* @author Ajinkya Kale
* @author Ryan Curtin
* @author Marcus Edel
*
* Defines the PCA class to perform Principal Components Analysis on the
* specified data set. There are many variations on how to do this, so template
* parameters allow the selection of different techniques.
* specified data set. There are many variations on how to do this, so
* template parameters allow the selection of different techniques.
*/
#ifndef MLPACK_METHODS_PCA_PCA_HPP
#define MLPACK_METHODS_PCA_PCA_HPP
Expand All @@ -16,14 +18,14 @@ namespace mlpack {
namespace pca {

/**
* This class implements principal components analysis (PCA). This is a common,
* widely-used technique that is often used for either dimensionality reduction
* or transforming data into a better basis. Further information on PCA can be
* found in almost any statistics or machine learning textbook, and all over the
* internet.
* This class implements principal components analysis (PCA). This is a
* common, widely-used technique that is often used for either dimensionality
* reduction or transforming data into a better basis. Further information on
* PCA can be found in almost any statistics or machine learning textbook, and
* all over the internet.
*/
template<typename DecompositionPolicy = ExactSVDPolicy>
class PCA
class PCAType
{
public:
/**
Expand All @@ -32,12 +34,12 @@ class PCA
*
* @param scaleData Whether or not to scale the data.
*/
PCA(const bool scaleData = false,
const DecompositionPolicy& decomposition = DecompositionPolicy());
PCAType(const bool scaleData = false,
const DecompositionPolicy& decomposition = DecompositionPolicy());

/**
* Apply Principal Component Analysis to the provided data set. It is safe to
* pass the same matrix reference for both data and transformedData.
* Apply Principal Component Analysis to the provided data set. It is safe
* to pass the same matrix reference for both data and transformedData.
*
* @param data Data matrix.
* @param transformedData Matrix to put results of PCA into.
Expand All @@ -50,8 +52,8 @@ class PCA
arma::mat& eigvec);

/**
* Apply Principal Component Analysis to the provided data set. It is safe to
* pass the same matrix reference for both data and transformedData.
* Apply Principal Component Analysis to the provided data set. It is safe
* to pass the same matrix reference for both data and transformedData.
*
* @param data Data matrix.
* @param transformedData Matrix to store results of PCA in.
Expand All @@ -62,11 +64,11 @@ class PCA
arma::vec& eigVal);

/**
* Use PCA for dimensionality reduction on the given dataset. This will save
* Use PCA for dimensionality reduction on the given dataset. This will save
* the newDimension largest principal components of the data and remove the
* rest. The parameter returned is the amount of variance of the data that is
* retained; this is a value between 0 and 1. For instance, a value of 0.9
* indicates that 90% of the variance present in the data was retained.
* rest. The parameter returned is the amount of variance of the data that
* is retained; this is a value between 0 and 1. For instance, a value of
* 0.9 indicates that 90% of the variance present in the data was retained.
*
* @param data Data matrix.
* @param newDimension New dimension of the data.
Expand All @@ -81,7 +83,7 @@ class PCA
}

/**
* Use PCA for dimensionality reduction on the given dataset. This will save
* Use PCA for dimensionality reduction on the given dataset. This will save
* as many dimensions as necessary to retain at least the given amount of
* variance (specified by parameter varRetained). The amount should be
* between 0 and 1; if the amount is 0, then only 1 dimension will be
Expand All @@ -97,8 +99,8 @@ class PCA
*/
double Apply(arma::mat& data, const double varRetained);

//! Get whether or not this PCA object will scale (by standard deviation) the
//! data when PCA is performed.
//! Get whether or not this PCA object will scale (by standard deviation)
//! the data when PCA is performed.
bool ScaleData() const { return scaleData; }
//! Modify whether or not this PCA object will scale (by standard deviation)
//! the data when PCA is performed.
Expand All @@ -110,9 +112,11 @@ class PCA
{
if (scaleData)
{
// Scaling the data is when we reduce the variance of each dimension to 1.
// We do this by dividing each dimension by its standard deviation.
arma::vec stdDev = arma::stddev(centeredData, 0, 1 /* for each dimension */);
// Scaling the data is when we reduce the variance of each dimension
// to 1. We do this by dividing each dimension by its standard
// deviation.
arma::vec stdDev = arma::stddev(
centeredData, 0, 1 /* for each dimension */);

// If there are any zeroes, make them very small.
for (size_t i = 0; i < stdDev.n_elem; ++i)
Expand All @@ -131,6 +135,8 @@ class PCA
DecompositionPolicy decomposition;
}; // class PCA

//! 3.0.0 TODO: break reverse-compatibility by changing PCAType to PCA.
typedef PCAType<ExactSVDPolicy> PCA;

} // namespace pca
} // namespace mlpack
Expand Down
31 changes: 19 additions & 12 deletions src/mlpack/methods/pca/pca_impl.hpp
@@ -1,6 +1,8 @@
/**
* @file pca.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be pca_impl.hpp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks!

* @author Ajinkya Kale
* @author Ryan Curtin
* @author Marcus Edel
*
* Implementation of PCA class to perform Principal Components Analysis on the
* specified data set.
Expand All @@ -18,8 +20,8 @@ namespace mlpack {
namespace pca {

template<typename DecompositionPolicy>
PCA<DecompositionPolicy>::PCA(const bool scaleData,
const DecompositionPolicy& decomposition) :
PCAType<DecompositionPolicy>::PCAType(const bool scaleData,
const DecompositionPolicy& decomposition) :
scaleData(scaleData),
decomposition(decomposition)
{ }
Expand All @@ -33,10 +35,10 @@ PCA<DecompositionPolicy>::PCA(const bool scaleData,
* @param coeff - PCA Loadings/Coeffs/EigenVectors
*/
template<typename DecompositionPolicy>
void PCA<DecompositionPolicy>::Apply(const arma::mat& data,
arma::mat& transformedData,
arma::vec& eigVal,
arma::mat& coeff)
void PCAType<DecompositionPolicy>::Apply(const arma::mat& data,
arma::mat& transformedData,
arma::vec& eigVal,
arma::mat& coeff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter should call "eigvec" or "coeff"? They are different in pca.hpp and pca_impl.hpp

{
Timer::Start("pca");

Expand All @@ -61,9 +63,9 @@ void PCA<DecompositionPolicy>::Apply(const arma::mat& data,
* @param eigVal - contains eigen values in a column vector
*/
template<typename DecompositionPolicy>
void PCA<DecompositionPolicy>::Apply(const arma::mat& data,
arma::mat& transformedData,
arma::vec& eigVal)
void PCAType<DecompositionPolicy>::Apply(const arma::mat& data,
arma::mat& transformedData,
arma::vec& eigVal)
{
arma::mat coeffs;
Apply(data, transformedData, eigVal, coeffs);
Expand All @@ -81,7 +83,8 @@ void PCA<DecompositionPolicy>::Apply(const arma::mat& data,
* @return Amount of the variance of the data retained (between 0 and 1).
*/
template<typename DecompositionPolicy>
double PCA<DecompositionPolicy>::Apply(arma::mat& data, const size_t newDimension)
double PCAType<DecompositionPolicy>::Apply(arma::mat& data,
const size_t newDimension)
{
// Parameter validation.
if (newDimension == 0)
Expand All @@ -95,6 +98,8 @@ double PCA<DecompositionPolicy>::Apply(arma::mat& data, const size_t newDimensio
arma::mat coeffs;
arma::vec eigVal;

Timer::Start("pca");

// Center the data into a temporary matrix.
arma::mat centeredData;
math::Center(data, centeredData);
Expand All @@ -112,6 +117,8 @@ double PCA<DecompositionPolicy>::Apply(arma::mat& data, const size_t newDimensio
// the right dimension before calculating the amount of variance retained.
double eigDim = std::min(newDimension - 1, (size_t) eigVal.n_elem - 1);

Timer::Stop("pca");

// Calculate the total amount of variance retained.
return (sum(eigVal.subvec(0, eigDim)) / sum(eigVal));
}
Expand All @@ -127,7 +134,8 @@ double PCA<DecompositionPolicy>::Apply(arma::mat& data, const size_t newDimensio
* always be greater than or equal to the varRetained parameter.
*/
template<typename DecompositionPolicy>
double PCA<DecompositionPolicy>::Apply(arma::mat& data, const double varRetained)
double PCAType<DecompositionPolicy>::Apply(arma::mat& data,
const double varRetained)
{
// Parameter validation.
if (varRetained < 0)
Expand Down Expand Up @@ -159,7 +167,6 @@ double PCA<DecompositionPolicy>::Apply(arma::mat& data, const double varRetained
return varSum;
}


} // namespace pca
} // namespace mlpack

Expand Down
8 changes: 7 additions & 1 deletion src/mlpack/methods/pca/pca_main.cpp
Expand Up @@ -46,7 +46,7 @@ void RunPCA(arma::mat& dataset,
const size_t scale,
const double varToRetain)
{
PCA<DecompositionPolicy> p(scale);
PCAType<DecompositionPolicy> p(scale);

Log::Info << "Performing PCA on dataset..." << endl;
double varRetained;
Expand Down Expand Up @@ -112,6 +112,12 @@ int main(int argc, char** argv)
{
RunPCA<QUICSVDPolicy>(dataset, newDimension, scale, varToRetain);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a user picks an invalid decomposition policy, no error is issued? I am not sure if I am reading this right, but if so I think we should add an else to catch the error. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, changed in 1127e61.

else
{
// Invalid decomposition method.
Log::Fatal << "Invalid decomposition method ('" << decompositionMethod
<< "'); valid choices are 'exact', 'randomized', 'quic'." << endl;
}

// Now save the results.
string outputFile = CLI::GetParam<string>("output_file");
Expand Down