Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC]Binarize Function + Test #666

Merged
merged 17 commits into from Jun 19, 2016
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions src/mlpack/core/data/CMakeLists.txt
Expand Up @@ -15,6 +15,7 @@ set(SOURCES
save_impl.hpp
serialization_shim.hpp
split_data.hpp
binarize.hpp
)

# add directory name to sources
Expand Down
76 changes: 76 additions & 0 deletions src/mlpack/core/data/binarize.hpp
@@ -0,0 +1,76 @@
/**
* @file binarize.hpp
* @author Keon Kim
*
* Defines Binarize(), a utility function, sets values to 0 or 1
* to a given threshold.
*/
#ifndef MLPACK_CORE_DATA_BINARIZE_HPP
#define MLPACK_CORE_DATA_BINARIZE_HPP

#include <mlpack/core.hpp>

namespace mlpack {
namespace data {
/**
* Given an input dataset and threshold, set values greater than threshold to
* 1 and values less than or equal to the threshold to 0. This overload takes
* a dimension and applys the changes to the given dimension.
*
* @code
* arma::mat input = loadData();
* double threshold = 0;
* size_t dimension = 0;
*
* // Binarize the first dimension. All positive values in the first dimension
* // will be set to 1 and the values less than or equal to 0 will become 0.
* Binarize(input, threshold, dimension);
* @endcode
*
* @param input Input matrix to Binarize.
* @param threshold Threshold can by any number.
* @param dimension Feature to apply the Binarize function.
*/
template<typename T>
void Binarize(arma::Mat<T>& input,
const double threshold,
const size_t dimension)
{
for (size_t i = 0; i < input.n_cols; ++i)
{
if (input(dimension, i) > threshold)
input(dimension, i) = 1;
else
input(dimension, i) = 0;
}
}

/**
* Given an input dataset and threshold, set values greater than threshold to
* 1 and values less than or equal to the threshold to 0. This overload applies
* the changes to all dimensions.
*
* @code
* arma::mat input = loadData();
* double threshold = 0;
*
* // Binarize the whole Matrix. All positive values in will be set to 1 and
* // the values less than or equal to 0 will become 0.
* Binarize(input, threshold);
* @endcode
*
* @param input Input matrix to Binarize.
* @param threshold Threshold can by any number.
*/
template<typename T>
void Binarize(arma::Mat<T>& input,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's reasonable to provide an interface that enables the user to set output matrix? e.g.

void Binarize(arma::Mat<T>& input, arma::Mat<T>& output, const double threshold)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many other functions in mlpack seems to provide these kind of interfaces.
so yea, I'll add void Binarize(arma::Mat<T>& input, arma::Mat<T>& output, const double threshold).

const double threshold)
{
for (size_t i = 0; i < input.n_rows; ++i)
Binarize(input, threshold, i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Armadillo matrices are column-major, but this calculation accesses the matrix in a row-major way. So it would be faster to just loop over all elements in the matrix instead of calling the other overload of Binarize().

}

} // namespace data
} // namespace mlpack

#endif
1 change: 1 addition & 0 deletions src/mlpack/tests/CMakeLists.txt
Expand Up @@ -8,6 +8,7 @@ add_executable(mlpack_test
allkrann_search_test.cpp
arma_extend_test.cpp
aug_lagrangian_test.cpp
binarize_test.cpp
cf_test.cpp
cli_test.cpp
convolution_test.cpp
Expand Down
78 changes: 78 additions & 0 deletions src/mlpack/tests/binarize_test.cpp
@@ -0,0 +1,78 @@
/**
* @file binarize_test.cpp
* @author Keon Kim
*
* Test the Binarzie method.
*/
#include <mlpack/core.hpp>
#include <mlpack/core/data/binarize.hpp>
#include <mlpack/core/math/random.hpp>

#include <boost/test/unit_test.hpp>
#include "old_boost_test_definitions.hpp"

using namespace mlpack;
using namespace arma;
using namespace mlpack::data;

BOOST_AUTO_TEST_SUITE(BinarizeTest);

/**
* Compare the binarized data with answer.
*
* @param input The original data set before Binarize.
* @param answer The data want to compare with the input.
*/
void CheckAnswer(const mat& input,
const umat& answer)
{
for (size_t i = 0; i < input.n_cols; ++i)
{
const mat& lhsCol = input.col(i);
const umat& rhsCol = answer.col(i);
for (size_t j = 0; j < lhsCol.n_rows; ++j)
{
if (std::abs(rhsCol(j)) < 1e-5)
BOOST_REQUIRE_SMALL(lhsCol(j), 1e-5);
else
BOOST_REQUIRE_CLOSE(lhsCol(j), rhsCol(j), 1e-5);
}
}
}

BOOST_AUTO_TEST_CASE(BinarizeThreshold)
{
mat input(10, 10, fill::randu); // fill input with randome Number
mat constMat(10, 10);
math::RandomSeed((size_t) std::time(NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid setting the random seed in the tests, this can make specific test errors really hard to reproduce. What I like to do is set the random seed like you did here and run like 1000 tests on my local machine to make sure it works, then remove the line that sets the seed.

double threshold = math::Random(); // random number threshold
constMat.fill(threshold);

umat answer = input > constMat;

// Binarize every values inside the matrix with threshold of 0;
Binarize(input, threshold);

CheckAnswer(input, answer);
}

/**
* The same test as above, but on a larger dataset.
*/
BOOST_AUTO_TEST_CASE(BinarizeThresholdLargerTest)
{
mat input(10, 500, fill::randu); // fill input with randome Number
mat constMat(10, 500);
math::RandomSeed((size_t) std::time(NULL));
double threshold = math::Random(); // random number threshold
constMat.fill(threshold);

umat answer = input > constMat;

// Binarize every values inside the matrix with threshold of 0;
Binarize(input, threshold);

CheckAnswer(input, answer);
}

BOOST_AUTO_TEST_SUITE_END();