Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC]DatasetMapper & Imputer #694

Merged
merged 47 commits into from Jul 25, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
87c05a5
concept work for imputer
keon Jun 1, 2016
2e4b1a8
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 6, 2016
631e59e
do not to use NaN by default, let the user specify
keon Jun 6, 2016
391006e
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 6, 2016
6a1fb81
add template to datasetinfo and add imputer class
keon Jun 12, 2016
b0c5224
clean datasetinfo class and rename files
keon Jun 13, 2016
de35241
implement basic imputation strategies
keon Jun 13, 2016
2d38604
modify imputer_main and clean logs
keon Jun 13, 2016
bb045b8
add parameter verification for imputer_main
keon Jun 13, 2016
1295f4b
add custom strategy to impute_main
keon Jun 13, 2016
5a517c2
add datatype change in IncrementPolicy
keon Jun 14, 2016
94b7a5c
update types used in datasetinfo
keon Jun 14, 2016
ebed68f
initialize imputer with parameters
keon Jun 14, 2016
db78f39
remove datatype in dataset_info
keon Jun 15, 2016
7c60b97
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 15, 2016
da4e409
add test for imputer
keon Jun 15, 2016
d8618ec
restructure, add listwise deletion & imputer tests
keon Jun 18, 2016
3b8ffd0
fix transpose problem
keon Jun 27, 2016
90a5cd2
Merge pull request #7 from mlpack/master
keon Jun 27, 2016
32c8a73
merge
keon Jun 27, 2016
e09d9bc
updates and fixes on imputation methods
keon Jun 28, 2016
87d8d46
update data::load to accept different mappertypes
keon Jul 1, 2016
de0b2db
update data::load to accept different policies
keon Jul 1, 2016
bc187ca
add imputer doc
keon Jul 1, 2016
a340f69
debug median imputation and listwise deletion
keon Jul 2, 2016
21d94c0
remove duplicate code in load function
keon Jul 2, 2016
a92afaa
delete load overload
keon Jul 3, 2016
bace8b2
modify MapToNumerical to work with MissingPolicy
keon Jul 4, 2016
896a018
MissingPolicy uses NaN instead of numbers
keon Jul 4, 2016
1a908c2
fix reference issue in DatasetMapper
keon Jul 4, 2016
2edbc40
Move MapToNumerical(MapTokens) to Policy class
keon Jul 5, 2016
d881cb7
make policy and imputation api more consistent
keon Jul 5, 2016
a881831
numerical values can be set as missing values
keon Jul 6, 2016
63268a3
add comments and use more proper names
keon Jul 7, 2016
2eb6754
modify custom impute interface and rename variables
keon Jul 10, 2016
6d43aa3
add input-only overloads to imputation methods
keon Jul 10, 2016
fedc5e0
update median imputation to exclude missing values
keon Jul 11, 2016
787fd82
optimize imputation methods with output overloads
keon Jul 18, 2016
a0b7d59
expressive comments in imputation_test
keon Jul 18, 2016
9a6dce7
shorten imputation tests
keon Jul 18, 2016
c3aeba1
optimize preprocess imputer executable
keon Jul 18, 2016
028c217
fix bugs in imputation test
keon Jul 18, 2016
03e19a4
add more comments and delete impute_test.csv
keon Jul 22, 2016
ef4536b
Merge pull request #8 from mlpack/master
keon Jul 22, 2016
6e2c1ff
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jul 22, 2016
5eb9abd
fix PARAM statements in imputer
keon Jul 22, 2016
d043235
delete Impute() overloads that produce output matrix
keon Jul 23, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/mlpack/core/data/dataset_info_impl.hpp
Expand Up @@ -37,7 +37,7 @@ inline typename PolicyType::mapped_type DatasetMapper<PolicyType>::MapString(
const std::string& string,
const size_t dimension)
{
return policy.template MapString<MapType>(maps, types, string, dimension);
return policy.template MapString<MapType>(string, dimension, maps, types);
}

// Return the string corresponding to a value in a given dimension.
Expand Down
12 changes: 6 additions & 6 deletions src/mlpack/core/data/imputation_methods/custom_imputation.hpp
Expand Up @@ -19,12 +19,12 @@ template <typename T>
class CustomImputation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can provide some comments on what this class is and what it does (also for the Impute() function), I think it would be really helpful for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

{
public:
void Apply(const arma::Mat<T>& input,
arma::Mat<T>& output,
const T& mappedValue,
const T& customValue,
const size_t dimension,
const bool transpose = true)
void Impute(const arma::Mat<T>& input,
arma::Mat<T>& output,
const T& mappedValue,
const T& customValue,
const size_t dimension,
const bool transpose = true)
{
// initiate output
output = input;
Expand Down
10 changes: 5 additions & 5 deletions src/mlpack/core/data/imputation_methods/listwise_deletion.hpp
Expand Up @@ -23,11 +23,11 @@ template <typename T>
class ListwiseDeletion
{
public:
void Apply(const arma::Mat<T>& input,
arma::Mat<T>& output,
const T& mappedValue,
const size_t dimension,
const bool transpose = true)
void Impute(const arma::Mat<T>& input,
arma::Mat<T>& output,
const T& mappedValue,
const size_t dimension,
const bool transpose = true)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to reimplement the function, we could reuse the function from above: Impute(input, input, mappedValue, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now Impute(input, input, mappedValue, ...) is implemented by Impute(input, mappedValue, ...). Benefit of Impute(input, mappedValue, ...) is the users can reuse input, save the cost of allocating new memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reason to provide the interface, but we can reuse the implementation inside of the simplified function.

// initiate output
output = input;
Expand Down
Expand Up @@ -22,7 +22,7 @@ template <typename T>
class MeanImputation
{
public:
void Apply (const arma::Mat<T>& input,
void Impute(const arma::Mat<T>& input,
arma::Mat<T>& output,
const T& mappedValue,
const size_t dimension,
Expand Down
Expand Up @@ -22,7 +22,7 @@ template <typename T>
class MedianImputation
{
public:
void Apply (const arma::Mat<T>& input,
void Impute(const arma::Mat<T>& input,
arma::Mat<T>& output,
const T& mappedValue,
const size_t dimension,
Expand Down
10 changes: 3 additions & 7 deletions src/mlpack/core/data/imputer.hpp
Expand Up @@ -60,7 +60,7 @@ class Imputer
const size_t dimension)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that many times the user will want to impute without copying the matrix; this will especially make a difference when the matrix is large. Should we also provide an overload with just one matrix that will be modified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

{
T mappedValue = static_cast<T>(mapper.UnmapValue(missingValue, dimension));
strategy.Apply(input, output, mappedValue, dimension, transpose);
strategy.Impute(input, output, mappedValue, dimension, transpose);
}

/**
Expand All @@ -74,12 +74,8 @@ class Imputer
const size_t dimension)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload seems like it is necessary specifically for the CustomImputation strategy, and allows the user to pass the customValue parameter. But it seems like a user could accomplish the same thing by passing an instantiated CustomImputation to the constructor, if CustomImputation held as a local variable the customValue. I think that would allow you to remove this function and simplify the code; what do you think?

Certainly in the examples for the imputer, we should document how someone can use the CustomImputation class as an example, since it would be a little more complex than just calling this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea. I'll update this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

{
T mappedValue = static_cast<T>(mapper.UnmapValue(missingValue, dimension));
strategy.Apply(input,
output,
mappedValue,
customValue,
dimension,
transpose);
strategy.Impute(input, output, mappedValue, customValue, dimension,
transpose);
}

//! Get the strategy
Expand Down
12 changes: 6 additions & 6 deletions src/mlpack/core/data/map_policies/increment_policy.hpp
Expand Up @@ -28,10 +28,10 @@ class IncrementPolicy
using mapped_type = size_t;

template <typename MapType>
mapped_type MapString(MapType& maps,
std::vector<Datatype>& types,
const std::string& string,
const size_t dimension)
mapped_type MapString(const std::string& string,
const size_t dimension,
MapType& maps,
std::vector<Datatype>& types)
{
// If this condition is true, either we have no mapping for the given string
// or we have no mappings for the given dimension at all. In either case,
Expand Down Expand Up @@ -79,8 +79,8 @@ class IncrementPolicy
{
for (size_t i = 0; i != tokens.size(); ++i)
{
const eT val = static_cast<eT>(this->MapString(maps, types, tokens[i],
row));
const eT val = static_cast<eT>(this->MapString(tokens[i], row, maps,
types));
double temp = (double) val;
matrix.at(row, i) = val;
}
Expand Down
12 changes: 6 additions & 6 deletions src/mlpack/core/data/map_policies/missing_policy.hpp
Expand Up @@ -41,10 +41,10 @@ class MissingPolicy


template <typename MapType>
mapped_type MapString(MapType& maps,
std::vector<Datatype>& types,
const std::string& string,
const size_t dimension)
mapped_type MapString(const std::string& string,
const size_t dimension,
MapType maps,
std::vector<Datatype>& types)
{
// If this condition is true, either we have no mapping for the given string
// or we have no mappings for the given dimension at all. In either case,
Expand Down Expand Up @@ -84,8 +84,8 @@ class MissingPolicy
token>>matrix.at(row, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discover potential problems.
What if users set the missing value as numerical value?
maybe we should check the tokens[i] is part of the missingSet or not

if(missingSet.find(tokens[i]) != std::end(missingSet)){
   const eT val = static_cast<eT>(this->MapString(maps, types, tokens[i],row));
   matrix.at(row, i) = val;
}else{
   token.str(tokens[i]);
   token>>matrix.at(row, i);
   if (token.fail()) // if not number, map it to datasetmapper
   {       
        const eT val = static_cast<eT>(this->MapString(maps, types, tokens[i],
                                                       row));
        matrix.at(row, i) = val;
    }
    token.clear();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats brilliant, thanks!

if (token.fail()) // if not number, map it to datasetmapper
{
const eT val = static_cast<eT>(this->MapString(maps, types, tokens[i],
row));
const eT val = static_cast<eT>(this->MapString(tokens[i], row, maps,
types));
matrix.at(row, i) = val;
}
token.clear();
Expand Down
16 changes: 8 additions & 8 deletions src/mlpack/tests/imputation_test.cpp
Expand Up @@ -99,7 +99,7 @@ BOOST_AUTO_TEST_CASE(CustomImputationTest)
CustomImputation<double> imputer;

// transposed
imputer.Apply(input, outputT, mappedValue, customValue, 0/*dimension*/, true);
imputer.Impute(input, outputT, mappedValue, customValue, 0/*dimension*/, true);

BOOST_REQUIRE_CLOSE(outputT(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(outputT(0, 1), 99.0, 1e-5);
Expand All @@ -115,7 +115,7 @@ BOOST_AUTO_TEST_CASE(CustomImputationTest)
BOOST_REQUIRE_CLOSE(outputT(2, 3), 8.0, 1e-5);

// not transposed
imputer.Apply(input, output, mappedValue, customValue, 1, false);
imputer.Impute(input, output, mappedValue, customValue, 1, false);

BOOST_REQUIRE_CLOSE(output(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(0, 1), 99.0, 1e-5);
Expand Down Expand Up @@ -146,7 +146,7 @@ BOOST_AUTO_TEST_CASE(MeanImputationTest)
MeanImputation<double> imputer;

// transposed
imputer.Apply(input, outputT, mappedValue, 0, true);
imputer.Impute(input, outputT, mappedValue, 0, true);

BOOST_REQUIRE_CLOSE(outputT(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(outputT(0, 1), 2.5, 1e-5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter how I calculate, the mean value should be 5.66666667
3+5+9+6+8+2+4+6+8 = 51
51/9 = 5.66666667

Maybe it is because to you forgot to initialize sum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation applies only to the given dimension (0 in this case)
so, 0 is ignored and mean is going to be (3 + 2) / 2 = 2.5
and it replaces 0 with 2.5.

Expand All @@ -162,7 +162,7 @@ BOOST_AUTO_TEST_CASE(MeanImputationTest)
BOOST_REQUIRE_CLOSE(outputT(2, 3), 8.0, 1e-5);

// not transposed
imputer.Apply(input, output, mappedValue, 1, false);
imputer.Impute(input, output, mappedValue, 1, false);

BOOST_REQUIRE_CLOSE(output(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(0, 1), 7.0, 1e-5);
Expand Down Expand Up @@ -193,7 +193,7 @@ BOOST_AUTO_TEST_CASE(MedianImputationTest)
MedianImputation<double> imputer;

// transposed
imputer.Apply(input, outputT, mappedValue, 1, true);
imputer.Impute(input, outputT, mappedValue, 1, true);

BOOST_REQUIRE_CLOSE(outputT(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(outputT(0, 1), 0.0, 1e-5);
Expand All @@ -209,7 +209,7 @@ BOOST_AUTO_TEST_CASE(MedianImputationTest)
BOOST_REQUIRE_CLOSE(outputT(2, 3), 8.0, 1e-5);

// not transposed
imputer.Apply(input, output, mappedValue, 1, false);
imputer.Impute(input, output, mappedValue, 1, false);

BOOST_REQUIRE_CLOSE(output(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(0, 1), 6.0, 1e-5);
Expand Down Expand Up @@ -240,7 +240,7 @@ BOOST_AUTO_TEST_CASE(ListwiseDeletionTest)
ListwiseDeletion<double> imputer;

// transposed
imputer.Apply(input, outputT, mappedValue, 0, true); // transposed
imputer.Impute(input, outputT, mappedValue, 0, true); // transposed

BOOST_REQUIRE_CLOSE(outputT(0, 0), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(outputT(0, 1), 2.0, 1e-5);
Expand All @@ -250,7 +250,7 @@ BOOST_AUTO_TEST_CASE(ListwiseDeletionTest)
BOOST_REQUIRE_CLOSE(outputT(2, 1), 4.0, 1e-5);

// not transposed
imputer.Apply(input, output, mappedValue, 1, false); // not transposed
imputer.Impute(input, output, mappedValue, 1, false); // not transposed

BOOST_REQUIRE_CLOSE(output(0, 0), 5.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(0, 1), 6.0, 1e-5);
Expand Down