Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC]DatasetMapper & Imputer #694

Merged
merged 47 commits into from Jul 25, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
87c05a5
concept work for imputer
keon Jun 1, 2016
2e4b1a8
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 6, 2016
631e59e
do not to use NaN by default, let the user specify
keon Jun 6, 2016
391006e
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 6, 2016
6a1fb81
add template to datasetinfo and add imputer class
keon Jun 12, 2016
b0c5224
clean datasetinfo class and rename files
keon Jun 13, 2016
de35241
implement basic imputation strategies
keon Jun 13, 2016
2d38604
modify imputer_main and clean logs
keon Jun 13, 2016
bb045b8
add parameter verification for imputer_main
keon Jun 13, 2016
1295f4b
add custom strategy to impute_main
keon Jun 13, 2016
5a517c2
add datatype change in IncrementPolicy
keon Jun 14, 2016
94b7a5c
update types used in datasetinfo
keon Jun 14, 2016
ebed68f
initialize imputer with parameters
keon Jun 14, 2016
db78f39
remove datatype in dataset_info
keon Jun 15, 2016
7c60b97
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 15, 2016
da4e409
add test for imputer
keon Jun 15, 2016
d8618ec
restructure, add listwise deletion & imputer tests
keon Jun 18, 2016
3b8ffd0
fix transpose problem
keon Jun 27, 2016
90a5cd2
Merge pull request #7 from mlpack/master
keon Jun 27, 2016
32c8a73
merge
keon Jun 27, 2016
e09d9bc
updates and fixes on imputation methods
keon Jun 28, 2016
87d8d46
update data::load to accept different mappertypes
keon Jul 1, 2016
de0b2db
update data::load to accept different policies
keon Jul 1, 2016
bc187ca
add imputer doc
keon Jul 1, 2016
a340f69
debug median imputation and listwise deletion
keon Jul 2, 2016
21d94c0
remove duplicate code in load function
keon Jul 2, 2016
a92afaa
delete load overload
keon Jul 3, 2016
bace8b2
modify MapToNumerical to work with MissingPolicy
keon Jul 4, 2016
896a018
MissingPolicy uses NaN instead of numbers
keon Jul 4, 2016
1a908c2
fix reference issue in DatasetMapper
keon Jul 4, 2016
2edbc40
Move MapToNumerical(MapTokens) to Policy class
keon Jul 5, 2016
d881cb7
make policy and imputation api more consistent
keon Jul 5, 2016
a881831
numerical values can be set as missing values
keon Jul 6, 2016
63268a3
add comments and use more proper names
keon Jul 7, 2016
2eb6754
modify custom impute interface and rename variables
keon Jul 10, 2016
6d43aa3
add input-only overloads to imputation methods
keon Jul 10, 2016
fedc5e0
update median imputation to exclude missing values
keon Jul 11, 2016
787fd82
optimize imputation methods with output overloads
keon Jul 18, 2016
a0b7d59
expressive comments in imputation_test
keon Jul 18, 2016
9a6dce7
shorten imputation tests
keon Jul 18, 2016
c3aeba1
optimize preprocess imputer executable
keon Jul 18, 2016
028c217
fix bugs in imputation test
keon Jul 18, 2016
03e19a4
add more comments and delete impute_test.csv
keon Jul 22, 2016
ef4536b
Merge pull request #8 from mlpack/master
keon Jul 22, 2016
6e2c1ff
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jul 22, 2016
5eb9abd
fix PARAM statements in imputer
keon Jul 22, 2016
d043235
delete Impute() overloads that produce output matrix
keon Jul 23, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 10 additions & 0 deletions src/mlpack/core/data/dataset_info.hpp
Expand Up @@ -37,7 +37,13 @@ class DatasetMapper
*/
explicit DatasetMapper(const size_t dimensionality = 0);

/**
* Create the DatasetMapper object with the given policy and dimensionality.
* Note that the dimensionality cannot be changed later; you will have to
* create a new DatasetMapper object. Policy can be modified by the modifier.
*/
explicit DatasetMapper(PolicyType& policy, const size_t dimensionality = 0);

/**
* Given the string and the dimension to which it belongs, return its numeric
* mapping. If no mapping yet exists, the string is added to the list of
Expand Down Expand Up @@ -101,8 +107,12 @@ class DatasetMapper
ar & data::CreateNVP(maps, "maps");
}

//! Return the policy of the mapper.
PolicyType& Policy() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return type should be PolicyType const&, else the compiler may able to compile(this make sense, const member function return reference non-const data member is weird).

Try to compile following codes, you will find it cannot compile

struct testClass
{
    int& getValue() const
    {
        return a;
    }

    int a;
};


//! Modify the policy of the mapper (be careful!).
PolicyType& Policy();

private:
//! Types of each dimension.
std::vector<Datatype> types;
Expand Down
7 changes: 7 additions & 0 deletions src/mlpack/core/data/dataset_info_impl.hpp
Expand Up @@ -121,6 +121,13 @@ inline PolicyType& DatasetMapper<PolicyType>::Policy() const
return this->policy;
}

template<typename PolicyType>
inline PolicyType& DatasetMapper<PolicyType>::Policy()
{
return this->policy;
}


} // namespace data
} // namespace mlpack

Expand Down
46 changes: 0 additions & 46 deletions src/mlpack/core/data/load.hpp
Expand Up @@ -96,52 +96,6 @@ bool Load(const std::string& filename,
arma::Mat<eT>& matrix,
DatasetMapper<PolicyType>& info,
const bool fatal = false,
const bool transpose = true)
{
PolicyType policy;
return Load(filename, matrix, info, policy, fatal, transpose);
}

/**
* Loads a matrix from a file, guessing the filetype from the extension and
* mapping categorical features with a DatasetMapper object. This will
* transpose the matrix (unless the transpose parameter is set to false).
* This particular overload of Load() can only load text-based formats, such as
* those given below:
*
* - CSV (csv_ascii), denoted by .csv, or optionally .txt
* - TSV (raw_ascii), denoted by .tsv, .csv, or .txt
* - ASCII (raw_ascii), denoted by .txt
*
* If the file extension is not one of those types, an error will be given.
* This is preferable to Armadillo's default behavior of loading an unknown
* filetype as raw_binary, which can have very confusing effects.
*
* If the parameter 'fatal' is set to true, a std::runtime_error exception will
* be thrown if the matrix does not load successfully. The parameter
* 'transpose' controls whether or not the matrix is transposed after loading.
* In most cases, because data is generally stored in a row-major format and
* mlpack requires column-major matrices, this should be left at its default
* value of 'true'.
*
* The DatasetMapper object passed to this function will be re-created, so any
* mappings from previous loads will be lost. policy is passed to the
* constructor of DatasetMapper to create a new instance.
*
* @param filename Name of file to load.
* @param matrix Matrix to load contents of file into.
* @param info DatasetMapper object to populate with mappings and data types.
* @param policy Policy class that decides how the DatasetMapper should map.
* @param fatal If an error should be reported as fatal (default false).
* @param transpose If true, transpose the matrix after loading.
* @return Boolean value indicating success or failure of load.
*/
template<typename eT, typename PolicyType>
bool Load(const std::string& filename,
arma::Mat<eT>& matrix,
DatasetMapper<PolicyType>& info,
PolicyType& policy,
const bool fatal = false,
const bool transpose = true);

/**
Expand Down
15 changes: 6 additions & 9 deletions src/mlpack/core/data/load_impl.hpp
Expand Up @@ -369,18 +369,17 @@ bool Load(const std::string& filename,
return success;
}

// Load with mappings and policy.
// Load with mappings. Unfortunately we have to implement this ourselves.
template<typename eT, typename PolicyType>
bool Load(const std::string& filename,
arma::Mat<eT>& matrix,
DatasetMapper<PolicyType>& info,
PolicyType& policy,
const bool fatal,
const bool transpose)
{
// Get the extension and load as necessary.
Timer::Start("loading_data");
Log::Debug << "Load with Policy" << std::endl;

// Get the extension.
std::string extension = Extension(filename);

Expand Down Expand Up @@ -412,7 +411,7 @@ bool Load(const std::string& filename,
type = "raw ASCII-formatted data";

Log::Info << "Loading '" << filename << "' as " << type << ". "
<< std::flush;
<< std::endl;
std::string separators;
if (commas)
separators = ",";
Expand Down Expand Up @@ -447,14 +446,12 @@ bool Load(const std::string& filename,
if (transpose)
{
matrix.set_size(cols, rows);
Log::Debug << "initialize datasetmapper with policy" << std::endl;
info = DatasetMapper<PolicyType>(policy, cols);
info = DatasetMapper<PolicyType>(info.Policy(), cols);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible the problem I am thinking of existed in earlier versions of mlpack, but what happens if I do this:

arma::mat trainingSet, testSet;
DatasetInfo d;
data::Load("training_set.csv", trainingSet, d);
data::Load("test_set.csv", testSet, d);

Ideally the mappings from the first load should be preserved and used for the second load, but the lines here make me think that is not what is happening. I think probably we should add a test for this situation, to ensure that mapping information from a previous load is not destroyed.

}
else
{
matrix.set_size(rows, cols);
Log::Debug << "initialize datasetmapper with policy" << std::endl;
info = DatasetMapper<PolicyType>(policy, rows);
info = DatasetMapper<PolicyType>(info.Policy(), rows);
}

stream.close();
Expand Down Expand Up @@ -499,7 +496,7 @@ bool Load(const std::string& filename,
else if (extension == "arff")
{
Log::Info << "Loading '" << filename << "' as ARFF dataset. "
<< std::flush;
<< std::endl;
try
{
LoadARFF(filename, matrix, info);
Expand Down
3 changes: 2 additions & 1 deletion src/mlpack/core/data/map_policies/increment_policy.hpp
Expand Up @@ -24,7 +24,8 @@ namespace data {
class IncrementPolicy
{
public:
typedef size_t mapped_type;
// typedef of mapped_type
using mapped_type = size_t;

template <typename MapType>
mapped_type MapString(MapType& maps,
Expand Down
9 changes: 6 additions & 3 deletions src/mlpack/core/data/map_policies/missing_policy.hpp
Expand Up @@ -24,7 +24,8 @@ namespace data {
class MissingPolicy
{
public:
typedef size_t mapped_type;
// typedef of mapped_type
using mapped_type = size_t;

MissingPolicy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite weird, looks like my suggestion confuse you, sorry about that. Could you show me some examples, explain what kind of effects you want to achieve by MissingPolicy?Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you could add some comments about what the function does and the parameter used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you could use the doxygen commands like @param for parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

{
Expand All @@ -48,9 +49,10 @@ class MissingPolicy
// If this condition is true, either we have no mapping for the given string
// or we have no mappings for the given dimension at all. In either case,
// we create a mapping.
Log::Debug << "missingSet has: " << missingSet.count(string) << std::endl;
if (missingSet.count(string) != 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid this check if you added everything from missingSet to maps[dimension] in the constructor. That might give some noticeable speedup (I am not sure about that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of missingSet and maps are different.
missingSet specify which value/string we should map to
maps record the value/string we already mapped to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. My thinking was, everything in missingSet is something we might expect to be in maps by the end of loading. So you can save a little bit of time by just putting everything in missingSet into maps pre-emptively. The code as it is written here only adds things into maps from missingSet when it is seen, but I am not sure it is important to have a different, standalone missingSet. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is maps have different dimension need to handle, but missingSet apply to all of the dimensions. Studying Fast C++ CSV Parser, the codes make me very appreciate the expression power and performance spirit give us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't think of that. I agree now, putting everything from missingSet into maps is not the right thing.

I agree, too, Boost spirit seems really cool, I need to play with the code in your PR to see if we can preserve decent compile times.

maps.count(dimension) == 0 ||
maps[dimension].first.left.count(string) == 0)
(maps.count(dimension) == 0 ||
maps[dimension].first.left.count(string) == 0))
{
// This string does not exist yet.
size_t& numMappings = maps[dimension].second;
Expand All @@ -62,6 +64,7 @@ class MissingPolicy
else
{
// This string already exists in the mapping.
Log::Debug << "string already exists in the mapping" << std::endl;
return maps[dimension].first.left.at(string);
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/mlpack/methods/preprocess/preprocess_imputer_main.cpp
Expand Up @@ -98,7 +98,7 @@ int main(int argc, char** argv)
Log::Debug << "initalize info(policy)" << endl;
DatasetMapper<MissingPolicy> info(policy);

Load<double, MissingPolicy>(inputFile, input, info, policy, true, true);
Load(inputFile, input, info, true, true);

// for testing purpose
Log::Info << input << endl;
Expand Down
28 changes: 20 additions & 8 deletions src/mlpack/tests/imputation_test.cpp
Expand Up @@ -39,22 +39,34 @@ BOOST_AUTO_TEST_CASE(DatasetMapperImputerTest)

arma::mat input;
arma::mat output;
string missingValue = "a";
double customValue = 99;
size_t feature = 0;
size_t dimension = 0;

DatasetInfo info;
std::set<string> mset;
mset.insert("a");
MissingPolicy miss(mset);
DatasetMapper<MissingPolicy> info(miss);
BOOST_REQUIRE(data::Load("test_file.csv", input, info) == true);

BOOST_REQUIRE_EQUAL(input.n_rows, 3);
BOOST_REQUIRE_EQUAL(input.n_cols, 3);

/* TODO: Connect Load with the new DatasetMapper instead of DatasetInfo*/

//Imputer<double,
//DatasetInfo,
//CustomImputation<double>> impu(info);
//impu.Impute(input, output, missingValue, customValue, feature);
Imputer<double,
DatasetMapper<MissingPolicy>,
CustomImputation<double>> imputer(info);
imputer.Impute(input, output, "a", 99, dimension); // convert a -> 99

BOOST_REQUIRE_CLOSE(output(0, 0), 99.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(0, 1), 2.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(0, 2), 3.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(1, 0), 5.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(1, 1), 6.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(1, 2), 7.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(2, 0), 8.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(2, 1), 9.0, 1e-5);
BOOST_REQUIRE_CLOSE(output(2, 2), 10.0, 1e-5);

// Remove the file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This haven't complete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this is the part I wanted to test the whole process of Loading, mapping, and imputation.
I will at least make a mock data and test only mapping and imputation for this.

remove("test_file.csv");
}
Expand Down