New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC]DatasetMapper & Imputer #694
Changes from 1 commit
87c05a5
2e4b1a8
631e59e
391006e
6a1fb81
b0c5224
de35241
2d38604
bb045b8
1295f4b
5a517c2
94b7a5c
ebed68f
db78f39
7c60b97
da4e409
d8618ec
3b8ffd0
90a5cd2
32c8a73
e09d9bc
87d8d46
de0b2db
bc187ca
a340f69
21d94c0
a92afaa
bace8b2
896a018
1a908c2
2edbc40
d881cb7
a881831
63268a3
2eb6754
6d43aa3
fedc5e0
787fd82
a0b7d59
9a6dce7
c3aeba1
028c217
03e19a4
ef4536b
6e2c1ff
5eb9abd
d043235
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -369,18 +369,17 @@ bool Load(const std::string& filename, | |
return success; | ||
} | ||
|
||
// Load with mappings and policy. | ||
// Load with mappings. Unfortunately we have to implement this ourselves. | ||
template<typename eT, typename PolicyType> | ||
bool Load(const std::string& filename, | ||
arma::Mat<eT>& matrix, | ||
DatasetMapper<PolicyType>& info, | ||
PolicyType& policy, | ||
const bool fatal, | ||
const bool transpose) | ||
{ | ||
// Get the extension and load as necessary. | ||
Timer::Start("loading_data"); | ||
Log::Debug << "Load with Policy" << std::endl; | ||
|
||
// Get the extension. | ||
std::string extension = Extension(filename); | ||
|
||
|
@@ -412,7 +411,7 @@ bool Load(const std::string& filename, | |
type = "raw ASCII-formatted data"; | ||
|
||
Log::Info << "Loading '" << filename << "' as " << type << ". " | ||
<< std::flush; | ||
<< std::endl; | ||
std::string separators; | ||
if (commas) | ||
separators = ","; | ||
|
@@ -447,14 +446,12 @@ bool Load(const std::string& filename, | |
if (transpose) | ||
{ | ||
matrix.set_size(cols, rows); | ||
Log::Debug << "initialize datasetmapper with policy" << std::endl; | ||
info = DatasetMapper<PolicyType>(policy, cols); | ||
info = DatasetMapper<PolicyType>(info.Policy(), cols); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's possible the problem I am thinking of existed in earlier versions of mlpack, but what happens if I do this:
Ideally the mappings from the first load should be preserved and used for the second load, but the lines here make me think that is not what is happening. I think probably we should add a test for this situation, to ensure that mapping information from a previous load is not destroyed. |
||
} | ||
else | ||
{ | ||
matrix.set_size(rows, cols); | ||
Log::Debug << "initialize datasetmapper with policy" << std::endl; | ||
info = DatasetMapper<PolicyType>(policy, rows); | ||
info = DatasetMapper<PolicyType>(info.Policy(), rows); | ||
} | ||
|
||
stream.close(); | ||
|
@@ -499,7 +496,7 @@ bool Load(const std::string& filename, | |
else if (extension == "arff") | ||
{ | ||
Log::Info << "Loading '" << filename << "' as ARFF dataset. " | ||
<< std::flush; | ||
<< std::endl; | ||
try | ||
{ | ||
LoadARFF(filename, matrix, info); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,8 @@ namespace data { | |
class MissingPolicy | ||
{ | ||
public: | ||
typedef size_t mapped_type; | ||
// typedef of mapped_type | ||
using mapped_type = size_t; | ||
|
||
MissingPolicy() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks quite weird, looks like my suggestion confuse you, sorry about that. Could you show me some examples, explain what kind of effects you want to achieve by MissingPolicy?Thanks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be great if you could add some comments about what the function does and the parameter used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be great if you could use the doxygen commands like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated |
||
{ | ||
|
@@ -48,9 +49,10 @@ class MissingPolicy | |
// If this condition is true, either we have no mapping for the given string | ||
// or we have no mappings for the given dimension at all. In either case, | ||
// we create a mapping. | ||
Log::Debug << "missingSet has: " << missingSet.count(string) << std::endl; | ||
if (missingSet.count(string) != 0 && | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could avoid this check if you added everything from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The role of missingSet and maps are different. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see what you mean. My thinking was, everything in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Problem is maps have different dimension need to handle, but missingSet apply to all of the dimensions. Studying Fast C++ CSV Parser, the codes make me very appreciate the expression power and performance spirit give us. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I didn't think of that. I agree now, putting everything from I agree, too, Boost spirit seems really cool, I need to play with the code in your PR to see if we can preserve decent compile times. |
||
maps.count(dimension) == 0 || | ||
maps[dimension].first.left.count(string) == 0) | ||
(maps.count(dimension) == 0 || | ||
maps[dimension].first.left.count(string) == 0)) | ||
{ | ||
// This string does not exist yet. | ||
size_t& numMappings = maps[dimension].second; | ||
|
@@ -62,6 +64,7 @@ class MissingPolicy | |
else | ||
{ | ||
// This string already exists in the mapping. | ||
Log::Debug << "string already exists in the mapping" << std::endl; | ||
return maps[dimension].first.left.at(string); | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,22 +39,34 @@ BOOST_AUTO_TEST_CASE(DatasetMapperImputerTest) | |
|
||
arma::mat input; | ||
arma::mat output; | ||
string missingValue = "a"; | ||
double customValue = 99; | ||
size_t feature = 0; | ||
size_t dimension = 0; | ||
|
||
DatasetInfo info; | ||
std::set<string> mset; | ||
mset.insert("a"); | ||
MissingPolicy miss(mset); | ||
DatasetMapper<MissingPolicy> info(miss); | ||
BOOST_REQUIRE(data::Load("test_file.csv", input, info) == true); | ||
|
||
BOOST_REQUIRE_EQUAL(input.n_rows, 3); | ||
BOOST_REQUIRE_EQUAL(input.n_cols, 3); | ||
|
||
/* TODO: Connect Load with the new DatasetMapper instead of DatasetInfo*/ | ||
|
||
//Imputer<double, | ||
//DatasetInfo, | ||
//CustomImputation<double>> impu(info); | ||
//impu.Impute(input, output, missingValue, customValue, feature); | ||
Imputer<double, | ||
DatasetMapper<MissingPolicy>, | ||
CustomImputation<double>> imputer(info); | ||
imputer.Impute(input, output, "a", 99, dimension); // convert a -> 99 | ||
|
||
BOOST_REQUIRE_CLOSE(output(0, 0), 99.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(0, 1), 2.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(0, 2), 3.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 0), 5.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 1), 6.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 2), 7.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 0), 8.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 1), 9.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 2), 10.0, 1e-5); | ||
|
||
// Remove the file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This haven't complete? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea, this is the part I wanted to test the whole process of Loading, mapping, and imputation. |
||
remove("test_file.csv"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the return type should be PolicyType const&, else the compiler may able to compile(this make sense, const member function return reference non-const data member is weird).
Try to compile following codes, you will find it cannot compile