New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC]DatasetMapper & Imputer #694
Changes from 1 commit
87c05a5
2e4b1a8
631e59e
391006e
6a1fb81
b0c5224
de35241
2d38604
bb045b8
1295f4b
5a517c2
94b7a5c
ebed68f
db78f39
7c60b97
da4e409
d8618ec
3b8ffd0
90a5cd2
32c8a73
e09d9bc
87d8d46
de0b2db
bc187ca
a340f69
21d94c0
a92afaa
bace8b2
896a018
1a908c2
2edbc40
d881cb7
a881831
63268a3
2eb6754
6d43aa3
fedc5e0
787fd82
a0b7d59
9a6dce7
c3aeba1
028c217
03e19a4
ef4536b
6e2c1ff
5eb9abd
d043235
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,34 +65,17 @@ void MapToNumerical(const std::vector<std::string>& tokens, | |
DatasetMapper<PolicyType>& info, | ||
arma::Mat<eT>& matrix) | ||
{ | ||
auto notNumber = [](const std::string& str) | ||
std::stringstream token; | ||
for (size_t i = 0; i != tokens.size(); ++i) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice idea, thanks for the enhancement 👍 |
||
{ | ||
eT val(0); | ||
std::stringstream token; | ||
token.str(str); | ||
token>>val; | ||
return token.fail(); | ||
}; | ||
|
||
const bool notNumeric = std::any_of(std::begin(tokens), | ||
std::end(tokens), notNumber); | ||
if(notNumeric) | ||
{ | ||
for(size_t i = 0; i != tokens.size(); ++i) | ||
token.str(tokens[i]); | ||
token>>matrix.at(row, i); | ||
if (token.fail()) // if not number, map it to datasetmapper | ||
{ | ||
const eT val = static_cast<eT>(info.MapString(tokens[i], row)); | ||
matrix.at(row, i) = val; | ||
} | ||
} | ||
else | ||
{ | ||
std::stringstream token; | ||
for(size_t i = 0; i != tokens.size(); ++i) | ||
{ | ||
token.str(tokens[i]); | ||
token>>matrix.at(row, i); | ||
token.clear(); | ||
} | ||
token.clear(); | ||
} | ||
} | ||
|
||
|
@@ -411,7 +394,7 @@ bool Load(const std::string& filename, | |
type = "raw ASCII-formatted data"; | ||
|
||
Log::Info << "Loading '" << filename << "' as " << type << ". " | ||
<< std::endl; | ||
<< std::flush; | ||
std::string separators; | ||
if (commas) | ||
separators = ","; | ||
|
@@ -496,7 +479,7 @@ bool Load(const std::string& filename, | |
else if (extension == "arff") | ||
{ | ||
Log::Info << "Loading '" << filename << "' as ARFF dataset. " | ||
<< std::endl; | ||
<< std::flush; | ||
try | ||
{ | ||
LoadARFF(filename, matrix, info); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,13 +29,12 @@ class MissingPolicy | |
|
||
MissingPolicy() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks quite weird, looks like my suggestion confuse you, sorry about that. Could you show me some examples, explain what kind of effects you want to achieve by MissingPolicy?Thanks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be great if you could add some comments about what the function does and the parameter used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be great if you could use the doxygen commands like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated |
||
{ | ||
Log::Debug << "MissingPolicy()" << std::endl; | ||
// Nothing to initialize here. | ||
} | ||
|
||
explicit MissingPolicy(std::set<std::string> missingSet) : | ||
missingSet(std::move(missingSet)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is picky, but this line should be indented two tabs instead of just one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated |
||
{ | ||
Log::Debug << "MissingPolicy(missingSet)" << std::endl; | ||
// Nothing to initialize here. | ||
} | ||
|
||
|
@@ -49,7 +48,6 @@ class MissingPolicy | |
// If this condition is true, either we have no mapping for the given string | ||
// or we have no mappings for the given dimension at all. In either case, | ||
// we create a mapping. | ||
Log::Debug << "missingSet has: " << missingSet.count(string) << std::endl; | ||
if (missingSet.count(string) != 0 && | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could avoid this check if you added everything from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The role of missingSet and maps are different. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see what you mean. My thinking was, everything in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Problem is maps have different dimension need to handle, but missingSet apply to all of the dimensions. Studying Fast C++ CSV Parser, the codes make me very appreciate the expression power and performance spirit give us. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I didn't think of that. I agree now, putting everything from I agree, too, Boost spirit seems really cool, I need to play with the code in your PR to see if we can preserve decent compile times. |
||
(maps.count(dimension) == 0 || | ||
maps[dimension].first.left.count(string) == 0)) | ||
|
@@ -64,7 +62,6 @@ class MissingPolicy | |
else | ||
{ | ||
// This string already exists in the mapping. | ||
Log::Debug << "string already exists in the mapping" << std::endl; | ||
return maps[dimension].first.left.at(string); | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,7 @@ BOOST_AUTO_TEST_CASE(DatasetMapperImputerTest) | |
fstream f; | ||
f.open("test_file.csv", fstream::out); | ||
f << "a, 2, 3" << endl; | ||
f << "5, 6, 7" << endl; | ||
f << "5, 6, b" << endl; | ||
f << "8, 9, 10" << endl; | ||
f.close(); | ||
|
||
|
@@ -43,6 +43,7 @@ BOOST_AUTO_TEST_CASE(DatasetMapperImputerTest) | |
|
||
std::set<string> mset; | ||
mset.insert("a"); | ||
mset.insert("b"); | ||
MissingPolicy miss(mset); | ||
DatasetMapper<MissingPolicy> info(miss); | ||
BOOST_REQUIRE(data::Load("test_file.csv", input, info) == true); | ||
|
@@ -56,15 +57,16 @@ BOOST_AUTO_TEST_CASE(DatasetMapperImputerTest) | |
DatasetMapper<MissingPolicy>, | ||
CustomImputation<double>> imputer(info); | ||
imputer.Impute(input, output, "a", 99, dimension); // convert a -> 99 | ||
imputer.Impute(input, output, "b", 99, dimension); // convert b -> 99 | ||
|
||
BOOST_REQUIRE_CLOSE(output(0, 0), 99.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(0, 1), 2.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(0, 2), 3.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 0), 5.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(0, 1), 5.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(0, 2), 8.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 0), 2.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 1), 6.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 2), 7.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 0), 8.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 1), 9.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(1, 2), 9.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 0), 3.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 1), 99.0, 1e-5); | ||
BOOST_REQUIRE_CLOSE(output(2, 2), 10.0, 1e-5); | ||
|
||
// Remove the file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This haven't complete? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea, this is the part I wanted to test the whole process of Loading, mapping, and imputation. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass by reference do not tell the user you want to move the value policy, this may give user a surprise like "oops, I do not meant to change the value of policy, but after calling it, the data of my parameter loss". To make the api more self-explain, we could
1 : pass by value
void DatasetMapper<PolicyType>::Policy(PolicyType policy)
2 : pass by universal reference
Q : What is std::forward?
Ans : This function will convert the parameter with universal reference type to rvalue if the parameter is rvalue, to lvalue if the parameter is lvalue.
Q : Why pass by value
A : Because this would not have any side effect on the parameter(in most case) you pass in, this make codes easier to manage, user will never have a surprise like "oops, my parameter changed by the function"
Q : Why pass by universal reference
A : Because this make the api self-explain and more flexible at the same time.With this api, users have more choices, example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the resource, I now understand the concepts.