Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added check_input_matrices option to python bindings #2787

Merged
merged 39 commits into from Feb 2, 2021

Conversation

NippunSharma
Copy link
Contributor

@NippunSharma NippunSharma commented Dec 28, 2020

This is in reference to #2679 .
[updated after making option more general]
I have added no_sanity_checks option for the python bindings in this PR. This option checks for nan values inside all the parameters that are of the type arma::mat , arma::colvec and arma::rowvec and throws an error if there is a nan value along with the name of the parameter.
I am working on adding this option inside other bindings as well for which I will open separate PR's.
Since this is my first PR on the automatic binding system please forgive any silly mistakes.

@NippunSharma
Copy link
Contributor Author

Currently the tests are failing because different methods have different identifiers for the training / reference matrix.
Somewhere it is training and somewhere else it is input. To solve this issue I am thinking of a more generalized approach where we check for Nan values inside all parameters that are of the type arma::mat, arma::rowec or arma::colvec and report the name of the parameter that has a Nan value.

@NippunSharma NippunSharma marked this pull request as ready for review December 29, 2020 10:05
@NippunSharma NippunSharma changed the title [WIP] added no_sanity_checks option to python bindings added no_sanity_checks option to python bindings Dec 29, 2020
Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NippunSharma, looking good! Thank you for taking the time to work on this. I have a handful of comments; hopefully they are helpful. 👍 I think once we get this worked out, it will be easy to add this functionality to the other languages too.

{
std::string paramName = itr->first;
std::string paramType = itr->second.cppType;
if (paramType == "arma::mat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about arma::Mat<size_t>, arma::Row<size_t>, arma::Col<size_t>, and then the 'categorical matrix' type which is either std::tuple<data::DatasetInfo, arma::mat> or std::tuple<arma::mat, data::DatasetInfo> (I can't remember which---you can check src/mlpack/core/util/param.hpp to see how it is defined in PARAM_MATRIX_WITH_INFO(). 👍)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/**
* Sanity Check.
*/
void SanityChecks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about moving this function into the IO class, so that it can be called from any binding, not just the Python bindings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better, I have moved it

@@ -230,6 +230,8 @@ PARAM_FLAG("copy_all_inputs", "If specified, all input parameters will be deep"
" copied before the method is run. This is useful for debugging problems "
"where the input parameters are being modified by the algorithm, but can "
"slow down the code.", "");
PARAM_FLAG("no_sanity_checks", "If specified, the input matrix is checked for"
" nan values.", "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" nan values.", "");
" NaN values; an exception will be thrown if any are found.", "");

@@ -230,6 +230,8 @@ PARAM_FLAG("copy_all_inputs", "If specified, all input parameters will be deep"
" copied before the method is run. This is useful for debugging problems "
"where the input parameters are being modified by the algorithm, but can "
"slow down the code.", "");
PARAM_FLAG("no_sanity_checks", "If specified, the input matrix is checked for"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no_sanity_checks? In this case, if we specify the flag, then we are doing the sanity checks. I might suggest a name like check_input_matrices or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have corrected it now

@@ -224,6 +235,10 @@ void PrintPYX(const util::BindingDetails& doc,
cout << " IO.SetPassed(<const string> '" << d.name << "')" << endl;
}

// Before calling mlpackMain(), we do a sanity check if needed.
cout << " if not IO.GetParam[cbool](<const string> 'no_sanity_checks'):" << endl;
cout << " SanityChecks()" << endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is never used elsewhere in the program, so if you wanted, you could only add a block down here where you call SanityChecks(), and then no need to call IO.SetPassed() or SetParam[] or anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand what you mean here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've added two blocks of code to print_pyx.cpp: the first checks the type of the option and adds it to IO via IO.SetPassed(...). The second checks the value of the parameter via IO.GetParam() and then calls SanityChecks(). What I'm suggesting is that there is no need for two blocks. You can do it all in one block, and there is no need to call SetPassed() or GetParam().

@@ -15,7 +15,7 @@

#include <mlpack/core/util/io.hpp>
#include <mlpack/core/data/dataset_mapper.hpp>

#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added by mistake, removed now

}
else if (paramType == "std::tuple<data::DatasetInfo, arma::mat>")
{
if (std::get<1>(IO::GetParam<std::tuple<data::DatasetInfo, arma::mat>>(paramName)).has_nan())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of lines in this file are longer than 80 characters---do you think you could wrap them please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// Only "verbose" and "copy_all_inputs" will be persistent.
if (identifier == "verbose" || identifier == "copy_all_inputs")
// Only "verbose", "copy_all_inputs" and "no_sanity_checks" will be persistent.
if (identifier == "verbose" || identifier == "copy_all_inputs" || identifier == "check_input_matrices")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is longer than 80 characters too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -267,3 +267,50 @@ void IO::ClearSettings()
GetSingleton().aliases = persistentAliases;
GetSingleton().functionMap = persistentFunctions;
}

void IO::SanityChecks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to change this to CheckInputMatrices() to match the name of the option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it would be better here, changed it to CheckInputMatrices()

@@ -64,8 +64,8 @@ class PyOption
data.required = required;
data.input = input;
data.loaded = false;
// Only "verbose" and "copy_all_inputs" will be persistent.
if (identifier == "verbose" || identifier == "copy_all_inputs")
// Only "verbose", "copy_all_inputs" and "no_sanity_checks" will be persistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Only "verbose", "copy_all_inputs" and "no_sanity_checks" will be persistent.
// Only "verbose", "copy_all_inputs" and "check_input_matrices" will be persistent.

:)

@NippunSharma NippunSharma changed the title added no_sanity_checks option to python bindings added check_input_matrices option to python bindings Jan 4, 2021
Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NippunSharma, looking great! Do you think you could add a test for this functionality to src/mlpack/bindings/python/tests/?

std::string errMsg = "The input " + paramName + " has NaN values.";
if (paramType == "arma::mat")
{
if (IO::GetParam<arma::Mat<double>>(paramName).has_nan())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check for Inf too---you could use .is_finite() for that. http://arma.sourceforge.net/docs.html#is_finite 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a good idea, I have added the same. I decided to .has_inf() just for uniformity with .has_nan(). I have also added separate tests for the nan and inf values in src/mlpack/bindings/python/tests/test_python_binding.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NippunSharma maybe now it makes sense to split out the body of each if statement into a templated function? The bodies of each if statement are basically identical, so it would be great if we could capture that identical functionality in one standalone function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that will reduce the repetitive code. I have added CheckInputMatrix() function that is called inside the CheckInputMatrices() function.

@NippunSharma
Copy link
Contributor Author

@rcurtin you were correct. PARAM_COL() gives a arma::vec. I rechecked the type of matrix in all cases from here, all types seem to be correct now. However I still see errors while running the tests locally. The matrix_in is checked correctly for NaN or inf but there is some problem with the unsigned matrix input. I am looking into the problem and will try to correct it ASAP.

double_in=4.0,
mat_req_in=[[1.0]],
col_req_in=[1.0],
umatrix_in=x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this one might not work because there is no way to represent NaN or Inf with an arma::Mat<size_t>. So I think perhaps the checks are only necessary for double-type matrices or vectors.

@NippunSharma
Copy link
Contributor Author

NippunSharma commented Jan 25, 2021

@rcurtin I have removed the tests for arma::Mat<size_t>, arma::Row<size_t> and arma::Col<size_t>. The tests are passing locally. One change was that paramType for matrix_and_info_in is actually given as TUPLE_TYPE and not std::tuple<data::DatasetInfo, arma::mat>. I have made that change in the code so that we check for TUPLE_TYPE

{
IO::CheckInputMatrix(IO::GetParam<arma::mat>(paramName), paramName);
}
else if (paramType == "arma::Mat<size_t>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now you can remove these if statements, since it's not possible to find a NaN or Inf in a matrix that has type size_t. 👍

IO::CheckInputMatrix(
IO::GetParam<arma::Row<size_t>>(paramName), paramName);
}
else if (paramType == "TUPLE_TYPE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should hunt down why this is TUPLE_TYPE; ideally it should actually be std::tuple<mlpack::data::DatasetInfo, arma::mat>. I think it's because PARAM_MATRIX_AND_INFO_IN() is defined like this:

#define TUPLE_TYPE std::tuple<mlpack::data::DatasetInfo, arma::mat>
#define PARAM_MATRIX_AND_INFO_IN(ID, DESC, ALIAS) \
    PARAM_IN(TUPLE_TYPE, ID, DESC, ALIAS, TUPLE_TYPE(), false)

(That's in src/mlpack/core/util/param.hpp.) What happens if you inline that TUPLE_TYPE definition? Then we should hopefully get the right result here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to fully understand what you are saying here. Should I remove the line #define TUPLE_TYPE std::tuple<mlpack::data::DatasetInfo, arma::mat> line and use std::tuple<mlpack::DatasetInfo, arma::mat> directly instead of using TUPLE_TYPE ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, exactly---give that a shot, and then I think paramType will be std::tuple<mlpack::data::DatasetInfo, arma::mat>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcurtin the following error is thrown locally after I make this change:
macro "PARAM_IN" passed 8 arguments, but takes just 6
I guess this is because the macro is taking the tuple type as 2 arguments instead of 1. Is this supposed to happen ? I have pushed the changes can you take a look ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right... that must be why I defined TUPLE_TYPE to begin with. Perhaps we can't use the PARAM_IN() macro then. Take a look at how PARAM_MATRIX() is defined:

  #define PARAM_MATRIX(ID, DESC, ALIAS, REQ, TRANS, IN) \
      static mlpack::util::Option<arma::mat> \
      JOIN(io_option_dummy_matrix_, __COUNTER__) \
      (arma::mat(), ID, DESC, ALIAS, "arma::mat", \
      REQ, IN, !TRANS, testName);

You can see that it just constructs an Option<arma::mat> object, and the fifth argument is the name of the type that will be param.cppType. So instead of using PARAM_IN to define PARAM_MATRIX_AND_INFO_IN(), do you think you could try an implementation like that of PARAM_MATRIX instead, which explicitly specifies the fifth parameter to Option<>?

@NippunSharma
Copy link
Contributor Author

@rcurtin I have added PARAM_IN_WITH_NAME in param.hpp that takes an additional parameter NAME that will be the cppType. This solution seems to work correctly and tests are passing locally. Can you take a look ?

#define PARAM_IN_WITH_NAME(T, ID, DESC, ALIAS, NAME, DEF, REQ) \
static mlpack::util::Option<T> \
JOIN(io_option_dummy_object_in_, __COUNTER__) \
(DEF, ID, DESC, ALIAS, NAME, REQ, true, false, testName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good! To keep the code maintainable (and minimal), can you adapt PARAM_COL(), PARAM_UCOL(), etc. to be expressed in terms of this macro?

@NippunSharma
Copy link
Contributor Author

I have renamed the PARAM_IN_WITH_NAME() to PARAM_COMPLETE() and made it more general so that we can write all other macros in terms of it. The tests are passing locally.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NippunSharma! Really cool that we were even able to reduce code duplication in param.hpp. I have a couple tiny comments; up to you how you want to handle them before merge. 👍

src/mlpack/core/util/io.cpp Outdated Show resolved Hide resolved
@@ -1223,51 +1258,10 @@ using DatasetInfo = DatasetMapper<IncrementPolicy, std::string>;
* @param REQ Whether or not parameter is required (boolean value).
*/
#ifdef __COUNTER__
#define PARAM_IN(T, ID, DESC, ALIAS, DEF, REQ) \
#define PARAM_COMPLETE(T, ID, DESC, ALIAS, NAME, REQ, IN, TRANS, DEF) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but honestly I think it would be okay to just give this the simpler name of PARAM. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just afraid that PARAM would be already taken up by some other macro but since the tests are passing now I think it is not taken up and definitely better to use here.

NippunSharma and others added 2 commits January 29, 2021 18:04
Co-authored-by: Ryan Curtin <ryan@ratml.org>
Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good too me as well.

@rcurtin rcurtin merged commit 041617d into mlpack:master Feb 2, 2021
@rcurtin
Copy link
Member

rcurtin commented Feb 2, 2021

Thanks @NippunSharma! 💯

This was referenced Oct 14, 2022
@rcurtin rcurtin mentioned this pull request Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Tracking
Need Review
Development

Successfully merging this pull request may close these issues.

None yet

3 participants