New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple cross-validation #1044

Merged
merged 48 commits into from Aug 7, 2017

Conversation

Projects
None yet
3 participants
@micyril
Member

micyril commented Jun 30, 2017

Add a simple cross-validation strategy. This patch also includes some common cross-validation infrastructure, which is supposed to be utilized by other prospective cross-validation strategies.

@rcurtin

This code and definitely the tests look good so far, but I haven't reviewed it fully in depth. I have some high-level questions that I think should probably be addressed first, to make sure that we are understanding everything from the same perspective.

Thank you for your hard work with this! I am excited to work through any issues and be able to integrate this into mlpack.

Show outdated Hide outdated src/mlpack/core/cv/cv_base.hpp
* @tparam MatType The type of data.
* @tparam PredictionsType The type of predictions (labels/responses).
* @tparam WeightsType The type of weights. It supposed to be void* when weights
* are not supported.

This comment has been minimized.

@rcurtin

rcurtin Jul 1, 2017

Member

In this case, should WeightsType default to void*? That way whoever is using this class upstream (even if it is an internal use) does not have to remember to specify void*.

@rcurtin

rcurtin Jul 1, 2017

Member

In this case, should WeightsType default to void*? That way whoever is using this class upstream (even if it is an internal use) does not have to remember to specify void*.

This comment has been minimized.

@micyril

micyril Jul 3, 2017

Member

This class template is supposed to be initialized with the same parameters as its derivative (e.g., SimpleCV, which indeed has default parameters for MatType, PredictionsType and WeightsType), and it's hard for me to imagine some other usage for CVBase. But anyway, if you would like, we can set a default value for WeightsType.

@micyril

micyril Jul 3, 2017

Member

This class template is supposed to be initialized with the same parameters as its derivative (e.g., SimpleCV, which indeed has default parameters for MatType, PredictionsType and WeightsType), and it's hard for me to imagine some other usage for CVBase. But anyway, if you would like, we can set a default value for WeightsType.

Show outdated Hide outdated src/mlpack/core/cv/cv_base.hpp
Show outdated Hide outdated src/mlpack/core/cv/cv_base.hpp
Show outdated Hide outdated src/mlpack/core/cv/cv_base.hpp
Show outdated Hide outdated src/mlpack/tests/data_facilities.hpp
typename MatType,
typename PredictionsType,
typename WeightsType>
class CVBase

This comment has been minimized.

@rcurtin

rcurtin Jul 1, 2017

Member

This is just a clarification question, why do we need a CVBase class? We have SimpleCV that inherits from it, but I don't see anything else yet. I don't mean to say that a CVBase class isn't a good idea, I just don't have enough information to know yet, so I thought I would ask to find out. :)

@rcurtin

rcurtin Jul 1, 2017

Member

This is just a clarification question, why do we need a CVBase class? We have SimpleCV that inherits from it, but I don't see anything else yet. I don't mean to say that a CVBase class isn't a good idea, I just don't have enough information to know yet, so I thought I would ask to find out. :)

This comment has been minimized.

@micyril

micyril Jul 3, 2017

Member

The main reason to use CVBase is to handle different Train interfaces across models (algorithms) in mlpack. We have already discussed that some models take a data::DatasetInfo parameter, some take the numClasses parameter, some support weights. So, when a user pass some of these parameters to cross-validation class, we want to check that the model really can be created with these parameters.

The next thing is that parameters like datasetInfo or numClass are not really important for cross-validation (it doesn't impact how we create datasets for training and validation in some particular cross-validation strategy). But data points, predictions and optionally weights are important. CVBase provides an interface to extract these parameters from all ones regardless what constructor have been used (whether datasetInfo or numClasses have been passed).

Finally, when it is time to train a model, CVBase takes responsibly for passing parameters like datasetInfo and numClasses (if it is required) to the MLAlgorithm constructor, requiring only to specify what dataset and hyperparameters to use for training.

You have also noticed that I use private inheritance, which means "is implemented in terms of" rather than "is a" relationship. For the same type of relationship we can use composition, but my motivation is based on the following: I want to separate interfaces for CVBase constructors (which are supposed to be used as a reference by any user of mlpack who wants to do cross-validation) and for other facilities that are supposed to be used by developers of cross-validation strategies. For that I use the pubic and protected keywords respectively. If the same were implemented through composition, we would need to put everything under public.

@micyril

micyril Jul 3, 2017

Member

The main reason to use CVBase is to handle different Train interfaces across models (algorithms) in mlpack. We have already discussed that some models take a data::DatasetInfo parameter, some take the numClasses parameter, some support weights. So, when a user pass some of these parameters to cross-validation class, we want to check that the model really can be created with these parameters.

The next thing is that parameters like datasetInfo or numClass are not really important for cross-validation (it doesn't impact how we create datasets for training and validation in some particular cross-validation strategy). But data points, predictions and optionally weights are important. CVBase provides an interface to extract these parameters from all ones regardless what constructor have been used (whether datasetInfo or numClasses have been passed).

Finally, when it is time to train a model, CVBase takes responsibly for passing parameters like datasetInfo and numClasses (if it is required) to the MLAlgorithm constructor, requiring only to specify what dataset and hyperparameters to use for training.

You have also noticed that I use private inheritance, which means "is implemented in terms of" rather than "is a" relationship. For the same type of relationship we can use composition, but my motivation is based on the following: I want to separate interfaces for CVBase constructors (which are supposed to be used as a reference by any user of mlpack who wants to do cross-validation) and for other facilities that are supposed to be used by developers of cross-validation strategies. For that I use the pubic and protected keywords respectively. If the same were implemented through composition, we would need to put everything under public.

typename MetaInfoExtractor<MLAlgorithm, MatType>::PredictionsType,
typename WeightsType =
typename MetaInfoExtractor<MLAlgorithm, MatType,
PredictionsType>::WeightsType>

This comment has been minimized.

@rcurtin

rcurtin Jul 1, 2017

Member

Hmm, maybe there is something I have overlooked, but two questions here. Is it necessary to take the MatType parameter at the class level? It seems like that should actually only be needed in Evaluate(), and could be deduced based on what the user passes in as arguments. The second question is, is there any situation where PredictionsType and WeightsType cannot be directly derived from MLAlgorithm and MatType? In the way this is set up, the user can set a custom PredictionsType and WeightsType, but I don't know if it is ever valid for them to be able to set anything other than what MetaInfoExtractor will provide.

@rcurtin

rcurtin Jul 1, 2017

Member

Hmm, maybe there is something I have overlooked, but two questions here. Is it necessary to take the MatType parameter at the class level? It seems like that should actually only be needed in Evaluate(), and could be deduced based on what the user passes in as arguments. The second question is, is there any situation where PredictionsType and WeightsType cannot be directly derived from MLAlgorithm and MatType? In the way this is set up, the user can set a custom PredictionsType and WeightsType, but I don't know if it is ever valid for them to be able to set anything other than what MetaInfoExtractor will provide.

This comment has been minimized.

@micyril

micyril Jul 3, 2017

Member
  1. It is useful to know MatType, PredictionsType and WeightsType before the Evaluate call to have an opportunity to prepare datasets for training and validation, so repeated calls (what is the case for hyper-parameter tuning) of Evaluate will have less computational cost. It's not that useful for the simplest case like in SimpleCV, but in general it can make sense (like for k-fold cross-validation). Another reason is to have a similar API to hyper-parameter tuning, where it is more required to pass MatType, PredictionsType and WeightsType into constructor of a tuner in order to have a more nice interface for passing sets of hyper-parameters.
  2. Yes, there are situations where PredictionsType and WeightsType cannot be directly derived from MLAlgorithm and MatType. A good example is DecisionTree, for which MatType, PredictionsType and WeightsType are not specified unless one of Train methods is called. So, if I understand the design of DecisionTree in the right way, it is possible to use arma::Mat<unsigned char> for PredictionsType or arma::Mat<float> for WeightsType.
@micyril

micyril Jul 3, 2017

Member
  1. It is useful to know MatType, PredictionsType and WeightsType before the Evaluate call to have an opportunity to prepare datasets for training and validation, so repeated calls (what is the case for hyper-parameter tuning) of Evaluate will have less computational cost. It's not that useful for the simplest case like in SimpleCV, but in general it can make sense (like for k-fold cross-validation). Another reason is to have a similar API to hyper-parameter tuning, where it is more required to pass MatType, PredictionsType and WeightsType into constructor of a tuner in order to have a more nice interface for passing sets of hyper-parameters.
  2. Yes, there are situations where PredictionsType and WeightsType cannot be directly derived from MLAlgorithm and MatType. A good example is DecisionTree, for which MatType, PredictionsType and WeightsType are not specified unless one of Train methods is called. So, if I understand the design of DecisionTree in the right way, it is possible to use arma::Mat<unsigned char> for PredictionsType or arma::Mat<float> for WeightsType.

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2017

Member

Ah, I see---thank you for the clarification. :)

@rcurtin

rcurtin Jul 10, 2017

Member

Ah, I see---thank you for the clarification. :)

Show outdated Hide outdated src/mlpack/core/cv/simple_cv.hpp
SimpleCV<LinearRegression, MSE> cv(0.6, data, responses);
BOOST_REQUIRE_CLOSE(cv.Evaluate(), expectedMSE, 1e-5);

This comment has been minimized.

@rcurtin

rcurtin Jul 1, 2017

Member

This is related to the earlier comment about whether or not to take the MatType/WeightsType/PredictionsType parameters at class construction time, but I think this interface would be nicer to use if the user could basically just do

SimpleCV<LinearRegression, MSE> cv(0.6);
cv.Evaluate(data, responses);

since that way they could evaluate cross-validation on multiple different datasets (if for some reason they wanted to do that). What do you think?

@rcurtin

rcurtin Jul 1, 2017

Member

This is related to the earlier comment about whether or not to take the MatType/WeightsType/PredictionsType parameters at class construction time, but I think this interface would be nicer to use if the user could basically just do

SimpleCV<LinearRegression, MSE> cv(0.6);
cv.Evaluate(data, responses);

since that way they could evaluate cross-validation on multiple different datasets (if for some reason they wanted to do that). What do you think?

This comment has been minimized.

@micyril

micyril Jul 3, 2017

Member

Again, see the comment above.

@micyril

micyril Jul 3, 2017

Member

Again, see the comment above.

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2017

Member

Right, so correct me if I am wrong. My understanding from what you wrote in the previous case is that it is beneficial for the case of hyper-parameter tuning to have the cross-validation class store the dataset internally, so it does not need to be split multiple times. That could save some computation. But there is no need to save the same computation for the SimpleCV class as seen by the users. So do you think it would be better to refactor a little bit, so that the CVBase::Evaluate() method takes an already-set-up dataset (in whatever the form you choose is, a doubled dataset or whatever else)? Then a constructor for the hyper-parameter tuner could assemble the dataset in the right format and store it internally (thus saving computation) but SimpleCV would only need to do it once for a dataset.

The reason I am focusing on this bit so much is that I think it is a lot more intuitive for a user to pass the dataset/labels/weights to the Evaluate() function, instead of the constructor. This "feels" more like the rest of the mlpack interface, and also allows someone to reuse a single cross-validation object for doing cross-validation on multiple datasets.

@rcurtin

rcurtin Jul 10, 2017

Member

Right, so correct me if I am wrong. My understanding from what you wrote in the previous case is that it is beneficial for the case of hyper-parameter tuning to have the cross-validation class store the dataset internally, so it does not need to be split multiple times. That could save some computation. But there is no need to save the same computation for the SimpleCV class as seen by the users. So do you think it would be better to refactor a little bit, so that the CVBase::Evaluate() method takes an already-set-up dataset (in whatever the form you choose is, a doubled dataset or whatever else)? Then a constructor for the hyper-parameter tuner could assemble the dataset in the right format and store it internally (thus saving computation) but SimpleCV would only need to do it once for a dataset.

The reason I am focusing on this bit so much is that I think it is a lot more intuitive for a user to pass the dataset/labels/weights to the Evaluate() function, instead of the constructor. This "feels" more like the rest of the mlpack interface, and also allows someone to reuse a single cross-validation object for doing cross-validation on multiple datasets.

This comment has been minimized.

@micyril

micyril Jul 11, 2017

Member

Maybe we should provide both interfaces? I guess the one you would like to have can be implemented through a static version of Evaluate (SimpleCV::Evaluate) .

@micyril

micyril Jul 11, 2017

Member

Maybe we should provide both interfaces? I guess the one you would like to have can be implemented through a static version of Evaluate (SimpleCV::Evaluate) .

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2017

Member

Were you planning on having the hyperparameter tuner directly use SimpleCV? In that case I think both overloads make sense, otherwise if SimpleCV is meant to be used just by users, I think that the interface where the dataset/labels/weights are passed to Evaluate() will be a lot more intuitive for users. I'm not sure a static version of Evaluate() could be used, because it will depend on the validationSize parameter that's given to the constructor of SimpleCV; correct me if I am wrong there.

@rcurtin

rcurtin Jul 11, 2017

Member

Were you planning on having the hyperparameter tuner directly use SimpleCV? In that case I think both overloads make sense, otherwise if SimpleCV is meant to be used just by users, I think that the interface where the dataset/labels/weights are passed to Evaluate() will be a lot more intuitive for users. I'm not sure a static version of Evaluate() could be used, because it will depend on the validationSize parameter that's given to the constructor of SimpleCV; correct me if I am wrong there.

This comment has been minimized.

@micyril

micyril Jul 11, 2017

Member

I thought it would make sense to pass validationSize into Evaluate too, since with a new dataset a user might want to use another splitting.

@micyril

micyril Jul 11, 2017

Member

I thought it would make sense to pass validationSize into Evaluate too, since with a new dataset a user might want to use another splitting.

This comment has been minimized.

@micyril

micyril Jul 11, 2017

Member

I was planning that the hyper-parameter tuner would possess a cross-validation object like a SimpleCV object to assess different sets of hyper-parameters.

@micyril

micyril Jul 11, 2017

Member

I was planning that the hyper-parameter tuner would possess a cross-validation object like a SimpleCV object to assess different sets of hyper-parameters.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 12, 2017

Member

I decided to move the discussion about the API out here, since Github gives a little bit of a bigger text box for it... This is a big and difficult discussion, because cross-validation and the hyper-parameter tuner mean that a lot of different methods in mlpack must agree on the same basic API and structure, and that is a difficult problem. Basically, we are forced to find a way to unify everything, and as we get further in this project, I think we keep finding more things that are not unified... That unfortunately means that the scope of my comment here is quite large, not just the CV code.

I would like to see the "standard" mlpack algorithm look like this:

  • Hyperparameters are passed to the constructor, but not to Train().
  • Hyperparameters may also be modified by calling a function with the hyperparameter name.
  • Constructor versions do exist that also take Train() arguments followed by hyperparameters.

So, for instance, take LARS:

// constructor that takes only hyperparameters
LARS(const bool useCholesky = false,
     const double lambda1 = 0.0,
     const double lambda2 = 0.0,
     const double tolerance = 1e-16);

// Train() that does not take hyperparameters
void Train(const arma::mat& data,
           const arma::rowvec& responses,
           const bool transposeData = true);

So we could use this like this:

LARS l;
// An important note:
// This means users don't have to specify *every* hyperparameter, they can only set the ones they want.
l.Tolerance() = 1e-10;
l.Train(dataset, responses);

// But, to do it all in one line, users can also specify all hyperparameters and the training data.
LARS l2(dataset, responses, true, false, 0.0, 0.0, 1e-10);

I know that there are a lot of mlpack methods that do not follow this style exactly, and this needs to be fixed. I will open an issue for that.

Because of this, I'd also like to be able to use the cross-validation code in the same way. But there are some extra complexities. I see that there are three types of runtime parameters we must worry about for SimpleCV:

  1. The cross-validation parameters (in this case only validationSize).
  2. The dataset/labels/weights.
  3. The algorithm's configuration hyperparameters (i.e. minimumLeafSize for decision trees, etc.).

Right now, SimpleCV operates like the following:

SimpleCV<Algorithm, Measure> cv(/* type 1: cv configuration */, /* type 2: dataset/labels/weights */);
cv.Evaluate(/* type 3: algorithm configuration */);

But this is different than the idea for "standard" mlpack algorithms I proposed above, where the type 3 parameters go in the constructor of the algorithm. Therefore, this leads me to suggest something else:

Algorithm a(/* type 3 parameters */);
SimpleCV<Algorithm, Measure> cv(a /* pass untrained algorithm */, /* type 1: cv configuration */);
cv.Evaluate(/* type 2 parameters: dataset/weights/labels */);

In this situation we have actually avoided the confusion of asking the user to pass algorithm hyperparameters and CV hyperparameters at the same time, by having the user pass the type 3 parameters to the constructor of Algorithm, which then stores those parameters. In addition, we could even modify only some of the hyperparameters in the same way as the example above with this style of interface. Here are two examples:

// Set only tolerance.
LARS l;
l.Tolerance() = 1e-10;
SimpleCV<LARS, MSE> cv(l, 0.1 /* 10-fold CV */);
const double mse = cv.Evaluate(dataset, responses);
// Set tolerance, after CV is constructed.
SimpleCV<LARS, MSE> cv(LARS(), 0.1 /* 10-fold CV */);
cv.Algorithm().Tolerance() = 1e-10; // Maybe this can have a different name than Algorithm(), not sure.
const double mse = cv.Evaluate(dataset, responses);

This allows us to not require users to pass all hyperparameters at once, which I think is really important for keeping code readable:

// Readable:
LARS l;
l.Tolerance() = 1e-10;
// Much less readable... and brittle if defaults change:
LARS l(false, 0.0, 0.0, 1e-10);

So, I think, at the end of the day, what I am saying is that I think this interface I have proposed for SimpleCV would be a lot easier to understand and use for users. Unfortunately I think it would mean that things would be less easy for the hyperparameter tuner and you may have to use CVBase directly in that class, since you need to manage the dataset to make sure it is not unnecessarily copied.

However, I know that this interface would be a lot of refactoring work for you, so I want to ensure that my discussions here delay your project only minimally. Therefore, if you agree that this interface is a good change, then I can open an issue highlighting the API changes we'll need to make to have mlpack algorithms match the "standard" interface, and then I will go implement those changes quickly (within a week), and in the longer term (before the 3.0.0 release) I will prepare some documents detailing what the "standard" mlpack API and design is. If you don't agree that it is a good change, then let's keep discussing and see if we can find a better solution. :)

Member

rcurtin commented Jul 12, 2017

I decided to move the discussion about the API out here, since Github gives a little bit of a bigger text box for it... This is a big and difficult discussion, because cross-validation and the hyper-parameter tuner mean that a lot of different methods in mlpack must agree on the same basic API and structure, and that is a difficult problem. Basically, we are forced to find a way to unify everything, and as we get further in this project, I think we keep finding more things that are not unified... That unfortunately means that the scope of my comment here is quite large, not just the CV code.

I would like to see the "standard" mlpack algorithm look like this:

  • Hyperparameters are passed to the constructor, but not to Train().
  • Hyperparameters may also be modified by calling a function with the hyperparameter name.
  • Constructor versions do exist that also take Train() arguments followed by hyperparameters.

So, for instance, take LARS:

// constructor that takes only hyperparameters
LARS(const bool useCholesky = false,
     const double lambda1 = 0.0,
     const double lambda2 = 0.0,
     const double tolerance = 1e-16);

// Train() that does not take hyperparameters
void Train(const arma::mat& data,
           const arma::rowvec& responses,
           const bool transposeData = true);

So we could use this like this:

LARS l;
// An important note:
// This means users don't have to specify *every* hyperparameter, they can only set the ones they want.
l.Tolerance() = 1e-10;
l.Train(dataset, responses);

// But, to do it all in one line, users can also specify all hyperparameters and the training data.
LARS l2(dataset, responses, true, false, 0.0, 0.0, 1e-10);

I know that there are a lot of mlpack methods that do not follow this style exactly, and this needs to be fixed. I will open an issue for that.

Because of this, I'd also like to be able to use the cross-validation code in the same way. But there are some extra complexities. I see that there are three types of runtime parameters we must worry about for SimpleCV:

  1. The cross-validation parameters (in this case only validationSize).
  2. The dataset/labels/weights.
  3. The algorithm's configuration hyperparameters (i.e. minimumLeafSize for decision trees, etc.).

Right now, SimpleCV operates like the following:

SimpleCV<Algorithm, Measure> cv(/* type 1: cv configuration */, /* type 2: dataset/labels/weights */);
cv.Evaluate(/* type 3: algorithm configuration */);

But this is different than the idea for "standard" mlpack algorithms I proposed above, where the type 3 parameters go in the constructor of the algorithm. Therefore, this leads me to suggest something else:

Algorithm a(/* type 3 parameters */);
SimpleCV<Algorithm, Measure> cv(a /* pass untrained algorithm */, /* type 1: cv configuration */);
cv.Evaluate(/* type 2 parameters: dataset/weights/labels */);

In this situation we have actually avoided the confusion of asking the user to pass algorithm hyperparameters and CV hyperparameters at the same time, by having the user pass the type 3 parameters to the constructor of Algorithm, which then stores those parameters. In addition, we could even modify only some of the hyperparameters in the same way as the example above with this style of interface. Here are two examples:

// Set only tolerance.
LARS l;
l.Tolerance() = 1e-10;
SimpleCV<LARS, MSE> cv(l, 0.1 /* 10-fold CV */);
const double mse = cv.Evaluate(dataset, responses);
// Set tolerance, after CV is constructed.
SimpleCV<LARS, MSE> cv(LARS(), 0.1 /* 10-fold CV */);
cv.Algorithm().Tolerance() = 1e-10; // Maybe this can have a different name than Algorithm(), not sure.
const double mse = cv.Evaluate(dataset, responses);

This allows us to not require users to pass all hyperparameters at once, which I think is really important for keeping code readable:

// Readable:
LARS l;
l.Tolerance() = 1e-10;
// Much less readable... and brittle if defaults change:
LARS l(false, 0.0, 0.0, 1e-10);

So, I think, at the end of the day, what I am saying is that I think this interface I have proposed for SimpleCV would be a lot easier to understand and use for users. Unfortunately I think it would mean that things would be less easy for the hyperparameter tuner and you may have to use CVBase directly in that class, since you need to manage the dataset to make sure it is not unnecessarily copied.

However, I know that this interface would be a lot of refactoring work for you, so I want to ensure that my discussions here delay your project only minimally. Therefore, if you agree that this interface is a good change, then I can open an issue highlighting the API changes we'll need to make to have mlpack algorithms match the "standard" interface, and then I will go implement those changes quickly (within a week), and in the longer term (before the 3.0.0 release) I will prepare some documents detailing what the "standard" mlpack API and design is. If you don't agree that it is a good change, then let's keep discussing and see if we can find a better solution. :)

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Jul 13, 2017

Member

It’s a little bit sad to hear all of that only in the middle of July. The interface of SimpleCV in the PR shouldn’t not look like something new for you - it is the same in my proposal, as well as in the pretty long discussion #929. The same goes for hyper-parameter tuning interface, which we quite in depth discussed in the mail thread "Cross-validation and hyper-parameter tuning infrastructure” (like here). But anyway, I would like to know what is your current vision of the interface for hyper-parameter tuning. It looks like the strategy "pass hyper-parameters first, data later" is not going to be applicable here.

Member

micyril commented Jul 13, 2017

It’s a little bit sad to hear all of that only in the middle of July. The interface of SimpleCV in the PR shouldn’t not look like something new for you - it is the same in my proposal, as well as in the pretty long discussion #929. The same goes for hyper-parameter tuning interface, which we quite in depth discussed in the mail thread "Cross-validation and hyper-parameter tuning infrastructure” (like here). But anyway, I would like to know what is your current vision of the interface for hyper-parameter tuning. It looks like the strategy "pass hyper-parameters first, data later" is not going to be applicable here.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jul 13, 2017

Member

It’s a little bit sad to hear all of that only in the middle of July. The interface of SimpleCV in the PR shouldn’t not look like something new for you - it is the same in my proposal, as well as in the pretty long discussion #929.

I think, it's quite normal that you discuss something at length, which at the end turns out to be suboptimal, it looks like something like that happened here. Don't get me wrong, I totally get your point, you both put a lot of time into the class design and to realize that we have to put more time on it is frustrating. But I think in the end it is worth it if we can come up with something that is easier to use as the current idea I think more people are going to use it.

That said, I think the API proposed by @rcurtin makes sense in the context of usability and readability, and if we all agree on that, we can go and implement those changes quickly.

Member

zoq commented Jul 13, 2017

It’s a little bit sad to hear all of that only in the middle of July. The interface of SimpleCV in the PR shouldn’t not look like something new for you - it is the same in my proposal, as well as in the pretty long discussion #929.

I think, it's quite normal that you discuss something at length, which at the end turns out to be suboptimal, it looks like something like that happened here. Don't get me wrong, I totally get your point, you both put a lot of time into the class design and to realize that we have to put more time on it is frustrating. But I think in the end it is worth it if we can come up with something that is easier to use as the current idea I think more people are going to use it.

That said, I think the API proposed by @rcurtin makes sense in the context of usability and readability, and if we all agree on that, we can go and implement those changes quickly.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 13, 2017

Member

I see, now that I review previous discussions, that my responses probably seem incomprehensible and schizophrenic. It is worth keeping in mind that I look through so much code every day that it is hard to keep it all straight, and as a result I can give inconsistent answers. In this case, I owe you an apology since what I wrote here directly contradicts what I wrote just a few months ago! So I am sorry about that. I know how stressful it is to be on the receiving end of this and so I will work to make sure it does not happen again for you.

In addition, I want to point out, there are two factors that make this project extra difficult to stay on top of---first, since this project sits "on top of" all of the other mlpack methods, some design work has to be done to standardize all of those. Second, reviewing heavy template metaprogramming code requires a huge amount of effort since the code is typically not very straightforward. These contribute to the difficulty of keeping everything I have previously said in mind.

I see that there is also a misunderstanding on my side:

k-fold cross-validation is supposed to be implemented in a separate class

When I review the proposal now, I see that KFoldCV is meant to be an entirely separate class. Is SimpleCV meant to be used by users? If it is not meant to be used by users but is instead an internal utility class, then we should add that in the comments, and the API matters a lot less and I'm fine with it as-is. Typically when I review code, I try to do it from the perspective of a user, and if code comments don't make it clear what the class is for then I (and a user) will end up with a strange and incorrect view of what the class actually does. As an example here, it was not until today that I realized that SimpleCV is not k-fold CV---clearer comments in the class description could have prevented that, I think.

As for the API change I suggested, what we have now that we did not in April is a formal idea of how all mlpack methods will look:

  • Hyperparameters are passed to the constructor, but not to Train().
  • Hyperparameters may also be modified by calling a function with the hyperparameter name.
  • Constructor versions do exist that also take Train() arguments followed by hyperparameters.

The primary benefit here is the second point: a user is not required to set all hyperparameters to set the last one; instead they can just call the function corresponding to the last one's name (see the LARS example in the previous post). So, from that angle, it would be nice if KFoldCV could work in that way also:

// Set tolerance, after CV is constructed.
KFoldCV<LARS, MSE> cv(LARS(), 10);
cv.Algorithm().Tolerance() = 1e-10; // Maybe this can have a different name than Algorithm(), not sure.
const double mse = cv.Evaluate(dataset, responses);

This specific point is one I did not realize in our discussion of #929. So, what I would like to understand from your side, is: is this possible, and how much extra effort would that be? If it is not easily possible, or it takes a huge amount of effort, then we can leave it as proposed in #929; after all, I even wrote:

The KFoldCV you've proposed here looks great, by the way. It functions the way a user might expect,
and doesn't burden the user with API details that aren't relevant to them---they just select the learner
and the gain function to maximize via template parameters, and then put in their data and options. I
think this type of simplified interface is definitely what we should aim for where C++ allows us to. :)

I have to add, the API idea I am proposing does not make any sense for the hyperparameter tuner, only for cross-validation. In the hyperparameter tuner it only makes sense for the user to pass all of the hyperparameters to the Optimize() function, just like you have it written in your proposal.

Member

rcurtin commented Jul 13, 2017

I see, now that I review previous discussions, that my responses probably seem incomprehensible and schizophrenic. It is worth keeping in mind that I look through so much code every day that it is hard to keep it all straight, and as a result I can give inconsistent answers. In this case, I owe you an apology since what I wrote here directly contradicts what I wrote just a few months ago! So I am sorry about that. I know how stressful it is to be on the receiving end of this and so I will work to make sure it does not happen again for you.

In addition, I want to point out, there are two factors that make this project extra difficult to stay on top of---first, since this project sits "on top of" all of the other mlpack methods, some design work has to be done to standardize all of those. Second, reviewing heavy template metaprogramming code requires a huge amount of effort since the code is typically not very straightforward. These contribute to the difficulty of keeping everything I have previously said in mind.

I see that there is also a misunderstanding on my side:

k-fold cross-validation is supposed to be implemented in a separate class

When I review the proposal now, I see that KFoldCV is meant to be an entirely separate class. Is SimpleCV meant to be used by users? If it is not meant to be used by users but is instead an internal utility class, then we should add that in the comments, and the API matters a lot less and I'm fine with it as-is. Typically when I review code, I try to do it from the perspective of a user, and if code comments don't make it clear what the class is for then I (and a user) will end up with a strange and incorrect view of what the class actually does. As an example here, it was not until today that I realized that SimpleCV is not k-fold CV---clearer comments in the class description could have prevented that, I think.

As for the API change I suggested, what we have now that we did not in April is a formal idea of how all mlpack methods will look:

  • Hyperparameters are passed to the constructor, but not to Train().
  • Hyperparameters may also be modified by calling a function with the hyperparameter name.
  • Constructor versions do exist that also take Train() arguments followed by hyperparameters.

The primary benefit here is the second point: a user is not required to set all hyperparameters to set the last one; instead they can just call the function corresponding to the last one's name (see the LARS example in the previous post). So, from that angle, it would be nice if KFoldCV could work in that way also:

// Set tolerance, after CV is constructed.
KFoldCV<LARS, MSE> cv(LARS(), 10);
cv.Algorithm().Tolerance() = 1e-10; // Maybe this can have a different name than Algorithm(), not sure.
const double mse = cv.Evaluate(dataset, responses);

This specific point is one I did not realize in our discussion of #929. So, what I would like to understand from your side, is: is this possible, and how much extra effort would that be? If it is not easily possible, or it takes a huge amount of effort, then we can leave it as proposed in #929; after all, I even wrote:

The KFoldCV you've proposed here looks great, by the way. It functions the way a user might expect,
and doesn't burden the user with API details that aren't relevant to them---they just select the learner
and the gain function to maximize via template parameters, and then put in their data and options. I
think this type of simplified interface is definitely what we should aim for where C++ allows us to. :)

I have to add, the API idea I am proposing does not make any sense for the hyperparameter tuner, only for cross-validation. In the hyperparameter tuner it only makes sense for the user to pass all of the hyperparameters to the Optimize() function, just like you have it written in your proposal.

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Jul 14, 2017

Member

I think, it's quite normal that you discuss something at length, which at the end turns out to be suboptimal, it looks like something like that happened here.

@zoq, my main fear was that a lot of work that have been already done (it primary concerns the meta programming tools and the way it is utilized by SimpleCV) was like wasting time. It is why I have said "it's a little bit sad" - I guess such cases can be considered as a little bit sad. Anyway, I'm sorry if let myself be too much negative, but I hope you guys got my point.

When I review the proposal now, I see that KFoldCV is meant to be an entirely separate class. Is SimpleCV meant to be used by users?

Actually, SimpleCV was meant to be used by users too. It basically evaluates a model on just one portion (specified by validationSize) of data, like it is described in the lecture "Advice for applying machine learning" from the known Coursera online course taught by Andrew Ng.

So, what I would like to understand from your side, is: is this possible, and how much extra effort would that be? If it is not easily possible, or it takes a huge amount of effort, then we can leave it as proposed in #929;

The current interface for cross-validation (proposed in the PR) was designed with the intention to be utilized by hyper-parameter tuning - there we want to prepare a dataset only once and then use it to assess different sets of hyper-parameters. So, from this point of view I think it makes sense to have the interface proposed in this PR.

Now I'm finishing working on the planned part of the GSOC project (the plan that I described in my proposal with remarks from the end of this letter). I think that all planned code should be ready for review by the end of the second evaluation period. That means that after the second evaluation period I should have time to implement the interface of cross-validation that you suggest (in addition to the interface proposed in this PR). What if we go this way? If you agree, I will go ahead and add the remaining minor changes that you have suggested in this PR.

Member

micyril commented Jul 14, 2017

I think, it's quite normal that you discuss something at length, which at the end turns out to be suboptimal, it looks like something like that happened here.

@zoq, my main fear was that a lot of work that have been already done (it primary concerns the meta programming tools and the way it is utilized by SimpleCV) was like wasting time. It is why I have said "it's a little bit sad" - I guess such cases can be considered as a little bit sad. Anyway, I'm sorry if let myself be too much negative, but I hope you guys got my point.

When I review the proposal now, I see that KFoldCV is meant to be an entirely separate class. Is SimpleCV meant to be used by users?

Actually, SimpleCV was meant to be used by users too. It basically evaluates a model on just one portion (specified by validationSize) of data, like it is described in the lecture "Advice for applying machine learning" from the known Coursera online course taught by Andrew Ng.

So, what I would like to understand from your side, is: is this possible, and how much extra effort would that be? If it is not easily possible, or it takes a huge amount of effort, then we can leave it as proposed in #929;

The current interface for cross-validation (proposed in the PR) was designed with the intention to be utilized by hyper-parameter tuning - there we want to prepare a dataset only once and then use it to assess different sets of hyper-parameters. So, from this point of view I think it makes sense to have the interface proposed in this PR.

Now I'm finishing working on the planned part of the GSOC project (the plan that I described in my proposal with remarks from the end of this letter). I think that all planned code should be ready for review by the end of the second evaluation period. That means that after the second evaluation period I should have time to implement the interface of cross-validation that you suggest (in addition to the interface proposed in this PR). What if we go this way? If you agree, I will go ahead and add the remaining minor changes that you have suggested in this PR.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 14, 2017

Member

I think, it's quite normal that you discuss something at length, which at the end turns out to be suboptimal, it looks like something like that happened here.

@zoq, my main fear was that a lot of work that have been already done (it primary concerns the meta programming tools and the way it is utilized by SimpleCV) was like wasting time. It is why I have said "it's a little bit sad" - I guess such cases can be considered as a little bit sad. Anyway, I'm sorry if let myself be too much negative, but I hope you guys got my point.

This one's on me, don't worry about it. :)

Actually, SimpleCV was meant to be used by users too. It basically evaluates a model on just one portion (specified by validationSize) of data, like it is described in the lecture "Advice for applying machine learning" from the known Coursera online course taught by Andrew Ng.

Ok, that seems reasonable to me---but could we make the documentation for the class clearer about exactly what it does and how it would be used? This would be in addition to the tutorial you are planning to make (because a user may not find that tutorial first, they may find the class documentation first).

The current interface for cross-validation (proposed in the PR) was designed with the intention to be utilized by hyper-parameter tuning - there we want to prepare a dataset only once and then use it to assess different sets of hyper-parameters. So, from this point of view I think it makes sense to have the interface proposed in this PR.

Yes, definitely, I can agree that the SimpleCV interface makes sense for the hyper-parameter tuner. It might be possible that the interface I proposed could work well for the hyper-parameter tuner as well, but I am not sure---you would have a better idea than me.

Now I'm finishing working on the planned part of the GSOC project (the plan that I described in my proposal with remarks from the end of this letter). I think that all planned code should be ready for review by the end of the second evaluation period. That means that after the second evaluation period I should have time to implement the interface of cross-validation that you suggest (in addition to the interface proposed in this PR). What if we go this way? If you agree, I will go ahead and add the remaining minor changes that you have suggested in this PR.

Yes, this seems reasonable to me; we can decide at that time which interface makes the most sense (including possibly both), and see what is possible in the remaining time you have this summer.

Member

rcurtin commented Jul 14, 2017

I think, it's quite normal that you discuss something at length, which at the end turns out to be suboptimal, it looks like something like that happened here.

@zoq, my main fear was that a lot of work that have been already done (it primary concerns the meta programming tools and the way it is utilized by SimpleCV) was like wasting time. It is why I have said "it's a little bit sad" - I guess such cases can be considered as a little bit sad. Anyway, I'm sorry if let myself be too much negative, but I hope you guys got my point.

This one's on me, don't worry about it. :)

Actually, SimpleCV was meant to be used by users too. It basically evaluates a model on just one portion (specified by validationSize) of data, like it is described in the lecture "Advice for applying machine learning" from the known Coursera online course taught by Andrew Ng.

Ok, that seems reasonable to me---but could we make the documentation for the class clearer about exactly what it does and how it would be used? This would be in addition to the tutorial you are planning to make (because a user may not find that tutorial first, they may find the class documentation first).

The current interface for cross-validation (proposed in the PR) was designed with the intention to be utilized by hyper-parameter tuning - there we want to prepare a dataset only once and then use it to assess different sets of hyper-parameters. So, from this point of view I think it makes sense to have the interface proposed in this PR.

Yes, definitely, I can agree that the SimpleCV interface makes sense for the hyper-parameter tuner. It might be possible that the interface I proposed could work well for the hyper-parameter tuner as well, but I am not sure---you would have a better idea than me.

Now I'm finishing working on the planned part of the GSOC project (the plan that I described in my proposal with remarks from the end of this letter). I think that all planned code should be ready for review by the end of the second evaluation period. That means that after the second evaluation period I should have time to implement the interface of cross-validation that you suggest (in addition to the interface proposed in this PR). What if we go this way? If you agree, I will go ahead and add the remaining minor changes that you have suggested in this PR.

Yes, this seems reasonable to me; we can decide at that time which interface makes the most sense (including possibly both), and see what is possible in the remaining time you have this summer.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 14, 2017

Member

To add to that, I think this is ready to merge if you can add some more documentation to the SimpleCV class detailing what it is for with perhaps a simple example of how it can be used, and also some documentation for the other classes pointing out if they are meant to be used by the user, etc; for example, CVBase is not really meant to be used by the user, but there is nothing in the comments to say this---it might be good to add This is a base class, not meant to be used by users. For cross-validation tasks, see KFoldCV and SimpleCV (or similar).

Member

rcurtin commented Jul 14, 2017

To add to that, I think this is ready to merge if you can add some more documentation to the SimpleCV class detailing what it is for with perhaps a simple example of how it can be used, and also some documentation for the other classes pointing out if they are meant to be used by the user, etc; for example, CVBase is not really meant to be used by the user, but there is nothing in the comments to say this---it might be good to add This is a base class, not meant to be used by users. For cross-validation tasks, see KFoldCV and SimpleCV (or similar).

* runs training on the training set and evaluates performance on the validation
* set.
*
* To construct a SimpleCV object you need to pass the validationSize parameter

This comment has been minimized.

@micyril

micyril Jul 17, 2017

Member

Let me know whether the documentation is clear enough and whether we need to provide a more extended tutorial before this PR gets merged.

@micyril

micyril Jul 17, 2017

Member

Let me know whether the documentation is clear enough and whether we need to provide a more extended tutorial before this PR gets merged.

This comment has been minimized.

@rcurtin

rcurtin Jul 17, 2017

Member

Looks great to me. Can you wrap the code in @code and @endcode for doxygen? It might be good to randomly generate the dataset instead of using ... for clarity, if you like:

// 100-point 5-dimensional random dataset.
arma::mat data = arma::randu<arma::mat>(5, 100);
// Random shuffled labels in [0, 5).
arma::Row<size_t> labels = arma::shuffle(arma::linspace<arma::Row<size_t>>(0, 4, 100));
size_t numClasses = 5;
@rcurtin

rcurtin Jul 17, 2017

Member

Looks great to me. Can you wrap the code in @code and @endcode for doxygen? It might be good to randomly generate the dataset instead of using ... for clarity, if you like:

// 100-point 5-dimensional random dataset.
arma::mat data = arma::randu<arma::mat>(5, 100);
// Random shuffled labels in [0, 5).
arma::Row<size_t> labels = arma::shuffle(arma::linspace<arma::Row<size_t>>(0, 4, 100));
size_t numClasses = 5;

This comment has been minimized.

@micyril

micyril Jul 18, 2017

Member

Sure. I used arma::randi<arma::Row<size_t>>(100, arma::distr_param(0, 4)) for labels - it seems to me to be a bit more concise with approximately the same result.

@micyril

micyril Jul 18, 2017

Member

Sure. I used arma::randi<arma::Row<size_t>>(100, arma::distr_param(0, 4)) for labels - it seems to me to be a bit more concise with approximately the same result.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 17, 2017

Member

Looks great to me, I think that this is ready to merge. But the Windows build is failing:

c:\projects\mlpack\src\mlpack\core\cv\simple_cv_impl.hpp(49): error C2672: 'mlpack::cv::SimpleCV<mlpack::regression::LogisticRegression<arma::mat>,mlpack::cv::Accuracy,arma::mat,arma::Row<arma::u64>,void *>::TrainAndEvaluate': no matching overloaded function

https://ci.appveyor.com/project/mlpack/mlpack/build/%232917/job/ae0y7hopw3l0kdoq

If you can handle that, then I think it's time to merge.

Member

rcurtin commented Jul 17, 2017

Looks great to me, I think that this is ready to merge. But the Windows build is failing:

c:\projects\mlpack\src\mlpack\core\cv\simple_cv_impl.hpp(49): error C2672: 'mlpack::cv::SimpleCV<mlpack::regression::LogisticRegression<arma::mat>,mlpack::cv::Accuracy,arma::mat,arma::Row<arma::u64>,void *>::TrainAndEvaluate': no matching overloaded function

https://ci.appveyor.com/project/mlpack/mlpack/build/%232917/job/ae0y7hopw3l0kdoq

If you can handle that, then I think it's time to merge.

micyril added some commits Jul 18, 2017

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Jul 18, 2017

Member

It looks like the needed packages can't be loaded in the Windows builds

Member

micyril commented Jul 18, 2017

It looks like the needed packages can't be loaded in the Windows builds

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jul 18, 2017

Member

https://appveyor.statuspage.io/incidents/m2vdvw39kdk8, I think we should wait a couple more hours and restart the build.

Member

zoq commented Jul 18, 2017

https://appveyor.statuspage.io/incidents/m2vdvw39kdk8, I think we should wait a couple more hours and restart the build.

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril
Member

micyril commented Jul 21, 2017

@rcurtin

Ah, ok, thanks. That online compiler is a nice tool to know about! I think the workaround you have committed is just fine.

I am happy with this PR and I think it's ready to merge. I'll wait 5 days for the merge since this PR is complex, in case anyone has any more comments.

micyril added some commits Jul 27, 2017

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Jul 27, 2017

Member

I have left a comment on the thread about LogisticRegression in #1038. Let me know whether we want to keep the merged changes. If yes, I will make appropriate changes here to solve the building problems.

Member

micyril commented Jul 27, 2017

I have left a comment on the thread about LogisticRegression in #1038. Let me know whether we want to keep the merged changes. If yes, I will make appropriate changes here to solve the building problems.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 29, 2017

Member

We can revert the changes to LogisticRegression from #1038, so if you would like to refactor the code so that numClasses is not required for binary classifiers such as LogisticRegression, then I will go ahead and merge this. If you would like to revert the changes to LogisticRegression itself from #1038, please feel free, otherwise I can handle it when I have some time.

Member

rcurtin commented Jul 29, 2017

We can revert the changes to LogisticRegression from #1038, so if you would like to refactor the code so that numClasses is not required for binary classifiers such as LogisticRegression, then I will go ahead and merge this. If you would like to revert the changes to LogisticRegression itself from #1038, please feel free, otherwise I can handle it when I have some time.

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Jul 29, 2017

Member

Ok, I will handle it.

Member

micyril commented Jul 29, 2017

Ok, I will handle it.

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Jul 29, 2017

Member

Should I make a new PR for it, or it is ok to put changes related to LogisticRegression here?

Member

micyril commented Jul 29, 2017

Should I make a new PR for it, or it is ok to put changes related to LogisticRegression here?

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 29, 2017

Member

I think the LR changes are fine here; it's just a revert, so no need for an involved review process for that.

Member

rcurtin commented Jul 29, 2017

I think the LR changes are fine here; it's just a revert, so no need for an involved review process for that.

@rcurtin

rcurtin approved these changes Aug 5, 2017

Looks good to me, are you ready for me to merge this? I said I would merge it a week ago but I didn't have internet then. I do now though. :)

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Aug 6, 2017

Member

I think it's ready to be merged after deciding how to store models (see the commit "Get rid of smart pointers" and our discussion about smart pointers here).

Member

micyril commented Aug 6, 2017

I think it's ready to be merged after deciding how to store models (see the commit "Get rid of smart pointers" and our discussion about smart pointers here).

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Aug 7, 2017

Member

It's unclear for me why the style checker failed. It referred to files that are not a part of this PR - files from src/mlpack/methods/rann/.

Member

micyril commented Aug 7, 2017

It's unclear for me why the style checker failed. It referred to files that are not a part of this PR - files from src/mlpack/methods/rann/.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Aug 7, 2017

Member

Yeah, the style checks are my fault, when I committed a PR earlier today the style checks had not been run on it. I'll handle it, don't worry about it. But probably tomorrow because it is late here...

Member

rcurtin commented Aug 7, 2017

Yeah, the style checks are my fault, when I committed a PR earlier today the style checks had not been run on it. I'll handle it, don't worry about it. But probably tomorrow because it is late here...

@micyril

This comment has been minimized.

Show comment
Hide comment
@micyril

micyril Aug 7, 2017

Member

Ok. I think that I have added all changes that I wanted. So, feel free to merge it if you think it's ready.

Member

micyril commented Aug 7, 2017

Ok. I think that I have added all changes that I wanted. So, feel free to merge it if you think it's ready.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Aug 7, 2017

Member

Sounds good to me. Thank you so much for your hard work with this PR, I'm looking forward to using it in practice. :)

Member

rcurtin commented Aug 7, 2017

Sounds good to me. Thank you so much for your hard work with this PR, I'm looking forward to using it in practice. :)

@rcurtin rcurtin merged commit 7ce1136 into mlpack:master Aug 7, 2017

2 of 3 checks passed

Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@micyril micyril changed the title from Simple cv to Add simple cross-validation Aug 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment