Refactor classifiers to take numClasses parameters #1038

rcurtin · 2017-06-25T09:13:29Z

This is needed for the cross-validation module. For each of the mlpack classifiers, I have made sure that the API can match the following signature:

void Train(const arma::mat& data, const arma::Row<size_t>& labels, const size_t numClasses);

and constructors have been updated appropriately also. The following classes needed changes:

AdaBoost
DecisionStump
LogisticRegression -- this one is special because logistic regression is two-class only, so I just added an extra overload and corresponding documentation
NaiveBayesClassifier
Perceptron

@micyril: take a look at this, let me know if this is what you need or if there are any further changes necessary.

micyril

I left some comments. Shouldn't we also add the numClasses parameter to the Train method of HoeffdingTree?

micyril · 2017-06-26T03:46:33Z

src/mlpack/methods/logistic_regression/logistic_regression.hpp

+   */
+  LogisticRegression(const MatType& predictors,
+                     const arma::Row<size_t>& responses,
+                     const size_t numClasses,


I guess my comment about lacking the numClasses parameter in LogisticRegression was a bit missleading. When we disccussed the possibility to add the numClasses parameter in CV constructors, I perceived it as we should always require this parameter if it is present in one of Train methods. According to this strategy we will require users to pass numClasses when they want to apply cross-validation to LogisticRegression. Do we want it? If no, we can leave LogisticRegression without changes.

The same considerations are applied to AdaBoost.

Hm, so I spent some time thinking about this point. I think that in order to preserve compatibility of two-class classifiers with other classifiers, we can simply provide two overloads, one that takes numClasses (so that the classifier satisfies the requirements of other classifiers) and one that doesn't, for user convenience.

AdaBoost is a multiclass classifier (or it can be, depending on its weak learner) so I think leaving numClasses there is just fine.

Do you think that is ok, or does that cause any problems?

Sorry, I missed this reply. The main disadvantage of keeping numClasses in LogisticRegression is that users will be required to pass it when they use LogisticRegression with the cross-validation and hyper-parameter tuning modules. If we want to keep it for the reason described in the documentation "as necessary for the cross-validation code and hyper-parameter tuner to work", it is not actually the case - cross-validation and hyper-parameter tuning will work without it. So, we need to keep it only if there is some other reason(s). Is there any other code that depends on the assumption of the presence of numClasses in classifiers?

I see, so are you saying that we can remove the numClasses from LogisticRegression and other two-class classifiers entirely, and the functionality of CV or the hyper-parameter tuner will not be affected? If that is true, I guess we can revert the changes to LogisticRegression; would you agree with that?

Cross-validation and hyper-parameter tuning will work in both cases. The only difference is that in one case it will be required to pass numClasses, in another - no. I agree that we can revert the changes for LogisticRegression.

micyril · 2017-06-26T03:55:05Z

src/mlpack/methods/naive_bayes/naive_bayes_classifier.hpp

   * @param incremental Whether or not to use the incremental algorithm for
   *      training.
   */
  void Train(const MatType& data,
             const arma::Row<size_t>& labels,
+             const size_t numClasses,


Is it ok just to add a new parameter to the Train method? Shouldn't we deprecate the old method and add a new one?

That should only be necessary if we plan to release another 2.x.x version with this new signature, but since the next release should be 3.0.0, it is ok to break reverse compatibility at this level.

rcurtin · 2017-06-26T10:37:57Z

Hm, ok, I had thought that we should synchronize the Train() methods across all classifiers and then the presence of the numClasses parameter could be used to determine whether the method was classification or regression. Did you have something else in mind? I thought we can enforce that each classifier must implement

ClassifierType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses);
void ClassifierType::Train(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses);

but then a question also is what to do with the DatasetInfo parameter, which is used by some classifiers that can accept categorical data. We could have other overloads:

ClassifierType(const MatType& data, const DatasetInfo& data, const arma::Row<size_t>& labels, const size_t numClasses);
void ClassifierType::Train(const MatType& data, const DatasetInfo& data, const arma::Row<size_t>& labels, const size_t numClasses);

Or, we could force the DatasetInfo to be accepted as another later optional parameter to the constructor, i.e., ClassifierType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, const DatasetInfo& datasetInfo = DatasetInfo()).

Whichever of those we choose, I think it is easiest if you specify the best interface for your work, and then we can apply it through the codebase. Since the next release will be 3.0.0 and this allows us to break reverse compatibility, this is the easiest time to make these changes. :)

micyril · 2017-06-26T11:13:31Z

We can support cross-validation for LogisticRegression and AdaBoost without requirement to pass the unnecessary numClasses parameter. The changes in DecisionStump, NaiveBayesClassifier and Perceptron are more appropriate since users always need to pass the numClasses parameter into constructors of these classes.

Regarding detection of whether a method is classification or regression - the cross-validation module don't need to know this information as long as it knows which parameters it should require to pass and which ones it can take as optional.

rcurtin · 2017-06-26T11:16:23Z

I see, thanks for the clarification. Do you have an opinion of how the APIs for classification and regression methods should be structured? If we can standardize this, we make it easier to implement new classifiers and regressors.

micyril · 2017-06-26T11:23:50Z

I guess in terms of constructor/Train interfaces we can use LinearRegression as a de facto standard for regression algorithms, and DecisionTree as a de facto standard for classification algorithms.

rcurtin · 2017-06-26T15:38:47Z

I see, the only issue is that these don't account for categorical variables. I suggest that we do this:

regression:

// I guess usually MatType == arma::Mat<eT> and ResponsesType == arma::Col<eT>
RegressionType(const MatType& data, const ResponsesType& responses, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const ResponsesType& responses);

// Sometimes the input could be categorical for some methods too, so these can be optional:
RegressionType(const MatType& data, const DatasetInfo& info, const ResponsesType& responses, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const DatasetInfo& info, const ResponsesType& responses);

// if the data has weights, we can end up with four more overloads:
RegressionType(const MatType& data, const ResponsesType& responses, const WeightsType& weights, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const ResponsesType& responses, const WeightsType& weights);
RegressionType(const MatType& data, const DatasetInfo& info, const ResponsesType& responses, const WeightsType& weights, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const DatasetInfo& info, const ResponsesType& responses, const WeightsType& weights);

classification:

ClassificationType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses);

// optional for classes that support categorical data also
ClassificationType(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses);

// if the classifier supports weighted training, we can end up with up to four more overloads
ClassificationType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights);
ClassificationType(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights);

In these cases it is not required that the overload is exactly implemented as specified, but the compiler should at least be able to instantiate it with MatType = arma::mat and ResponsesType = arma::vec. So it is ok if the signature is implemented a little differently. It would be nice to eventually require that MatType can be an Armadillo object of any type, but there is other support that would be needed first.

@zoq: I wouldn't mind your input here too, this is a fairly fundamental design decision so it would be best to get eyes on this before things start changing.

Whenever we decide on something, I will have to create some documentation detailing exactly what is required of a classification and a regression type, a little like http://mlpack.org/docs/mlpack-2.2.3/doxygen.php?doc=trees.html .

micyril · 2017-06-26T15:44:03Z

For completeness we can also specify an optional parameter for weights.

rcurtin · 2017-06-26T16:15:57Z

Ah, right, I forgot about weights. I've updated my comment.

zoq · 2017-07-06T12:40:46Z

src/mlpack/methods/adaboost/adaboost_main.cpp

@@ -202,8 +202,10 @@ int main(int argc, char *argv[])
    else if (weakLearner == "perceptron")
      m.WeakLearnerType() = AdaBoostModel::WeakLearnerTypes::PERCEPTRON;

+    const size_t numClasses = arma::max(labels) + 1;


Do we require continuous labels starting with 0? If not I think this doesn't return the correct number of labels.

Yes, that is required, but I am not sure how clear the documentation is. I'll update it.

Ack, sorry, I thought this was in a different place than it is. In this code the call to NormalizeLabels() forces the labels to be between 0 and numClasses. But I can still make the call a little more efficient, because numClasses == m.Mappings().n_elem.

zoq · 2017-07-06T12:48:11Z

As for me, the proposed interface looks good, as long as we provide the same interface for all methods. I'm a little bit worried about the LogisticRegression method since someone could get the impression that the method works with numClasses > 2, but if can make that clear in the documentation this is just a minor issue. I guess we could also through an exception or something like that in the future, once we decided on the error handling: #891.

rcurtin · 2017-07-17T19:40:52Z

I updated the documentation somewhat to make it clearer that LogisticRegression only supports two classes even though the numClasses parameter is sometimes available. If you think that is clear, should we go ahead and merge this?

zoq · 2017-07-24T21:10:33Z

Completely missed the last message, it's clear for me and I think we waited long enough for any more comments, so let's merge this.

rcurtin · 2017-07-25T02:06:38Z

Agreed, so in this case I will go ahead and merge. If anyone complains we can revisit the issue then :)

Changes have been reverted manually since git fails to revert them automatically without reverting changes from the merge commit 82d8aa1.

Refactor classifiers to take numClasses parameters.

9e4da7b

micyril reviewed Jun 26, 2017

View reviewed changes

zoq reviewed Jul 6, 2017

View reviewed changes

rcurtin added 4 commits July 10, 2017 11:06

Use more efficient way of getting number of classes.

b62d2e5

Merge branch 'master' into num_classes

82d8aa1

Fix too-long line.

bb29032

Clarify limitation on number of classes.

e717023

rcurtin merged commit dc70207 into mlpack:master Jul 25, 2017

micyril mentioned this pull request Jul 27, 2017

Add simple cross-validation #1044

Merged

micyril added a commit to micyril/mlpack that referenced this pull request Aug 5, 2017

Revert changes to LR introduced in mlpack#1038

852139d

Changes have been reverted manually since git fails to revert them automatically without reverting changes from the merge commit 82d8aa1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor classifiers to take numClasses parameters #1038

Refactor classifiers to take numClasses parameters #1038

rcurtin commented Jun 25, 2017

micyril left a comment

micyril Jun 26, 2017

micyril Jun 26, 2017

rcurtin Jun 28, 2017

micyril Jul 27, 2017

rcurtin Jul 27, 2017

micyril Jul 28, 2017

micyril Jun 26, 2017

rcurtin Jun 26, 2017

rcurtin commented Jun 26, 2017

micyril commented Jun 26, 2017 •

edited

rcurtin commented Jun 26, 2017

micyril commented Jun 26, 2017 •

edited

rcurtin commented Jun 26, 2017 •

edited

micyril commented Jun 26, 2017

rcurtin commented Jun 26, 2017

zoq Jul 6, 2017

rcurtin Jul 10, 2017

rcurtin Jul 10, 2017

zoq commented Jul 6, 2017

rcurtin commented Jul 17, 2017

zoq commented Jul 24, 2017

rcurtin commented Jul 25, 2017

Refactor classifiers to take numClasses parameters #1038

Refactor classifiers to take numClasses parameters #1038

Conversation

rcurtin commented Jun 25, 2017

micyril left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin commented Jun 26, 2017

micyril commented Jun 26, 2017 • edited

rcurtin commented Jun 26, 2017

micyril commented Jun 26, 2017 • edited

rcurtin commented Jun 26, 2017 • edited

micyril commented Jun 26, 2017

rcurtin commented Jun 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zoq commented Jul 6, 2017

rcurtin commented Jul 17, 2017

zoq commented Jul 24, 2017

rcurtin commented Jul 25, 2017

micyril commented Jun 26, 2017 •

edited

micyril commented Jun 26, 2017 •

edited

rcurtin commented Jun 26, 2017 •

edited