Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor classifiers to take numClasses parameters #1038

Merged
merged 5 commits into from Jul 25, 2017

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented Jun 25, 2017

This is needed for the cross-validation module. For each of the mlpack classifiers, I have made sure that the API can match the following signature:

void Train(const arma::mat& data, const arma::Row<size_t>& labels, const size_t numClasses);

and constructors have been updated appropriately also. The following classes needed changes:

  • AdaBoost
  • DecisionStump
  • LogisticRegression -- this one is special because logistic regression is two-class only, so I just added an extra overload and corresponding documentation
  • NaiveBayesClassifier
  • Perceptron

@micyril: take a look at this, let me know if this is what you need or if there are any further changes necessary.

Copy link
Member

@micyril micyril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Shouldn't we also add the numClasses parameter to the Train method of HoeffdingTree?

*/
LogisticRegression(const MatType& predictors,
const arma::Row<size_t>& responses,
const size_t numClasses,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my comment about lacking the numClasses parameter in LogisticRegression was a bit missleading. When we disccussed the possibility to add the numClasses parameter in CV constructors, I perceived it as we should always require this parameter if it is present in one of Train methods. According to this strategy we will require users to pass numClasses when they want to apply cross-validation to LogisticRegression. Do we want it? If no, we can leave LogisticRegression without changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same considerations are applied to AdaBoost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so I spent some time thinking about this point. I think that in order to preserve compatibility of two-class classifiers with other classifiers, we can simply provide two overloads, one that takes numClasses (so that the classifier satisfies the requirements of other classifiers) and one that doesn't, for user convenience.

AdaBoost is a multiclass classifier (or it can be, depending on its weak learner) so I think leaving numClasses there is just fine.

Do you think that is ok, or does that cause any problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this reply. The main disadvantage of keeping numClasses in LogisticRegression is that users will be required to pass it when they use LogisticRegression with the cross-validation and hyper-parameter tuning modules. If we want to keep it for the reason described in the documentation "as necessary for the cross-validation code and hyper-parameter tuner to work", it is not actually the case - cross-validation and hyper-parameter tuning will work without it. So, we need to keep it only if there is some other reason(s). Is there any other code that depends on the assumption of the presence of numClasses in classifiers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so are you saying that we can remove the numClasses from LogisticRegression and other two-class classifiers entirely, and the functionality of CV or the hyper-parameter tuner will not be affected? If that is true, I guess we can revert the changes to LogisticRegression; would you agree with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-validation and hyper-parameter tuning will work in both cases. The only difference is that in one case it will be required to pass numClasses, in another - no. I agree that we can revert the changes for LogisticRegression.

* @param incremental Whether or not to use the incremental algorithm for
* training.
*/
void Train(const MatType& data,
const arma::Row<size_t>& labels,
const size_t numClasses,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok just to add a new parameter to the Train method? Shouldn't we deprecate the old method and add a new one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should only be necessary if we plan to release another 2.x.x version with this new signature, but since the next release should be 3.0.0, it is ok to break reverse compatibility at this level.

@rcurtin
Copy link
Member Author

rcurtin commented Jun 26, 2017

Hm, ok, I had thought that we should synchronize the Train() methods across all classifiers and then the presence of the numClasses parameter could be used to determine whether the method was classification or regression. Did you have something else in mind? I thought we can enforce that each classifier must implement

ClassifierType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses);
void ClassifierType::Train(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses);

but then a question also is what to do with the DatasetInfo parameter, which is used by some classifiers that can accept categorical data. We could have other overloads:

ClassifierType(const MatType& data, const DatasetInfo& data, const arma::Row<size_t>& labels, const size_t numClasses);
void ClassifierType::Train(const MatType& data, const DatasetInfo& data, const arma::Row<size_t>& labels, const size_t numClasses);

Or, we could force the DatasetInfo to be accepted as another later optional parameter to the constructor, i.e., ClassifierType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, const DatasetInfo& datasetInfo = DatasetInfo()).

Whichever of those we choose, I think it is easiest if you specify the best interface for your work, and then we can apply it through the codebase. Since the next release will be 3.0.0 and this allows us to break reverse compatibility, this is the easiest time to make these changes. :)

@micyril
Copy link
Member

micyril commented Jun 26, 2017

We can support cross-validation for LogisticRegression and AdaBoost without requirement to pass the unnecessary numClasses parameter. The changes in DecisionStump, NaiveBayesClassifier and Perceptron are more appropriate since users always need to pass the numClasses parameter into constructors of these classes.

Regarding detection of whether a method is classification or regression - the cross-validation module don't need to know this information as long as it knows which parameters it should require to pass and which ones it can take as optional.

@rcurtin
Copy link
Member Author

rcurtin commented Jun 26, 2017

I see, thanks for the clarification. Do you have an opinion of how the APIs for classification and regression methods should be structured? If we can standardize this, we make it easier to implement new classifiers and regressors.

@micyril
Copy link
Member

micyril commented Jun 26, 2017

I guess in terms of constructor/Train interfaces we can use LinearRegression as a de facto standard for regression algorithms, and DecisionTree as a de facto standard for classification algorithms.

@rcurtin
Copy link
Member Author

rcurtin commented Jun 26, 2017

I see, the only issue is that these don't account for categorical variables. I suggest that we do this:

regression:

// I guess usually MatType == arma::Mat<eT> and ResponsesType == arma::Col<eT>
RegressionType(const MatType& data, const ResponsesType& responses, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const ResponsesType& responses);

// Sometimes the input could be categorical for some methods too, so these can be optional:
RegressionType(const MatType& data, const DatasetInfo& info, const ResponsesType& responses, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const DatasetInfo& info, const ResponsesType& responses);

// if the data has weights, we can end up with four more overloads:
RegressionType(const MatType& data, const ResponsesType& responses, const WeightsType& weights, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const ResponsesType& responses, const WeightsType& weights);
RegressionType(const MatType& data, const DatasetInfo& info, const ResponsesType& responses, const WeightsType& weights, /* all optional arguments past here... */);
void RegressionType::Train(const MatType& data, const DatasetInfo& info, const ResponsesType& responses, const WeightsType& weights);

classification:

ClassificationType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses);

// optional for classes that support categorical data also
ClassificationType(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses);

// if the classifier supports weighted training, we can end up with up to four more overloads
ClassificationType(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights);
ClassificationType(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights, /* all optional arguments past here... */);
void ClassificationType::Train(const MatType& data, const DatasetInfo& info, const arma::Row<size_t>& labels, const size_t numClasses, const WeightsType& weights);

In these cases it is not required that the overload is exactly implemented as specified, but the compiler should at least be able to instantiate it with MatType = arma::mat and ResponsesType = arma::vec. So it is ok if the signature is implemented a little differently. It would be nice to eventually require that MatType can be an Armadillo object of any type, but there is other support that would be needed first.

@zoq: I wouldn't mind your input here too, this is a fairly fundamental design decision so it would be best to get eyes on this before things start changing.

Whenever we decide on something, I will have to create some documentation detailing exactly what is required of a classification and a regression type, a little like http://mlpack.org/docs/mlpack-2.2.3/doxygen.php?doc=trees.html .

@micyril
Copy link
Member

micyril commented Jun 26, 2017

For completeness we can also specify an optional parameter for weights.

@rcurtin
Copy link
Member Author

rcurtin commented Jun 26, 2017

Ah, right, I forgot about weights. I've updated my comment.

@@ -202,8 +202,10 @@ int main(int argc, char *argv[])
else if (weakLearner == "perceptron")
m.WeakLearnerType() = AdaBoostModel::WeakLearnerTypes::PERCEPTRON;

const size_t numClasses = arma::max(labels) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require continuous labels starting with 0? If not I think this doesn't return the correct number of labels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is required, but I am not sure how clear the documentation is. I'll update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, sorry, I thought this was in a different place than it is. In this code the call to NormalizeLabels() forces the labels to be between 0 and numClasses. But I can still make the call a little more efficient, because numClasses == m.Mappings().n_elem.

@zoq
Copy link
Member

zoq commented Jul 6, 2017

As for me, the proposed interface looks good, as long as we provide the same interface for all methods. I'm a little bit worried about the LogisticRegression method since someone could get the impression that the method works with numClasses > 2, but if can make that clear in the documentation this is just a minor issue. I guess we could also through an exception or something like that in the future, once we decided on the error handling: #891.

@rcurtin
Copy link
Member Author

rcurtin commented Jul 17, 2017

I updated the documentation somewhat to make it clearer that LogisticRegression only supports two classes even though the numClasses parameter is sometimes available. If you think that is clear, should we go ahead and merge this?

@zoq
Copy link
Member

zoq commented Jul 24, 2017

Completely missed the last message, it's clear for me and I think we waited long enough for any more comments, so let's merge this.

@rcurtin
Copy link
Member Author

rcurtin commented Jul 25, 2017

Agreed, so in this case I will go ahead and merge. If anyone complains we can revisit the issue then :)

@rcurtin rcurtin merged commit dc70207 into mlpack:master Jul 25, 2017
micyril added a commit to micyril/mlpack that referenced this pull request Aug 5, 2017
Changes have been reverted manually since git fails to revert them
automatically without reverting changes from the merge commit
82d8aa1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants