New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor classifiers to take numClasses parameters #1038
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Shouldn't we also add the numClasses parameter to the Train method of HoeffdingTree?
*/ | ||
LogisticRegression(const MatType& predictors, | ||
const arma::Row<size_t>& responses, | ||
const size_t numClasses, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my comment about lacking the numClasses
parameter in LogisticRegression
was a bit missleading. When we disccussed the possibility to add the numClasses
parameter in CV constructors, I perceived it as we should always require this parameter if it is present in one of Train
methods. According to this strategy we will require users to pass numClasses
when they want to apply cross-validation to LogisticRegression. Do we want it? If no, we can leave LogisticRegression
without changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same considerations are applied to AdaBoost
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so I spent some time thinking about this point. I think that in order to preserve compatibility of two-class classifiers with other classifiers, we can simply provide two overloads, one that takes numClasses
(so that the classifier satisfies the requirements of other classifiers) and one that doesn't, for user convenience.
AdaBoost is a multiclass classifier (or it can be, depending on its weak learner) so I think leaving numClasses
there is just fine.
Do you think that is ok, or does that cause any problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this reply. The main disadvantage of keeping numClasses
in LogisticRegression
is that users will be required to pass it when they use LogisticRegression
with the cross-validation and hyper-parameter tuning modules. If we want to keep it for the reason described in the documentation "as necessary for the cross-validation code and hyper-parameter tuner to work", it is not actually the case - cross-validation and hyper-parameter tuning will work without it. So, we need to keep it only if there is some other reason(s). Is there any other code that depends on the assumption of the presence of numClasses
in classifiers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so are you saying that we can remove the numClasses
from LogisticRegression
and other two-class classifiers entirely, and the functionality of CV or the hyper-parameter tuner will not be affected? If that is true, I guess we can revert the changes to LogisticRegression
; would you agree with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-validation and hyper-parameter tuning will work in both cases. The only difference is that in one case it will be required to pass numClasses
, in another - no. I agree that we can revert the changes for LogisticRegression
.
* @param incremental Whether or not to use the incremental algorithm for | ||
* training. | ||
*/ | ||
void Train(const MatType& data, | ||
const arma::Row<size_t>& labels, | ||
const size_t numClasses, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok just to add a new parameter to the Train
method? Shouldn't we deprecate the old method and add a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should only be necessary if we plan to release another 2.x.x version with this new signature, but since the next release should be 3.0.0, it is ok to break reverse compatibility at this level.
Hm, ok, I had thought that we should synchronize the
but then a question also is what to do with the
Or, we could force the Whichever of those we choose, I think it is easiest if you specify the best interface for your work, and then we can apply it through the codebase. Since the next release will be 3.0.0 and this allows us to break reverse compatibility, this is the easiest time to make these changes. :) |
We can support cross-validation for Regarding detection of whether a method is classification or regression - the cross-validation module don't need to know this information as long as it knows which parameters it should require to pass and which ones it can take as optional. |
I see, thanks for the clarification. Do you have an opinion of how the APIs for classification and regression methods should be structured? If we can standardize this, we make it easier to implement new classifiers and regressors. |
I guess in terms of constructor/Train interfaces we can use LinearRegression as a de facto standard for regression algorithms, and DecisionTree as a de facto standard for classification algorithms. |
I see, the only issue is that these don't account for categorical variables. I suggest that we do this: regression:
classification:
In these cases it is not required that the overload is exactly implemented as specified, but the compiler should at least be able to instantiate it with @zoq: I wouldn't mind your input here too, this is a fairly fundamental design decision so it would be best to get eyes on this before things start changing. Whenever we decide on something, I will have to create some documentation detailing exactly what is required of a classification and a regression type, a little like http://mlpack.org/docs/mlpack-2.2.3/doxygen.php?doc=trees.html . |
For completeness we can also specify an optional parameter for weights. |
Ah, right, I forgot about weights. I've updated my comment. |
@@ -202,8 +202,10 @@ int main(int argc, char *argv[]) | |||
else if (weakLearner == "perceptron") | |||
m.WeakLearnerType() = AdaBoostModel::WeakLearnerTypes::PERCEPTRON; | |||
|
|||
const size_t numClasses = arma::max(labels) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we require continuous labels starting with 0? If not I think this doesn't return the correct number of labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is required, but I am not sure how clear the documentation is. I'll update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, sorry, I thought this was in a different place than it is. In this code the call to NormalizeLabels()
forces the labels to be between 0 and numClasses. But I can still make the call a little more efficient, because numClasses == m.Mappings().n_elem
.
As for me, the proposed interface looks good, as long as we provide the same interface for all methods. I'm a little bit worried about the LogisticRegression method since someone could get the impression that the method works with |
I updated the documentation somewhat to make it clearer that |
Completely missed the last message, it's clear for me and I think we waited long enough for any more comments, so let's merge this. |
Agreed, so in this case I will go ahead and merge. If anyone complains we can revisit the issue then :) |
Changes have been reverted manually since git fails to revert them automatically without reverting changes from the merge commit 82d8aa1.
This is needed for the cross-validation module. For each of the mlpack classifiers, I have made sure that the API can match the following signature:
and constructors have been updated appropriately also. The following classes needed changes:
AdaBoost
DecisionStump
LogisticRegression
-- this one is special because logistic regression is two-class only, so I just added an extra overload and corresponding documentationNaiveBayesClassifier
Perceptron
@micyril: take a look at this, let me know if this is what you need or if there are any further changes necessary.