-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shuffle parameter to KFoldCV constructor #1412
Conversation
Awesome Ryan, thanks for adding data shuffling! |
// If this is the last fold, we have to handle it a little bit differently, | ||
// since the last fold may not contain 'binSize' points. | ||
const size_t subsetSize = (i == k - 1) ? lastBinSize + (k - 2) * binSize : | ||
trainingSubsetSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it right, the whole story with lastBinSize
is to make use of all data when the number of the samples is not divisible by k
(the previous code will miss some samples during each run in this case). If it's true, then I guess we need to use this "last bin" in a training subset for i
from 1
to k - 1
rather than just when i == k - 1
. We probably also should get rid of the variable trainingSubsetSize
, since with this implementation it makes the code more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading your initial message for this PR I see that we are probably not on same page about the last cross-validation bin. I think that when we do something like this
binSize = source.n_cols / k;
trainingSubsetSize = binSize * (k - 1);
lastBinSize = source.n_cols - ((k - 1) * binSize);
lastBinSize
is actually no less than binSize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since source.n_cols / k
is integer math it's effectively floor(source.n_cols / k)
, so actually I had thought about this a little wrong. I had been thinking lastBinSize
would be less than binSize
but it will actually be up to k
larger than binSize
. But I think that is no problem.
I agree that trainingSubsetSize
is now confusing since it can change, so I've removed the variable and replaced it with the manual calculation of binSize * (k - 1)
or binSize * (k - 2) + lastBinSize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the logic behind the condition i == k - 1
. In my opinion it should be i != 0
since when i == 0
the last bin will be used for validation, and in all other cases it should be used for training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks for the clarification.
WeightsType>::ShuffleData() | ||
{ | ||
MatType data = xs.cols(0, (k - 1) * binSize + lastBinSize - 1); | ||
PredictionsType labels = ys.subvec(0, (k - 1) * binSize + lastBinSize - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's better to use the name "predictions" instead of "labels" since k-fold cross-validation can be applied to such methods as LinearRegression
. Also probably it's better to use the method cols
rather than subvec
if one day we want to use it with something like FFN
(which in the current implementation uses arma::mat
for responses
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'll just change them to xsOrig
and ysOrig
, there is no confusion there. And you are right, cols()
would be better.
src/mlpack/core/cv/k_fold_cv.hpp
Outdated
@@ -69,10 +69,14 @@ class KFoldCV | |||
* @param xs Data points to cross-validate on. | |||
* @param ys Predictions (labels for classification algorithms and responses | |||
* for regression algorithms) for each data point. | |||
* @param shuffle Whether or not to shuffle the data and predictions before | |||
* performing cross-validation. Shuffling will be performed before every | |||
* call to Evaluate(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was implementing KFoldCV
, I also was thinking about the need to shuffle the passed data. But I was thinking about doing it once during construction of a KFoldCV
object rather than before each call of Evaluate()
. Is it a good practice to do it for each set of hyper-parameters? If so, can you provide some references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I was not thinking of the hyper-parameter tuning usage. So I changed the functionality---now if shuffle
is specified to the constructor, it shuffles during object construction, and shuffling can be performed again manually with the Shuffle()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no more comments from my side.
@micyril thanks for the review---I'll go ahead and merge this in 3 days to leave time for any other comments. |
I've modified the KFoldCV class so that there is an optional 'shuffle' parameter, and made a few other changes also:
Better handling of sparse matrices to fix ShuffleData directly reads sparse matrix arrays #1411 throughout the codebase.
Better handling of datasets where the number of points is not evenly divisible by
k
inKFoldCV
. @micyril, if you like, you can take a look at what I did and see if there are any issues. Basically the modification amounts to holding alastBinSize
, since the last cross-validation bin may hold fewer points than the others, and then modifying theGetTrainingSubset()
andGetValidationSubset()
functions.The shuffling is done at the beginning of any call to
KFoldCV::Evaluate()
.This fixes #1409.