New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle parameter to KFoldCV constructor #1412

Merged
merged 9 commits into from Jun 8, 2018

Conversation

Projects
None yet
3 participants
@rcurtin
Member

rcurtin commented May 31, 2018

I've modified the KFoldCV class so that there is an optional 'shuffle' parameter, and made a few other changes also:

  • Better handling of sparse matrices to fix #1411 throughout the codebase.

  • Better handling of datasets where the number of points is not evenly divisible by k in KFoldCV. @micyril, if you like, you can take a look at what I did and see if there are any issues. Basically the modification amounts to holding a lastBinSize, since the last cross-validation bin may hold fewer points than the others, and then modifying the GetTrainingSubset() and GetValidationSubset() functions.

The shuffling is done at the beginning of any call to KFoldCV::Evaluate().

This fixes #1409.

@octogman

This comment has been minimized.

octogman commented May 31, 2018

Awesome Ryan, thanks for adding data shuffling!

// If this is the last fold, we have to handle it a little bit differently,
// since the last fold may not contain 'binSize' points.
const size_t subsetSize = (i == k - 1) ? lastBinSize + (k - 2) * binSize :
trainingSubsetSize;

This comment has been minimized.

@micyril

micyril Jun 1, 2018

Member

If I understand it right, the whole story with lastBinSize is to make use of all data when the number of the samples is not divisible by k (the previous code will miss some samples during each run in this case). If it's true, then I guess we need to use this "last bin" in a training subset for i from 1 to k - 1 rather than just when i == k - 1. We probably also should get rid of the variable trainingSubsetSize, since with this implementation it makes the code more confusing.

This comment has been minimized.

@micyril

micyril Jun 1, 2018

Member

By reading your initial message for this PR I see that we are probably not on same page about the last cross-validation bin. I think that when we do something like this

binSize = source.n_cols / k;
trainingSubsetSize = binSize * (k - 1);
lastBinSize = source.n_cols - ((k - 1) * binSize);

lastBinSize is actually no less than binSize.

This comment has been minimized.

@rcurtin

rcurtin Jun 1, 2018

Member

Since source.n_cols / k is integer math it's effectively floor(source.n_cols / k), so actually I had thought about this a little wrong. I had been thinking lastBinSize would be less than binSize but it will actually be up to k larger than binSize. But I think that is no problem.

I agree that trainingSubsetSize is now confusing since it can change, so I've removed the variable and replaced it with the manual calculation of binSize * (k - 1) or binSize * (k - 2) + lastBinSize.

This comment has been minimized.

@micyril

micyril Jun 2, 2018

Member

I'm not sure I understand the logic behind the condition i == k - 1. In my opinion it should be i != 0 since when i == 0 the last bin will be used for validation, and in all other cases it should be used for training.

This comment has been minimized.

@rcurtin

rcurtin Jun 4, 2018

Member

You're right, thanks for the clarification.

WeightsType>::ShuffleData()
{
MatType data = xs.cols(0, (k - 1) * binSize + lastBinSize - 1);
PredictionsType labels = ys.subvec(0, (k - 1) * binSize + lastBinSize - 1);

This comment has been minimized.

@micyril

micyril Jun 1, 2018

Member

Probably it's better to use the name "predictions" instead of "labels" since k-fold cross-validation can be applied to such methods as LinearRegression. Also probably it's better to use the method cols rather than subvec if one day we want to use it with something like FFN (which in the current implementation uses arma::mat for responses).

This comment has been minimized.

@rcurtin

rcurtin Jun 1, 2018

Member

Actually I'll just change them to xsOrig and ysOrig, there is no confusion there. And you are right, cols() would be better.

@@ -69,10 +69,14 @@ class KFoldCV
* @param xs Data points to cross-validate on.
* @param ys Predictions (labels for classification algorithms and responses
* for regression algorithms) for each data point.
* @param shuffle Whether or not to shuffle the data and predictions before
* performing cross-validation. Shuffling will be performed before every
* call to Evaluate().

This comment has been minimized.

@micyril

micyril Jun 1, 2018

Member

When I was implementing KFoldCV, I also was thinking about the need to shuffle the passed data. But I was thinking about doing it once during construction of a KFoldCV object rather than before each call of Evaluate(). Is it a good practice to do it for each set of hyper-parameters? If so, can you provide some references?

This comment has been minimized.

@rcurtin

rcurtin Jun 1, 2018

Member

Agreed, I was not thinking of the hyper-parameter tuning usage. So I changed the functionality---now if shuffle is specified to the constructor, it shuffles during object construction, and shuffling can be performed again manually with the Shuffle() function.

@micyril

micyril approved these changes Jun 5, 2018

There are no more comments from my side.

@rcurtin

This comment has been minimized.

Member

rcurtin commented Jun 5, 2018

@micyril thanks for the review---I'll go ahead and merge this in 3 days to leave time for any other comments.

@rcurtin rcurtin merged commit 6a59dd5 into mlpack:master Jun 8, 2018

5 checks passed

Memory Checks
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@rcurtin rcurtin deleted the rcurtin:shuffle-cv branch Jun 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment