Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle parameter to KFoldCV constructor #1412

Merged
merged 9 commits into from
Jun 8, 2018

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented May 31, 2018

I've modified the KFoldCV class so that there is an optional 'shuffle' parameter, and made a few other changes also:

  • Better handling of sparse matrices to fix ShuffleData directly reads sparse matrix arrays #1411 throughout the codebase.

  • Better handling of datasets where the number of points is not evenly divisible by k in KFoldCV. @micyril, if you like, you can take a look at what I did and see if there are any issues. Basically the modification amounts to holding a lastBinSize, since the last cross-validation bin may hold fewer points than the others, and then modifying the GetTrainingSubset() and GetValidationSubset() functions.

The shuffling is done at the beginning of any call to KFoldCV::Evaluate().

This fixes #1409.

@octogman
Copy link

Awesome Ryan, thanks for adding data shuffling!

// If this is the last fold, we have to handle it a little bit differently,
// since the last fold may not contain 'binSize' points.
const size_t subsetSize = (i == k - 1) ? lastBinSize + (k - 2) * binSize :
trainingSubsetSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it right, the whole story with lastBinSize is to make use of all data when the number of the samples is not divisible by k (the previous code will miss some samples during each run in this case). If it's true, then I guess we need to use this "last bin" in a training subset for i from 1 to k - 1 rather than just when i == k - 1. We probably also should get rid of the variable trainingSubsetSize, since with this implementation it makes the code more confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading your initial message for this PR I see that we are probably not on same page about the last cross-validation bin. I think that when we do something like this

binSize = source.n_cols / k;
trainingSubsetSize = binSize * (k - 1);
lastBinSize = source.n_cols - ((k - 1) * binSize);

lastBinSize is actually no less than binSize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since source.n_cols / k is integer math it's effectively floor(source.n_cols / k), so actually I had thought about this a little wrong. I had been thinking lastBinSize would be less than binSize but it will actually be up to k larger than binSize. But I think that is no problem.

I agree that trainingSubsetSize is now confusing since it can change, so I've removed the variable and replaced it with the manual calculation of binSize * (k - 1) or binSize * (k - 2) + lastBinSize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic behind the condition i == k - 1. In my opinion it should be i != 0 since when i == 0 the last bin will be used for validation, and in all other cases it should be used for training.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for the clarification.

WeightsType>::ShuffleData()
{
MatType data = xs.cols(0, (k - 1) * binSize + lastBinSize - 1);
PredictionsType labels = ys.subvec(0, (k - 1) * binSize + lastBinSize - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it's better to use the name "predictions" instead of "labels" since k-fold cross-validation can be applied to such methods as LinearRegression. Also probably it's better to use the method cols rather than subvec if one day we want to use it with something like FFN (which in the current implementation uses arma::mat for responses).

Copy link
Member Author

@rcurtin rcurtin Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'll just change them to xsOrig and ysOrig, there is no confusion there. And you are right, cols() would be better.

@@ -69,10 +69,14 @@ class KFoldCV
* @param xs Data points to cross-validate on.
* @param ys Predictions (labels for classification algorithms and responses
* for regression algorithms) for each data point.
* @param shuffle Whether or not to shuffle the data and predictions before
* performing cross-validation. Shuffling will be performed before every
* call to Evaluate().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was implementing KFoldCV, I also was thinking about the need to shuffle the passed data. But I was thinking about doing it once during construction of a KFoldCV object rather than before each call of Evaluate(). Is it a good practice to do it for each set of hyper-parameters? If so, can you provide some references?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I was not thinking of the hyper-parameter tuning usage. So I changed the functionality---now if shuffle is specified to the constructor, it shuffles during object construction, and shuffling can be performed again manually with the Shuffle() function.

Copy link
Member

@micyril micyril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no more comments from my side.

@rcurtin
Copy link
Member Author

rcurtin commented Jun 5, 2018

@micyril thanks for the review---I'll go ahead and merge this in 3 days to leave time for any other comments.

@rcurtin rcurtin merged commit 6a59dd5 into mlpack:master Jun 8, 2018
@rcurtin rcurtin deleted the shuffle-cv branch June 8, 2018 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ShuffleData directly reads sparse matrix arrays Add shuffle parameter to KFoldCV constructor
3 participants