Add shuffle parameter to KFoldCV constructor #1412

rcurtin · 2018-05-31T21:35:34Z

I've modified the KFoldCV class so that there is an optional 'shuffle' parameter, and made a few other changes also:

Better handling of sparse matrices to fix ShuffleData directly reads sparse matrix arrays #1411 throughout the codebase.
Better handling of datasets where the number of points is not evenly divisible by k in KFoldCV. @micyril, if you like, you can take a look at what I did and see if there are any issues. Basically the modification amounts to holding a lastBinSize, since the last cross-validation bin may hold fewer points than the others, and then modifying the GetTrainingSubset() and GetValidationSubset() functions.

The shuffling is done at the beginning of any call to KFoldCV::Evaluate().

This fixes #1409.

This fixes mlpack#1411.

octogman · 2018-05-31T22:56:44Z

Awesome Ryan, thanks for adding data shuffling!

micyril · 2018-06-01T07:19:11Z

src/mlpack/core/cv/k_fold_cv_impl.hpp

+  // If this is the last fold, we have to handle it a little bit differently,
+  // since the last fold may not contain 'binSize' points.
+  const size_t subsetSize = (i == k - 1) ? lastBinSize + (k - 2) * binSize :
+      trainingSubsetSize;


If I understand it right, the whole story with lastBinSize is to make use of all data when the number of the samples is not divisible by k (the previous code will miss some samples during each run in this case). If it's true, then I guess we need to use this "last bin" in a training subset for i from 1 to k - 1 rather than just when i == k - 1. We probably also should get rid of the variable trainingSubsetSize, since with this implementation it makes the code more confusing.

By reading your initial message for this PR I see that we are probably not on same page about the last cross-validation bin. I think that when we do something like this

binSize = source.n_cols / k; trainingSubsetSize = binSize * (k - 1); lastBinSize = source.n_cols - ((k - 1) * binSize);

lastBinSize is actually no less than binSize.

Since source.n_cols / k is integer math it's effectively floor(source.n_cols / k), so actually I had thought about this a little wrong. I had been thinking lastBinSize would be less than binSize but it will actually be up to k larger than binSize. But I think that is no problem.

I agree that trainingSubsetSize is now confusing since it can change, so I've removed the variable and replaced it with the manual calculation of binSize * (k - 1) or binSize * (k - 2) + lastBinSize.

I'm not sure I understand the logic behind the condition i == k - 1. In my opinion it should be i != 0 since when i == 0 the last bin will be used for validation, and in all other cases it should be used for training.

You're right, thanks for the clarification.

micyril · 2018-06-01T07:29:24Z

src/mlpack/core/cv/k_fold_cv_impl.hpp

+             WeightsType>::ShuffleData()
+{
+  MatType data = xs.cols(0, (k - 1) * binSize + lastBinSize - 1);
+  PredictionsType labels = ys.subvec(0, (k - 1) * binSize + lastBinSize - 1);


Probably it's better to use the name "predictions" instead of "labels" since k-fold cross-validation can be applied to such methods as LinearRegression. Also probably it's better to use the method cols rather than subvec if one day we want to use it with something like FFN (which in the current implementation uses arma::mat for responses).

Actually I'll just change them to xsOrig and ysOrig, there is no confusion there. And you are right, cols() would be better.

micyril · 2018-06-01T07:40:50Z

src/mlpack/core/cv/k_fold_cv.hpp

@@ -69,10 +69,14 @@ class KFoldCV
   * @param xs Data points to cross-validate on.
   * @param ys Predictions (labels for classification algorithms and responses
   *     for regression algorithms) for each data point.
+   * @param shuffle Whether or not to shuffle the data and predictions before
+   *     performing cross-validation.  Shuffling will be performed before every
+   *     call to Evaluate().


When I was implementing KFoldCV, I also was thinking about the need to shuffle the passed data. But I was thinking about doing it once during construction of a KFoldCV object rather than before each call of Evaluate(). Is it a good practice to do it for each set of hyper-parameters? If so, can you provide some references?

Agreed, I was not thinking of the hyper-parameter tuning usage. So I changed the functionality---now if shuffle is specified to the constructor, it shuffles during object construction, and shuffling can be performed again manually with the Shuffle() function.

micyril

There are no more comments from my side.

rcurtin · 2018-06-05T14:59:49Z

@micyril thanks for the review---I'll go ahead and merge this in 3 days to leave time for any other comments.

rcurtin added 4 commits May 31, 2018 16:35

Handle training sets that don't divide evenly by k.

c37df2f

Add ShuffleData() with weights.

cd37c05

Add 'shuffle' parameter to KFoldCV.

b71251d

Safer handling of sparse matrix values arrays.

3a62cf2

This fixes mlpack#1411.

micyril reviewed Jun 1, 2018

View reviewed changes

rcurtin added 5 commits June 1, 2018 11:51

Remove trainingSubsetSize and clarify comments.

c726c32

Rename variables for clarity.

63ef265

Expose Shuffle() to users.

cc8b31a

Update documentation.

ca97f48

Fix condition, thanks Kirill for pointing it out.

c17f481

micyril approved these changes Jun 5, 2018

View reviewed changes

rcurtin merged commit 6a59dd5 into mlpack:master Jun 8, 2018

rcurtin deleted the shuffle-cv branch June 8, 2018 20:00

This was referenced Jun 8, 2018

Add shuffle parameter to KFoldCV constructor #1409

Closed

ShuffleData directly reads sparse matrix arrays #1411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shuffle parameter to KFoldCV constructor #1412

Add shuffle parameter to KFoldCV constructor #1412

rcurtin commented May 31, 2018

octogman commented May 31, 2018

micyril Jun 1, 2018

micyril Jun 1, 2018

rcurtin Jun 1, 2018

micyril Jun 2, 2018

rcurtin Jun 4, 2018

micyril Jun 1, 2018

rcurtin Jun 1, 2018 •

edited

Loading

micyril Jun 1, 2018

rcurtin Jun 1, 2018

micyril left a comment

rcurtin commented Jun 5, 2018

Add shuffle parameter to KFoldCV constructor #1412

Add shuffle parameter to KFoldCV constructor #1412

Conversation

rcurtin commented May 31, 2018

octogman commented May 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin Jun 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micyril left a comment

Choose a reason for hiding this comment

rcurtin commented Jun 5, 2018

rcurtin Jun 1, 2018 •

edited

Loading