grouped data #11

ggrothendieck · 2024-01-08T17:19:22Z

With grouped data it is important that if one row of a group is in the training set other then other rows in that group cannot be in the test set That is instead of sampling individual rows sample groups. This link shows an example and there is another example further down here.

https://stackoverflow.com/questions/71087864/how-to-keep-grouped-variables-together-in-training-and-test-data

Perhaps allow the holdout= argument to be a vector of indexes or provide for a group= argument. The first possibility would allow other schemes as well whereas the second is easier for the user in this situation but does not allow for unanticipated sampling schemes. It would be possible to have both, of course.

I am currently kludging it using this where the example is iris assuming each successive 10 rows forms a group.

# iris where each successive 10 rows forms a group
library(qeML)
set.seed(123)

# create grouping variable 
grp <- rep(1:15, each = 10)

# set holdout indexes so that if a row is in test or is in train then others in group are too
holdout <- which(grp %in% sample(15, 3))

# kludge it by redefining sample within qeKNN to return the indexes we want
trace(qeKNN, quote(sample <- function(x, holdout) holdout))
qeKNN(iris, "Species", holdout = holdout)
untrace(qeKNN)

The text was updated successfully, but these errors were encountered:

matloff · 2024-01-09T23:22:41Z

I will add a function to v.1.2, and then blog about it. Will post a link here, all probably later this week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grouped data #11

grouped data #11

ggrothendieck commented Jan 8, 2024 •

edited

Loading

matloff commented Jan 9, 2024

grouped data #11

grouped data #11

Comments

ggrothendieck commented Jan 8, 2024 • edited Loading

matloff commented Jan 9, 2024

ggrothendieck commented Jan 8, 2024 •

edited

Loading