Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grouped data #11

Open
ggrothendieck opened this issue Jan 8, 2024 · 1 comment
Open

grouped data #11

ggrothendieck opened this issue Jan 8, 2024 · 1 comment

Comments

@ggrothendieck
Copy link

ggrothendieck commented Jan 8, 2024

With grouped data it is important that if one row of a group is in the training set other then other rows in that group cannot be in the test set That is instead of sampling individual rows sample groups. This link shows an example and there is another example further down here.

https://stackoverflow.com/questions/71087864/how-to-keep-grouped-variables-together-in-training-and-test-data

Perhaps allow the holdout= argument to be a vector of indexes or provide for a group= argument. The first possibility would allow other schemes as well whereas the second is easier for the user in this situation but does not allow for unanticipated sampling schemes. It would be possible to have both, of course.

I am currently kludging it using this where the example is iris assuming each successive 10 rows forms a group.

# iris where each successive 10 rows forms a group
library(qeML)
set.seed(123)

# create grouping variable 
grp <- rep(1:15, each = 10)

# set holdout indexes so that if a row is in test or is in train then others in group are too
holdout <- which(grp %in% sample(15, 3))

# kludge it by redefining sample within qeKNN to return the indexes we want
trace(qeKNN, quote(sample <- function(x, holdout) holdout))
qeKNN(iris, "Species", holdout = holdout)
untrace(qeKNN)
@matloff
Copy link
Owner

matloff commented Jan 9, 2024

I will add a function to v.1.2, and then blog about it. Will post a link here, all probably later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants