-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in partition/fold: restriction on having different categories within an ID #7
Comments
Hi Emily, You are right, that there are times when the same ID could reasonably have two different classes and it's definitely something I should look into supporting in the future. Right now though, the implementation doesn't make it possible: With categorical balancing: 1) Split the dataset by the With With both: 1) Extract the unique IDs with their classes. 2) Do the categorical balancing. 3) Put all rows for the ID in its partition. So, in the "Both" version, if we had multiple classes within an ID, the approach won't work, as I would either have multiple rows per ID (that may end up in different partitions) or only consider one of the classes for each ID (would happen in the current code, I believe). It's very possible, that I can think of a way to solve this when it's not 3 am :) For now, I'll suggest that you only use the Alternatively, if it's only two classes, you could make an extra class if the study has both. So you could make a column Let me know if those suggestions would work for your project and if you need help implementing them. :) Best, |
Thank you! That makes a lot of sense re the implementation in the package. I have this implemented using tidyverse and it works (included below), it's just long and not exactly the same but works well enough for me-- but there is probably a cleaner and faster way :).
|
Just went through your code, and in practice it seems to be in the ball park of the The code seems to be for |
ok! no problem! :) This is something I wish was written when I started the project, so I am sure it will be helpful to others. I don't - I just use partition into 5 folds and then assign 1-4 to training for now to get an 80/20 split, but it would be a trivial extension, just changing |
Great :) Any other ideas your get or use cases you find, do let me know :) |
will do! |
groupdata2
looks great and solves a lot of problems I am looking for!I tried to use this for one of the problems I am working on and I am running into the following error. I know that this is by design
Is it possible to relax your restriction? It is very possible that two measurements from the same group (or ID) could be different. For example, a subject could get two different diagnoses, and we want to make sure they are still either only in training or test, and the diagnoses do count to the total. In the case I'm looking at, the IDs are studies, and examples of both classes can be within in a single study.
E.g., using your df as defined in the cross-validation with group data vignette, let's just change:
And then when we run
partition()
it fails with the above error.Thanks so much!
Emily
The text was updated successfully, but these errors were encountered: