Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in partition/fold: restriction on having different categories within an ID #7

Closed
erflynn opened this issue Apr 14, 2020 · 6 comments
Closed

Comments

@erflynn
Copy link

erflynn commented Apr 14, 2020

groupdata2 looks great and solves a lot of problems I am looking for!

I tried to use this for one of the problems I am working on and I am running into the following error. I know that this is by design

 The value in 'data[[cat_col]]' must be
 * constant within each ID.

Is it possible to relax your restriction? It is very possible that two measurements from the same group (or ID) could be different. For example, a subject could get two different diagnoses, and we want to make sure they are still either only in training or test, and the diagnoses do count to the total. In the case I'm looking at, the IDs are studies, and examples of both classes can be within in a single study.

E.g., using your df as defined in the cross-validation with group data vignette, let's just change:

df[3, "diagnosis"] <- "b"
parts <- partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')

And then when we run partition() it fails with the above error.

Thanks so much!
Emily

@LudvigOlsen
Copy link
Owner

Hi Emily,

You are right, that there are times when the same ID could reasonably have two different classes and it's definitely something I should look into supporting in the future. Right now though, the implementation doesn't make it possible:

With categorical balancing: 1) Split the dataset by the cat_col column. 2) Partition each split. 3) Combine the partitions (first partition from split 1 with first partition from split 2, etc.)

With id_col: 1) Extract the unique IDs. 2) Partition them. 3) Put all rows for the ID in its partition.

With both: 1) Extract the unique IDs with their classes. 2) Do the categorical balancing. 3) Put all rows for the ID in its partition.

So, in the "Both" version, if we had multiple classes within an ID, the approach won't work, as I would either have multiple rows per ID (that may end up in different partitions) or only consider one of the classes for each ID (would happen in the current code, I believe).

It's very possible, that I can think of a way to solve this when it's not 3 am :)

For now, I'll suggest that you only use the id_col, and perhaps run the partitioning a couple of times with different seeds and use the one that has the best distribution of the classes. Whether this is useful of course depends on the dataset.

Alternatively, if it's only two classes, you could make an extra class if the study has both. So you could make a column new_class with the classes 0,1, and Both. Then use cat_col = new_class. That should at least make sure that every class is included in each partition, although it won't necessarily be well-balanced.

Let me know if those suggestions would work for your project and if you need help implementing them. :)

Best,
Ludvig

@erflynn
Copy link
Author

erflynn commented Apr 14, 2020

Thank you! That makes a lot of sense re the implementation in the package.

I have this implemented using tidyverse and it works (included below), it's just long and not exactly the same but works well enough for me-- but there is probably a cleaner and faster way :).

require('tidyverse')

sep_studies <- function(num_studies, nfolds){
 # separate studies into a certain number of folds
  my_l <- c(1:nfolds, nfolds:1)
  if (runif(1, 0, 1) >= 0.5){ # note that because of this, it can be randomly off by a few samples, and this varies
    my_l <- c(nfolds:1, 1:nfolds)
  }
  if (num_studies < (2*nfolds)){
    
    return(my_l[1:num_studies])
  } 
  else {
    num_reps <- num_studies %/% (2*nfolds)
    num_rem <- num_studies %% (2*nfolds)
    if (num_rem==0){
      return(rep(my_l, num_reps))
    } else{
      return(c(rep(my_l, num_reps), my_l[1:num_rem]))
    }
  }
}

partition_group_data <- function(df, grp_col="grp", class_col="class", nfolds=2){
  # rename the columns for analysis
  colnames(df)[colnames(df)==grp_col] <- "grp"
  colnames(df)[colnames(df)==class_col] <- "class"

  # get the counts by class in each grp
  study_counts_by_class <- df %>% 
    mutate(grp=as.factor(grp)) %>%
    group_by(grp, class) %>% 
    count() %>% 
    ungroup() %>%
    pivot_wider(names_from=class, values_from=n, names_prefix="num", values_fill=c(n=0)) 
  
  # shuffle and partition
  partitioned_data <- study_counts_by_class %>% 
    group_by_if(is.numeric) %>% 
    sample_n(n()) %>%
    mutate(partition=unlist(sep_studies(n(), nfolds)))  %>%
    ungroup()
  
  # add the sample names back in
  samples_to_grps <- partitioned_data %>% 
    select(grp, partition) %>%
    left_join(df, by=c("grp")) 
  
  return(samples_to_grps)
}


set.seed(104) 
df[3, "diagnosis"] <- "b" # using the same df as before, with the same edit
parts <- partition_group_data(df, grp_col ="participant", class_col="diagnosis", nfolds=2) 
parts %>% group_by(partition, class) %>% count() # varies depending on the iteration, but pretty close

@LudvigOlsen
Copy link
Owner

Just went through your code, and in practice it seems to be in the ball park of the new_class approach I mentioned (more generalized though). That seems to be a good approach for your situation!
It's unclear to me whether it's the optimal approach in general, so I will need to work with it a bit, but it's definitely a great starting point for thinking about it! Thanks for sharing :)

The code seems to be for fold(). Do you need a version of this for partition as well?

@erflynn
Copy link
Author

erflynn commented Apr 14, 2020

ok! no problem! :) This is something I wish was written when I started the project, so I am sure it will be helpful to others.

I don't - I just use partition into 5 folds and then assign 1-4 to training for now to get an 80/20 split, but it would be a trivial extension, just changing sep_studies() to work with a fraction instead of just shuffling IDs.

@LudvigOlsen
Copy link
Owner

Great :)

Any other ideas your get or use cases you find, do let me know :)

@erflynn
Copy link
Author

erflynn commented Apr 14, 2020

will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants