in partition/fold: restriction on having different categories within an ID #7

erflynn · 2020-04-14T00:26:33Z

groupdata2 looks great and solves a lot of problems I am looking for!

I tried to use this for one of the problems I am working on and I am running into the following error. I know that this is by design

 The value in 'data[[cat_col]]' must be
 * constant within each ID.

Is it possible to relax your restriction? It is very possible that two measurements from the same group (or ID) could be different. For example, a subject could get two different diagnoses, and we want to make sure they are still either only in training or test, and the diagnoses do count to the total. In the case I'm looking at, the IDs are studies, and examples of both classes can be within in a single study.

E.g., using your df as defined in the cross-validation with group data vignette, let's just change:

df[3, "diagnosis"] <- "b"
parts <- partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')

And then when we run partition() it fails with the above error.

Thanks so much!
Emily

The text was updated successfully, but these errors were encountered:

LudvigOlsen · 2020-04-14T01:20:47Z

Hi Emily,

You are right, that there are times when the same ID could reasonably have two different classes and it's definitely something I should look into supporting in the future. Right now though, the implementation doesn't make it possible:

With categorical balancing: 1) Split the dataset by the cat_col column. 2) Partition each split. 3) Combine the partitions (first partition from split 1 with first partition from split 2, etc.)

With id_col: 1) Extract the unique IDs. 2) Partition them. 3) Put all rows for the ID in its partition.

With both: 1) Extract the unique IDs with their classes. 2) Do the categorical balancing. 3) Put all rows for the ID in its partition.

So, in the "Both" version, if we had multiple classes within an ID, the approach won't work, as I would either have multiple rows per ID (that may end up in different partitions) or only consider one of the classes for each ID (would happen in the current code, I believe).

It's very possible, that I can think of a way to solve this when it's not 3 am :)

For now, I'll suggest that you only use the id_col, and perhaps run the partitioning a couple of times with different seeds and use the one that has the best distribution of the classes. Whether this is useful of course depends on the dataset.

Alternatively, if it's only two classes, you could make an extra class if the study has both. So you could make a column new_class with the classes 0,1, and Both. Then use cat_col = new_class. That should at least make sure that every class is included in each partition, although it won't necessarily be well-balanced.

Let me know if those suggestions would work for your project and if you need help implementing them. :)

Best,
Ludvig

erflynn · 2020-04-14T04:49:12Z

Thank you! That makes a lot of sense re the implementation in the package.

I have this implemented using tidyverse and it works (included below), it's just long and not exactly the same but works well enough for me-- but there is probably a cleaner and faster way :).

require('tidyverse')

sep_studies <- function(num_studies, nfolds){
 # separate studies into a certain number of folds
  my_l <- c(1:nfolds, nfolds:1)
  if (runif(1, 0, 1) >= 0.5){ # note that because of this, it can be randomly off by a few samples, and this varies
    my_l <- c(nfolds:1, 1:nfolds)
  }
  if (num_studies < (2*nfolds)){
    
    return(my_l[1:num_studies])
  } 
  else {
    num_reps <- num_studies %/% (2*nfolds)
    num_rem <- num_studies %% (2*nfolds)
    if (num_rem==0){
      return(rep(my_l, num_reps))
    } else{
      return(c(rep(my_l, num_reps), my_l[1:num_rem]))
    }
  }
}

partition_group_data <- function(df, grp_col="grp", class_col="class", nfolds=2){
  # rename the columns for analysis
  colnames(df)[colnames(df)==grp_col] <- "grp"
  colnames(df)[colnames(df)==class_col] <- "class"

  # get the counts by class in each grp
  study_counts_by_class <- df %>% 
    mutate(grp=as.factor(grp)) %>%
    group_by(grp, class) %>% 
    count() %>% 
    ungroup() %>%
    pivot_wider(names_from=class, values_from=n, names_prefix="num", values_fill=c(n=0)) 
  
  # shuffle and partition
  partitioned_data <- study_counts_by_class %>% 
    group_by_if(is.numeric) %>% 
    sample_n(n()) %>%
    mutate(partition=unlist(sep_studies(n(), nfolds)))  %>%
    ungroup()
  
  # add the sample names back in
  samples_to_grps <- partitioned_data %>% 
    select(grp, partition) %>%
    left_join(df, by=c("grp")) 
  
  return(samples_to_grps)
}


set.seed(104) 
df[3, "diagnosis"] <- "b" # using the same df as before, with the same edit
parts <- partition_group_data(df, grp_col ="participant", class_col="diagnosis", nfolds=2) 
parts %>% group_by(partition, class) %>% count() # varies depending on the iteration, but pretty close

LudvigOlsen · 2020-04-14T10:38:13Z

Just went through your code, and in practice it seems to be in the ball park of the new_class approach I mentioned (more generalized though). That seems to be a good approach for your situation!
It's unclear to me whether it's the optimal approach in general, so I will need to work with it a bit, but it's definitely a great starting point for thinking about it! Thanks for sharing :)

The code seems to be for fold(). Do you need a version of this for partition as well?

erflynn · 2020-04-14T15:26:12Z

ok! no problem! :) This is something I wish was written when I started the project, so I am sure it will be helpful to others.

I don't - I just use partition into 5 folds and then assign 1-4 to training for now to get an 80/20 split, but it would be a trivial extension, just changing sep_studies() to work with a fraction instead of just shuffling IDs.

LudvigOlsen · 2020-04-14T15:28:42Z

Great :)

Any other ideas your get or use cases you find, do let me know :)

erflynn · 2020-04-14T15:30:09Z

will do!

LudvigOlsen added the enhancement label Apr 14, 2020

LudvigOlsen closed this as completed Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in partition/fold: restriction on having different categories within an ID #7

in partition/fold: restriction on having different categories within an ID #7

erflynn commented Apr 14, 2020

LudvigOlsen commented Apr 14, 2020

erflynn commented Apr 14, 2020 •

edited

Loading

LudvigOlsen commented Apr 14, 2020

erflynn commented Apr 14, 2020 •

edited

Loading

LudvigOlsen commented Apr 14, 2020

erflynn commented Apr 14, 2020

in partition/fold: restriction on having different categories within an ID #7

in partition/fold: restriction on having different categories within an ID #7

Comments

erflynn commented Apr 14, 2020

LudvigOlsen commented Apr 14, 2020

erflynn commented Apr 14, 2020 • edited Loading

LudvigOlsen commented Apr 14, 2020

erflynn commented Apr 14, 2020 • edited Loading

LudvigOlsen commented Apr 14, 2020

erflynn commented Apr 14, 2020

erflynn commented Apr 14, 2020 •

edited

Loading

erflynn commented Apr 14, 2020 •

edited

Loading