Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groups in kenstone not working because of wrong argument check #51

Closed
michaelsimmler opened this issue Jan 28, 2023 · 10 comments
Closed

Comments

@michaelsimmler
Copy link

michaelsimmler commented Jan 28, 2023

Hi Leonardo

Unfortunately kenstone with groups not working correctly.

Eg. for a dataset with 100 groups * 10 samples per group = 1000 observations, it will only allow to select 100 observations at max (k = 100).

Because when k > 100 in the above example it will fail here in line 157

if (k > nlevels(group)) { stop("k is larger the the number of groups/levels in 'group'") }

best regards & thanks

@l-ramirez-lopez
Copy link
Owner

Hi,

the function is correct, the problem was in the documentation (which I just clarified for the upcoming version).

The point is that when you use the argument group, the sampling is conducted by groups and not by individual
samples. Therefore, the value of k becomes the number of groups to sample. When one observation is selected by the procedure all observations of the same group are removed together and assigned to the calibration set.

In your example your max. k is 100 as it is the number of groups you have. To have an approximate number of samples selected by the function you will need to provide a k based on the average number of samples you have per group.

@michaelsimmler
Copy link
Author

michaelsimmler commented Jan 29, 2023 via email

@michaelsimmler
Copy link
Author

michaelsimmler commented Jan 29, 2023

Here a reproducible example:

library(prospectr)

my_spec <- NIRsoil$spc

groups <- rep(1:165, each = 5) # 165 groups with 5 obs each

results <- kenStone(X = my_spec, k = 20, group = groups)

length(results$model) # 20 observations

unique(groups[results$model]) # but only four groups

@l-ramirez-lopez
Copy link
Owner

Hi Michi,
thanks for the reproducible example...
You are right indeed (I jumped too fast into my previous explanation). Even when you use groups, k still refers to samples.
I'll have a look at this

@l-ramirez-lopez
Copy link
Owner

your fist comment was accurate... there is no reason why k must be equal or below the number of groups. So I remove that from sanity checks.

The k argument always indicates the target number of samples to be selected. So my statements in my initial reply/comment were not correct.

I added the following explanations in the documentation of the function:

In the argument description:
group An optional factor (or vector that can be coerced to a factor by as.factor) of length equal to nrow(X), giving the identifier of related observations (e.g. samples of the same batch of measurements, samples of the same origin, or of the same soil profile). Note that by using this option in some cases, the number of samples retrieved is not exactly the one specified in k as it will prepend on the groups. See details.

In Details:
When the group argument is used, the sampling is conducted in such a way that at each iteration, when a single sample is selected, this sample along with all the samples that belong to its group, are assigned to the final calibration set. In this respect, at each iteration, the algorithm will select one sample (in case that sample is the only one in that group) or more to the calibration set. This also implies that the argument k passed to the function will not necessary reflect the exact number of samples selected. For example, if k = 2 and if the first sample identified belongs to with group of 5 samples and the second one belongs to a group with 10 samples, then, the total amount of samples retrieved by the function will be 15.

Thank you very much for noticing this issue!

@michaelsimmler
Copy link
Author

Hi,

Ok yes. Although I think your explanation is not sound yet.

If I'm not mistaken, in your example in details it will actually return 5 samples and not 15 (see your while loop).

"For example, if k = 2 and if the first sample identified belongs to with group of 5 samples and the second one belongs to a group with 10 samples, then, the total amount of samples retrieved by the function will be 15."

@l-ramirez-lopez
Copy link
Owner

l-ramirez-lopez commented Jan 29, 2023

I do not think so... I did my homework

library(prospectr)

my_spec <- NIRsoil$spc

# prepare data 
# identify the first two samples
results <- kenStone(X = my_spec, k = 2, pc = 8)
results$model

# place the first two samples on top of the matrix
my_spec <- rbind(my_spec[c(results$model), ], my_spec[-c(results$model), ])

# check
results <- kenStone(X = my_spec, k = 2, pc = 8)
results$model

# create groups
my_groups <- rep(1:275, each = 3)

# make a group of 10 samples for the second sample  
my_groups[1:11] <- 1

# make a group of 5 samples for the first sample  
my_groups[c(2, 12:15)] <- 2

my_groups <- my_groups |> as.factor()

# check the group size for the first two samples
table(my_groups)[1:2]

# get the samples using groups
results_group <- kenStone(X = my_spec, k = 2, pc = 8, group = my_groups)

results_group$model

@michaelsimmler
Copy link
Author

ok, you are right, your example is correct -- but that a bit special because the first two samples are selected together.

But for higher k the behaviour is similar to what I suspected. So I'm not sure how illustrative the example is (at least for me it's confusing).

library(prospectr)

my_spec <- NIRsoil$spc

# prepare data 
# identify the first four samples
results <- kenStone(X = my_spec, k = 4, pc = 8)
results$model

# place the first four samples on top of the matrix
my_spec <- rbind(my_spec[c(results$model), ], my_spec[-c(results$model), ])

# check
results <- kenStone(X = my_spec, k = 4, pc = 8)
results$model

# create groups
my_groups <- rep(1:275, each = 3)

# make a group of 10 samples of the group of the first sample  
my_groups[c(1, 5:13)] <- 1

# make a group of 5 samples of the group of the second sample  
my_groups[c(2, 14:17)] <- 2

# make a group of 5 samples of the group of the third sample  
my_groups[c(3, 18:21)] <- 3

# make a group of 5 samples of the group of the fourth sample  
my_groups[c(4, 26:29)] <- 4

my_groups <- my_groups |> as.factor()

# check the group size for the first two samples
table(my_groups)[1:4]

# up to k = 15 always 15 samples are return (group à 10, group à 5)
results_group <- kenStone(X = my_spec, k = 15, pc = 8, group = my_groups)

length(results_group$model)

my_groups[results_group$model]

# prepare data 
# when k = 16 à next group is then returned... 
results_group <- kenStone(X = my_spec, k = 16, pc = 8, group = my_groups)

length(results_group$model)

my_groups[results_group$model]

@l-ramirez-lopez
Copy link
Owner

Not sure what your suspicion was about...
the rationale behind the selection of the first two is the same as for the rest
I perhaps include a basic code example in the documentation

@michaelsimmler
Copy link
Author

Hmm. Maybe I don't understand. My understanding is that the first two groups are selected before the while loop. So they both get selected even if one group alone (!) already has more than k observations (as in your example). This is then different for the third group which would only be added if the first two groups together have less than k observations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants