groups in kenstone not working because of wrong argument check #51

michaelsimmler · 2023-01-28T11:24:00Z

Hi Leonardo

Unfortunately kenstone with groups not working correctly.

Eg. for a dataset with 100 groups * 10 samples per group = 1000 observations, it will only allow to select 100 observations at max (k = 100).

Because when k > 100 in the above example it will fail here in line 157

if (k > nlevels(group)) { stop("k is larger the the number of groups/levels in 'group'") }

best regards & thanks

The text was updated successfully, but these errors were encountered:

l-ramirez-lopez · 2023-01-28T14:18:51Z

Hi,

the function is correct, the problem was in the documentation (which I just clarified for the upcoming version).

The point is that when you use the argument group, the sampling is conducted by groups and not by individual
samples. Therefore, the value of k becomes the number of groups to sample. When one observation is selected by the procedure all observations of the same group are removed together and assigned to the calibration set.

In your example your max. k is 100 as it is the number of groups you have. To have an approximate number of samples selected by the function you will need to provide a k based on the average number of samples you have per group.

michaelsimmler · 2023-01-29T07:15:38Z

Hi, Well I understand your reasoning but then the code is wrong. So in the case of groups the number of groups is compared with the number of selected observations in the while loop (while is should be compared with the number of selected groups). That’s why, when I use goups and k = 30 I get 30 observations and not 30 groups. see comparisson statement in while loop line 199 and update in line 220 Please check again Thanks Michi

michaelsimmler · 2023-01-29T07:45:07Z

Here a reproducible example:

library(prospectr)

my_spec <- NIRsoil$spc

groups <- rep(1:165, each = 5) # 165 groups with 5 obs each

results <- kenStone(X = my_spec, k = 20, group = groups)

length(results$model) # 20 observations

unique(groups[results$model]) # but only four groups

l-ramirez-lopez · 2023-01-29T10:38:48Z

Hi Michi,
thanks for the reproducible example...
You are right indeed (I jumped too fast into my previous explanation). Even when you use groups, k still refers to samples.
I'll have a look at this

l-ramirez-lopez · 2023-01-29T14:57:00Z

your fist comment was accurate... there is no reason why k must be equal or below the number of groups. So I remove that from sanity checks.

The k argument always indicates the target number of samples to be selected. So my statements in my initial reply/comment were not correct.

I added the following explanations in the documentation of the function:

In the argument description:
group An optional factor (or vector that can be coerced to a factor by as.factor) of length equal to nrow(X), giving the identifier of related observations (e.g. samples of the same batch of measurements, samples of the same origin, or of the same soil profile). Note that by using this option in some cases, the number of samples retrieved is not exactly the one specified in k as it will prepend on the groups. See details.

In Details:
When the group argument is used, the sampling is conducted in such a way that at each iteration, when a single sample is selected, this sample along with all the samples that belong to its group, are assigned to the final calibration set. In this respect, at each iteration, the algorithm will select one sample (in case that sample is the only one in that group) or more to the calibration set. This also implies that the argument k passed to the function will not necessary reflect the exact number of samples selected. For example, if k = 2 and if the first sample identified belongs to with group of 5 samples and the second one belongs to a group with 10 samples, then, the total amount of samples retrieved by the function will be 15.

Thank you very much for noticing this issue!

michaelsimmler · 2023-01-29T15:41:14Z

Hi,

Ok yes. Although I think your explanation is not sound yet.

If I'm not mistaken, in your example in details it will actually return 5 samples and not 15 (see your while loop).

"For example, if k = 2 and if the first sample identified belongs to with group of 5 samples and the second one belongs to a group with 10 samples, then, the total amount of samples retrieved by the function will be 15."

l-ramirez-lopez · 2023-01-29T16:13:17Z

I do not think so... I did my homework

library(prospectr)

my_spec <- NIRsoil$spc

# prepare data 
# identify the first two samples
results <- kenStone(X = my_spec, k = 2, pc = 8)
results$model

# place the first two samples on top of the matrix
my_spec <- rbind(my_spec[c(results$model), ], my_spec[-c(results$model), ])

# check
results <- kenStone(X = my_spec, k = 2, pc = 8)
results$model

# create groups
my_groups <- rep(1:275, each = 3)

# make a group of 10 samples for the second sample  
my_groups[1:11] <- 1

# make a group of 5 samples for the first sample  
my_groups[c(2, 12:15)] <- 2

my_groups <- my_groups |> as.factor()

# check the group size for the first two samples
table(my_groups)[1:2]

# get the samples using groups
results_group <- kenStone(X = my_spec, k = 2, pc = 8, group = my_groups)

results_group$model

michaelsimmler · 2023-01-29T17:19:58Z

ok, you are right, your example is correct -- but that a bit special because the first two samples are selected together.

But for higher k the behaviour is similar to what I suspected. So I'm not sure how illustrative the example is (at least for me it's confusing).

library(prospectr)

my_spec <- NIRsoil$spc

# prepare data 
# identify the first four samples
results <- kenStone(X = my_spec, k = 4, pc = 8)
results$model

# place the first four samples on top of the matrix
my_spec <- rbind(my_spec[c(results$model), ], my_spec[-c(results$model), ])

# check
results <- kenStone(X = my_spec, k = 4, pc = 8)
results$model

# create groups
my_groups <- rep(1:275, each = 3)

# make a group of 10 samples of the group of the first sample  
my_groups[c(1, 5:13)] <- 1

# make a group of 5 samples of the group of the second sample  
my_groups[c(2, 14:17)] <- 2

# make a group of 5 samples of the group of the third sample  
my_groups[c(3, 18:21)] <- 3

# make a group of 5 samples of the group of the fourth sample  
my_groups[c(4, 26:29)] <- 4

my_groups <- my_groups |> as.factor()

# check the group size for the first two samples
table(my_groups)[1:4]

# up to k = 15 always 15 samples are return (group à 10, group à 5)
results_group <- kenStone(X = my_spec, k = 15, pc = 8, group = my_groups)

length(results_group$model)

my_groups[results_group$model]

# prepare data 
# when k = 16 à next group is then returned... 
results_group <- kenStone(X = my_spec, k = 16, pc = 8, group = my_groups)

length(results_group$model)

my_groups[results_group$model]

l-ramirez-lopez · 2023-01-29T17:51:37Z

Not sure what your suspicion was about...
the rationale behind the selection of the first two is the same as for the rest
I perhaps include a basic code example in the documentation

michaelsimmler · 2023-01-29T19:36:29Z

Hmm. Maybe I don't understand. My understanding is that the first two groups are selected before the while loop. So they both get selected even if one group alone (!) already has more than k observations (as in your example). This is then different for the third group which would only be added if the first two groups together have less than k observations.

l-ramirez-lopez added a commit that referenced this issue Jan 28, 2023

doc(clarify group-based sampling #51)

2abb7e3

l-ramirez-lopez closed this as completed Jan 28, 2023

l-ramirez-lopez reopened this Jan 29, 2023

l-ramirez-lopez referenced this issue Jan 29, 2023

doc(add explanation on k when combined with groups)

4841d05

l-ramirez-lopez referenced this issue Jan 29, 2023

fix(remove k limit as function of groups)

c81debe

l-ramirez-lopez added a commit that referenced this issue Jan 29, 2023

doc(fix in kenStone) #51

f9863a4

l-ramirez-lopez closed this as completed Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groups in kenstone not working because of wrong argument check #51

groups in kenstone not working because of wrong argument check #51

michaelsimmler commented Jan 28, 2023 •

edited

l-ramirez-lopez commented Jan 28, 2023

michaelsimmler commented Jan 29, 2023 via email •

edited

michaelsimmler commented Jan 29, 2023 •

edited

l-ramirez-lopez commented Jan 29, 2023

l-ramirez-lopez commented Jan 29, 2023

michaelsimmler commented Jan 29, 2023

l-ramirez-lopez commented Jan 29, 2023 •

edited

michaelsimmler commented Jan 29, 2023

l-ramirez-lopez commented Jan 29, 2023

michaelsimmler commented Jan 29, 2023

groups in kenstone not working because of wrong argument check #51

groups in kenstone not working because of wrong argument check #51

Comments

michaelsimmler commented Jan 28, 2023 • edited

l-ramirez-lopez commented Jan 28, 2023

michaelsimmler commented Jan 29, 2023 via email • edited

michaelsimmler commented Jan 29, 2023 • edited

l-ramirez-lopez commented Jan 29, 2023

l-ramirez-lopez commented Jan 29, 2023

michaelsimmler commented Jan 29, 2023

l-ramirez-lopez commented Jan 29, 2023 • edited

michaelsimmler commented Jan 29, 2023

l-ramirez-lopez commented Jan 29, 2023

michaelsimmler commented Jan 29, 2023

michaelsimmler commented Jan 28, 2023 •

edited

michaelsimmler commented Jan 29, 2023 via email •

edited

michaelsimmler commented Jan 29, 2023 •

edited

l-ramirez-lopez commented Jan 29, 2023 •

edited