Add genMixFormula #51

assignUser · 2020-09-25T11:53:48Z

I had another idea for a useful function, particularly if you implment the vectorization. It could be nice to have a function genMixFormula(functions, probs) that takes two arguments - a vector of function names or a double-dot variable that is itself a vector and a vector of probabilities (which could be NULL, defaulting to equal probability 1/n with n functions being mixed.

Originally posted by @kgoldfeld in #44 (comment)

The text was updated successfully, but these errors were encountered:

assignUser · 2020-09-28T09:15:45Z

Do you mean something like this:

genMixFormula <- function(vars, probs = NULL, roundTo = 3) {
    assertNotMissing(vars = missing(vars))

    if (is.null(probs)) {
        n <- length(vars)
        probs <- round(rep(1 / n, n), roundTo)
    } else {
        assertNumeric(probs = probs)
        probs <- .adjustProbs(unlist(probs))
        assertLengthEqual(vars = vars, probs = probs)
    }

    paste(vars, probs, sep = " | ", collapse = " + ")
}

genMixFormula(c("a", "..b[..i]", "c"))
# [1] "a | 0.333 + ..b[..i] | 0.333 + c | 0.333"
genMixFormula(c("a", "..b", "c"), list(.2, .5, .3))
# [1] "a | 0.2 + ..b | 0.5 + c | 0.3"

I wasn't quite sure what you meant with

or a double-dot variable that is itself a vector

kgoldfeld · 2020-09-29T00:38:07Z

So, if a is a vector of values with length n and b is vector probabilities (also length n) in double-dot world, then I was thinking something like

a <- c(1, 2, 5)
b <- c(.3, .2, .5)

genMixFormula(a, b}
# [1] "1 | .3 + 2 | .2  + 5 | .5"

Or it could be more general if we want to define the formula first and later on we are changing a and b:

genMixFormula(a, b}
# [1] "..a[1] | ..b[1] + ..a[2] | ..b[2 ] + ..a[3] | ..b[3]"

But now that I think about it, I am not sure the second option makes sense, which is what I guess you are questioning.

I did have another thought - maybe it makes sense to have a single genFormula() function that can augment the existing function, incorporates catProbs(), and include as mixture" formula like you've started to create. Does this seem like a good idea, or simpler to keep as 3 separate functions?

assignUser · 2020-09-29T01:19:30Z

With the function I posted you could do both variants, you would just need to pass in the array vars as strings as seen in my first example:

genMixFormula(c("a", "..b[..i]", "c"))
# [1] "a | 0.333 + ..b[..i] | 0.333 + c | 0.333"

# Could just as well be:
genMixFormula(c("..a[2]", "..b[..i]", "..a[3]"))
# [1] "..a[2] | 0.333 + ..b[..i] | 0.333 + ..a[3] | 0.333"

(This formular is ofc only usable once #41 is done)
I am not sure that a unified genFormula function would bring benefits usage wise, it would mean much more options to pass in, I think its cleaner to have separate functions. We could rename catProbs to genCatFormula or so to keep one naming theme.

Btw should we enable passing rho and corstr via variance and link parameters for cat vars that are defined via defData ?

kgoldfeld · 2020-09-29T02:28:12Z

Yeah - I guess I was trying to simplify things by not having to specify all the elements of the vector manually in the call to genMixFormula. I do see the challenge in distinguishing between the two examples I gave, but it would be nice to have some shorthand way of entering the vector c("..a[1]", "..a[2]". ..a[3]", ... "..a[n]"). But, we can go with the simpler version you have created, and if I ever really come up with a case where it might be useful to think about a shorthand version, then we could accomodate that. My guess is your solution is sufficient.

I am not sure I get your follow-up question. In defData, a single categorical variable is defined (and later generated) so there is no correlation structure to be applied. Of course, the probabilities can be dependent on previously defined variables, so that correlation will be induced. Now, I could see a situation where in genCorData, rho could be a function of some group membership (so that correlation for some groups would be higher than others), but I don't think that is what you are talking about.

assignUser · 2020-09-29T07:34:50Z

ah I see, I'll think about the shorthand.

ah, 🤦 of course xD

kgoldfeld · 2020-09-29T12:41:27Z

Not to keep harping on this - after all this is not the most important function - but two possible ideas. One is to interpret based on contents of a and/or b. If a is a vector of characters then those get propagated into the formula (as you were doing in your example). Or if a is a vector of values, then the more generic version get propagated. So

a <- c("x", "y", "z")
genMixFormula(a)
# [1] "x | 0.3333 + y | 0.3333 + z | 0.3333"

but

a <- c(1, 2, 6)

genMixFormula(a)
# [1] "..a[1] | 0.3333 + ..a[2] | 0.3333 + ..a[3] | 0.3333"

In the first case x, y, and z will clearly refer to variables that have been generated in the data set as part of dataDef. In the second, the mixture will be based on some external vector. Now, an additional twist would be if aa doesn't exist, then there could be another argument to specify the length of aa:

genMixFormula(aa, len_a = 4)
# [1] "..aa[1] | 0.25 + ..aa[2] | 0.25 + ..aa[3] | 0.25 + ..aa[4] | 0.25"

where aa is not created.

kgoldfeld · 2020-10-02T18:18:23Z

It looks like you are getting very close with this, but still not perfect (or maybe I am using incorrectly).

genMixFormula itself works in this case, but the mixture option fails:

``` r
library(simstudy)

a <- c(1, 2, 3)
mixform <- genMixFormula("..a", varLength = 3)
mixform
#> [1] "..a[[1]] | 0.333 + ..a[[2]] | 0.333 + ..a[[3]] | 0.333"

d1 <- defData(varname = "a", formula = 6, dist = "nonrandom")
d1 <- defData(d1, varname = "c", formula = 87, dist = "nonrandom")
d1 <- defData(d1, varname = "x", formula = mixform, dist = "mixture")
#> Error in .checkMixture(newform): Invalid variable(s): 
#> Probabilities can only be numeric or numeric ..vars (not arrays). See ?distribution

^{Created on 2020-10-02 by the reprex package (v0.3.0)}

And, if we don't want the probabilities to be 1/varLength, then we are out of luck. Not saying we need to do this now - fixing the first problem would be sufficient, but it might be nice to be able to specify:

genMixForm("..a", c(.5 .3, .2)) and get the result "..a[[1]] | .5 + ..a[[2]] | .3 + ..a[[3]] | .2"

or even more generally

genMixForm("..a", ."..b", varLength = 3) and get the result

"..a[[1]] | ..b[[1]] + ..a[[2]] | ..b[[2]] + ..a[[3]] | ..b[[3]]"

assignUser · 2020-10-02T18:30:36Z

Oh yeah this is just the new function, I still have to fix the check in evalDef to work with arrays.

Ill have a look at the rest later.

kgoldfeld · 2020-10-02T18:32:34Z

Should I hold off on merging until evalDef is updated? Or should I go ahead?

assignUser · 2020-10-02T18:37:16Z

The changes to eval def are bit bigger as we have to move these checks to pre generation and I'll have to think about how to do that in a nice way. (this will probably involve #18 ) I think merging now would be ok what do you think?

kgoldfeld · 2020-10-02T18:41:25Z

Probably OK since only super users would likely attempt to use the mixture formula in this fashion. But given that it fails, we might not want to include it in the examples just yet.

And more generally, I kind of like the example in help to be used in the context of data generation - more like what I did above. If you don't disagree (or don't really have an opinion), I am happy to edit the example.

assignUser · 2020-10-02T18:59:23Z

Sure feel free to improve the documentation :) I will of course make the array stuff work before 0.2.0

assignUser · 2020-10-02T19:00:13Z

btw should we rename catProbs to genCatFormula to be inline with genFormula and genMixFormula?

kgoldfeld · 2020-10-02T19:01:57Z

Yes - definitely

assignUser added the feature feature request or enhancement label Sep 25, 2020

assignUser added a commit that referenced this issue Oct 2, 2020

add shorthand to genMixFormula #51

8150a4e

assignUser mentioned this issue Oct 2, 2020

Fix a number of issues and add features #56

Merged

kgoldfeld closed this as completed in 53b4177 Oct 2, 2020

assignUser mentioned this issue Oct 2, 2020

rename catProbs & deprecate #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genMixFormula #51

Add genMixFormula #51

assignUser commented Sep 25, 2020

assignUser commented Sep 28, 2020

kgoldfeld commented Sep 29, 2020

assignUser commented Sep 29, 2020

kgoldfeld commented Sep 29, 2020

assignUser commented Sep 29, 2020

kgoldfeld commented Sep 29, 2020

kgoldfeld commented Oct 2, 2020

assignUser commented Oct 2, 2020

kgoldfeld commented Oct 2, 2020

assignUser commented Oct 2, 2020

kgoldfeld commented Oct 2, 2020

assignUser commented Oct 2, 2020

assignUser commented Oct 2, 2020

kgoldfeld commented Oct 2, 2020

Add genMixFormula #51

Add genMixFormula #51

Comments

assignUser commented Sep 25, 2020

assignUser commented Sep 28, 2020

kgoldfeld commented Sep 29, 2020

assignUser commented Sep 29, 2020

kgoldfeld commented Sep 29, 2020

assignUser commented Sep 29, 2020

kgoldfeld commented Sep 29, 2020

kgoldfeld commented Oct 2, 2020

assignUser commented Oct 2, 2020

kgoldfeld commented Oct 2, 2020

assignUser commented Oct 2, 2020

kgoldfeld commented Oct 2, 2020

assignUser commented Oct 2, 2020

assignUser commented Oct 2, 2020

kgoldfeld commented Oct 2, 2020