Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add genMixFormula #51

Closed
assignUser opened this issue Sep 25, 2020 · 14 comments
Closed

Add genMixFormula #51

assignUser opened this issue Sep 25, 2020 · 14 comments
Labels
feature feature request or enhancement

Comments

@assignUser
Copy link
Collaborator

I had another idea for a useful function, particularly if you implment the vectorization. It could be nice to have a function genMixFormula(functions, probs) that takes two arguments - a vector of function names or a double-dot variable that is itself a vector and a vector of probabilities (which could be NULL, defaulting to equal probability 1/n with n functions being mixed.

Originally posted by @kgoldfeld in #44 (comment)

@assignUser assignUser added the feature feature request or enhancement label Sep 25, 2020
@assignUser
Copy link
Collaborator Author

Do you mean something like this:

genMixFormula <- function(vars, probs = NULL, roundTo = 3) {
    assertNotMissing(vars = missing(vars))

    if (is.null(probs)) {
        n <- length(vars)
        probs <- round(rep(1 / n, n), roundTo)
    } else {
        assertNumeric(probs = probs)
        probs <- .adjustProbs(unlist(probs))
        assertLengthEqual(vars = vars, probs = probs)
    }

    paste(vars, probs, sep = " | ", collapse = " + ")
}

genMixFormula(c("a", "..b[..i]", "c"))
# [1] "a | 0.333 + ..b[..i] | 0.333 + c | 0.333"
genMixFormula(c("a", "..b", "c"), list(.2, .5, .3))
# [1] "a | 0.2 + ..b | 0.5 + c | 0.3"

I wasn't quite sure what you meant with

or a double-dot variable that is itself a vector

@kgoldfeld
Copy link
Owner

So, if a is a vector of values with length n and b is vector probabilities (also length n) in double-dot world, then I was thinking something like

a <- c(1, 2, 5)
b <- c(.3, .2, .5)

genMixFormula(a, b}
# [1] "1 | .3 + 2 | .2  + 5 | .5"

Or it could be more general if we want to define the formula first and later on we are changing a and b:

genMixFormula(a, b}
# [1] "..a[1] | ..b[1] + ..a[2] | ..b[2 ] + ..a[3] | ..b[3]"

But now that I think about it, I am not sure the second option makes sense, which is what I guess you are questioning.

I did have another thought - maybe it makes sense to have a single genFormula() function that can augment the existing function, incorporates catProbs(), and include as mixture" formula like you've started to create. Does this seem like a good idea, or simpler to keep as 3 separate functions?

@assignUser
Copy link
Collaborator Author

With the function I posted you could do both variants, you would just need to pass in the array vars as strings as seen in my first example:

genMixFormula(c("a", "..b[..i]", "c"))
# [1] "a | 0.333 + ..b[..i] | 0.333 + c | 0.333"

# Could just as well be:
genMixFormula(c("..a[2]", "..b[..i]", "..a[3]"))
# [1] "..a[2] | 0.333 + ..b[..i] | 0.333 + ..a[3] | 0.333"

(This formular is ofc only usable once #41 is done)
I am not sure that a unified genFormula function would bring benefits usage wise, it would mean much more options to pass in, I think its cleaner to have separate functions. We could rename catProbs to genCatFormula or so to keep one naming theme.

Btw should we enable passing rho and corstr via variance and link parameters for cat vars that are defined via defData ?

@kgoldfeld
Copy link
Owner

Yeah - I guess I was trying to simplify things by not having to specify all the elements of the vector manually in the call to genMixFormula. I do see the challenge in distinguishing between the two examples I gave, but it would be nice to have some shorthand way of entering the vector c("..a[1]", "..a[2]". ..a[3]", ... "..a[n]"). But, we can go with the simpler version you have created, and if I ever really come up with a case where it might be useful to think about a shorthand version, then we could accomodate that. My guess is your solution is sufficient.

I am not sure I get your follow-up question. In defData, a single categorical variable is defined (and later generated) so there is no correlation structure to be applied. Of course, the probabilities can be dependent on previously defined variables, so that correlation will be induced. Now, I could see a situation where in genCorData, rho could be a function of some group membership (so that correlation for some groups would be higher than others), but I don't think that is what you are talking about.

@assignUser
Copy link
Collaborator Author

ah I see, I'll think about the shorthand.

ah, 🤦 of course xD

@kgoldfeld
Copy link
Owner

Not to keep harping on this - after all this is not the most important function - but two possible ideas. One is to interpret based on contents of a and/or b. If a is a vector of characters then those get propagated into the formula (as you were doing in your example). Or if a is a vector of values, then the more generic version get propagated. So

a <- c("x", "y", "z")
genMixFormula(a)
# [1] "x | 0.3333 + y | 0.3333 + z | 0.3333"

but

a <- c(1, 2, 6)

genMixFormula(a)
# [1] "..a[1] | 0.3333 + ..a[2] | 0.3333 + ..a[3] | 0.3333"

In the first case x, y, and z will clearly refer to variables that have been generated in the data set as part of dataDef. In the second, the mixture will be based on some external vector. Now, an additional twist would be if aa doesn't exist, then there could be another argument to specify the length of aa:

genMixFormula(aa, len_a = 4)
# [1] "..aa[1] | 0.25 + ..aa[2] | 0.25 + ..aa[3] | 0.25 + ..aa[4] | 0.25"

where aa is not created.

@kgoldfeld
Copy link
Owner

It looks like you are getting very close with this, but still not perfect (or maybe I am using incorrectly).

genMixFormula itself works in this case, but the mixture option fails:

``` r
library(simstudy)

a <- c(1, 2, 3)
mixform <- genMixFormula("..a", varLength = 3)
mixform
#> [1] "..a[[1]] | 0.333 + ..a[[2]] | 0.333 + ..a[[3]] | 0.333"

d1 <- defData(varname = "a", formula = 6, dist = "nonrandom")
d1 <- defData(d1, varname = "c", formula = 87, dist = "nonrandom")
d1 <- defData(d1, varname = "x", formula = mixform, dist = "mixture")
#> Error in .checkMixture(newform): Invalid variable(s): 
#> Probabilities can only be numeric or numeric ..vars (not arrays). See ?distribution

Created on 2020-10-02 by the reprex package (v0.3.0)

And, if we don't want the probabilities to be 1/varLength, then we are out of luck. Not saying we need to do this now - fixing the first problem would be sufficient, but it might be nice to be able to specify:

genMixForm("..a", c(.5 .3, .2)) and get the result "..a[[1]] | .5 + ..a[[2]] | .3 + ..a[[3]] | .2"

or even more generally

genMixForm("..a", ."..b", varLength = 3) and get the result

"..a[[1]] | ..b[[1]] + ..a[[2]] | ..b[[2]] + ..a[[3]] | ..b[[3]]"

@assignUser
Copy link
Collaborator Author

Oh yeah this is just the new function, I still have to fix the check in evalDef to work with arrays.

Ill have a look at the rest later.

@kgoldfeld
Copy link
Owner

Should I hold off on merging until evalDef is updated? Or should I go ahead?

@assignUser
Copy link
Collaborator Author

The changes to eval def are bit bigger as we have to move these checks to pre generation and I'll have to think about how to do that in a nice way. (this will probably involve #18 ) I think merging now would be ok what do you think?

@kgoldfeld
Copy link
Owner

Probably OK since only super users would likely attempt to use the mixture formula in this fashion. But given that it fails, we might not want to include it in the examples just yet.

And more generally, I kind of like the example in help to be used in the context of data generation - more like what I did above. If you don't disagree (or don't really have an opinion), I am happy to edit the example.

@assignUser
Copy link
Collaborator Author

Sure feel free to improve the documentation :) I will of course make the array stuff work before 0.2.0

@assignUser
Copy link
Collaborator Author

btw should we rename catProbs to genCatFormula to be inline with genFormula and genMixFormula?

@kgoldfeld
Copy link
Owner

Yes - definitely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants