unify defData/defData add #18

assignUser · 2020-04-19T16:44:57Z

While getting the new evalDef working (need to pass link from defData), I noticed that the first defined variable is not checked by evalDef, only tested to see if it is scalar.

kgoldfeld · 2020-04-19T16:54:14Z

The first call to a defData (when creating a new definition) is creating the first variable, there is no way to reference an existing variable in the data.table. So, the formula can only be a scalar. (This is not true for defDataAdd, where by definition that is used to add data to an existing data.table.) But, now that I think about it, I guess all the checks that are done elsewhere could be done in the first variable, except we need to check that no variable was referenced as well. Does that make sense? I never liked the fact that I had defData and defDataAdd. But I did it because in defData I could check the definition as it was being built. In the case of defDataAdd, that was not possible, so I only checked that when the definition was being executed. I guess I should have done defData the same way, and could have had a single function.

…

________________________________ From: assignUser <notifications@github.com> Sent: Sunday, April 19, 2020 12:45:09 PM To: kgoldfeld/simstudy Cc: Subscribed Subject: [kgoldfeld/simstudy] defData: first row not checked (#18) [EXTERNAL] While getting the new evalDef working (need to pass link from defData), I noticed that the first defined variable is not checked by evalDef, only tested to see if it is scalar. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kgoldfeld_simstudy_issues_18&d=DwMCaQ&c=j5oPpO0eBH1iio48DtsedeElZfc04rx3ExJHeIIZuCs&r=fPRk5fx5iys38LPukC4wczPXS5KI6iyvPUM-ICDfxh8&m=h-tUHvXBSzuHdz_3fEd1ABMfEJqwSnZXzvfxqy9_gmg&s=dFNcyU1bl1xwJD7ICgmR7sZHGw20sLtjZImTOLt-O2Q&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABNXOIIPHI4UEMQ32YNHKX3RNMTBLANCNFSM4ML3P32A&d=DwMCaQ&c=j5oPpO0eBH1iio48DtsedeElZfc04rx3ExJHeIIZuCs&r=fPRk5fx5iys38LPukC4wczPXS5KI6iyvPUM-ICDfxh8&m=h-tUHvXBSzuHdz_3fEd1ABMfEJqwSnZXzvfxqy9_gmg&s=y5NS02lG7b-w-oTj-faMSLN6gdufMS1ukmvGg0qzDRA&e=>.

assignUser · 2020-04-19T16:58:43Z

Yeah I understand why you did it this way, but we still can check formula formats etc. :)

I will hot fix defData for now to get the new evalDef working and testing. Afterwards I'll look at it and defDataAdd. Maybe we can unify them a bit,

assignUser · 2020-10-03T15:10:43Z

One idea I have to unify defData/Add would be to add a dt argument to defData which is used to check for previously defined vars in case it is given. This would of course change the api in a backwards incompatibily way...
Alternativly we could add a flag add = TRUE/FALSE that disables the previously defined checks.

Let me know if you have any other ideas :)

assignUser · 2020-10-03T15:19:59Z

Or the add flag could be either a boolean or a dt, in case it is dt we can check validity but in case of bool we just don't check (as is the behavior of defDataAddno).

This would allow for backward compatibility while also allowing for more "security" with adding new vars.

kgoldfeld · 2020-10-03T15:27:22Z

This is in line with what I had been thinking. In concept, I like the 2nd option you just mentioned. But, as I think about it, one issue I see is users (me, for example), might like to define all the data generating processes before actually generating the data - so that it would be impossible to specify the data set:


# Define data

d1 <- defData(...)
...
d1 <- defData(d1, ...)

d2 <- defDataAdd(...)
...
d2 <- defDataAdd(d2, ...)

# Generate data

dd <- genData(1000, d1)
dd <- addColumns(d2, dd)

assignUser · 2020-10-03T15:31:59Z

Maybe im just a bit dumb right now but why would you need two definitions in that case?

Addtionaly the "dual" flag would allow for just setting it to TRUE in that case.
Also would deprecate defDataAdd I guess.

kgoldfeld · 2020-10-03T15:52:46Z

Not dumb - it was just a highly simplified example - so my fault. Typically, it would look more like this, where there is some outcome variable that is a function of the treatment assignment. (Another typical example would be where we've created longitudinal data from a data set and want to add new data that are dependent on some time variable):

library(simstudy)
library(ggplot2)

# Define data

d1 <- defData(varname = "x", formula = 0, variance = 1)
d2 <- defDataAdd(varname = "y", formula = "5 + 0.5 * x + 2 * rx", variance = 2)

# Generate data

dd <- genData(1000, d1)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d2, dd)

# Look at data and regression line

ggplot(data = dd, aes(x = x, y = y, group = rx)) +
  geom_point()+
  geom_smooth(aes(color = factor(rx)), method = "lm")

kgoldfeld · 2020-10-03T15:54:18Z

If we can't unify them, that is fine. My primary goal is to make the data generation process super clear, and if that comes at the expense of efficiency, I am definitely willing to sacrifice efficiency.

assignUser · 2020-10-03T16:00:52Z

Ah that makes sense. I mean we could make defDataAdd use the same code as defData without changeing the external behaviour if you were thinking about that.

kgoldfeld · 2020-10-03T16:11:11Z

Would there be an advantage to that?

assignUser · 2020-10-03T16:34:32Z

For the user? no, but it sounds like you are happy with the way it work at the moment? For us it would reduce the amount of code to maintain.

kgoldfeld · 2020-10-03T17:03:08Z

So, if it would be easier to maintain - it makes sense to do it.

assignUser · 2020-10-27T21:59:04Z

There was also the idea from you at some point to add a "previous Dataset" parameter to defData. is this still something you are interested in?

kgoldfeld · 2020-10-28T13:45:56Z

I am thinking to put this on hold - I think I talked myself out of making any changes to this.

assignUser · 2020-10-28T14:00:21Z

Ok, maybe we should than make clear via warning that the id parameter will be ignored if a data definition is supplied (the current behaviour).

kgoldfeld · 2020-10-28T14:15:18Z

That sounds totally reasonable.

assignUser · 2021-04-04T17:31:00Z

I am thinking to put this on hold - I think I talked myself out of making any changes to this.

Any new insights on this? Or do you still want to keep the api as is and (if any changes) unify internally?

kgoldfeld · 2021-04-04T18:59:57Z

No - I have no new insights on this. Though your suggestion in #50 for a flag for the copy option raised a possibility here to add a flag to defData to indicate that this is adding a new columns to an existing table (which would mean we could get away from defDataAdd. It is basically the same thing as having a separate function, but makes explicit that they are the same thing, except that the formula checking if the add flag is TRUE would be be different.

I am not sure this is any better, just throwing it out there.

assignUser mentioned this issue May 6, 2020

New formula system & more tests #22

Merged

assignUser self-assigned this Aug 6, 2020

assignUser added the feature feature request or enhancement label Aug 6, 2020

assignUser changed the title ~~defData: first row not checked~~ unify defData/defData add Aug 6, 2020

assignUser mentioned this issue Aug 6, 2020

Ensure correct dist values. #24

Closed

assignUser mentioned this issue Oct 2, 2020

Add genMixFormula #51

Closed

assignUser added the question label Apr 4, 2021

assignUser mentioned this issue Apr 4, 2021

Unify def* function checking behavior #65

Open

assignUser mentioned this issue Apr 25, 2021

updateDef does not accept string argument for newvariance #93

Closed

assignUser added this to the Simstudy 1.0(?) milestone Oct 31, 2021

assignUser mentioned this issue Jan 17, 2022

Def data enhancements #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify defData/defData add #18

unify defData/defData add #18

assignUser commented Apr 19, 2020

kgoldfeld commented Apr 19, 2020 via email

assignUser commented Apr 19, 2020

assignUser commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 27, 2020

kgoldfeld commented Oct 28, 2020

assignUser commented Oct 28, 2020

kgoldfeld commented Oct 28, 2020

assignUser commented Apr 4, 2021

kgoldfeld commented Apr 4, 2021

unify defData/defData add #18

unify defData/defData add #18

Comments

assignUser commented Apr 19, 2020

kgoldfeld commented Apr 19, 2020 via email

assignUser commented Apr 19, 2020

assignUser commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 3, 2020

kgoldfeld commented Oct 3, 2020

assignUser commented Oct 27, 2020

kgoldfeld commented Oct 28, 2020

assignUser commented Oct 28, 2020

kgoldfeld commented Oct 28, 2020

assignUser commented Apr 4, 2021

kgoldfeld commented Apr 4, 2021