Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about defData vs defDataAdd #143

Open
DAVIDCRUZ0202 opened this issue Mar 4, 2022 · 6 comments
Open

Question about defData vs defDataAdd #143

DAVIDCRUZ0202 opened this issue Mar 4, 2022 · 6 comments

Comments

@DAVIDCRUZ0202
Copy link

In utility.R, there are two functions.

  1. updateDef
  2. updateDefAdd

Does anyone know the difference between these two in functionality? I can't seem to distinguish one from the other. Documentation says that updateDef should only be used with defData, while updateDefAdd should only be used with defDataAdd. What is the difference between how these update definition tables? Thank you in advance!

@kgoldfeld
Copy link
Owner

That is an excellent question, and yes, I can try to explain the difference.

Originally, defData and defDataAdd were created (maybe ill advisedly) as distinct functions in the way that they check for preexisting variables in the definition statement or in the generation statement. In defData, the checking occurs in the creation of the data definition table itself to ensure that the formulas will refer to valid variables once the data are generated; this guarantees a valid definition table.

defDataAdd creates a definition table that will add data to an existing data set, so the formulas can include variables that are not in this new definition table but are in the data set that is being augmented. The checking in this case is done when addColumns is called. So, it is possible to create an invalid set of data definitions using defDataAdd, and you won't find out until the subsequent call is made.

The same distinction exists between updateDef and updateDefAdd - and again has to do with validity checks.

I've never really been satisfied with this difference, and realize that it is confusing. We are thinking of changing this, but it will likely come at a cost of allowing users to create invalid definition tables in the case when a new table is created.

@DAVIDCRUZ0202
Copy link
Author

Excellent clarification, thank you! Will keep these points in mind.

The documentation for updateDef and updateDefAdd already does a good job of directing users to one function or the other, based on whether defData or defDataAdd is used. I'm thinking at the very least, to include some of your notes above about validity checking in the python version of the docstrings, to help users understand this distinction.

Maybe something can be done to simplify this (ex. combining defData with defDataAdd by simply adding an add argument to defData, and then performing those additional checks if add == TRUE) but that is just an idea!

To keep this conversation going, I won't close this thread yet; I'll post more information about how we handle this in the python port when ready👍🏽

@kgoldfeld
Copy link
Owner

I totally missed that you were with the IBM/python team. The question makes much more sense. I am afraid you are going to find a lot of these weird things - this grew organically and not always entirely logically. Throw on top of that the fact that I am a mere statistician and not a professional software developer, and who knows what you will find? Fortunately, Jacob joined the effort to really improve things, so things are much better than they once were (though I still take blame for anything that is totally wacky).

Regarding your suggestion for defData and defDataAdd, that is we had considered that exact solution (among others). Maybe the way to go.

Also, I was wondering if you want to open a python porting-specific issue that you and your colleagues can use (are there more than one of you working on this?), and it can be ongoing. You can decide, but it would be more obvious for me to know where the question is coming from.

@gabgilling
Copy link

@kgoldfeld Using this thread to follow up with some questions (also on the IBM team) - we noticed that when calling addPeriods without specifying the timevars, simstudy returns the original dataframe duplicated nPeriods of times. Is that normal? Thank you!

@kgoldfeld
Copy link
Owner

Yes - that is the intended outcome. You might want to do that if you follow up by generating a time-dependent outcome. Here is a simple example:

defT <- defDataAdd(varname = "Y", formula = "5 + 2*period", variance = 1)

dtTrial <- genData(2)
dtTime <- addPeriods(dtTrial, nPeriods = 3)
dtTime
#>    id period timeID
#> 1:  1      0      1
#> 2:  1      1      2
#> 3:  1      2      3
#> 4:  2      0      4
#> 5:  2      1      5
#> 6:  2      2      6

dtTime <- addColumns(defT, dtTime)
dtTime
#>    id period timeID         Y
#> 1:  1      0      1  5.054995
#> 2:  1      1      2  6.756151
#> 3:  1      2      3  9.277415
#> 4:  2      0      4  5.389718
#> 5:  2      1      5  8.320501
#> 6:  2      2      6 10.314547

Does that help clarify, or were you asking something else?

@gabgilling
Copy link

@kgoldfeld that makes sense to me, thanks for your explanation! @DAVIDCRUZ0202 FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants