/
extra_parameters.Rmd
358 lines (285 loc) · 10.6 KB
/
extra_parameters.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: "Extra parameters"
author: "Krystian Igras"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Extra parameters}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = TRUE,
echo = TRUE, # echo code?
message = TRUE, # Show messages
warning = TRUE, # Show warnings
fig.width = 8, # Default plot width
fig.height = 6, # .... height
dpi = 200, # Plot resolution
fig.align = "center"
)
knitr::opts_chunk$set() # Figure alignment
library(DataFakeR)
set.seed(123)
options(tibble.width = Inf)
```
Most of the functionality offered by DataFakeR were described in the remaining vignettes.
Goal of this section is to present additional options offered by DataFakeR that can make your data simulating workflow much easier.
### Grouped simulation
It is quite common that values of the column are strongly related to the groups defined by the other column values.
Some of the examples may be the age of death that may differ between males and females.
In order to allow simulating values based on groups designed by the other columns, `group_by` parameter was introduced.
Let's see the below schema structure:
```
# schema-patient.yml
public:
tables:
patient:
columns:
treatment_id:
type: varchar
formula: !expr paste0(patient_id, line_number)
patient_id:
type: char(8)
line_number:
type: smallint
gender:
type: char(1)
values: [F, M]
biomarker:
type: numeric
range: [0, 1]
```
Let's simulate the data using the below definition:
```{r}
sch <- schema_source(system.file("extdata", "schema-patient.yml", package = "DataFakeR"))
sch <- schema_simulate(sch)
schema_get_table(sch, "patient")
```
Now we'd like to extend the definition following the below rules:
1. For each `patient_id` we usually want to have more than one row (it allows us to analyze the patient in multiple treatment stages).
2. For each `patient_id`, `line_number` is the sequence starting from 1 to the number of patient rows.
3. `gender` should be unique for each `patient_id` value.
4. `biomarker` value should `mean = 0.5` for females and `mean = 0.6` for males.
#### Rule no 1
Let's start with condition 1. We may simply achieve this by providing possible `patient_id` values with unique number less than target number of rows.
```
# schema-patient_2.yml
public:
tables:
patient:
columns:
treatment_id:
type: varchar
formula: !expr paste0(patient_id, line_number)
patient_id:
type: char(8)
values: [PTNT01ID, PTNT02ID, PTNT03ID]
line_number:
type: smallint
gender:
type: char(1)
values: [F, M]
biomarker:
type: numeric
range: [0, 1]
```
Let's simulate the data using the below definition:
```{r}
sch <- schema_update_source(sch, file = system.file("extdata", "schema-patient_2.yml", package = "DataFakeR"))
sch <- schema_simulate(sch)
schema_get_table(sch, "patient")
```
#### Rule no 2
Now, we'd like to make sure `line_number` is a sequence `1:n`, where `n` is the number of rows for each patient.
If we define `line_number` formula as `!expr 1:dplyr::n()`, the resulting column would be a sequence `1:n` where `n` is the number of table rows.
There is an obvious need to apply grouping by `patient_id`.
How can we achieve this in DataFakeR?
It is enough to add `group_by: patient_id` parameter to `line_number`.
Let's see it in action:
```
# schema-patient_3.yml
public:
tables:
patient:
columns:
treatment_id:
type: varchar
formula: !expr paste0(patient_id, line_number)
patient_id:
type: char(8)
values: [PTNT01ID, PTNT02ID, PTNT03ID]
line_number:
type: smallint
group_by: patient_id
formula: !expr 1:dplyr::n()
gender:
type: char(1)
values: [F, M]
biomarker:
type: numeric
range: [0, 1]
```
Looking at the dependency graph:
```{r}
sch <- schema_update_source(sch, file = system.file("extdata", "schema-patient_3.yml", package = "DataFakeR"))
schema_plot_deps(sch, "patient")
```
we can see, DataFakeR detected dependency between `patient_id` and `line_numer` column.
`line_number` was also created according to our needs:
```{r}
sch <- schema_simulate(sch)
schema_get_table(sch, "patient")
```
#### Rule no 3
Now let's assure gender is unique for each patient.
To achieve this, we'll have to group by `patient_id` and sample one value from `c("F", "M)` and repeat this with desired number of rows.
We may again use the formula, but for the sake of the example we'll define custom restricted method.
The method will be executed whenever `singular: true` parameter is provided to the column.
```
# schema-patient_4.yml
public:
tables:
patient:
columns:
treatment_id:
type: varchar
formula: !expr paste0(patient_id, line_number)
patient_id:
type: char(8)
values: [PTNT01ID, PTNT02ID, PTNT03ID]
line_number:
type: smallint
group_by: patient_id
formula: !expr 1:dplyr::n()
gender:
type: char(1)
values: [F, M]
singular: true
group_by: patient_id
biomarker:
type: numeric
range: [0, 1]
```
The method definition:
```{r}
singular_vals <- function(n, values, singular, ...) {
if (!missing(singular) && isTRUE(singular)) {
val <- sample(values, 1)
return(rep(val, n))
}
return(NULL)
}
```
and add it to schema options:
```{r}
my_opts = set_faker_opts(
opt_simul_restricted_character = opt_simul_restricted_character(
singular = singular_vals
)
)
```
Now we can start simulation:
```{r patient_deps}
sch <- schema_update_source(
sch,
file = system.file("extdata", "schema-patient_4.yml", package = "DataFakeR"),
my_opts
)
sch <- schema_simulate(sch)
schema_get_table(sch, "patient")
```
Voila!
#### Rule no 4
We'd like to sample from normal distribution with a specific mean dependent on the gender value.
Again we have a few options here. The simplest one would be to group `biomarker` by `gender` and use:
`formula: ifelse(gender == "F", rnorm(dplyr::n(), 0.5, 0.01), rnorm(dplyr::n(), 0.6, 0.01))`
but we'll do it using a special method instead.
The main issue we might have is how to access the group value from within the method.
We shouldn't be worried. DataFakeR automatically passes the value as a `group_val` parameter.
So let's try it out.
Let's define spec method named `dep_sampl`:
```
# schema-patient_5.yml
public:
tables:
patient:
columns:
treatment_id:
type: varchar
formula: !expr paste0(patient_id, line_number)
patient_id:
type: char(8)
values: [PTNT01ID, PTNT02ID, PTNT03ID]
line_number:
type: smallint
group_by: patient_id
formula: !expr 1:dplyr::n()
gender:
type: char(1)
values: [F, M]
singular: true
group_by: patient_id
biomarker:
type: numeric
range: [0, 1]
spec: dep_sampl
group_by: gender
```
Inside function definition let's add print line to see current `group_key` value:
```{r ptntwo_deps}
dep_sampl <- function(n, group_val, range, ...) {
print(group_val)
if (group_val == "M") {
pmax(pmin(rnorm(n, 0.6, 0.01), range[2]), range[1])
} else {
pmax(pmin(rnorm(n, 0.5, 0.01), range[2]), range[1])
}
}
my_opts = set_faker_opts(
opt_simul_spec_numeric = opt_simul_spec_numeric(
dep_sampl = dep_sampl
),
# don't forget option from previous case
opt_simul_restricted_character = opt_simul_restricted_character(
singular = singular_vals
)
)
sch <- schema_update_source(
sch,
file = system.file("extdata", "schema-patient_5.yml", package = "DataFakeR"),
my_opts
)
sch <- schema_simulate(sch)
schema_get_table(sch, "patient")
```
This way we achieved assumed goal.
### Ratio of NA values
Another extra parameter offered by DataFakeR is `na_ratio`.
Na ratio allows to precise the ratio of how many NA values should the column have.
For each simulation method the final sample is modified by [na_rand](../reference/sample_modifiers.html) function, that replaces desired ratio of values with `NA`s.
Default `na_ratio` value is `0.05` but can be easily overwritten by `opt_default_` configuration, or passed directly in column definition.
**Note** Whenever column have defined `not_null: true`, `na_rand` doesn't attach `NA` values to the sample.
### Ratio of column values
The last extra parameter offered by DataFakeR is `levels_ratio`.
Levels ratio allows to precise how many unique values should the column have.
For each simulation method (before the sample is modified by `na_rand`) the sample is modified by [levels_rand](../reference/sample_modifiers.html) function, that takes desired number of sample levels and resamples it using only provided levels.
Default `levels_ratio` is `1` but can be easily overwritten by `opt_default_` configuration, or passed directly in column definition.
**Note** Whenever column have defined `unique: true` or `levels_ratio: 1`, `levels_rand` doesn't modify the sample.
### Remaining parameters
There are a few parameters that can be configured to each column and be handled by
[all the simulation methods](simulation_methods.html).
Such parameters are:
- `values` - The parameter keeps possible values for the simulated column,
- `range` - Two-length parameter storing minimum and maximum value for simulated column (numeric, integer and date only),
- `precision` - Precision of numeric column values,
- `scale` - Scale of numeric column values,
- `min_date`, `max_date` - minimal and maximal values for simulating date columns (overwritten by `range` when specified),
- `format` - Date format of `min_date`, `max_date` and `range` in case of date columns (%Y-%m-%d by default),
- `nchar` - Maximum number of characters for simulating character column (10 by default).
Whenever any of the above parameters is defined as a parameter of column-type simulation method, such value can be used to
get more accurate result respecting the configuration.
For example default character simulation method ([simul_default_character](../reference/simulation_methods_character.html)) takes an advantage of `nchar` parameter, but special method for simulating names don't ([simul_spec_character_name](../reference/simulation_methods_character.html)).