-
Notifications
You must be signed in to change notification settings - Fork 0
/
b-Data-processing-elements.Rmd
379 lines (263 loc) · 9.14 KB
/
b-Data-processing-elements.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: "Data Processing Elements"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data Processing Elements}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
eval = FALSE,
comment = "#>",
warning = FALSE,
message = FALSE,
echo = TRUE)
```
```{r, eval = TRUE, echo = FALSE}
source('datatables.R')
```
# What is the Data Processing Elements ?
The Data Processing Elements (**DPE**) is a table that defines and documents
information about the processing used to generate harmonized datasets and is
typically prepared in an Excel spreadsheet. Each row indicates if an
input dataset can generate a DataSchema variable, and if so, how input variables
are processed to generate a harmonized variable as defined in the DataSchema.
This page explains the basic methods to fill out the DPEs in order to be used
correctly by Rmonize functions.
# General structure of the DPE
The DPE is an typically an Excel file that you open locally in your computer and
that you can fill one row after the other to generate the rules of harmonization.
It contains at least 5 mandatory columns, plus one additional for your documentation.
The process cannot work if one of these columns are not present.
::: {}
`r DT_data_proc_elem_def`
:::
[return to summary](#summary)
# Harmonization rules
<!-- Harmonization Rules Navigation -->
<div class="row">
<!-- Nav Buttons -->
<div class="col-md-3">
<ul class="nav nav-pills nav-stacked" role="tablist">
<li class="active">
<a href="#id-creation" role="tab" data-toggle="tab">
id_creation
</a>
</li>
<li><a href="#direct-mapping" role="tab" data-toggle="tab">
direct_mapping
</a>
</li>
<li>
<a href="#recode" role="tab" data-toggle="tab">
recode
</a>
</li>
<li>
<a href="#case-when" role="tab" data-toggle="tab">
case_when
</a>
</li>
<li>
<a href="#paste" role="tab" data-toggle="tab">
paste
</a>
</li>
<li>
<a href="#operation" role="tab" data-toggle="tab">
operation
</a>
</li>
<li>
<a href="#other" role="tab" data-toggle="tab">
other
</a>
</li>
<li>
<a href="#impossible-undertermined" role="tab" data-toggle="tab">
impossible<br>
undertermined<br>
\_\_BLANK\_\_
</a>
</li>
</ul>
</div>
<div class="col-md-9">
<!-- Nav Contents -->
<div class="tab-content">
<!-- ID Creation -->
<div class="tab-pane fade active in" id="id-creation">
**id_creation** is mandatory and the first rule that initiates the
harmonization process. This rule allows the user to provide the column used as
a reference per observation (row).
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_id_creation`
</div>
<br>
Notes:
* Usually, the harmonized variable is a standardized identifier generated
from the input identifier.
* If the dataset does not have any identifier column, the user can create
*before harmonization* an index and provide this index as the variable to use.
</div>
<!-- Direct Mapping -->
<div class="tab-pane fade" id="direct-mapping">
The harmonized variable is generated by replicating one input variable.
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_direct_mapping`
</div>
<br>
Note:
* One and only one variable can be replicated at a time.
</div>
<!-- Recode -->
<div class="tab-pane fade" id="recode">
The harmonized variable is generated by recoding values from one input variable.
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_recode`
</div>
<br>
Notes:
* One and only one variable can be recoded at a time.
* The variable to be recoded must be (partially at the very least) a categorical
variable. To recode a continuous variable (to create brackets for example.),
use **case_when** instead.
* If all categories are recoded to the same categories (recode(1 = 1 ; 2 = 2)),
Prefer **direct_mapping** instead.
* Separate each value/code with an equal sign **=**
* Separate each elements with a semi-colon **;** .
* Use **ELSE = NA** to attribute NA to all of the other values.
If an equal sign already exists in the data, use **\_=** to escape them. Equally,
if a semi-colon already exists in the data, use **\_;** to escape them.
```
recode(
"banana ; apple" = "fruits" _;
"salad ; potatoe" = "veggies" _;
"bread ; pasta" = "carbs" )
recode(
"1000 (='high') _= 3 ;
" 500 (='mid') _= 2 ;
" 200 (='low') _= 1 )
```
The values can be gathered using R syntax to recode multiple numerical values.
```
recode(
0 = "low" ;
c(1:10) = "mid" ;
c(-7, -99) = NA )
```
If the recoding requires more complex codification, use **case_when** or **other** instead.
</div>
<!-- Case When -->
<div class="tab-pane fade" id="case-when">
The harmonized variable is generated from one or more if-else conditions,
using one or more input variables.
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_case_when`
</div>
<br>
Notes:
* Multiple variables can be used to combine their values using case_when. Separate
each of them in **input_variables** by a semi-colon **;**
* If only one variable is used, and is (or seems) a categorical
variable, use **recode** or **direct_mapping** instead.
* Each statement ("if ... equals, greater, is not, ...") can be use in this function.
Separate the statement/code with a tilde **~**
* Separate each elements with a semi-colon **;** .
* Use **ELSE ~ NA** to attribute NA to all of the other values.
**case_when** is sensitive to the data type. Each code generated with the statement
must have the same data type, including the NA.
```
case_when(
var_x == 1 ~ 1L
var_x != 0 & !is.na(var_y) ~ 0L
ELSE ~ NA_integer_ )
case_when(
var_x == 1 ~ "1"
var_x != 0 & !is.na(var_y) ~ "0"
ELSE ~ NA_character_)
```
If the statement requires more complex codification, use **other** instead.
</div>
<!-- Paste -->
<div class="tab-pane fade" id="paste">
The harmonized variable is generated by setting the same value for all observation, not taken from a input variable.
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_paste`
</div>
<br>
Notes:
* This function does not require any variable. The user must provide
**\_\_BLANK\_\_** as a placeholder.
* Usually, the harmonized variable is a standardized identifier for the whole
dossier when comes the time to aggregate the harmonized datasets into a
pooled harmonized dataset.
</div>
<!-- Operation -->
<div class="tab-pane fade" id="operation">
The harmonized variable is generated by applying an operation to one or more input variables.
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_operation`
</div>
<br>
Notes:
* Multiple variables can be used to combine their values using case_when. Separate
each of them in **input_variables** by a semi-colon **;**
* If the operation (or seems) is simple, prefer **case_when**, **recode** or **direct_mapping**
instead.
* The user must have the libraries present on their machine (and loaded) to function
with the call of them in the **case_when** script. To specify the library calling, use
double two-point **::** in the formula.
```
lubridate::year(var_x)
```
* If the operation is requires more complex codification, use **other** instead.
</div>
<!-- Other -->
<div class="tab-pane fade" id="other">
The harmonized variable is generated from a non-standard or complex processing
rule, not covered by other rule categories.
<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_other`
</div>
<br>
Note:
* This feature is equivalent to launch a local code/function in a R script.
If assignment is needed to modify environment of the user, use
double assignation **<<-** to place the result in the user environment. Carefully
make sure you control your environment when using the **other** function.
```
my_harmo_var <- runif(20) + ... # complex lines of code
# double assignation to modify the environment.
harmonized_dossier$DATASET$variable_F <<- my_harmo_var
```
**other** function can be used to source a code from a different script where complex
harmonization processes are written.
```
source("my_file.R")
```
</div>
<!-- Impossible Undertermined __BLANK__ -->
<div class="tab-pane fade" id="impossible-undertermined">
These additional features allow the user to handle specific cases. This ensure that the line
is completed and there is no missing argument in the function to perform.
`r DT_impundebla`
<br>
Notes:
* *\_\_BLANK\_\_* : If no variable is needed to generate the harmonized variable
(for example using the rule category **paste** or **other**)
* *impossible* : If the project of research does not collect DataSchema variable
or cannot be used to generate DataSchema variable or is unknown.
* *undetermined* : If the user needs further investigation to harmonize, or future
information to be completed, they can use this feature without being blocked in
the process.
</div>
</div>
</div>
</div>
## Examples of Rule Categories
::: {}
`r DT_rule_categories`
:::