vignettes/b-Data-processing-elements.Rmd

---
title: "Data Processing Elements"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Processing Elements}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  eval = FALSE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  echo = TRUE)

```


```{r, eval = TRUE, echo = FALSE}
source('datatables.R')

```


# What is the Data Processing Elements ?

The Data Processing Elements (**DPE**) is a table that defines and documents 
information about the processing used to generate harmonized datasets and is 
typically prepared in an Excel spreadsheet. Each row indicates if an 
input dataset can generate a DataSchema variable, and if so, how input variables 
are processed to generate a harmonized variable as defined in the DataSchema. 
This page explains the basic methods to fill out the DPEs in order to be used 
correctly by Rmonize functions.

# General structure of the DPE 

The DPE is an typically an Excel file that you open locally in your computer and
that you can fill one row after the other to generate the rules of harmonization. 
It contains at least 5 mandatory columns, plus one additional for your documentation. 
The process cannot work if one of these columns are not present.

::: {} 
  `r DT_data_proc_elem_def` 
:::

[return to summary](#summary)


# Harmonization rules


<!-- Harmonization Rules Navigation -->
<div class="row">
  
  
<!-- Nav Buttons -->
<div class="col-md-3">
<ul class="nav nav-pills nav-stacked" role="tablist">
<li class="active">
<a href="#id-creation" role="tab" data-toggle="tab">
id_creation
</a>
</li>
<li><a href="#direct-mapping" role="tab" data-toggle="tab">
direct_mapping
</a>
</li>
<li>
<a href="#recode" role="tab" data-toggle="tab">
recode
</a>
</li>
<li>
<a href="#case-when" role="tab" data-toggle="tab">
case_when
</a>
</li>
<li>
<a href="#paste" role="tab" data-toggle="tab">
paste
</a>
</li>
<li>
<a href="#operation" role="tab" data-toggle="tab">
operation
</a>
</li>
<li>
<a href="#other" role="tab" data-toggle="tab">
other
</a>
</li>
<li>
<a href="#impossible-undertermined" role="tab" data-toggle="tab">
impossible<br>
undertermined<br>
\_\_BLANK\_\_
</a>
</li>
</ul>
</div>

<div class="col-md-9">
<!-- Nav Contents -->
<div class="tab-content">
<!-- ID Creation -->
<div class="tab-pane fade active in" id="id-creation">
**id_creation** is mandatory and the first rule that initiates the 
harmonization process. This rule allows the user to provide the column used as 
a reference per observation (row).


<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_id_creation`
</div>

<br>

Notes:

* Usually, the harmonized variable is a standardized identifier generated
from the input identifier.

* If the dataset does not have any identifier column, the user can create
*before harmonization* an index and provide this index as the variable to use.

</div>


<!-- Direct Mapping -->
<div class="tab-pane fade" id="direct-mapping">
The harmonized variable is generated by replicating one input variable.

<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_direct_mapping`
</div>

<br>

Note:

* One and only one variable can be replicated at a time.


</div>
<!-- Recode -->            
<div class="tab-pane fade" id="recode">
The harmonized variable is generated by recoding values from one input variable.

<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_recode`
</div>

<br>

Notes:

* One and only one variable can be recoded at a time.

* The variable to be recoded must be (partially at the very least) a categorical
variable. To recode a continuous variable (to create brackets for example.),
use **case_when** instead.

* If all categories are recoded to the same categories (recode(1 = 1 ; 2 = 2)),
Prefer **direct_mapping** instead.

* Separate each value/code with an equal sign **=**

* Separate each elements with a semi-colon **;** . 

* Use **ELSE = NA** to attribute NA to all of the other values.


If an equal sign already exists in the data, use **\_=** to escape them. Equally,
if a semi-colon already exists in the data, use **\_;** to escape them.

```
recode(
"banana ; apple"  = "fruits"    _;
"salad  ; potatoe" = "veggies"  _;
"bread  ; pasta"   = "carbs"        )

recode(
"1000 (='high')  _= 3  ;
" 500 (='mid')   _= 2  ;
" 200 (='low')   _= 1     )

```

The values can be gathered using R syntax to recode multiple numerical values.

```
recode(
0            = "low"   ;
c(1:10)      = "mid"   ;
c(-7, -99)   = NA    )

```

If the recoding requires more complex codification, use **case_when** or **other** instead.                              
</div>
<!-- Case When -->
<div class="tab-pane fade" id="case-when">
The harmonized variable is generated from one or more if-else conditions, 
using one or more input variables.

<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_case_when`
</div>

<br>

Notes:

* Multiple variables can be used to combine their values using case_when. Separate
each of them in **input_variables** by a semi-colon **;**

* If only one variable is used, and is (or seems) a categorical
variable, use **recode** or **direct_mapping** instead.

* Each statement ("if ... equals, greater, is not, ...") can be use in this function.
Separate the statement/code with a tilde **~**

* Separate each elements with a semi-colon **;** . 

* Use **ELSE ~ NA** to attribute NA to all of the other values.

**case_when** is sensitive to the data type. Each code generated with the statement
must have the same data type, including the NA.


```
case_when(
var_x == 1                 ~ 1L  
var_x != 0 & !is.na(var_y) ~ 0L  
ELSE                       ~ NA_integer_  )

case_when(
var_x == 1                 ~ "1"  
var_x != 0 & !is.na(var_y) ~ "0"  
ELSE                       ~ NA_character_)

```

If the statement requires more complex codification, use **other** instead.                
</div>
<!-- Paste -->
<div class="tab-pane fade" id="paste">
The harmonized variable is generated by setting the same value for all observation, not taken from a input variable.

<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_paste`
</div>

<br>

Notes:

* This function does not require any variable. The user must provide 
**\_\_BLANK\_\_** as a placeholder.

* Usually, the harmonized variable is a standardized identifier for the whole 
dossier when comes the time to aggregate the harmonized datasets into a 
pooled harmonized dataset.


</div>
<!-- Operation -->
<div class="tab-pane fade" id="operation">
The harmonized variable is generated by applying an operation to one or more input variables.

<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_operation`
</div>

<br>

Notes: 

* Multiple variables can be used to combine their values using case_when. Separate
each of them in **input_variables** by a semi-colon **;**

* If the operation (or seems) is simple, prefer **case_when**, **recode** or **direct_mapping**
instead.

* The user must have the libraries present on their machine (and loaded) to function
with the call of them in the **case_when** script. To specify the library calling, use 
double two-point **::**  in the formula.

```
lubridate::year(var_x)

```

* If the operation is requires more complex codification, use **other** instead.
</div>            

<!-- Other -->
<div class="tab-pane fade" id="other">

The harmonized variable is generated from a non-standard or complex processing 
rule, not covered by other rule categories.

<div style="width: 70%; margin: 0 auto; display: flex; justify-content: center;">
`r DT_other`
</div>

<br>

Note:

* This feature is equivalent to launch a local code/function in a R script. 

If assignment is needed to modify environment of the user, use 
double assignation **<<-** to place the result in the user environment. Carefully
make sure you control your environment when using the **other** function.


```
my_harmo_var <- runif(20) + ... # complex lines of code

# double assignation to modify the environment.
harmonized_dossier$DATASET$variable_F <<- my_harmo_var

```

**other** function can be used to source a code from a different script where complex
harmonization processes are written.

```
source("my_file.R")

```

</div>
<!-- Impossible Undertermined __BLANK__ -->
<div class="tab-pane fade" id="impossible-undertermined">
These additional features allow the user to handle specific cases. This ensure that the line 
is completed and there is no missing argument in the function to perform.

`r DT_impundebla`

<br>

Notes:

* *\_\_BLANK\_\_* : If no variable is needed to generate the harmonized variable 
(for example using the rule category **paste** or **other**)

* *impossible* : If the project of research does not collect DataSchema variable 
or cannot be used to generate DataSchema variable or is unknown.

* *undetermined* : If the user needs further investigation to harmonize, or future
information to be completed, they can use this feature without being blocked in
the process.


</div>
</div>
</div>
</div>


## Examples of Rule Categories


::: {} 
  `r DT_rule_categories` 
:::