Skip to content

Advanced_Configurations

Helene edited this page Jul 19, 2023 · 9 revisions

Advanced Configurations

Configure Database Connections

The configuration of databases, be it CSV files or SQL-based databases, is done with environment variables, which can be set using the base R command Sys.setenv().

A detailed description, which environment variables need to be set for the specific databases can be found here.

For convenience, users can define multiple databases inside a text file (similar to Docker .env-files), which can be called in R before executing the DQA tool functions using this command, making all included environment variables available:

DIZutils::set_env_vars(
    env_file = "path/to/environmen/configuraions/.env"
)

Testing Subsets by Time Restrictions

As a feature of this tool, we have implemented a possibility to only test a subset of the whole database by filtering the data using database timestamp columns. On the one hand, this can drastically shorten the analysis time and on the other hand it allows to test databases that would not fit into memory.

However, some 'engineering' needs to be done before being able to use this feature. Currently, two different ways of connecting database timestamp columns are implemented with both being in general equally capable of doing the job - so it is up to the user, which way to go in her/his current situation.

Please note that for all dataelements of one database the same way of connecting database timestamps needs to be chosen!!!

Using Database Timestamps: 1) Temporal Views

By default, database timestamp columns are connected using temporal views. However, this works only when the column specified in the MDR field 'restricting_date_var' is present in the same table, which is specified in the MDR field 'source_system_table' for the respective dataelement. If this is not the case, or your SQL statements to query the dataelements are too complex, it is highly recommended to connect timestamp columns with the second approach using subselect modifications (see below).

Otherwise, you could define the SQL statement in a manner that allows to create the temporal view by the use of a timestamp column from another table. This can be achieved by using a JOIN statement and modifying the 'source_table_name' field in the MDR. The following example aims to clarify this process:

-- In the i2b2 database, the dataelement 'birth_date' is only available
-- in the table 'patient_dimension':
SELECT
    patient_num,
    birth_date::date
FROM
    i2b2.patient_dimension;
-- However, this table does not contain any information on the admission dates
-- of the respective patients. Hence, to filter the data by admission dates,
-- we need information on the admission start date, which, however, is
-- available from another table, e.g. 'visit_dimension'.

-- To be able to apply the DQA tool time restrictions to 'birth_date', the
-- corresponding SQL statement needs to be defined as follows:
SELECT
    mn.patient_num,
    jn.birth_date::date
FROM
    i2b2.visit_dimension AS mn
JOIN
    i2b2.patient_dimension AS jn
    ON mn.patient_num = jn.patient_num;

Finally, to make thinks working, the MDR entries for the dataelement 'birth_date' in the i2b2 database need to be changed as follows:

  • source_system_table --> "i2b2miracum.patient_dimension"
  • restricting_date_var --> "start_date"

This allows the DQA tool to create a temporal view for the dataelement 'birth_date' using the 'start_date' from the table 'visit_dimension'.

Using Database Timestamps: 2) Subselect-Modifications

To use subset modifications for connecting database timestamp columns, a more sophisticated formatting of the SQL statements is necessary. However, this approach comes with more flexibility, e.g. when the timestamp column for filtering your dataelement is not in the same table as the dataelement and therefore needs to be joined from another table inside the database.

For this approach, the FROM-statements is expected to be a subselect where the timestamp filtering can be performed. Inside this subselect, a timestamp column needs to be selected and named with an alias. This alias of the timestamp column name has also to be put to the MDR field 'restricting_date_var' for the respective dataelement(s). Furthermore, the alias of the subselect must always be named 'r_intermediate' and must be assigned in the SQL using 'AS':

SELECT
    r1.pid AS "Person.Patient.Patienten-Identifikator.Patienten-Identifikator",
    to_timestamp(jsonb_path_query(r1.jsonbdata, '$.period') ->> 'start', 'YYYY-MM-DDTHH:MI:SS') AS "Fall.Einrichtungskontakt.Beginndatum"
FROM 
-- SUBSELECT BEGIN
(SELECT * FROM (
SELECT
    DATA AS jsonbdata,
    REPLACE(jsonb_path_query(DATA, '$.subject') ->> 'reference', 'Patient/', '') AS pid,
    to_timestamp(jsonb_path_query(DATA, '$.period') ->> 'start', 'YYYY-MM-DDTHH:MI:SS') AS fhir_start_date
FROM resources
WHERE TYPE = 'Encounter')
-- SUBSELECT END
AS r_intermediate ) r1;

In the DQA tool, the part 'AS r_intermediate' of the subselect will then be replaced with the time restrictions the DQA tool was parameterized with in the following manner:

restricting_date_var <- rv$mdr[
    get("key") == i &
    get("source_system_name") == db_name,
    get("restricting_date_var")
]
replace_string <- paste0(
    "AS r_intermediate WHERE r_intermediate.",
    restricting_date_var, " >= '",
    rv$restricting_date$start,
    "' AND r_intermediate.",
    restricting_date_var, " <= '",
    rv$restricting_date$end,
    "' "
)
sql <- gsub("AS r_intermediate", replace_string, sql_statements[[i]])

In order to enable the 'Subselect-Modifications'-functionality for an SQL database, the environment variable {db_name}_SQLMODIFY = TRUE must be set.

In case the database type is Oracle, "AS r_intermediate" is replaced with "r_intermediate". This is necessary, because Oracle is not accepting AS for table aliasing.

R package DIZutils

Generic functions e.g. for logging purposes or database connections of the packages DQAstats and DQAgui have been outsourced to the R package DIZutils in order to be more easily maintained and integrable into other R packages.

Thus, if developers want to add interfaces for new database types, the basic implementation takes place in DIZutils and only the DQA tool specific adoptions are implemented in the downstream packages, such as the logic to call the new database interface in DQAstats or the definition of a new Config-Tab for DQAgui.