Skip to content
Helene edited this page May 10, 2024 · 14 revisions

Metadata Repository

The metadata repository (MDR) is the core part of the DQAstats R-package and therefore receives its own section within this documentation.

Using this MDR, the DQA tool can be configured to test the data quality of datasets that are held in either CSV files or SQL-databases.

Currently supported input data formats / databases are:

By default, the MDR itself is stored in a CSV file --> meaning: the CSV file is the MDR. Each column of the CSV table is a specific MDR field and each row is an entry of one distinct data element in one database.

The MDR allows the definition of a set of data elements for multiple databases (with databases being either CSV-files or SQL-based databases). This allows for comparing the data quality of this set of data elements between two databases, regardless of their 'data format'.

The concepts of how the connection of the DQA tool to the MDR works in detail are described in our publication "Linking a Consortium-Wide Data Quality Assessment Tool with the MIRACUM Metadata Repository".

MDR: Overview

The following table gives an overview over the available MDR fields and their purpose:

MDR Field Description
designation The data element's "speaking" term (short).
definition A (longer) description of the data element. Line breaks ('\n') are allowed.
variable_type The data type of the data element.
variable_name The DQA tool's internal (machine processable) variable name. Internally, corresponding data elements across different databases are renamed with this variable_name, which should not contain spaces or special characters other than underscores "_".
key A unique DQA tool-specific identifier key, which is used to
- compare data elements across different databases
- load corresponding SQL statements from a JSON file, if the data source is a database (please refer to the section MDR: SQL statements for further information)
source_variable_name The name of the data element in the database (e.g. the column names in a CSV table or database table).
source_table_name The name of the table, in which the data element is stored in the database (the filename in case of CSV files or the table name in case of databases).
source_system_name The name of the database from which the data element originates. It has to uniquely identify one specific database and, in the case of analyzing SQL databases, the value needs to correspond with the filename suffix of the JSON files containing the SQL statements for the data elements of the respective database (please refer to MDR: SQL statements for further details).
source_system_type The type of the database. For supported system types please refer to [Use the DQA tool](#using-the-DQA tools).
constraints A JSON object, describing the allowed values for the corresponding data element.
filter
data_map
plausibility_relation A JSON object, describing plausibility checks for the corresponding data element (please refer to MDR: plausibility relations for further details).
dqa_assessment A binary coded indicator, if the information in this specific row is eligable for data quality assessment (dqa_assessment=1) or not (dqa_assessment=0).
restricting_date_var An optional field to specify another column from source_table_name of data type datetime in order to filter the data by time constraints.
restricting_date_format

MDR: Minimum Set of Required MDR Fields

For a seamless functioning of the DQA tool, the following MDR fields need at least to be defined for each data element:

  • designation
  • variable_type
  • variable_name
  • source_variable_name
  • source_table_name
  • source_system_name
  • source_system_type
  • key
  • dqa_assessment
  • restricting_date_var
  • restricting_date_format

MDR: Variable Types and Constraints

variable_type Description Constraint
integer / float The possible variable types for continuous variables are 'integer' and 'float'.

Constraints are to be defined using a single-key JSON-object with the key "range" and its value being another JSON-object containing the keys "min", "max" and "unit" (see Example 1 below).
range: min, max, unit
enumerated Discrete variables with a limited number of categories (appx. < 20-25) should be referenced as 'enumerated' in order to get frequency counts for the respective categories.

A limited number of categories is not further specified. However, the creation of a 'value_set' for a large number of categories does not seem feasible. Instead, the variable type 'string' should be chosen which allows the testing of constraints by using regular expressions.

Constraints are to be defined using a JSON-object with the key "value_set" and its value being a string with the allowed values separated by "," (comma) (see Example 2 below).
value_set
string Discrete variables with a large number of categories or where the testing for a specific value set does not seem feasible can be referenced as the variable type 'string'.

Constraints are to be defined using a JSON-object with the key "regex" and its value being a string with the respective regular expression (see Example 3 below).
regex
datetime The data type for datetime variables. N/A (testing '<= today' by default)

In general, the definition of value constraints is optional - the tool technically works without a constraint being defined for each data element in the MDR-field 'constraints'.

More detail on the topic 'constraints' for performing value conformance checks can be found in the background section of this documentation.

Details on the implementation of these conformance checks into the DQA tool can be found in our publication.

Value Constraints: 'range'

Constraints of data elements with the variable type 'integer' / 'float' are to be defined as follows:

{
  "range": {
    "min": 0,
    "max": 120,
    "unit": "years"
  }
}

Notice, that the 'unit' value is currently only informative and displayed along with the data quality results.
No value-transformations are performed between the values of a certain data element from different databases. This must be done beforehand when defining the SQL statements for this certain data element.

Value Constraints: 'datetime'

Constraints of data elements with the variable type 'datetime' are to be defined as follows:

{
  "datetime": {
    "min": "1950-01-01",
    "max": "1989-12-31",
    "format": "%d.%m.%Y"
  }
}

If no constraint is defined in the MDR for 'datetime' variables, a heuristik is checked for 'No future dates allowed', which basically checks if the data element contains datetime-values that are in the future, related to the date of performing the data quality analysis.

Value Constraints: 'value_set'

Constraints of data elements with the variable type 'enumerated' are to be defined as follows:

// value-set of a data element: array with the allowed values A-G
{
  "value_set": ["A", "B", "C", "D", "E", "F", "G"]
}

Value Constraints: 'regex'

Constraints of data elements with the variable type 'string' are to be defined as follows:

// regular expression to test the formatting of OPS procedure codes
{
  "regex": "^([[:digit:]]{1})(\\-)?([[:digit:]]{2})([[:lower:]]{1}|([[:digit:]]{1}))((\\.)?([[:alnum:]]){1,2})?$"
}

💡 For convenience, you can build your own python scripts to write these constraints to the mdr.csv.

MDR: SQL Statements

Organizing SQL Statements

When testing SQL-databases with the DQA tool, it is necessary for the tool to find the corresponding SQL statements for the data elements of each database. The mapping between the information of the MDR regarding a set of data elements and the corresponding set of SQL statements is done via the value of the MDR-field 'source_system_name' in the following manner:

  • For each data element to be tested by the DQA tool, an SQL statement needs to be defined. For each database, the set of SQL statements is then stored as key-value-pairs in one single JSON file inside the SQL-folder.

    • For each set of SQL statements belonging to the data elements of a database it is very important that the filename of the JSON-file contains the 'source_system_name' according to the pattern SQL_{...}.JSON, where {...} is the respective 'source_system_name' of the database which also used in the mdr.csv (to ensure the correct mapping, as described above).

      Example:  
      If the 'source_system_name' is i2b2, the JSON file
      containing the SQL-statements needs to be named  
      
      > `SQL/SQL_i2b2.JSON`  
      
      to be correctly identified by the DQA tool as the
      SQL-containing JSON file for the 'i2b2'-database.
  • Each of the 'keys' inside the JSON-file corresponds to one data element of the database. It must match the respective 'key'-value from the mdr.csv.

💡 For convenience, you can build your own python scripts to create this 'SQL' JSON-files.

Formatting of SQL Statements

Two rules must be followed for defining the SQL statements in order to ensure a correct functioning of the DQA tool:

  1. SQL statements must return a two column-table, or a two-column table with an additional TIMESTAMP column, with the following structure:

    • one column being the data element to test
    • the second column being a corresponding key, which sets the data element into a meaningful context (i.e. patient ID, encounter ID, ...)
    • optional: a column called "TIMESTAMP". This is required if you want to use the difference-check feature. You are free to choose any timestamp present in your data as long as it exists in both the source and target database. If your ETL process involves time-zone conversion, ensure that your SQL reflects this conversion.
    SELECT
      patient_num AS \"Patienten-Identifikator\",
      birth_date::date AS \"Geburtsdatum\"
    FROM
      i2b2miracum.patient_dimension
  2. Rename columns of the SQL table for each data element according to the values specified in the respective MDR-fields 'variable_name'

    • please refer to the description in MDR overview for further information

💡 The DQA tool is able to perform analyses on a subset of the whole dataset by filtering the data using datetime restrictions. More on detail on this feature can be found in the Advanced Configurations - Testing Subsets by Time restrictions section.

MDR: Plausibility Relations

With plausibility relations, the values of a variable are set into context of another variable. In order to enable the DQA tool to carry out such tests, a JSON-notation was developed for defining such plausibility statements and to have them evaluated automatically by the DQA tool.

If a plausibility check is defined for a data element in the MDR and all required data elements are available for the respective database(s), the plausibility check is automatically evaluated by the DQA tool. Similar to the implementation of 'Constraints', 'Plausibility statements' must be defined in the MDR using JSON objects.

Currently, two types of plausibility statements can be defined to be used with the DQA tool: uniqueness plausibility and atemporal plausibility statements.
An exemplary plausibility statement for each type is provided with the appendix section of our publication.

Uniqueness Plausibility

The template form for a uniqueness plausibility statement is a follows:

{
  "uniqueness": {
    "variable2": {
      "name": "Arbitrary Item Name 1",
      "description": "Description 1: With each distinct value of variable1,\
      only one value of variable2 may be associated."
    }
  }
}

The first key of the JSON-object indicates, that it is a "uniqueness" plausibility statement definition. The keys of the second level (here only 1) are the internal DQA tool variable names (MDR field 'variable_name') that create the context of the data element to be evaluated.

In order to have this statement evaluated correctly by the DQA tool it has to be placed in the MDR field 'plausibility_relation' of the dataelement with the 'variable_name' = "variable2".

Atemporal Plausibility

The template form for a atemporal plausibility statement is a follows:

{
  "atemporal": {
    "code1": {
      "name": "Arbitrary Item Name 1",
      "description": "Values 80-89 of dataelement 'code1' are \
      only allowed for entries with variable1 = A",
      "filter": {
        "database1": "^8[0-9]"
      },
      "join_crit": "variable2",
      "constraints": {
        "value_set": {
          "database1": "A"
        }
      }
    }
  }
}

The first key of the JSON-object indicates, that it is a "atemporal" plausibility statement definition. The keys of the second level (here only 1) are the internal DQA tool variable names (MDR field 'variable_name') that create the context of the data element to be evaluated.

For each of these context variable names, the JSON keys provided must be defined. The 'filter'-key is optional, and needs to be defined when only a subset of the values present in the variable 'code1' are required for the plausibility relation. The 'constraints' can be used to apply value conformance checks on the filtered results.

To understand the 'join_crit'-key, one needs to be aware of the two-column data structure of the dataelements inside the DQA tool (also see above Formatting of SQL Statements). If there is no table available holding both columns 'code1' and 'variable1', the 'join_crit'-key can be used as an intermediate step for connecting / joining these tables. However, therefore a two-column table with 'code1' and 'variable2' and another table with the mappings between 'variable1' and 'variable2' need to be present in the DQA tool.

💡 For convenience, you can build your own python scripts to write these plausibilities to the mdr.csv.