Skip to content
Carolina Schwedhelm edited this page Dec 5, 2023 · 13 revisions

Data harmonisation protocol for pilot studies in Use Case 5.1 ‘Nutritional Epidemiology’ and 5.2 ‘Epidemiology of Chronic diseases’

nfdi4health MDC_logo_V1_RGB-blau DIfE-Logo_hoch_EN_PNG UNI_Bonn_Logo_Standard_RZ_Office

NFDI4Health (National Research Data Infrastructure for personal health data)

T5.1 “Use case ‘Nutritional epidemiology’”, T5.2 “Use case ‘Epidemiology of chronic diseases’”

DFG project number: 442326535

Authors and affiliations:

Carolina Schwedhelm, Katharina Nimptsch, Tobias Pischon; Molecular Epidemiology Research Group, Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC)

Franziska Jannasch, Matthias Schulze; Department of Molecular Epidemiology, German Institute of Human Nutrition Potsdam-Rehbrücke (DIfE)

Ines Perrar, Ute Nöthlings; Institute of Nutritional and Food Sciences, Nutritional Epidemiology, University of Bonn

Contact person: Franziska Jannasch (franziska.jannasch@dife.de)

Version 1.0

Date: 22.05.2023

The work presented herein was made possible through the collaboration with Maelstrom Research (www.maelstrom-research.org) and their data harmonisation approach and tools.

Introduction

The purpose of this document is to guide the data harmonisation for the pilot projects in the NFDI4Health Use Cases 5.1 “Nutritional epidemiology” and 5.2 “Epidemiology of chronic diseases”. The document explains in detail the steps that require action both by NFDI4Health (Use Cases 5.1 and 5.2) as well as by the Data Holding Organisations (DHOs). Data handling and research data management are becoming increasingly important. Data harmonisation is a central element in making research data FAIR (findable, accessible, interoperable, reusable) and thus improving the data quality of studies. Especially for pooled analyses in collaborative research programmes and projects, data harmonisation is essential for the interoperability of study data, i.e., to ensure content equivalence across studies and minimise measurement or assessment error, which may cause bias or impair statistical power (Fortier et al., 2017. https://doi.org/10.1093/ije/dyw075). The NFDI4Health consortium (https://www.nfdi4health.de) aims to increase the quality of health research in Germany by increasing the visibility and accessibility of research data according to the FAIR principles. The focus is on data from epidemiological studies and clinical trials. In this context, scientists of the NFDI4Health Use Cases 5.1 and 5.2 have developed pilot projects in which data harmonisation will be achieved based on research questions typical for the respective research area. In NFDI4Health, data harmonisation will be performed based on the Maelstrom harmonisation procedures (https://www.maelstrom-research.org/; Fortier et al., 2017. https://doi.org/10.1093/ije/dyw075), which were adapted to the current pilot project needs. The harmonisation will be carried out by the respective DHOs according to the following steps:

Steps

Preparation

1 . Collect metadata of the variables

The first step will be the collection of metadata of the variables required for the three pilot projects. For this, we modified the excel template “Standard_DataSchema_Metadata.xlsx” originally developed by Maelstrom, which on the first two spreadsheets “Read me” and “Standard DS metadata” gives instructions on how to fill out each column in the third spreadsheet “Variables”, where all necessary information on the variable availability and format is inquired (NFDI4Health). We specifically ask the study DHOs to provide alternative solutions, if variables are not available in the exact format we are inquiring for (DHOs). This will give us the chance to explore alternative harmonisation rules in the next step and, if necessary, to resolve questions about the variable metadata of individual studies with the respective DHOs at this stage. We will ask DHOs to return the filled template within one-month time. A fourth spreadsheet “Categories” is also part of the Maelstrom template but was removed for the metadata collection. However, representatives of Use Cases 5.1 and 5.2 will complete this information based on the information given by the DHOs in the spreadsheet “Variables”.

image_1a image_1b
Fig 1A: “Standard_DataSchema_Metadata.xlsx”, first spreadsheet “Read me” Fig 1B: “Standard_DataSchema_Metadata.xlsx”, second spreadsheet “Standard DS metadata”
image_1c image_1d
Fig 1C: “Standard_DataSchema_Metadata.xlsx”, third spreadsheet “Variables” Fig 1D: “Standard_DataSchema_Metadata.xlsx”, fourth spreadsheet “Categories” (for NFDI4Health use only)
Action Responsible Party
Share “Standard_DataSchema_Metadata.xlsx” template with responsible person(s) of DHOs together with the harmonisation protocol and define a timeline for returning filled out document. NFDI4Health
Fill out the template “Standard_DataSchema_Metadata.xlsx” by entering all relevant variables and their metadata. Return filled out document to NFDI4Health Use Cases 5.1 and 5.2 within one month. DHOs
Exchange on specific questions in case clarity is needed or difficult cases with some variables. NFDI4Health and DHOs
Fill out “Categories” spreadsheet based on the information provided in “Variables” spreadsheet. NFDI4Health

2. Define harmonisation strategy for each variable (task NFDI4Health)

Based on the filled template “Standard_DataSchema_Metadata.xlsx” Use Cases 5.1 and 5.2 will evaluate the availability and compatibility of the variables across the studies. The aim of the second excel template “Standard_Data_Processing_Elements_Metadata_Original.xlsx”, provided by Maelstrom, is the processing of the collected variable metadata, so that the R program “Rmonize” (https://maelstrom-research.github.io/Rmonize-documentation/index.html) can handle the input dataset. We will fill out the second excel template (NFDI4Health). The first two spreadsheets “Read me” and “Standard proc. elem. metadata" will give the necessary instructions on how to fill in the third spreadsheet “Processing elements”. This spreadsheet lists all variables (Data schema – DS variables), which were collected with the first excel template, but provides the so-called processing elements in a format which enables the further processing via Rmonize. For this, harmonisation rules must be defined, described in the respective column in this spreadsheet. Because variables may vary across studies, and the needed variables and study population may also differ between the three pilot projects, this excel template will be generated per study and per pilot project. At this stage, additional feedback rounds with the studies will take place as needed to check our entries (DHOs). The Use Cases will send the filled document(s) to the DHOs (one per pilot project the DHO is contributing to) together with all other harmonisation documents (see Step 5).

image_2a image_2b
Fig 2A: ”Standard_Data_Processing_Elements_Metadata...xlsx”, first spreadsheet “Read Me” Fig 2B: ”Standard_Data_Processing_Elements_Metadata...xlsx”, second spreadsheet “Standard proc. elem. metadata"
image_2c
Fig 2C: ”Standard_Data_Processing_Elements_Metadata...xlsx”, third spreadsheet “Processing elements”
Action Responsible party
Define harmonisation strategy for each variable in each study. NFDI4Health
Fill out the template “Standard_Data_Processing_Elements_Metadata_Original.xlsx” based on the variable-specific metadata provided by the DHOs. The template will be filled out pilot project-specific and study-specific. NFDI4Health
Exchange on specific questions in case clarity is needed or difficult cases with some variables. NFDI4Health and DHOs

3. Prepare data dictionaries

We will prepare a harmonised data dictionary (included variables will be represented in the produced dataset after harmonisation with Rmonize). It can serve as an overview but also later for controlling against the data dictionary produced by Rmonize – to make sure the dataset is exactly as we need it.

Based on this, we will also prepare data dictionaries, specific to each pilot project and each study (this would give the DHOs guidance on the data that they have to feed into Rmonize). These will be needed by DHOs to create the project-specific datasets containing the source variables we need to derive the harmonised variables. This needs to be study-specific because source variables may differ by study.

Action Responsible party
Prepare data dictionaries: Harmonised data dictionaries (pilot project-specific), Study-specific data dictionary (pilot project-specific and study-specific) NFDI4Health

4. Prepare overall and study-specific harmonisation guide

After we have received all the necessary information about the variables from the studies, a harmonisation guide will be created. Each DHO will receive a study-specific guide containing a list of all variables to be harmonised per pilot project and documentation of the harmonisation process with Rmonize. In case Rmonize does not work in a specific study, this report will contain the necessary instructions to harmonise the variables in their usual statistical program.

Action Responsible party
Prepare overall and study-specific harmonisation guide (pilot project-specific) NFDI4Health

Harmonisation

5. Provide harmonisation material to DHOs

The following material created previously and described in the preparation section constitute harmonisation material and will be sent to the data manager or responsible person for the data harmonisation of each DHO:

Pilot- and study-specific processing elements spreadsheets (see Step 2)

Pilot- and study-specific data dictionaries (see Step 3)

Pilot- and study-specific harmonisation guides (see Step 4)

Instructions for using the R package Rmonize (adapted from Maelstrom and including links with all necessary resources)

All study-specific material will contain the number of the pilot study and the name of the study at the end of the file name (e.g., “Standard_Data_Processing_Elements_Metadata_5.1.1_NAKO.xlsx”).

Action Responsible party
Share harmonisation material (a-d in Step 5 above) with responsible person(s) of DHOs and communicate timeline for performing the data harmonisation (Step 6 below). NFDI4Health

6. Data harmonisation (task DHOs)

DHOs will first prepare a dataset per pilot project including only the variables needed to conduct the pilot projects (provided in the pilot- and study-specific data dictionaries). In these created datasets, the variable names, value Type, units and categories should match those reported on the first excel template “Standard_DataSchema_Metadata.xlsx”. These datasets will then be the starting point for the data harmonisation.

DHOs will then follow the instructions for data harmonisation with the R package Rmonize and will use the datasets created above as the basis for the harmonisation. The harmonisation will be performed by pilot project by uploading the corresponding dataset and processing elements spreadsheet into R and running the R package. The NFDI4Health Use Cases 5.1 and 5.2 will be available for feedback rounds in case of troubleshooting. If for any reason the harmonisation is not working using Rmonize, the study-specific harmonisation guide may be used to perform manual harmonisation.

The R package Rmonize will produce harmonisation reports (one report per variable, for each pilot study) as well as the harmonised data dictionaries. The DHOs will be instructed to save the produced reports and data dictionaries and share them with the NFD4Health Use Cases 5.1 and 5.2.

If variables needed for a pilot project are not available (or are not possible to harmonise) in a study, these variables will be omitted from the data dictionaries and will not be produced by the Rmonize package.

Action Responsible party
Prepare dataset per pilot project with only the required variables, according to the provided study-specific data dictionary. DHOs
Upload the dataset created and the processing elements spreadsheet into R and run Rmonize package. DHOs
Feedback round/troubleshooting NFDI4Health and DHOs
Save variable-specific harmonisation reports and harmonised data dictionary generated by Rmonize and share with NFDI4Health Use Cases 5.1 and 5.2. DHOs

7. First quality control

The NFDI4Health Use Cases 5.1 and 5.2 will perform a first quality control of the harmonisation process by studying the harmonised data dictionary and the harmonisation reports produced by the R package Rmonize and shared by the DHOs. The following parameters should be controlled to ensure that the harmonisation process was successful:

Harmonised study-specific data dictionary:

  • Containing only the following information (columns): variable names, value type, unit and categories. The information on these columns should correspond to the general harmonised data dictionary generated by NFDI4Health Use Cases 5.1 and 5.2 in Step 3, except for variables that are not available or not harmonisable in a study (row omitted).

Harmonisation reports:

  • Continuous variables: plausible and expected distributions (minimum, maximum, mean) depending on the nature of the variable and on pre-harmonisation distributions and requested units.

Adequate categories and expected proportions in each category based on pre-harmonisation characteristics

Action Responsible party
Controlling that the study-specific harmonised data dictionary produced by Rmonize matches the one generated by NFDI4Health Use Cases 5.1 and 5.2 in Step 3. NFDI4Health
Examining the variable-specific harmonisation reports produced by Rmonize and ensuring that categories and distributions of harmonised variables are plausible and match expectations. NFDI4Health
In case of discrepancies, DHOs will be contacted to solve issues. If needed, harmonisation process may be repeated for discrepant variables. NFDI4Health

8. Data upload

If necessary, DHOs will be contacted to solve issues and repeat the harmonisation process for variables where any discrepancy is identified.

Data access for analysis will be using Opal/DataSHIELD (Gaye, et al., 2014. https://doi.org/10.1093/ije/dyu188), a federated data analysis approach. When the data have successfully been harmonised, the DHOs will be instructed to upload and import the pilot project-specific harmonised data dictionaries and datasets (both produced with the help of R package Rmonize). A pilot-specific project and table name will be assigned to each study. The instructions for the DHOs to upload and import the dataset and data dictionary to the DHO Opal server is available in the Standard Operating Procedure for Installation and Configuration of Opal DataSHIELD in NFDI4Health prepared by Sofia Siampani (sofiamaria.siampani@mdc-berlin.de), who is the contact within NFDI4Health for analysis infrastructure and troubleshooting.

Following upload and import, DHOs must set permissions for analysts and share access credentials.

Action Responsible party
Upload and import of harmonised data dictionary and dataset per pilot project DHOs
Feedback round/troubleshooting NFDI4Health and DHOs
Set permissions for data analysts per pilot project DHOs

9. Second quality control

After upload and import of the respective dataset and once analysts have permission and credentials to connect to the DHO Opal server, data analysts can perform a second quality control to ensure that the data were uploaded correctly and are consistent with the harmonisation reports.

Specifically, data analysts will control the following:

There are no discrepancies between the data dictionary uploaded to the Opal server and the harmonised data dictionaries (general from Step 3; study-specific from Step 6). This includes checking: variable name, value type, unit and categories.

By means of summary functions in DataSHIELD, we will check number of missing values, distributions and frequencies of all variables.

Any discrepancy or implausibility will be communicated to the DHOs, who will solve any issues. If needed, NFDI4Health T3.7 will be involved for support.

Action Responsible party
Checking there are no discrepancies in data dictionary uploaded to Opal NFDI4Health
Checking plausibility of values using summary functions in DataSHIELD (checking missingness, distributions and frequencies). NFDI4Health
Feedback round/troubleshooting NFDI4Health and DHOs