/
dbGaPCheckup_vignette.Rmd
569 lines (399 loc) · 32.2 KB
/
dbGaPCheckup_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
---
title: "dbGaPCheckup Vignette"
author: "Lacey W. Heinsberg and Daniel E. Weeks\n"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 3
number_sections: true
pkgdown:
as_is: true
toc: true
toc_depth: 3
number_sections: true
vignette: >
%\VignetteIndexEntry{dbGaPCheckup Vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
# To render to GitHub markdown:
# render("dbGaPCheckup_vignette.Rmd",github_document(toc=TRUE, toc_depth=3, number_sections=TRUE))
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Copyright information
Copyright 2022, University of Pittsburgh. All Rights Reserved.
License: [GPL-2](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html)
# Overview
This document describes our R package, `dbGaPCheckup`, which implements a series of check, awareness, utility, and reporting functions to help you ensure your scientific data set meets [formatting requirements](https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/) for submission to the National Library of Medicine's database of Genotypes and Phenotypes ([dbGaP](https://www.ncbi.nlm.nih.gov/gap/)).
This vignette was designed to give you a broad overview of the utility of this R package. A complete table of functions and descriptions is shown below. See the Quick Start (`dbGaPCheckup`) vignette for a brief introduction to the package.
```{r individual_checks, echo=FALSE}
fn.path <- system.file("extdata", "Functions.xlsx",
package = "dbGaPCheckup", mustWork=TRUE)
fns <- readxl::read_xlsx(fn.path)
knitr::kable(fns, caption = "List of function names and types.")
```
# Installation
The package is written in R language.
To install from [CRAN](https://cran.r-project.org/), proceed as follows:
```
install.packages("dbGaPCheckup")
```
To install the development version from [GitHub](https://github.com/), proceed as follows:
1. Install and load the `devtools` package by issuing these commands:
```
install.packages("devtools")
library(devtools)
```
2. Install and load the `dbGaPCheckup` by issuing these commands:
```
install_github("lwheinsberg/dbGaPCheckup/pkg")
```
If you wish to have this vignette installed and accessible within your R help pages, use this command instead:
`install_github("lwheinsberg/dbGaPCheckup/pkg", build_opts = c("--no-resave-data", "--no-manual"), build_vignettes = TRUE)`
After the `dbGaPCheckup` package has been installed, you can view load the package using this command:
```{r}
library(dbGaPCheckup)
```
and view this vignette using:
`browseVignettes("dbGaPCheckup")`
# Data format, file types, and file names
dbGaP has a host of [formatting requirements](https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/) for data set submission.
This package focuses on two required files: the Subject Phenotype data set (DS) and the corresponding Subject Phenotype data dictionary (DD). Brief instructions on setting up the files have been included below.
## Files
Checks that are NOT currently embedded into this package that we want to draw special attention to include:
(1) You may ONLY submit tab-delimited .txt and .xlsx files.
--> Tab-delimited txt files are preferable for the data set.
--> Excel (.xlsx) format is preferable for the data dictionary.
(2) File names should NOT contain special characters, spaces, hyphens, brackets, periods, or forward (/) or backward slashes ().
--> For example, 'data.set.txt', 'data-set.txt', 'data set.txt' are all illegal names, but 'data_set.txt' would be OK.
(3) Excel files are only allowed to have one sheet (i.e., no multiple tabs/sheets are allowed).
## Subject Phenotype Data Set (DS)
In brief, the Subject Phenotype data set consists of the study data for participants. In the data set, each row represents a participant, and each column represents a study phenotype variable. The first column in the data set needs to be labeled `SUBJECT_ID` and contains the unique participant identifier as an integer or string value. Integers should not have zero padding or spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Columns falling after `SUBJECT_ID` will be unique to a given study, but include participant factors such as age, sex, etc. Formatting for an example data set is shown below.
```{r ds, echo=FALSE}
DS.path <- system.file("extdata", "DS_Example.txt",
package = "dbGaPCheckup", mustWork=TRUE)
DS.data <- read.table(DS.path, header=TRUE, sep="\t",
quote="", as.is = TRUE)
```
```{r, echo=FALSE}
knitr::kable(DS.data[1:6,], caption="First six lines of an example dbGaP data set.")
```
Other example data sets provided by dbGaP can be found at the NCBI [submission guide](https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/). See "Example of a Subject Phenotypes DS File" and "6a_SampleAttributes_DS.txt".
## Subject Phenotype Data Dictionary (DD)
In the Subject Phenotype data dictionary, each row represents a unique variable (that corresponds to columns in the data set), and each column represents information about that variable (see example below). For more detailed data dictionary formatting instructions, visit the NCBI [submission guide](https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/) and see heading "APPENDIX for Data Dictionary (DD) File Descriptions and Specifications", which includes a table of required and suggested column headers and descriptions, as well as an example file called "6b_SampleAttributes_DD.xlsx".
```{r dd, echo=FALSE, message=FALSE}
DD.path <- system.file("extdata", "DD_Example1.xlsx",
package = "dbGaPCheckup", mustWork=TRUE)
DD.dict <- readxl::read_xlsx(DD.path)
```
```{r, echo=FALSE}
knitr::kable(DD.dict[1:6,], caption = "First six lines of an example dbGaP data dictionary.")
```
Two special data dictionary formatting notes:
(1) The final columns of the data dictionary list all unique values/meanings of all encoded values, one value per cell, of which will vary based on the number of VALUE codes for a specific variable. For example, if your data set contains a variable called `SEX` in which 0 indicates female and 1 indicates male, these columns are designed to communicate `value=meaning` (e.g., 0=female). The `VALUES` header must be the last column header and should appear ONLY in the column above the FIRST encoded value that is listed. The remaining value column header cells should be left blank. (Note that when we read in our example data set with blank column names after `VALUES`, R automatically fills in the column names with the column number (e.g., `...18`, `...19`, etc.). This is acceptable for the package level checks, but not allowable for the files that are submitted to dbGaP.)
(2) This package requires several fields beyond those required by the dbGaP [formatting requirements](https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/) to support additional data integrity checks. Specifically, dbGaP requires only that the data dictionary contains the following fields: variable name (`VARNAME`); variable description (`VARDESC`); units (`UNITS`); and variable value and meaning (`VALUE`). Because this package was designed to perform both dbGaP formatting requirement checks, as well as a series of awareness checks to help you ensure data accuracy, this package also requires that the data dictionary contains the following additional fields: logical minimum (`MIN`) and logical maximum (`MAX`) values (allowed to be left blank, but column headers are required) and the data type (e.g., integer, decimal, encoded value, string; `TYPE`) fields. If your data dictionary does not include these additional fields already, you can simply use the `add_missing_fields` function to auto fill them (see below).
# Execution with example runs and interpretation
## Check, utility, and awareness functions
Note that all "check" functions included in our package return an invisible tibble that contains (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); and (5) Information (More detailed information about the potential errors identified). This was designed to streamline the complete workflow approach and to return a succinct report back to you via `check_report` (see below). Note that there are some dependencies between checks (e.g., name_check `values_check` is dependent upon `field_check`), so there are pre-checks embedded within many checks.
### Example 1
```{r data1, message=FALSE}
data(ExampleD)
```
We recommend starting with the `check_report` function, which includes 15 embedded checks. Note that for all functions, you need to first specify the name of the data dictionary, followed by the name of the data set.
```{r cr1}
e1_report <- check_report(DD.dict.D, DS.data.D, non.NA.missing.codes=c(-4444, -9999))
```
In this check, we see that several checks passed (e.g., `field_check`), some failed (e.g., `type_check`), and some could not be attempted because a pre-check in the function failed (e.g., `missing_value_check`).
The `check_report` output can be examined to better understand the issues at hand. For example, let's examine the `pkg_field_check` results more closely. You can call more detailed information for each check using the following commands:
```{r cr_invest}
e1_report$Message[2]
e1_report$Information$pkg_field_check.Info
```
Here, we see that the `TYPE`, `MIN`, and `MAX` columns required for the complete workflow approach in this package are missing. But never fear - we can simply use the `add_missing_fields` function to add these in!
```{r add_missing}
DD.dict_updated <- add_missing_fields(DD.dict.D, DS.data.D)
```
Now that our error has been corrected, let's return to `check_report` to further investigate. Don't forget to call in the updated version of the data dictionary here!
```{r cr2}
# Note! Don't forget to call in the updated version of the data dictionary here!
e1_report.v2 <- check_report(DD.dict_updated, DS.data.D,
non.NA.missing.codes=c(-4444, -9999))
```
As you can see, now 13 out of 15 checks pass, but the workflow fails at `description_check` and `missing_value_check`. Specifically, in `description_check` we see that variables `PREGNANT` and `REACT` were identified as having missing variable descriptions (`VARDESC`), and variables `HEIGHT` and `WEIGHT` incorrectly have identical descriptions. In `missing_value_check`, we see that the variable `CUFFSIZE` contains a `-9999` encoded value that is not specified in a `VALUES` column. While we have included several functions that support "quick fixes" (`add_missing_fields`, `name_correct`, `reorder_dictionary`, `reorder_data`, `id_first_data`, and `id_first_dict`), the issues identified here are a bit more complex and study-specific, so would need to be corrected manually in your data dictionary before moving on. For now, we will leave this example and move on to the next one!
### Example 2
```{r data2, message=FALSE}
data(ExampleL)
```
```{r cr3}
e2_report <- check_report(DD.dict.L, DS.data.L)
```
In example 2, we see that the first three checks (`field_check`, `pkg_field_check`, and `dimension_check`) and several others further down the workflow pass, but the fourth check (`name_check`) fails. Looking at the `check_report` output more closely, we see that there are two variables with names that do not match between the data dictionary and data set.
Before we move on to investigate this issue further, please note that we could arrive at the same conclusion using the functions individually (rather than the complete workflow approach implemented in `check_report`):
```{r name_check}
field_check(DD.dict.L) # pass
pkg_field_check(DD.dict.L) # pass
dimension_check(DD.dict.L, DS.data.L) # pass
name_check(DD.dict.L, DS.data.L) # failed
```
In looking more closely at the `name_check` output, we then see that, while the "intent" between the names match (i.e., "hx" is sometimes used as shorthand for "history"), there are a couple of discrepancies between the data dictionary and data set. Luckily, we have included a "quick fix" for this simple issue as implemented in the `name_correct` function so that you can continue working through the checks. Specifically, `name_correct` updates the names in the data set to match the names listed in the data dictionary. Similarly, if the variable names in the data dictionary and data set matched identically, but were in the wrong order, the `reorder_dictionary` function could be used to create a new version of the data dictionary to match the order presented in the data set (see Example 5)! Back to the example at hand, though -- let's give the `name_correct` function a try now!
```{r name_check2}
DS.data_updated <- name_correct(DD.dict.L, DS.data.L)
```
Now that our error has been corrected, let's return to `check_report`. Similar to above, be sure to call in our updated data set!
```{r cr4}
# Calling in updated data set
e2_report.v2 <- check_report(DD.dict.L, DS.data_updated,
non.NA.missing.codes=c(-4444, -9999))
```
We now see that `name_check` now passes, along with several other functions in the workflow, but we have failed on `values_check` and several others.
Investigating this check failure further by looking at the `check_report` output, we see a few issues that, due to the subjectivity and complexity of different data set, will need to be manually corrected before moving on. For the purposes of this tutorial, we will now leave this data set to move on to a new one, but in reality, we would correct this issue and return again to `check_report`.
### Example 3
```{r data3, message=FALSE}
data(ExampleB)
```
Again, we will start with the `check_report` function.
```{r cr5}
e3_report <- check_report(DD.dict.B, DS.data.B)
```
In the above chunk, `check_report` determines that all 15 checks were passed! But ALERT --- this is misleading a we forgot to include an important parameter!!!!! Rerunning the check with our missing value codes defined, we now see an issue at `missing_value_check`, which underscores the importance of specifying missing value codes.
```{r cr6}
e3_report.v2 <- check_report(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999))
```
If you are not immediately sure what your missing value codes are, you can use our `value_meaning_table` utility/awareness function.
```{r value_meaning}
value_meaning_table(DD.dict.B)
```
So here we see that -9999 is a verified missing value code in this example.
### Example 4
```{r data4, message=FALSE}
data(ExampleH)
```
```{r cr7}
e4_report <- check_report(DD.dict.H, DS.data.H, non.NA.missing.codes=c(-4444, -9999))
```
Note that in this example, we see an error at `integer_check`. Let's investigate this further.
Specifically, we can use the awareness function to grep (i.e., search) for this variable name in the dictionary.
```{r exp1}
dictionary_search(DD.dict.H, search.term=c("SUP_SKF"), search.column=c("VARNAME"))
```
We can also look at the values in the data set to see that, in fact, there are some values that are decimals (not integers as the dictionary suggests).
```{r exp2}
table(DS.data.H$SUP_SKF)
```
We can also use this awareness function to grep any variables that are described as "skinfold" measurements to evaluate data `TYPE` across variables.
```{r exp3}
dictionary_search(DD.dict.H, search.term=c("skinfold"))
```
Above we see that both abdominal and suprailiac skinfold thickness are listed as integers in the data dictionary, and thought to have been measured to the nearest mm.
```{r exp4}
table(DS.data.H$ABD_SKF)
```
While `ABD_SKF` appears to be a true integer, `SUP_SKF` appears to have some decimal places. This error could be corrected either by listing `SUP_SKF` as `TYPE` decimal, or by investigating if the data set has a measurement/recording error.
### Example 5
```{r data5, message=FALSE}
data(ExampleN)
```
```{r cr8}
d5_report <- check_report(DD.dict.N, DS.data.N)
```
In this example, dbGaPCheckup informs us several issues --- let's focus first on the `name_check` results. While the variable names match between the data dictionary and the data (in contrast to Example 2), they are in the wrong order. Instead of fixing this issue manually outside of R, we can simply call the `reoder_dictionary` function as a "quick fix" and run the `name_report` function to confirm our update works!
```{r reorder_dict}
DD.dict_updated <- reorder_dictionary(DD.dict.N, DS.data.N)
```
```{r nc}
# Remember to call in the updated data dictionary!
name_check(DD.dict_updated, DS.data.N)
```
Above, we see that `name_check` now passes! Moving forward, we could simply return to our `check_report` workflow to search for other potential issues in finalizing our files for dbGaP submission.
### Example 6
```{r data6, message=FALSE}
data(ExampleA)
```
As mentioned above, if you prefer, you can also simply run the individual checks that you are interested in rather than taking the complete workflow approach. Note that several package-specific pre-checks are embedded in many of the functions (e.g., `integer_check`).
```{r id_check}
id_check(DS.data.A)
```
```{r misc_format_check}
misc_format_check(DD.dict.A, DS.data.A)
```
```{r row_check}
row_check(DD.dict.A, DS.data.A)
```
```{r NA_check}
NA_check(DD.dict.A, DS.data.A)
```
```{r minmax_check}
minmax_check(DD.dict.A, DS.data.A)
```
Above we see that an issue has been discovered at `minmax_check`. Let's investigate this further. The approach to view the "out of range values" is a bit cryptic, but it can be done with the following code.
```{r minmax_check2}
b <- minmax_check(DD.dict.A, DS.data.A)
b$Information[[1]]$OutOfRangeValues
```
Here we see that we forgot to specify our missing value codes when we ran `minmax_check`, so they are being flagged as errors. Let's rerun the command specifying -4444 and -9999 as missing value codes.
```{r minmax_check3}
minmax_check(DD.dict.A, DS.data.A, non.NA.missing.codes=c(-4444, -9999))
```
Now we see that our check passed for this data set!
## Reporting functions
We have also created awareness and reporting functions that are not built into the complete workflow approach. These functions generate graphical and textual descriptions and awareness checks of the data in HTML format. These reports are designed to help you catch other potential errors in your data set. Note that the `create_report` generated is quite long however, so we recommend that you only submit subsets of variables at a time. Specification of missing value codes are also important for effective plotting. The commands are not ran here, as they work best when initiated interactively.
```
# Functions not run here as they work best when initiated interactively
# Awareness Report (See Appendix A for more details)
create_awareness_report(DD.dict, DS.data, non.NA.missing.codes=c(-9999, -4444),
output.path= tempdir())
# Data Report (See Appendix B for more details)
create_report(DD.dict, DS.data, sex.split=TRUE, sex.name= "SEX",
start = 3, end = 7, non.NA.missing.codes=c(-9999,-4444),
output.path= tempdir(), open.html=TRUE)
```
For more details and to learn more, see the appendices below (`create_awareness_report`, Appendix A; `create_report`, Appendix B).
## Label data function
Note that after your data dictionary is fully consistent with your data, you can use the `label_data` function to convert your data to labelled data, essentially embedding the data dictionary into the data for future use! This function uses Haven labelled data with SPSS style missing data codes to add non-missing information from the data dictionary as attributes to the data.
```{r label}
DS_labelled_data <- label_data(DD.dict.A, DS.data.A, non.NA.missing.codes=c(-9999))
labelled::var_label(DS_labelled_data$SEX)
labelled::val_labels(DS_labelled_data$SEX)
attributes(DS_labelled_data$SEX)
labelled::na_values(DS_labelled_data$HX_DEPRESSION)
```
# Appendix: Reporting functions
As described above, there are a variety of awareness and reporting functions that are not built into the complete workflow approach. The purpose of this appendix is to highlight some of these features using the following example data.
```{r dataA1, warning=FALSE}
data(ExampleB)
```
## Appendix A: Awareness Report
Run `create_awareness_report`, which creates a nice .Rmd version of the below checks. While the output below is nearly identical to that you will see using the `create_awareness_report` function, for the purposes of this vignette, we have further expanded the annotation to assist in interpretation of the output through an example.
```
# Not run as works best when run interactively
create_awareness_report(DD.dict, DS.data, non.NA.missing.codes=c(-9999),
output.path= tempdir())
```
### Missingness Summary
This awareness function summarizes the amount of missingness in the data set.
```{r misssum}
missingness_summary(DS.data.B, non.NA.missing.codes = c(-9999), threshold = 95)
```
Above we that there are 0 variables in our example data set that have a percent missingness >95%. Navigating through the output, we also see a complete summary of missingness in our data set, with `SAMPLE_ID` having the highest % missingness at 16%. Finally we see a histogram plotting missingness across our data set.
### Values Missing Tables
In the `value_missing_table` function, for each variable, we have three sets of possible values:
(1) the set D of all the unique values observed in the data;
(2) the set V of all the values explicitly encoded in the VALUES columns of the data dictionary; and
(3) the set M of the missing value codes defined by the user via the `non.NA.missing.codes` argument.
This function examines various intersections of these three sets, providing awareness checks about possible issues of concern.
```{r vmt}
results.list <- value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999))
results <- results.list$report
```
#### Check A: If the user defines a missing value code that is not present in the data (In Set M and Not in Set D).
```{r vmt1a, echo=FALSE}
knitr::kable(results$Information$details$CheckA.AllMInD,
caption = "Table Check A: List of variables for
which user-defined missing value code is not present
in the data.")
```
The above table lists the variables for which the user-defined missing value code of `-9999` is not present in the data. These are not necessarily errors, however, as `dbGaPCheckup` reads `non.NA.missing.codes` as "global" missing value codes, even if a specific variable does not contain the code. For example, in the example data set, the SEX variable is complete, containing no missing value codes and only containing encoded values of 0=male, and 1=female, but `SEX` is flagged in the above variable list since it does not contain a `-9999` value. In other words, this variable's presence in the above list is NOT an issue that we should be concerned about. This function is intended only to bring awareness to potential errors in your data (e.g., perhaps you knew that the sex variable was missing for 5 participants for your specific study.)
Interpretation of table column names:
--> `AllMInD`: Variable-specific check result communicating if user-defined missing value code(s) are detected in the data set (FALSE=no).
--> `NsetD`: Number of values (or levels) detected in the data (e.g., in this example, `SEX` has two levels [0=male, 1=female]).
--> `NsetM`: Number of missing value codes defined (e.g., in this example, 1 user-defined missing value code [`-9999`] was defined).
--> `NsetDAndSetM`: Number of occurrences detected in both the data set and the user-defined missing value code (e.g., here 0 overlap for these variables, but if a second missing value code were defined, we might see a 1 here).
--> `MNotInD`: User-defined missing value code the function checked for (e.g., in this example, `-9999`).
--> `MInD`: Variable-specific number; user-defined missing value codes detected in the data (e.g., in this example, 0).
#### Check B: If a VALUES entry defines an encoded code value, but that value is not present in the data (In Set V and Not in Set D).
```{r vmt2b, echo=FALSE}
knitr::kable(results$Information$details$CheckB.AllVsInD,
caption = "Table Check B: List of variables for which
a VALUES entry defines an encoded code value, but that
value is not present in the data.")
```
The above table lists variables for which a VALUES entry defines an encoded value (i.e., value=meaning; e.g., 0=male), but that value is not present in the data. While ideally all defined encoded values (i.e., in set V) should be observed in the data (i.e., in set D), it is NOT necessarily an error if one does not.
Interpretation of table column names:
--> `AllVsInD`: Check result communicating if all parsed VALUES entries were detected in the data set (FALSE=no).
--> `NsetD`: Number of values (or levels) detected in the data (e.g., in this example, `LENGTH_SMOKING_YEARS` has 12 unique levels).
--> `NsetV`: Number of encoded value codes detected (e.g., for this example, `LENGTH_SMOKING_YEARS` has two encoded values).
--> `NsetDAndSetV`: Number of occurrences detected in both the data set and the VALUES entries (e.g., for this example, `LENGTH_SMOKING_YEARS` has one of the two encoded values detected in the data).
--> `VsNotInD`: Encoded value not detected in the data (e.g., for this example, -9999 was not detected in either variable).
So this awareness check alerts us to two potential errors. Specifically, -9999 is defined as a missing value code for `LENGTH_SMOKING_YEARS` and `HEART_RATE`, but this code is not detected in the data itself.
```{r inspect}
# Smoking
table(DS.data.B$LENGTH_SMOKING_YEARS)
dictionary_search(DD.dict.B, search.term=c("LENGTH_SMOKING_YEARS"), search.column=c("VARNAME"))
# Heart rate
table(DS.data.B$HEART_RATE)
dictionary_search(DD.dict.B, search.term=c("HEART_RATE"), search.column=c("VARNAME"))
```
Looking at this more closely, we see a missing value code of -4444, not -9999, is being used for `LENGTH_SMOKING_YEARS`, and `HEART_RATE` is a complete variable with no missing data. -9999 could be removed as a `VALUES` entry for those variables and -4444 should added as a `non.NA.missing.value.code` for this function and example data set.
#### Check C: If the user defines a missing value code that is not defined in a VALUES entry (In Set M and Not in Set V).
```{r vmt3c, echo=FALSE}
knitr::kable(results$Information$details$CheckC.AllSetMInSetV,
caption = "Table Check C: List of variables for which
user-defined missing value code(s) are not defined in
a VALUES entry.")
```
Interpretation of table column names:
--> `AllSetMInSetV`: Variable-specific check result communicating if user-defined missing value code(s) are detected as a VALUES entry (FALSE=no).
--> `NsetV`: Number of encoded value codes detected (e.g., in this example, `SEX` has two levels [0=male, 1=female]).
--> `NsetM`: Number of missing value codes defined (e.g., in this example, 1 user-defined missing value code [`-9999`] was defined).
--> `NsetMAndSetD`: Number of occurrences detected in both the user-defined missing value code and data set.
--> `SetMsNotInSetV`: Missing value code defined that was not detected in the VALUES entries (e.g., here -9999).
#### Check D: If a user-defined missing value code is present in the data for a given variable, but that variable does not have a corresponding VALUES entry (M in Set D and Not in Set V).
```{r vmt4d, echo=FALSE}
knitr::kable(results$Information$details$CheckD.All_MInSetD_InSetV,
caption = "Table Check D: List of variables for which a
user-defined missing value code is present in the data for
a given variable, but that variable does not have a
corresponding VALUES entry.")
```
Interpretation of table column names:
--> `All_MInSetD_InSetV`: Variable-specific check result communicating if user-defined missing value code(s) are detected in the data for a given variable, but that variable does not have a corresponding VALUES entry (FALSE=no).
--> `setMInDNotInV`: Encoded value codes detected in the data but not in a corresponding VALUES entry.
Note that this check identified a true error! Specifically `CUFFSIZE` has a missing value code in the data, -9999, that has not been defined as an encoded value in the `VALUES` columns. (Funny enough, this was NOT intentional on our part when creating this synthetic data set! Thank you dbGaPCheckup!)
#### Check E: If a VALUES entry is NOT defined as a missing value code AND is NOT identified in the data. ((Set V values that are NOT in Set M) that are NOT in Set D).
```{r vmt4e, echo=FALSE}
knitr::kable(results$Information$details$CheckE.All_VNotInM_NotInD,
caption = "Table Check E: List of variables for which a
VALUES entry is NOT defined as a missing value code
AND is NOT identified in the data")
```
In our example here, all VALUES entries that are NOT defined as missing values codes are listed in the data - so our check passes.
However, if there were issues, interpretation of table column names would be as follows:
--> `All_VNotInM_NotInD`: Variable-specific check result communicating if encoded values that are NOT defined as a missing value code are detected in the data (FALSE=no).
--> `setVNotInM_NotInD`: Encoded value codes detected as a VALUES entry but NOT listed as a missing value code and NOT detected in the data.
## Appendix B: Data Report
Next we can run `create_report`, which generates a textual and graphical report of the selected variables in HTML format which will optionally open the report in a web browser. This awareness report is designed to help you catch other potential errors in your data set. Note that the report generated is quite long however, so we recommend that you only submit subsets of variables at a time. In the example below, for speed of rendering, we create the report for variables only in columns 3 through 6. Note that there is an option to plot/report the data split by sex if desired. Specification of missing value codes are also important for effective plotting.
Again, the code below generates a nearly identical output to the `create_report` function, with some additional annotation added here for the purposes of this vignette and ease of interpretation.
```
# Not run as works best when run interactively
create_report(DD.dict, DS.data, sex.split=TRUE, sex.name= "SEX",
start = 3, end = 6, non.NA.missing.codes=c(-9999,-4444),
output.path= tempdir(), open.html=TRUE)
```
### Summary and plots
```{r prep_data, echo=FALSE}
# Create data set with missing value codes
# replaced with NA's (embedded in create_report function)
library(dplyr)
non.NA.missing.codes <- c(-4444, -9999)
dataset.na <- DS.data
for (value in na.omit(non.NA.missing.codes)) {
dataset.na <- dataset.na %>%
mutate(across(everything(), ~na_if(.x, value)))
}
```
```{r applyfun, results="asis", warning=FALSE}
dat_function_selected(DS.data.B, DD.dict.B, sex.split = TRUE, sex.name = "SEX", start = 3, end = 6, dataset.na=dataset.na, h.level=4)
```
Above we see a full report for variables `AGE`, `SEX`, `HEIGHT`, and `WEIGHT` as well as `AGE`, `HEIGHT`, and `WEIGHT` split by sex. Given the complexity of many data sets, this report was created so that investigators could more easily manually review the data for potential errors (e.g., sex=male appearing in a data of pregnant participants who were all female assigned at birth).
# Contact information
If you have any questions or comments, please feel free to contact us!
Lacey W. Heinsberg: law145@pitt.edu
Daniel E. Weeks: weeks@pitt.edu
Bug reports: https://github.com/lwheinsberg/dbGaPCheckup/issues
# Acknowledgments
This package was developed with partial support from the National Institutes of Health under award numbers R01HL093093, R01HL133040, and K99HD107030. The `eval_function` and `dat_function` functions that form the backbone of the awareness reports were inspired by an elegant 2016 homework answer submitted by Tanbin Rahman in our HUGEN 2070 course ‘Bioinformatics for Human Genetics’. We would also like to thank Nick Moshgat for testing and providing feedback on our package during development.