Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #282: Test Data Guidance vignette #293

Merged
merged 9 commits into from
Aug 9, 2023
2 changes: 2 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,5 @@ navbar:
href: articles/pr_review_guidance.html
- text: Release Strategy
href: articles/release_strategy.html
- text: Test Data Guidance
href: articles/test_data_guidance.html
4 changes: 2 additions & 2 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ adex
adlb
admiralci
advs
anonymized
codebase
cyclomatic
datatable
Expand All @@ -56,10 +57,9 @@ flexibilities
functions’
funder
github
hotfixes
hotfix
hotfixes
insightsengineering
lifecycle
linter
lintr
lockfile
Expand Down
58 changes: 58 additions & 0 deletions vignettes/test_data_guidance.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: "Test Data Guidance"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Test Data Guidance}
%\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

# Introduction

[`admiraldata`](https://github.com/pharmaverse/admiraldata) provides a one-stop-shop for test data in the [`admiral`](https://pharmaverse.github.io/admiral/cran-release/) family of packages. This includes datasets that are therapeutic area (TA)-agnostic (`DM`, `VS`, `EG`, etc.) as well TA-specific ones (`RS`, `TR`, `OE`, etc.).
kaz462 marked this conversation as resolved.
Show resolved Hide resolved

# Data Sources

Some of the test datasets has been sourced from the [CDISC pilot project](https://github.com/cdisc-org/sdtm-adam-pilot-project), while other datasets have been constructed ad-hoc by the admiral team. Please check the [Github repository](https://github.com/pharmaverse/admiral.test/tree/main/data) for detailed information regarding the source of specific datasets.
kaz462 marked this conversation as resolved.
Show resolved Hide resolved

# Naming Conventions {#Name}

- Datasets/programs that are TA-agnostic are prefixed with `admiral_` (e.g. `admiral_dm`).
- Datasets/programs that are TA-specific are prefixed with the `admiral` extension package from which they derive (e.g. `admiralonco_rs`, `admiralophtha_oe`).
kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- Consistent name between datasets and programs.

**Note**: *If a domain is used by multiple TAs, [`admiraldata`](https://github.com/pharmaverse/admiraldata) may provide multiple versions of the corresponding test dataset. For instance, the package contains `admiral_ex` and `admiralophtha_ex` as the latter contains ophthalmology-specific variables such as `EXLAT` and `EXLOC`, and `EXROUTE` is exchanged for a plausible ophthalmology value.*
kaz462 marked this conversation as resolved.
Show resolved Hide resolved

# How To Update

Firstly, make a GitHub issue in this repo with the planned updates and tag `@pharmaverse/admiral` so that one of the development core team can sanity check the request. Then there are two main ways to extend the test data: either by adding new datasets or extending existing datasets with new records/variables. Whichever method you choose, it is worth noting the following:
kaz462 marked this conversation as resolved.
Show resolved Hide resolved

- Programs that generate test data are stored in the `dev/` folder.
kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- Each of these programs is written as a standalone R script: if any packages need to be loaded for a given program, then call `library()` at the start of the program (but please do **not** call `library(admiraldata)`).
- Most of the packages that you are likely to need will already be specified in the `renv.lock` file, so they will already be installed if you have been keeping in sync--you can check this by entering `renv::status()` in the Console. However, you may also wish to install [`metatools`](https://pharmaverse.github.io/metatools/) and [`ggplot2`](https://ggplot2.tidyverse.org/), which are currently not specified in the `renv.lock` file. If you feel that you need to install any other packages in addition to those just mentioned, then please tag `@pharmaverse/admiral` to discuss with the development core team.
kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- When you have created a program in the `dev/` folder, you need to run it as a standalone R script, in order to generate a test dataset that will become part of the [`admiraldata`](https://github.com/pharmaverse/admiraldata) package, but you do not need to build the package.
kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- Following [best practice](https://r-pkgs.org/data.html#sec-data-data), each dataset is stored as a `.rda` file whose name is consistent with the name of the dataset: for example, the dataset `dm` should be renamed to `raw_dm` before saving it as `raw_dm.rda`; if you save `dm` as `raw_dm.rda` and subsequently load the `.rda` file, then `dm` (not `raw_dm`) will be loaded into the global environment.
kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- The programs in `dev/` are stored within the [`admiraldata`](https://github.com/pharmaverse/admiraldata) GitHub repository, but they are **not** part of the [`admiraldata`](https://github.com/pharmaverse/admiraldata) package--the `dev/` folder is specified in `.Rbuildignore`.
kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- When you run a program that is in the `dev/` folder, you generate a dataset that is written to the `data/` folder, which will become part of the [`admiraldata`](https://github.com/pharmaverse/admiraldata) package.
- The names of test datasets are specified in `R/data.R`, for the purpose of generating documentation in the `man/` folder.

## Adding New Datasets
kaz462 marked this conversation as resolved.
Show resolved Hide resolved

- Create a program in the `dev/` folder, named `<name>.R`, where `<name>` should follow the [Naming conventions](#Name) and be consistent with the dataset name, to generate the test data and output (e.g., `<name>.rda`) to the `data/` folder. Use CDISC pilot data such as `admiral_dm` as input in this program in order to create realistic synthetic data that remains consistent with other domains. Note that **no personal data should be used** as part of this package, even if anonymized.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make using admiral_dm mandatory. It contains hundreds of subjects. Using it can cause a lot of unnecessary work.

Instead I would state that the datasets within a prefix should be consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using admiral_dm should be mandatory as it is the easiest way to retain the correct USUBJID codes so that the SDTM datasets remain related. @bundfussr can you explain why it would cause a lot more work?

kaz462 marked this conversation as resolved.
Show resolved Hide resolved
- Run the program.
- Reflect this update, by specifying `<name>` in `R/data.R`.
- Run `devtools::document()` in order to update `NAMESPACE` and update the `.Rd` files in `man/`.

## Updating Existing Datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add that this apply only to existing datasets from an external source.

For existing datasets which were generated the corresponding program should be updated.

Do we need rules for modifying existing datasets?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree


- Rename the source dataset as `raw_<name>`, where `<name>` is the domain name (e.g., rename `ds` to `raw_ds`), and then save it to the `data/` folder as `raw_<name>.rda` (e.g., `save(raw_ds, file = "data/raw_ds.rda")`).
- Create a program in the `dev/` folder, named `update_<name>.R`, to load `raw_<name>.rda`, make the updates, and output `admiral_<name>.rda` to the `data/` folder.
- Run the program.
- Reflect this update, by specifying both `raw_<name>` and `admiral_<name>` in `R/data.R`.
- Run `devtools::document()` in order to update `NAMESPACE` and update the `.Rd` files in `man/`.