Data Cleaning Workflow for Prospective Clinical Research, Using R + REDCap
This repo contains a tutorial and related files which describe the continual data cleaning process used by the Vanderbilt CIBS Center for prospective clinical research.
The tutorial refers to a sample REDCap database for a three-month longitudinal study of adult patients taking a dietary supplement and measuring creatinine, HDL and LDL cholesterol, and weight over time. (Sample database is adapted with thanks from REDCap's project templates.)
- Tutorial (PDF): Contains code, links, and prose describing our entire process, from study and database design through study completion.
- Example R script, extracted from the tutorial, which
- Allows you to code along with the tutorial more easily
- Serves as a base for developing your own data cleaning script
- Script of helper functions for data cleaning: Includes export of data dictionary from REDCap and three major helper
functions I use often when cleaning data:
create_error_df(): Given a matrix of T/F values and a set of error messages and labels, create a data.frame of all data issues represented
check_missing(): Check whether variables are simply missing from the specified data set; give error messages using labels from REDCap data dictionary
check_limits_numeric(): Check whether numeric fields are within specified limits, using either REDCap data dictionary limits (default) or user-supplied values
Auxilliary Files for Tutorial
tutorialfiles/: At a certain point in the tutorial, we need to wipe the slate clean and start over with updated datasets.
dataclean_partial.Rreads in the correct data and performs all operations needed to get you in the appropriate state at that point.
- Data files (CSV) included for those unable to connect to the OCU REDCap
rawdata/: Contains data as originally entered in the example REDCap database and exported using the REDCap API, using code as shown in the tutorial.
fixeddata/: Data in the same formats as
rawdata/, but after a few issues have been "corrected" in the original database.
querydata/: CSV files of the data issues observed at each point in the process: using
fixeddata/; and after unneeded issues have been removed.
codebook.pdf: for example study database
codebook_documentation.pdf: for documentation database
These materials were originally developed for a workshop series at Osaka City University, Osaka, Japan, June 2018. They are under an MIT license.