dfcheck is a R package and an acronym for Data Frame Check. The aim of this package is to check data.frames to find possible mistakes from users, like misspelling, extra spaces or cloned columns.
dfcheck is under active development then there is no real stable version for the moment.
The master branch have a version that always build, then you can safely use is. Nevertheless, this branch don't include lasted features.
To install it from GitHub, install the devtools and then type this line in R:
devtools::install_github(repo = "jomuller/dfcheck", ref = "master")
To install the development version from GitHub, install the devtools and then type this line in R:
devtools::install_github(repo = "jomuller/dfcheck", ref = "dev")
You are welcome to open issues on the issue page in GitHub.
You are welcome to fork to improve the code and the documentation. We try to use test driven development with testhat and well documented code with roxygen then functionnalities are added relatively slowly.
Follow the milestones of this GitHub project to show the release plan.
The dfcheck package was created to speed-up the boring and important step of checking the databases that user send us during the methodology consultations. Most of the time, we receive Excel files on the XLSX format. We open them using the openxlsx package, and we hope we could do some analysis on this data. But during the analysis, we detect always the same errors :
- Extra spaces before or after a word, or doubles spaces, that are then shown by R as differents levels in a categorial variable.
- Variable written form of levels names in the categorial variable (e.g. "primary", "Primary" or "ecole", "école", "École") that are also view as differents levels
- Empty lines due to bad table formating in Excel
- No-rectangular table, e.g. when the user already do some calculation at the end of a column
- Cloned row or colunms, that users produce to simulate freezing of the panels
- Duplicated column names
- ... and a lot more (the human imagination is infinite)
The problem is we need a minimum of structuration in the table to be able to give them to our statistic software. We previously tried to give to our users some guidelines to give us perfect data that should be directly processed using, for example, vartors. This improves the quality of the table, but errors are more insidious and cost us a lot of time to check and correct.
The main aim of the dfcheck package is to detect and repport the maximum of possible errors, before performing the statistical analysis.