Check a data table
R Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
dumbcode
man
test
.Rbuildignore
.gitignore
.travis.yml
DESCRIPTION
Makefile
NAMESPACE
README.md
dfcheck.Rproj

README.md

dfcheck

dfcheck is a R package and an acronym for Data Frame Check. The aim of this package is to check data.frames to find possible mistakes from users, like misspelling, extra spaces or cloned columns.

Installation

Stable version

Build Status

dfcheck is under active development then there is no real stable version for the moment.

The master branch have a version that always build, then you can safely use is. Nevertheless, this branch don't include lasted features.

To install it from GitHub, install the devtools and then type this line in R:

devtools::install_github(repo = "jomuller/dfcheck", ref = "master")

Developpement version

Build Status

To install the development version from GitHub, install the devtools and then type this line in R:

devtools::install_github(repo = "jomuller/dfcheck", ref = "dev")

Participate

BugTracking

You are welcome to open issues on the issue page in GitHub.

Code, documentation

You are welcome to fork to improve the code and the documentation. We try to use test driven development with testhat and well documented code with roxygen then functionnalities are added relatively slowly.

Release plan

Follow the milestones of this GitHub project to show the release plan.

Motivation

The dfcheck package was created to speed-up the boring and important step of checking the databases that user send us during the methodology consultations. Most of the time, we receive Excel files on the XLSX format. We open them using the openxlsx package, and we hope we could do some analysis on this data. But during the analysis, we detect always the same errors :

  • Extra spaces before or after a word, or doubles spaces, that are then shown by R as differents levels in a categorial variable.
  • Variable written form of levels names in the categorial variable (e.g. "primary", "Primary" or "ecole", "école", "École") that are also view as differents levels
  • Empty lines due to bad table formating in Excel
  • No-rectangular table, e.g. when the user already do some calculation at the end of a column
  • Cloned row or colunms, that users produce to simulate freezing of the panels
  • Duplicated column names
  • ... and a lot more (the human imagination is infinite)

The problem is we need a minimum of structuration in the table to be able to give them to our statistic software. We previously tried to give to our users some guidelines to give us perfect data that should be directly processed using, for example, vartors. This improves the quality of the table, but errors are more insidious and cost us a lot of time to check and correct.

The main aim of the dfcheck package is to detect and repport the maximum of possible errors, before performing the statistical analysis.