NOTE: This package has been replaced with a newer version. See here:
https://github.com/ki-tools/kitools-r
and here:
https://ki-tools.github.io/kitools-r/
The kitools R package provides utility functions for setting up Knowledge Integration (Ki) projects and supporting workflows within these projects, such as finding data and publishing results.
This package is in the extreme early development process and is expected to seriously evolve over time. Although hosted publicly on GitHub, it is not of use to the general public - only to those inside Ki.
More documentation will be made available as the package matures.
# install.packages("devtools")
devtools::install_github("ki-tools/kitools")
General guidelines:
- All analysis source code is managed and version controlled using Git and Github, using the "ki-analysis" Github organization.
- All analysis code is open and public on Github. While several Ki datasets are not appropriate to host publicly, code that acts on them can be public.
- Github is for code only. Data and binary artifacts such as figures are not stored in Github. There are good reasons for this, and we have a more appropriate place (Synapse) for these files.
- Core datasets (see definition below) as well as derived data artifacts are stored in your local analysis environment but are not pushed to Github. For provenance and reproducibility, data are synced with Synapse, with utility functions provided by kitools for pulling and pushing data.
- R Markdown documents and Shiny apps are published to the Ki RStudio Connect server and public / private viewing permissions are evaluated on a case-by-case basis. Utility functions and guidance for doing this are (TODO) provided as part of the kitools package. Source code for R Markdown documents and Shiny apps can be tracked in Git and published to Github.
- Support for publishing results using other technologies such as Jupyter notebooks, etc., will be considered when / if demand is sufficient.
This package provides utility functions that you must use to set up a new Ki analysis project. These set up minimal project structure, with the goal of giving you freedom while still maintaining standard behavior where necessary for Ki reproducibility and tracking needs.
- Create an empty directory named for your analysis (see "Naming projects" below) and enter R with this as your working directory
create_analysis()
- Prompts for your Synapse ID, if not already supplied
- Prompts for a title of your project
- Prompts for the ID of the associated Synapse space where data and other analysis outputs will be stored
- Stores these values in a project-level configuration file
- Sets up a minimal directory structure
/R
for analysis R code/data/core
to locally store core datasets pulled from Synapse, pulled withdata_use(id)
(see "Working with data" for more on core vs. discovered vs. derived datasets)/data/discovered/_raw
to locally place raw data files for discovered datasets/data/discovered
to locally place processed discovered datasets that you are ready to publish to Synapse and share with others, published withdata_publish(path)
/data/derived
to locally place datasets derived from core or discovered datasets, typically as the result of an analysis or summarization, also published withdata_publish(path)
/data/scripts
to store R scripts that operate on raw discovered data to create discovered data, or operate on core or discovered data to produce derived data
- Creates a .gitignore file to ignore data/core, data/derived, data/discovered, *.Rdata, *.rda, *.png, *.jpeg, *.pdf, etc.
usethis::use_git()
to intialize a git repository for the analysisusethis::use_github(organisation = "ki-analysis")
to set a GitHub remote atki-tools/__analysis-name__
Names for projects should be descriptive, use lowercase text, and use dashes to separate words. This convention should be used for naming all other files in an analysis as well.
There are three main classes of data types in a Ki analysis:
- Core data: Data that exists in Synapse, probably in its own Synapse project and has ideally gone through some sort of rigorous curation from the data management team. Once pulled into your local environment, you should treat these datasets as read-only.
- Discovered data: These are datasets that you may have found somewhere outside of Ki. These datasets should be treated as read-only. You should register these datasets in the Synapse project associated with your anlaysis, but if later deemed to be generally useful to projects beyond your analysis, they should be considered for additional curation and promotion to a core dataset.
- Derived data: These are datasets that you derive throughout the course of your analysis. Examples include transformations of core datasets, analysis output data objects, tables, plot-ready data, etc. Each derived dataset should have a script that generates it, stored in the "data/scripts" directory or in your "R" code directory.
There are several utility functions for working with data in kitools. The purpose for these data management functions is to help you keep the datasets you are using and producing organized and sharable with others. Each dataset you use or produce can be registered to your project and synced with Synapse, so that others who check out your code can easily get up to speed with. This also helps provide some additional provenance of what datasets are used in what analyses. These utilities also help you separate your code storage on Github from your data storage on Synapse.
Data management functions:
data_use(synapse_id)
: This is used when you need to register a core dataset with your analysis and download a local copy.data_publish(path, ...)
: This is used when you need to register a discovered or derived dataset with your analysis and push it to the appropriate place on Synapse.data_sync()
: This is used when you want to make sure you have a local copy of all datasets that have been associated with your analysis project. This is useful for collaborative environments where someone else might check out your code and wants to pull all data files associated with the project.
Datasets that have been registered with your analysis project are updated in your "project_config.yml" file and you can view this file to see what's registered with your project.
On project creation, an "R" directory is created in which you can add R scripts. Here you are free organize your files however you like, but you are encouraged to use descriptive file names with all lowercase, no spaces, and underscores to separate words, followed with a capital ".R" file extension.
An often-recommended approach to organizing R analysis code is to write your code as an R package. There are benefits to this approach, but we do not want to put too many constraints on project structure. There is nothing stopping you from making your code as a package, as the current project setup only requires a data and an R
folder, a subset of an R package. However,if there are aspects of your analysis code that could be of general use beyond your analysis, you should think of creating separate R packages for those as appropriate. For example, in the CIDACS Brazil analysis, general functionality for transforming DATASUS data was deemed to be generally useful as a separate R package and developed accordingly.
To make it easier to work in a collaborative manner, we strongly recommend adhering to the tidyverse style guide for your analysis code.
If you find yourself in need to develop a Ki-related R package, this package provides utilities to bootstrap your package in a way that conforms with Ki R package development guidance.
Several functions in the usethis package are leveraged for a ki R package setup. If you have not used usethis, some useful setup instructions can be found here.
The following steps are recommended to set up a ki R package:
usethis::create_package()
to create a package (if not already created)usethis::use_git()
to initialize a git repositoryusethis::use_readme_md()
to create a basic README.md file (orusethis::use_readme_rmd()
for R Markdown)usethis::use_testthat()
to set up unit testskitools::use_lintr_test()
to add a unit test for correct code stylekitools::use_apache_license()
to use the Apache 2.0 license (with copyright BMGF)usethis::use_github(organisation = "ki-tools")
to set a GitHub remote atki-tools/__package-name__
usethis::use_tidy_ci()
to set up travis-ci continuous integration and code coverage
To make it easier to work in a collaborative manner, we adhere to the tidyverse style guide.
You should add a unit test to your package that will cause the package check to fail if the code does not meet the style guide. You can set this up with a utility function kitools::use_lintr_test()
.
- Use roxygen to document package functions, data, etc.
- We recommend using pkgdown to create more user-friendly documentation. You can set this up with
usethis::use_pkgdown()
.
Please note that the 'kitools' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms