Design patterns for data projects

When starting new data projects, what's the best way to design your project to minimize growing pains and maximize reuse? This project highlights a few design patterns I've found useful and reusable across a variety of data projects.

Getting started

For an overview, compile the presentation by opening "overview.Rpres" in RStudio.
To learn more about specific design patterns, check out the READMEs in each design pattern directory.

Design patterns

Egg projects. A useful configuration of R packages and RStudio projects.
Parallel reports. The parallelization of code and report writing for improved interactive development.
Green stats. Result sections in knitr. Could save your life!
Merge recode. Authoritative recoder functions.
DRY plots. A small pattern for not repeating yourself when making plots using ggplot.

Example data projects

The sample projects are stored in git submodules, i.e., they link to other repos. After cloning the repo, they must be initialized and updated:

git submodule init && git submodule update

github-pulse is a project I made up mainly to demonstrates each of the design patterns in action. The project pulls freely available Github event data from githubarchive and analyzes it in R. This sample project demonstrates the value of data design patterns in facilitating the growth of a data project from exploratory analyses to final reports.
property-verification is a cognitive psychology experiment set up as a data project. The data in this project can be used to demonstrate how to write parallel reports and DRY plots that grow gracefully from exploratory analyses to final reports.
wikischolar is an ongoing research project interested in measuring changes in Wikipedia article quality over time. The process of obtaining the data is more elaborate and is contained in a python library, yet the benefits to using the data design patterns is the same.
words-in-transition is a research project on the evolution of language, specifically on the evolution of categorical word forms as a result of repeated imitation of non-verbal sounds, like in the children's game of telephone. The many stages of this research project make it a case study for effective report organization.

Description

Design patterns in the traditional software development sense are configurations of program components that solve problems that will likely crop up in the future but may not be immediately obvious at the beginning of a project. Design patterns involve some upfront cost but they make development easier and more sustainable in the long run by outsourcing design decisions to the design pattern itself. This repo does not contain formal design patterns but the term captures my philosophy in approaching data projects: that data projects should be structured in a way that makes them reproducible and reusable while allowing them to grow smoothly from initial hypotheses to publication-ready results.

Data projects lie somewhere between the analysis of a single data set and continuous analytics pipelines (big data). Data projects are extremely important for scientific experiments and empirical analysis. I believe that all experiments can and should be implemented as data projects to facilitate reproducibility and replicability. A developer perspective on data projects and data design patterns is that they allow for agile data science where iteration and incremental development is key.

History

July 20, 2016: Madison R Users Group http://www.meetup.com/MadR-Madison-R-Programming-UseRs-Group/events/230575960/.
Sept. 16, 2016: Curtin Addiction Research Lab

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
design-patterns		design-patterns
example-projects		example-projects
img		img
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
overview.Rpres		overview.Rpres
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Design patterns for data projects

Getting started

Design patterns

Example data projects

Description

History

About

Releases

Packages

Languages

pedmiston/data-design-patterns

Folders and files

Latest commit

History

Repository files navigation

Design patterns for data projects

Getting started

Design patterns

Example data projects

Description

History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages