Adapted from the Reproducible Science Curriculum
Special Thanks: Naupaka Zimmerman,Francois Michonneau, Hilmar Lapp, Karen Cranston, Jenny Bryan, and everyone else who created these materials.
Reproducibilty is actually all about being as lazy as possible!
-- Hadley Wickham (via Twitter, 2015-05-03)
More efficient, less redundant science: others can build upon our work.
Reproducibility spectrum for published research. Source: Peng, RD Reproducible Research in Computational Science Science (2011): 1226–1227 via Reproducible Science Curriculum
Five selfish reasons to work reproducibly - Florian Markowetz
-
Reproducibility helps to avoid disaster
-
Reproducibility makes it easier to write papers
-
Reproducibility helps reviewers see it your way
-
Reproducibility enables continuity of your work
-
Reproducibility helps to build your reputation
For research to be reproducible, the research products (data, code) need to be publicly available in a form that people can find and understand them.
-
Collaborators
-
Peer reviewers & journal editors
-
Broad scientific community
-
The public
Figure 1. Distribution of reporting errors per paper. Papers from which data were shared has fewer errors.
Click on citation to view paper.
-
GitHub: Version Control / Collaboration / Dissemination
-
R Markdown or Jupyter Notebooks: Code Documentation / Dissemination
Over 2-3 hours we will focus on the tools and skills associated with these facets.
- Organization
- Automation
- Documentation
- Dissemination
The more self explanatory the better:
- Consider overall structure of folders and files
- Use informative file names
Noble, William Stafford, 2009. A quick guide to organizing computational biology projects.A variable name that describes the object is more useful than a random variable name.
File Organization should:
- Reflect inputs, outputs and information flow
- Preserve raw data so it's not modified
- Carefully document & store intermediate & end outputs
- Carefully document & store data processing scripts
File / Folder Names should be:
- Machine readable
- Human readable
- Support sorting
More on file naming & organization
-- from the Reproducible Science Curriculum
- Your future self will be able to quickly find files
- Colleagues will be able to more quickly understand your workflow
- Machine readable names can be quickly and easily sorted and parsed
Scripting vs. Point and click
Script = more time spent up front, but will save time in the long run.
Time Savings:
- More efficient to modify and repeat an analysis down the road
- Easier for reviewers and colleagues to see every aspect of your methods
- Self documenting methods - your future self will forget small steps
DRY -- Don't Repeat Yourself
If your analysis is composed of scripts, with repeated code throughout, it will be more time consuming to maintain and update.
Reproducible Science Curriculum - Automation
Modularity -- use functions to write code in reusable chunks
- Variables created within a function are temporary
- Code with functions can be easier to read / cleaner
- Allows for better documentation
- Supports testing
- Allows for re-use of code on other data
Document all workflow steps:
- You can remind your future self of your workflow
- Others can see and understand your work
- Future "re-analysis" of your data is more efficient
Code should be easy to understand with clear goals
Document your code even if you think it's clear and simple. Your collaborators & your future self will inevitably have an easier time working with it down the road.
Add comments around functions that describe purpose, inputs and outputs.
Avoid proprietary formats: Use text files (.txt, .md) that don't require special tools to open.
Markdown to style documentation = machine readable, small file size, low overhead.
Use coding approaches that connect data cleaning, analysis & results
R Markdown allows you to publish code and results in one (or more) output files.
Publishing is not the end of your analysis, rather it is a way towards your future research and the future research of others.
- Funding agency / journal requirement
- Community expects it
- Increased visibility / citation
- More efficient, less redundant science
Example Workflow / Tools:
-
Document workflow: R Markdown / Matlab
-
Collaborate with Colleagues / Version Control : GitHub
-
Publish Data Snapshot: FigShare, Dryad, Zenodo, etc
-
Share workflow: Binder
-
Archive data
-
Share link to DOIs
-
Documentation: RStudio, GitHub
-
Organization: File naming / directory structure best practices
-
Automation: Efficient Coding Practices
-
Dissemination: GitHub, Data Archives, DOIs
Email: katharyn.duffy@nau.edu