GitHub - katharynduffy/ECOSS_reproducible_science: A basic introduction to the principles of version control and reproducible data science

Intro to Reproducible Science

Adapted from the Reproducible Science Curriculum

Special Thanks: Naupaka Zimmerman,Francois Michonneau, Hilmar Lapp, Karen Cranston, Jenny Bryan, and everyone else who created these materials.

Reproducibilty is actually all about being as lazy as possible!

-- Hadley Wickham (via Twitter, 2015-05-03)

:)

Why Use Reproducible Methods?

More efficient, less redundant science: others can build upon our work.

Reproducibility & Your research

Reproducibility spectrum for published research. Source: Peng, RD Reproducible Research in Computational Science Science (2011): 1226–1227 via Reproducible Science Curriculum

Five selfish reasons to work reproducibly - Florian Markowetz

Reproducibility helps to avoid disaster
Reproducibility makes it easier to write papers
Reproducibility helps reviewers see it your way
Reproducibility enables continuity of your work
Reproducibility helps to build your reputation

How to Make Work Reproducible

For research to be reproducible, the research products (data, code) need to be publicly available in a form that people can find and understand them.

Who do we need to share with?

Collaborators
Peer reviewers & journal editors
Broad scientific community
The public

Better Research

Figure 1. Distribution of reporting errors per paper. Papers from which data were shared has fewer errors.

Click on citation to view paper.

Wicherts et al (2011) Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.

The tools that make this easier

GitHub: Version Control / Collaboration / Dissemination
R Markdown or Jupyter Notebooks: Code Documentation / Dissemination

Four Facets of Reproducibility

Over 2-3 hours we will focus on the tools and skills associated with these facets.

Organization
Automation
Documentation
Dissemination

1. Organization

The more self explanatory the better:

Consider overall structure of folders and files
Use informative file names

1. Organization Pro-Tip

A variable name that describes the object is more useful than a random variable name.

![](http://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=info:doi/10.1371/journal.pcbi.1000424.g001)

Noble, William Stafford, 2009. A quick guide to organizing computational biology projects.

1. Organization

File Organization should:

Reflect inputs, outputs and information flow
Preserve raw data so it's not modified
Carefully document & store intermediate & end outputs
Carefully document & store data processing scripts

Organization

1. Organization -- File Names

File / Folder Names should be:

Machine readable
Human readable
Support sorting

Which set of file names are most self-explanatory?

1. Organization - Benefits

Your future self will be able to quickly find files
Colleagues will be able to more quickly understand your workflow
Machine readable names can be quickly and easily sorted and parsed

2. Automation Tip

Scripting vs. Point and click

Script = more time spent up front, but will save time in the long run.

2. Automation Tip

Time Savings:

More efficient to modify and repeat an analysis down the road
Easier for reviewers and colleagues to see every aspect of your methods
Self documenting methods - your future self will forget small steps

2. Automation

DRY -- Don't Repeat Yourself

If your analysis is composed of scripts, with repeated code throughout, it will be more time consuming to maintain and update.

Reproducible Science Curriculum - Automation

Automation Tips

Modularity -- use functions to write code in reusable chunks

Variables created within a function are temporary
Code with functions can be easier to read / cleaner
Allows for better documentation
Supports testing
Allows for re-use of code on other data

2. Documentation

Document all workflow steps:

You can remind your future self of your workflow
Others can see and understand your work
Future "re-analysis" of your data is more efficient

Documentation

Code should be easy to understand with clear goals

Document your code even if you think it's clear and simple. Your collaborators & your future self will inevitably have an easier time working with it down the road.

Documentation Tip 1

Add comments around functions that describe purpose, inputs and outputs.

Documentation Tip 2

Avoid proprietary formats: Use text files (.txt, .md) that don't require special tools to open.

Documentation Tip 3

Markdown to style documentation = machine readable, small file size, low overhead.

Documentation Tip 4

Use coding approaches that connect data cleaning, analysis & results

R Markdown allows you to publish code and results in one (or more) output files.

Dissemination

Publishing is not the end of your analysis, rather it is a way towards your future research and the future research of others.

Dissemination - Why

Funding agency / journal requirement
Community expects it
Increased visibility / citation
More efficient, less redundant science

Dissemination workflow

Example Workflow / Tools:

Document workflow: R Markdown / Matlab
Collaborate with Colleagues / Version Control : GitHub
Publish Data Snapshot: FigShare, Dryad, Zenodo, etc
Share workflow: Binder
Archive data
Share link to DOIs

Four Facets of Reproducibility

Documentation: RStudio, GitHub
Organization: File naming / directory structure best practices
Automation: Efficient Coding Practices
Dissemination: GitHub, Data Archives, DOIs

Questions?

Email: katharyn.duffy@nau.edu

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
presentation_materials		presentation_materials
.gitmodules		.gitmodules
README.md		README.md

katharynduffy/ECOSS_reproducible_science

Folders and files

Latest commit

History

Repository files navigation

Intro to Reproducible Science

Why Use Reproducible Methods?

Reproducibility & Your research

How to Make Work Reproducible

Who do we need to share with?

Better Research

The tools that make this easier

Four Facets of Reproducibility

1. Organization

1. Organization Pro-Tip

1. Organization

Organization

1. Organization -- File Names

Which set of file names are most self-explanatory?

1. Organization - Benefits

2. Automation Tip

2. Automation Tip

2. Automation

Automation Tips

2. Documentation

Documentation

Documentation Tip 1

Documentation Tip 2

Documentation Tip 3

Documentation Tip 4

Dissemination

Dissemination - Why

Dissemination workflow

Four Facets of Reproducibility

Questions?

About

Resources

Stars

Watchers

Forks

Languages