# Lab Meeting 
## February 6, 2017


## Reproducibility Workshops

In collaboration with [Data Carpentry](http://www.datacarpentry.org/)

**R** using RStudio: [http://www.datacarpentry.org//rr-workshop/](http://www.datacarpentry.org//rr-workshop/)

**Python** using Jupytr Notebook: *Coming soon!*

## Reproducibility matters


Lack of reproducibility in science causes significant issues

-  For science as an enterprise
-  For other researchers in the community
-  For oneself as a researcher




## Reproducibility = Accelerating science, including your own

<img src="http://i.imgur.com/Q8kV8.png" style="width: 75%; height: 75%"/>

Source: Bruno Oliveira



## Challenges

**Dependency hell**: Software requires differing dependencies. Any one can fail to install, conflict with those of others, and their exact versions can affect the results.

**Documentation gaps**: Code can easily be very difficult to understand if not documented. Documentation gaps and errors may be harmless for experts, but are often fatal for “method novices”.

**Unpredictable evolution**: Scientific software evolves constantly and often in drastic rather than incremental ways. As a result parameters can change in unpredictable ways, and can render code to fail.

## The Five Facets of Reproducibility:

1. **Organization**: tools to organize your projects so that you don’t have a single folder with hundreds of files.

2. **Automation**: the power of scripting to create automated data analyses.


## The Five Facets of Reproducibility:

3. **Dissemination**: publishing is not the end of your analysis, rather it is a way station towards your future research and the future research of others.

4. **Documentation and Literate Programming**: note the difference between binary files (e.g. docx) and .txt files and why text files are preferred for documentation.

5. **Version Control**: Git and Github


## Organization

Read Noble (15 minutes): [A Quick Guide to Organizing Computational Biology Projects](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424) 

![](http://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=10.1371/journal.pcbi.1000424.g001)



## Organization

**data-raw**: the original data, you shouldn't edit or otherwise alter any of the files in this folder.

**data-output**: intermediate datasets that will be generated by the analysis.
	
-	We write them to CSV files so we could share or archive them, for example if they take a long time (or expensive resources) to generate.

fig: the folder where we can store the figures used in the manuscript.

R: our R code (the functions)

-  Often easier to keep the prose separated from the code.
	If you have a lot of code (and/or manuscript is long), it's easier to navigate.

**tests**: the code to test that our functions are behaving properly and that all our data is included in the analysis.




## Automation

Students will learn how to restructure their scripts (typically written in R) so that the code is modularized. They will learn to define and use functions. Students will be able to write code to automate code build and write tests for their code. They will be briefly introduced to the continuous integration tools.



## Dissemination - Sharing, publishing, archiving

Why share / archive data & code?

- funding agency / journal requirement
- community expects it
- increased visibility / citation
- better research

Choose a license: [https://choosealicense.com/](https://choosealicense.com/)



## Dissemination - Sharing, publishing, archiving

Is there a domain specific repository?
What are the backup & replication policies?
Is there a plan for long-term preservation?
Can people find your materials?
Is it citable? (does it provide DOIs)
Is your purpose archival, sharing or publication?



## Documentation

**tip 1**: Use markdown (or plain text) to document your workflow so that anyone can pick up your data and follow what you are doing

**tip 2**: Use literate programming so that your analysis and your results are tightly connected, or better yet, unseperable

Literate programming Slide Show: [http://htmlpreview.github.io/?https://raw.githubusercontent.com/Reproducible-Science-Curriculum/rr-literate-programming/master/02-literate-programming-slides.html#1](http://htmlpreview.github.io/?https://raw.githubusercontent.com/Reproducible-Science-Curriculum/rr-literate-programming/master/02-literate-programming-slides.html#1)




## Literate Programming

How to

- organize your work?
- make work more pleasant for yourself? (less tedium, less manual, less …)
- reduce friction for collaboration?
- reduce friction for communication?
- make your work navigable, interpretable, and repeatable by others?
- A lot of this can be built into the normal coding and analysis process by using specific tools and habits.





## Literate Programming


Interactive Development Environments (IDE)

-  Jupyter/iPython Notebooks
-  R studio 

Ability to share content.



## Markdown

Markdown enables fast publishing to the web

Markdown: Easy to write and read in an editor

HTML: Easy to publish and read on the web




## Version Control


![http://i.imgur.com/rigioF2.jpg](http://i.imgur.com/rigioF2.jpg)
  



## Version Control 

**What is Version Control?**

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.



## Version Control 

How do you record the history of your projects?



## Version Control 

![](http://the9gag.com/images/pictuers/finaldoc.jpg)



## Version Control - Good

	2013-10-14_manuscriptFish.doc
	2013-10-30_manuscriptFish.doc
	2013-11-05_manusctiptFish_intitialRyanEdits.doc
	2013-11-10_manuscriptFish.doc
	2013-11-11_manuscriptFish.doc
	2013-11-15_manuscriptFish.doc
	2013-11-30_manuscriptFish.doc
	2013-12-01_manuscriptFish.doc 
	2013-12-02_manuscriptFish_PNASsubmitted.doc
	2014-01-03_manuscriptFish_PLOSsubmitted.doc
	2014-02-15_manuscriptFish_PLOSrevision.doc
	2014-03-14_manuscriptFish_PLOSpublished.doc



## Version Control - Better

Saving everything together at once.

Every time you make a save, you zip the entire directory and save it with a date.



## Version Control - Best

Full history of the project is described.  

The whole directory under version control can be brought back to any state it was in from the start of the project.  

[Fully reproducible manuscript with code](https://github.com/richfitz/wood) (although overkill)



## Why use Git

Why use Git?

- Makes you fearless
- Easy to set up
- Allows you to take a snapshot of every stage of your project history
- Takes up minimal space
- Creates a easy navigable map to the history of all changes made




## Github

Features of using a Hosting Service Like Github

- Backup of your project
- No need for a server: easy to set up
- GitHubs strong community: your colleagues are probably already there
- Provides tools to help enhance collaboration
- A common location to share off your work



## Modify Programmatically

Strive to never manually do change anything.
Avoid steps that make it hard to explain to others what you did. 

- i.e. taking data out of R and putting in excel to copy/paste/re-arrange.

Start small.

If you never solve the problem, start a list of what you couldn’t do. Learn when you have more time.





## Resources

1. Version Control Slide Show: [http://reproducible-science-curriculum.github.io/2015-06-01-reproducible-science-idigbio/vcs-slides/01-motivation-slides.html#/](http://reproducible-science-curriculum.github.io/2015-06-01-reproducible-science-idigbio/vcs-slides/01-motivation-slides.html#/)

2. Software Carpentry Version Control lesson:

3. Two hour mini-lesson based on Reproducible Science Curriculum material: [http://reproducible-science-curriculum.github.io/cbb-retreat/](http://reproducible-science-curriculum.github.io/cbb-retreat/)

4. Link to Eisen Lab Github:

5. Practice Git on a basic: 