# Part II: Introduction to workflow management and computational reproducibility

## Workflow Management 

In the middle of working through Part 1 of this week's assignment, you may have asked yourself the following questions: 

1. *Where* did I get this data file? 
2. *Why* did I drop those samples? 
3. *How* did I make that figure? *In an inspired moment of EDA, I created an interesting graph, but I can't remember how I did it... 
4. My script is now giving an error :( But it was working last week

*Examples from [Karl Broman](https://www.biostat.wisc.edu/~kbroman/presentations/repro_research_JSM2016_withnotes.pdf)

With good workflow management, we won't have the problems above. What is more, with good workflow management, we move a step closer towards computational reproducibility. 

### Some simple recommendations for workflow management 

More explanation can be found on [Karl's website](https://kbroman.org/steps2rr/pages/organize.html)
1. Folder organisation - separate data and code. For data, separate `raw` data from `derived` data, with a script detailing the steps used to get from `raw` to `derived`. For code, separate scripts from notebooks. Notebooks are for iterating, but the final analyses can be captured more formally in a script or set of Python modules. 

2. Record everything in a script. This includes steps for converting data files, cleaning data and analyzing data.

3. If you plan on running an analysis multiple times with different datasets, turn the operation into a one-button operation with automation tools like `GNU Make`

4. Turn repeated code into functions. Turn repeated functions into packages. 

5. `git commit` often. Nuff said. 

### Deliverable #2: 

For the ecommerce dataset we have been using, create a script or Python library that can run all the necessary data cleaning and feature engineering steps in a reusable manner. 

`python run.py`

## Computational Reproducibility 

A research project is computationally reproducible if a second investigator (including you in the future) can recreate the final reported results of the project, including key quantitative findings, tables, and figures, given only a set of files and written instructions. [source](https://www.practicereproducibleresearch.org)

Reproducibility is an indicator of your projects quality and trustworthiness. However, creating a project that is computationally reproducible can introduce more overhead than you would like. Inevitably, this overhead eats into time you could be spending on improving your model. Hence, we don't mandate that you practice full computational reproducibility now. Rather, we want to introduce you to this idea so that when the time comes for handing over a project or analysis, you will have some rules of thumb to work from. 

[Ten Simple Rules for Reproducible Computational Research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3812051/)

1. For every result, keep track of how it was produced. This includes the full sequence of pre- and post-processing steps. Scripts are handy here. Within scripts, informative function names and good indentation are useful.

2. Avoid manual data manipulation steps. Use scripts and functions instead of, say, manually converting a csv file in Excel 

3. Archive the exact versions of all external programs used. This includes package versions of `pip` and `conda` 

4. Version control all custom scripts 

5. Record all intermediate results, when possible in standardised format. 

6. For analyses that include randomness, note underlying random seeds. 

7. <mark>Always store raw data behind plots. </mark>

8. Generate Hierarchical Analysis Output

9. Connect your experimental results with a textual explanation. This will allow you to connect for findings with existing theories, interpretations or hypotheses you are exploring.

10. Provide public access to scripts, runs and results. 

### Case Studies of how reproducible research is practiced: 

[practicereproducibleresearch.org](https://www.practicereproducibleresearch.org/core-chapters/1-intro.html) from UC Berkeley

### Deliverable #3: 
Include a README.md with your project repo. Some pointers on how to write a good README.md can be found [here](https://help.github.com/articles/about-readmes/)

### Deliverable #4 
Include a `requirements.txt` or `environment.yml` with your project repo

`conda list -e > requirements.txt` #Save all the info about packages to your folder  
`conda env export > <environment-name>.yml `