# Getting started with Julia for reproducible science

Mathieu Besançon  
Polytechnique Montréal, INRIA & Centrale Lille

Re-using / citing the materials:  
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3249868.svg)](https://doi.org/10.5281/zenodo.3249868)

## Logistics

Slides at https://matbesancon.github.io/slides/JuliaNantes/JuliaTools.html  
Repository at https://github.com/matbesancon/JuliaNantes  

Lots of references & pointers, for the trip home.

Twitter for live-complaints: `@matbesancon`

# Content

- Reproducibility, why?
- Getting started with Git
- Working alone
    - Projects & environments
    - Tests
    - Publishing code
    - Working with data
- Collaborating on code
    - General workflow
    - Demo & homework

Bonus:
    - Unaesthetic diagrams
    - Latest research in linear algebra
    - Homework

# Reproducibility: why should I care?

- **Industrial software**:
    - Written once, used often, in (almost) all contexts
    - Bugs are found (eventually) and fixed


- **Academic software**:
    - Re-written a lot, used one final time
    - Used in a static, long-lasting document (paper)
    - Tested for one application? On how many data sets?

- First person reproducing results: **you** in a couple months
- Tools should buy a peace of mind, not additional burden
- Increasing expectations on reproducible software
    - NeurIPS reproducibility [checklist](https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf) for all papers
    - Requirements on reproducibility: not just by you, not just on your machine

5 levels of reproducibility defined in P. Vandewalle, J. Kovacevic and M. Vetterli, "Reproducible research in signal processing"  
DOI: 10.1109/MSP.2009.932122

# Meet Git: the time machine for computer files

## Download

On Linux, git is probably installed, otherwise:  
https://git-scm.com/downloads

## Learn

- https://fr.atlassian.com/git
- https://guides.github.com/introduction/git-handbook
- https://openclassrooms.com/fr/courses/2342361-gerez-votre-code-avec-git-et-github
- https://openclassrooms.com/en/courses/5671626-manage-your-code-project-with-git-github

## Try online
- https://learngitbranching.js.org

# Demo 1: a git project

<input type="checkbox" unchecked> New project  
<input type="checkbox" unchecked> Add files  
<input type="checkbox" unchecked> Commit  
<input type="checkbox" unchecked> See versions  

# Working alone

- One project, isolated from the rest
- Build a library or produce results

```shell
$ tree
.
├── code
├── data
│   ├── data1.csv
│   └── data2.txt
├── results
│   ├── results.csv
│   └── img
│
└── paper
    └── paper.tex
```

## Why not a script?

- Upgrading dependencies -> did I break my code?
- Am I sure this code can be run anywhere?
- Article review after 6 months -> is it still fine?

## Working in environments
![](img/pkg_structure_projects.svg)

## Safely depending on libraries
![](img/pkg_structure_full.svg)

# Pkg tools: files

### Project.toml

- This directory is a Julia Project
- Shows **what I need**
- Necessary for all projects

### Manifest.toml

- Generated when activating the project
- Shows **how it was run**
- Useful for debugging and research

Freeze the Manifest $\Rightarrow$ freeze how it's run 

# Demo 2: Pkg tools

<input type="checkbox" unchecked> Generate project IdentityMatrices  
<input type="checkbox" unchecked> Add dependencies  
<input type="checkbox" unchecked> See Project.toml, Manifest.toml  

All is in the Pkg [documentation](https://julialang.github.io/Pkg.jl/v1), go read it.

## Project isolation

Launch Julia and activate your project:
```
$ julia --project=@.
```

Launch, and then activate:
```julia
julia> ]
(v1.1) pkg> activate .
```

## Get a project and set the required environment

```julia
julia> ]
(v1.1) pkg> activate .
(JuliaNantes) pkg> instantiate
```

If *Manifest.toml* provided $\Rightarrow$ same exact configuration as when the code was written.  
Otherwise $\Rightarrow$ compatible configuration with *Project.toml*, creates a Manifest file.

# Tests, the easy way

Research software moves fast, and breaks things.

Tests:
- Specify expected behavior
- Communicate usage
- Signal robustness
- Safeguard against your future self
- Put yourself in the user's shoes

# Demo 3: writing tests

<input type="checkbox" unchecked> First tests for `IdentityMatrices`  
<input type="checkbox" unchecked> Write code for `IdentityMatrices`  
<input type="checkbox" unchecked> Test-specific dependencies  

### Personal tips

Cover corner cases:
- `@test_throws` with expected error
- What happens with limit values?
```julia
@test_throws MethodError mean(["hello"])
@test isnan(mean(Float64[]))
```

- Avoid too special structure in tests
Example: input always integer.  

Avoid trivial "comfort" tests.  
Example: copying a function implementation to test it:
```julia
@test mean(x) == sum(x) / length(x)
```

## Unit VS Property test

- Unit: test a given evaluation / data point
- Property tests: test a property of the result for given input

Examples: positivity, idempotency, existence for **any** input, order conservation, ...

Two steps:
1. generate random input
2. test property

# Publishing code

### Why?

> An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.  
> [Claerbout and Karrenbach 1992](http://sepwww.stanford.edu/doku.php?id=sep:research:reproducible:seg92)

Showcase your work, reference it in your paper.

Better **something** out now than some day a "perfect" library.

### How?

- Git platform (GitHub, Bitbucket, Gitlab)
- [Zenodo](https://zenodo.org): DOI provider and archive
- [Figshare](https://figshare.com): citable archive for data
- HAL [Archive ouverte](hal.archives-ouvertes.fr)

### A standard?

Parts of the Julia community moving towards `CITATION.bib`

## What about data?

- Do not track massive data sets with git
- Then how?  

![](img/datadeps.png)  
Source: Lyndon White, DataDeps.jl, JuliaCon2018 https://doi.org/10.6084/m9.figshare.6949145.v1

## DataDeps.jl

- Describe once *how* to get the data, parse preprocess
- Data gets cached, no 2nd download if available

Go check it: https://github.com/oxinabox/DataDeps.jl

# Collaborating on code and research

Why?

1. Using projects and getting a valuable experience to report
2. Use-case not covered (yet)
3. Somebody noticed your work online and wants to help
4. Great opportunity for unexpected research projects

![](img/pr_gitflow.svg)  
More info on https://guides.github.com/introduction/flow

# Demo 4: Contributing somewhere

<input type="checkbox" unchecked> Find a project to contribute to

- github.com/JuliaStats/Distributions.jl
- github.com/JuliaGraphs/LightGraphs.jl
- github.com/JuliaOpt/MathOptInterface.jl
- github.com/JuliaOpt/JuMP.jl

<input type="checkbox" unchecked> Fork  
<input type="checkbox" unchecked> Develop locally  
<input type="checkbox" unchecked> Commit & push to fork  
<input type="checkbox" unchecked> Pull request  

# Automate the burden: Continuous Integration

[Travis](https://travis-ci.org/) (Continuous Integration)  
[codecov](https://codecov.io//) (Code coverage)

## Continuous Integration

This works... on my machine.  
What if I could check it on a clean computer, without my setup?

![](./img/travis.png)

## Checking for every change (Pull Request)

![](./img/travis_pr.png)

# Code coverage

How much of the package behaviour did I test? (at least once)?

![](./img/codecov.png)

## Take-away

- Version control is the foundation for lots of modern tools
- Tests make your life easier
- Sharing code increases visibility, creates research opportunities

## Homework

- Reproduce all the demos  
- Use git on your research projects  
- Contribute to a package (demo 4)  
- Publish your first package  

## Reading more

Mathieu Tanneau's tutorial on coding for research: https://github.com/mtanneau/tutorial_airo  
Jane Herriman, *How to get started with Julia 1.0's package manager*: https://www.youtube.com/watch?v=76KL8aSz0Sg  
Read the documentation https://docs.julialang.org/en/v1/stdlib/Pkg/index.html