# Work habits and reproducible research

- Source your code
- Continuous Integration
- Reproducible research
- Unit Tests
- Workflows
- Acceleration: profiling, JIT

## Source your code

In [19]:
a = "important stuff"

In [None]:
assert(a == "important stuff"), "Ha! You just corrupted your data!"

In [20]:
def f(a):
    a = "corrupted " + a
    return

a = f(a) 

**Why?**

- Because you avoid errors. Because reproducibility. And because you must be a responsible programmer and researcher.
- You can version your code with a version control, track changes, assign error reporting, documentation etc.
- If the code is kept in Jupyter Notebooks, it should always be executed sequentially.
- Did you know, Jupyter has code editor, terminal and markdown editor, independent from the notebook?
- Make your research presentation easily reproducible by offering the data, the code and the findings separately.
- But, don't isolate the code from your findings! You can execute your source code from a notebook, that is fine!
- The bits of notebook that you see on internet are meant to demonstrate code, not do scientific research.
- If you receive a notebook from a colaborator, always make sure you can run it and reproduce the findings in it (it may already be corrupted).

**How**

Python editors:
- Simple text processors: Atom, Sublime, Geany, Notepad++ etc
- Spyder: good for basic scientific programming, ipython interpreter
- PyCharm: refactoring, internal jupyter, object browsing, etc
What matters:
- Using the editor most appropriate to the complexity of the task.
- Full feature editors make it easier to write good code!
- Syntax and style linting.
- Code refactoring.
- Git/svn integration.
- Remote development.
- Advanced debugging.
- Unit testing.


**Standards**

- Source code can be one or several scripts, it should contain information on deployment, testing and documentation.
- Some care for design should be given. Module hierarchy, will you create classes, use of design patterns.
    - https://www.geeksforgeeks.org/python-design-patterns/
- Python style guide.
    - https://www.python.org/dev/peps/pep-0008/


**Quality of life, or simply good habits**
- Software versioning milestones.
- Continuous integration.
- Reproducibility.
- Workflows.
- Containerization.

**Versioning example**

- https://www.python.org/dev/peps/pep-0440/
- https://en.wikipedia.org/wiki/Software_versioning

milestone x.x:
- expected outcomes
- tests

```
X.YaN   # Alpha release
X.YbN   # Beta release
X.YrcN  # Release Candidate
X.Y     # Final release
```
    

### Continuous Integration

- Submit your code to github often!
- Make backup for your data and findings.
- Set baselines on expected outcomes, and verify often that your code is tested against them.
- Unit tests are one way to do this.
- Notebook keeping helps (internal notebooks) especially if you can re-test.
- Test your code more than once, and everytime you do a modification.
- Use workflows, virtual environments and containers.

## Reproducible research


- The vast majority of published results today are not reproducible.
- Let us admit this, if our research cannot pe reproduced we probably did something else.
- Research findings do not only depend on your plotting skill
- For someone to be able to produce your results several thinks must harmonize:
    - Open data access:
        - (on federated databases)
    - Open source access
        - on github, gitlab, etc
    - Open environment:
        - conda requirements and container script (or image)
    - Findings (paper AND notebooks)
        - public access


**Development vs production**

- They are separated, fully.
- If is fine to demo your source code, or just parts of it on a notebook during development.
- Most notebooks you see on the web are in a development stage.
- How does it impact on the reproducibility if the development and production is not separated?
- Bring forward the issue of having different projects using the same source code directory, or the same data directory. What is to do?

### Reproducible environments: containers and conda

- **Docker usage is described in another notebook**
- containers can isolate an environment even better than a package manager!
- Problem: what if the old package versions cannot be maintained?
- Problem: what if the older container instances cannot be spinned or even re-created?


### Unit testing

- The unittest module can be used from the command line to run tests from modules, classes or even individual test methods
    - https://docs.python.org/3/library/unittest.html
    - https://www.geeksforgeeks.org/unit-testing-python-unittest/
    - https://realpython.com/python-testing/
- Some editors give special support for unit tests:
    - https://www.jetbrains.com/help/pycharm/testing-your-first-python-application.html#choose-test-runner


### Documentation

- Docstrings convention
    - https://www.python.org/dev/peps/pep-0257/
    - https://realpython.com/documenting-python-code/
- https://readthedocs.org/
    - simplifies software documentation by building, versioning, and hosting of your docs, automatically. Think of it as Continuous Documentation
- Using Sphinx:
    - https://docs.readthedocs.io/en/latest/intro/getting-started-with-sphinx.html
    - https://www.sphinx-doc.org/en/master/
- Other options exist, pydoc, etc

**Other development tools:**
- debugger: allows to to follow your code step by step and investigate the program stack
- profiler: shows memory usage finding possible leaks and bottlenecks (see acceleration notebook)

### Workflows

**Snakemake**
- https://snakemake.readthedocs.io/en/stable/tutorial/basics.html
- https://snakemake.github.io/snakemake-workflow-catalog/

```
conda install -c bioconda snakemake
conda install graphviz
```

In [None]:

SAMPLES = ['ctl1', 'ctl2']

rule all:
    input:
        'merged.txt'

rule acounts:
    input:
        file='{sample}.fastq'
    output:
        '{sample}_counts.txt'
    run:
        with open(input.file, 'r') as f:
            nc = [str(l.count('A')) for l in f if not l[0]=='@']
        data = ', '.join(nc)+'\n'
        with open(output[0], 'w') as f: f.write(data)

rule merge:
    input:
        counts=expand('{sample}_counts.txt',sample=SAMPLES)
    output:
        'merged.txt'
    shell:
        """
        for f in {input.counts}
        do
			cat $f >> {output}
		done
        """

In [None]:
snakemake --dag merged.txt | dot -Tsvg > dag.svg

In [1]:
snakemake --name mylittleworkflow.txt

learning.ipynb	scicomp.ipynb	  visualization.ipynb
networks.ipynb	statistics.ipynb  workflows.ipynb


**Nextflow**
- https://www.nextflow.io/

In [None]:


#!/usr/bin/env nextflow
 
params.range = 100
 
/*
 * A trivial Perl script producing a list of numbers pair
 */
process perlTask {
    output:
    stdout randNums
 
    shell:
    '''
    #!/usr/bin/env perl
    use strict;
    use warnings;
 
    my $count;
    my $range = !{params.range};
    for ($count = 0; $count < 10; $count++) {
        print rand($range) . ', ' . rand($range) . "\n";
    }
    '''
}
 
 
/*
 * A Python script task which parses the output of the previous script
 */
process pyTask {
    echo true
 
    input:
    stdin randNums
 
    '''
    #!/usr/bin/env python
    import sys
 
    x = 0
    y = 0
    lines = 0
    for line in sys.stdin:
        items = line.strip().split(",")
        x = x+ float(items[0])
        y = y+ float(items[1])
        lines = lines+1
 
    print "avg: %s - %s" % ( x/lines, y/lines )
    '''
 
}

# Acceleration

## Speed: Profiling, IPython, JIT

The Python standard library contains the cProfile module for determining the time that takes every Python function when running the code. The pstats module allows to read the profiling results. Third party profiling libraries include in particular line_profiler for profiling code line after line, and memory_profiler for profiling memory usage. All these tools are very powerful and extremely useful when optimizing some code, but they might not be very easy to use at first.

In [1]:
%%writefile script.py
import numpy as np
import numpy.random as rdn

# uncomment for line_profiler
# @profile
def test():
    a = rdn.randn(100000)
    b = np.repeat(a, 100)

test()

Writing script.py


In [2]:
!python -m cProfile -o prof script.py

In [None]:
$ pip install ipython
$ ipython --version
0.13.1
$ pip install line-profiler
$ pip install psutil
$ pip install memory_profiler


In [9]:
%timeit?

In [None]:
%run -t slow_functions.py

In [3]:
%time {1 for i in range(10*1000000)}
%timeit -n 1000 10*1000000

Wall time: 299 ms
7.37 ns ± 0.0881 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
def foo(n):
    phrase = 'repeat me'
    pmul = phrase * n
    pjoi = ''.join([phrase for x in xrange(n)])
    pinc = ''
    for x in xrange(n):
        pinc += phrase
    del pmul, pjoi, pinc

In [None]:
#%load_ext line_profiler
%lprun -f foo foo(100000)

- %time & %timeit: See how long a script takes to run (one time, or averaged over a bunch of runs).
- %prun: See how long it took each function in a script to run.
- %lprun: See how long it took each line in a function to run.
- %mprun & %memit: See how much memory a script uses (line-by-line, or averaged over a bunch of runs).

### Numba

Numba is an open source JIT (just in time) compiler that translates a subset of Python and NumPy code into fast machine code.
- https://numba.pydata.org/

```
conda install numba
conda install cudatoolkit
```

In [None]:
from numba import jit
import random

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

@numba.jit(nopython=True, parallel=True)
def logistic_regression(Y, X, w, iterations):
    for i in range(iterations):
        w -= np.dot(((1.0 /
              (1.0 + np.exp(-Y * np.dot(X, w)))
              - 1.0) * Y), X)
    return w

# JIT with JAX
Jax is a JIT compiler optimized for machine learning.
- https://github.com/google/jax
- https://jax.readthedocs.io/en/latest/notebooks/quickstart.html#using-jit-to-speed-up-functions