# Best Practices in Scientific Programming

Heidelberg University | Institute of Geography | August 12th 2020

Christina Ludwig

## Agenda


1. Learning goals
2. Open Reproducible Science
3. Good scientific code
4. Practical: 
    * Unit testing
    * Optimization
    * Going back in time using git


## Learning goals

1. Define _open reproducibile science_ and name its benefits.
2. Explain the requirements of _good scientific code_.
3. Test your code in python.
4. Explain optimization strategies. 
4. Compare your code to previous versions using git. 

## Open Reproducible Science
 
### 1. What does 'open' mean?

### 2. What does 'reproducible' mean?

### 3. Why is it important?


#### Discuss in breakout rooms -  10 min

In [6]:
from course_utils import make_groups

In [7]:
make_groups(3)

["Breakout Room 1: ['Hannah' 'Julian P' 'Rebekka']",
 "Breakout Room 2: ['Julian G' 'Matthias' 'Anna']",
 "Breakout Room 3: ['Leonie Marie' 'Mark']"]

## What does _Open Scientific Data_ mean?


In 2016, the [FAIR Guiding Principles for scientific data management and stewardship](https://www.nature.com/articles/sdata201618) were published in Scientific Data.

The authors intended to provide guidelines to improve the
* Findability,
* Accessibility,
* Interoperability, and
* Reuse of digital assets.

The principles emphasise machine-actionability because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.

Source: [go-fair.ogr](https://www.go-fair.org/fair-principles/>)



## What does _Reproducible Science_ mean?




**Reproducing** the result of a computation means running the same soware on the same input data and obtaining the same results. ([Rougier et al, 2017](https://www.frontiersin.org/articles/10.3389/fninf.2017.00076/full))

**Replicating** a published result means writing and then running new soware based on the description of a computational model or method provided in the original pub-lication, and obtaining results that are similar enough tobe considered equivalent. ([Rougier et al, 2017](https://www.frontiersin.org/articles/10.3389/fninf.2017.00076/full))

&rarr; But there are different opinions on this definition see [Plesser, 2018](https://www.frontiersin.org/articles/10.3389/fninf.2017.00076/full#B17)


## What does _Reproducible Science_ mean?

**re-runnable, repeatable, reproducible, reusable, and replicable.**

"The code should be executable (re-runnable) and produce the same result more than once (repeatable); it should allow an investigator to reobtain the published results (reproducible) while being easy to use, understand and modify (reusable), and it should act as an available reference for any ambiguity in the algorithmic descriptions of the article (replicable)." ([Benureau & Rougier, 2018](https://www.frontiersin.org/articles/10.3389/fninf.2017.00069/full))



## What are the benefits of _Open Reproducible Science_ ?

* __Transparency__ in the scientific process, as anyone including the general public can access the data, methods, and results.
* __Ease of replication and extension__ of your work by others, which further supports __peer review and collaborative learning in the scientific community.__
&rarr; Avoiding paper retractions due to computation bugs  
&rarr; Code is increasingly more required for journal review (e.g. [o2r](https://o2r.info/results/))  
* It supports you! You can __easily understand and re-run your own analyses__ as often as needed and after time has passed.

[Source](https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/)


## What is _good scientific code_ ?


 <img src="./img/wtf_minute.jpeg" alt="wtf_per_minute" width=500>


## What is _good scientific code_ ?


Group 1: Imagine you are a **web developer**.

Group 2: Imagine you are a **scientist**.

__What is important for your software to be considered _good_ ?__

* Efficiency / Speed
* Readability
* Maintainability
* Reusability
* Reliability / Accurate results

__Order the criteria by priority__ starting with the most important one on top:

&rarr; Put your ranking on the chaospad https://pads.ccc.de/r61LZPBk8L


### There is usually a compromise between readability and efficiency [(Rougier, 2017)](https://www.labri.fr/perso/nrougier/from-python-to-numpy/#readability-vs-speed)

In [None]:
def function_1(seq, sub):
    return [i for i in range(len(seq) - len(sub)) if seq[i:i+len(sub)] == sub]

def function_2(seq, sub):
    target = np.dot(sub, sub)
    candidates = np.where(np.correlate(seq, sub, mode='valid') == target)[0]
    check = candidates[:, np.newaxis] + np.arange(len(sub))
    mask = np.all((np.take(seq, check) == sub), axis=-1)
    return candidates[mask]

## How to write _good scientific code_

* Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., ... & Waugh, B. (2014). [Best practices for scientific computing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886731/). PLoS biology, 12(1).
* Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). [Good enough practices in scientific computing.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5480810/) PLoS computational biology, 13(6).

## Enough theory, let's put it into practice!

Today, we will look at these methods: 

1. __Unit testing__
2. __Optimization__ 
3. Going back in time using __git__

More later, e.g.

* Increase readability using refactoring, pylint, and pre-commit hooks
* __Organize code__ into functions, modules and config files. 
* Use __assertions__ to find errors quickly
* Write __documentation__
* etc.

## 1. Unit Testing

Testing frameworks in Python 
* unittest
* nosetest
* pytest

Every test function needs to start with "test_"

### What a good test looks like
#### Good:
* Short and quick to execute
* Easy to read
* Exercise one thing
* Fails fast

#### Bad:
* Relies on data files
* Messes with “real-life” files, servers, databases

### Basic structure of a test

__Given:__ Put your system in the right state for testing
* Create data, initialize parameters, define constants…

__When:__ Execute the feature that you are testing
* Typically one or two lines of code

__Then:__ Compare outcomes with the expected ones
* Define the expected result of the test
* Set of assertions that check that the new state of your system matches your expectations



### Example: Tests for ‘lower’ method of strings

In [20]:
def test_lower():
    # Given
    string = 'HeLlO wOrld'
    expected = 'hello world'
    # When
    output = string.lower()
    # Then
    assert output == expected

In [21]:
def test_lower():
    # Given
    # Each test case is a tuple of (input, expected_result)    
    test_cases = [('HeLlO wOrld', 'hello world'),('hi', 'hi'),('123 ([?', '123 ([?'),('', '')]
    for string, expected in test_cases:
        # When
        output = string.lower()
        # Then
        assert output == expected

### Code often breaks in corner cases: 
empty lists, None, NaN, 0.0, lists with repeated elements, non-existing file, …

This often involves making design decision: respond to corner case with __special behavior, or raise meaningful exception?__

In [19]:
string = None
string.lower()

AttributeError: 'NoneType' object has no attribute 'lower'

### Integrating debugging and testing
### When you find a bug:
1. Reproduce it in a __small simple test__ to make it fail fast.
2. __Debug your code__ so that all tests pass.

&rarr; In this way, you __increase your test coverage__ step by step __making your code more robust__ with each test. 

&rarr; If you work on your code, __run the tests regularly__ so you can make sure that none of your fixed bugs come back unanounced.

### Exercise:

Open this repository folder in PyCharm. The folder _src_ contains the `Poylgon` class which we have written on the first day. 

0. Paste your implementation code into the `MyPolygon.envelope()` method. 
1. Write tests to make sure that the `MyPolygon.envelope()` works correctly. Don't forget corner cases. 

## 2. Optimization

Once the reliability of your code is ensured using the tests, your code is ready for optimization.



### Order the following optimization steps in chronological order of how you would proceed: 

* Use GPU acceleration
* Use a “magic optimization” tool, like numexpr, or numba; or a “magic parallelization” tool, like joblib or dask
* Use Cython
* Don’t do anything
* Vectorize your code using numpy
* Parallelize your code
* Spend some money on better hardware

&rarr; Put them down on the chaospad: https://pads.ccc.de/r61LZPBk8L or on notes in MS Teams


In [8]:
make_groups(3)

["Breakout Room 1: ['Julian P' 'Hannah' 'Julian G']",
 "Breakout Room 2: ['Matthias' 'Anna' 'Rebekka']",
 "Breakout Room 3: ['Mark' 'Leonie Marie']"]

## When optimizing code, remember: 

* __Avoid of premature optimization__
* Python is slower than C, but - as we've seen - many costly operations are already optimized in the background
* Programming time is mostly expensive than computation time.
* Prioritize your tasks: Always weight the time you spend on a task vs its benefits e.g. us an issue tracker on GitHub

## 2. Optimization Workflow

__1. Make it work:__ Write simple code which gets the job done.


__2. Make it work reliably:__ Write automated unit tests.


__3. Profiling it:__ Profile simple test cases to find the bottlenecks. 


__4. Optimize it:__ Make sure that your tests are all still passing after optimization.

### Tools for profiling

### Jupyter Notebook
* Magic commands %%timeit
* line_profiler and memory_profiler extentions

### IDEs
* Most IDEs have an integrated Profiler
* In PyCharm, the profiler is only included in the Professional Version. &rarr; Register for the Eduction Program to use it

### Command line
* Use cProfile `python -m cProfile –o filename.prof myscript.py` and visualize the result using the package `snakeviz`.

### Profiling using line_profiler in PyCharm 

1. Open the Python console with IPython enabled. 
2. Load all necessary variables and functions in the console. 
3. Load the line_profiler: 
``` python
%load_ext line_profiler
```
4. Profile the __init__ function.
``` python
%lprun -f MyPolygon.__init__ test_init()
```

### The Agile Developmen Cycle

 <img src="./img/agile_cycle.png" alt="agile_cycle" width=400>
    
Source: [Pietro Berkes: Testing, Debugging, Profiling](https://github.com/ASPP/testing_debugging_profiling)


## Exercise

Take a look at the [GitHub Issues](https://github.com/geoscripting/03_scientific_programming/issues) of the main repository.

1. __Pick an issue__ and use the agile development cycle to solve it. 
2. __Assign the issue to yourself.__ There can be several people assigned to an issue. In this way you know who else is wokring on it, so you know whom to ask if you want to discuss someting or ask questions.
3. __Solve the issue__ by editing the python files. 
3. When you are done, __create a commit__ whose message contains the issue number. Push it to GitHub.

` $ git commit -m "added test for MyPolygon.envelope() #1" `

4. __Pick two more issues__ and solve them in the same way. Create commits for each one and push them to GitHub.

5. __Create new issues:__ Come up with a new functionality for your class, propose new classes, report bugs, etc. Create a new issue describing it on the [GitHub Issues](https://github.com/geoscripting/03_scientific_programming/issues) page of the main repository.


## Resources

[ Jake Vanderplas: Videos on Reproducible Data Analysis in Jupyter](https://www.youtube.com/watch?v=_ZEWDGpM-vM&list=PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ)

[Jake VanderPlas: Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html)
    
[Pietro Berkes: Testing, Debugging, Profiling](https://github.com/ASPP/testing_debugging_profiling)

[Software Design for Maintainability](http://gael-varoquaux.info/programming/software-design-for-maintainability.html)

[Improving your programming style in Python](http://gael-varoquaux.info/programming/improving-your-programming-style-in-python.html)

[From Python to Numpy](https://www.labri.fr/perso/nrougier/from-python-to-numpy/)

[SciPy Lectures: Optimizing code](https://scipy-lectures.org/advanced/optimizing/index.html)