# Notes 

Some notes on stuff for SSI workshops

Stuff to cover:
* Version control
* Testing
* Continuous integration
* Code coverage
* Documentation
* Publishing code (i.e. stuff to do with Makefiles, Docker, pypi)
* Profiling? (i.e. I love snakeviz and think more people should use it)

## Version control

### Why?
* Allows you to keep comprehensive record of all changes made during project - very useful if make a change that breaks something!
* Keeps a backup of files - useful if need to e.g. recreate some data (like a plot) that made several months ago, as can easily 'rewind' code back to the state it was then
* Also allows you to make changes without worrying about breaking something
* If working in a group project, helps prevent different members of the group overwriting each others' changes, can keep record of who wrote what
* No need for millions of different versions of files named e.g. `version_1.txt`, `version_2.txt`, ..., `version_27.txt`, `FINAL_version.txt`, `FINAL_FINAL_version.txt`, `FINAL_FINAL_PRINT_version.txt`.

### How?
* git most popular tool for local version control
* Alternatives: Bazaar, Subversion (SVN), Mercurial...
* Online tools for hosting repositories such as github, bitbucket, sourceforge, launchpad

## Testing

### Why?
* In experimental Sciences, in order to show a result is reliable, the experimental setup will be tested in order to show that it is working as it was designed and so as to eliminate or quantify any systematic errors
* In computational Science, we should apply the same principles to our code: a result can only be trusted if the code that produced it has undergone testing which demonstrates that it works as designed
* Testing from the very start of development of code can help catch errors early on when they are much easier to fix, rather than waiting until the code has become much more complex and errors are therefore much harder ot track down.

### How?
* Scientific codes can be difficult to test as by their very nature that will often be looking at systems where the behaviour is to some extent unknown
* If code has been built in a modular way (i.e. broken down into subfunctions and subroutines, rather than several thousand lines of code all written as one function), then can write unit tests for individual functions to check that they work in isolation - if that is so, much more likely that they will work together. This also helps track down the source of errors, as can pinpoint where they occur
* If code passes unit tests, can then write *integration tests* which verify functions work well together
* For many systems, there is often a 'control' - e.g. if modelling a physical system's evolution with time, there will be a set of initial data for which the system is stable and so if the code is evolved the solution should remain the same. Tests can be written which give the code this initial data, evolve the system and check that the solution has not changed. Similarly, in physical systems there are often symmetries and conserved quantities present - can write tests that check that the code preserves these. If all else fails, it is likely that at the very least it will be known the output should fall within some range of acceptable values - tests can be written to check this.
* It's important to check that code breaks as it should as well! If your code ends up producing something unphysical, it should be able to identify this and deal with it appropriately. If you never test this, will be much harder to make sure code is operating properly when it's dealing with real data. A good way of doing this is testing *edge/corner cases* - using input data at the very limits of the valid range and checking tests pass/fail as expected

### Tools
* Depends on language - in some languages, there are libraries which help automate the testing process
* In python, can use `pytest`, `nose`
* In languages like C/C++, check for memory leaks using `valgrind`

## Continuous integration

### Why?
* So you've written a set of tests for your code, you run them and everything passes - great! However, you then go back to work on your code and quickly forget about testing it. Eventually, a few months later after implementing several new features you remember to try testing your code again. You run the set of tests, only to find that they fail 
* Solution: continuous integration.
* This will run your tests for you regularly (e.g. every night, every time you push changes to a repository) and report back to you the results
* Can now spot (almost) instantly when code breaks 

### How?
* Several tools out there: `travis.ci`, `jenkins.ci`, `circle.ci`
* Involve writing a short script which details the computational setup (i.e. any libraries needed) and what code should be run to execute tests

## Code converage
### Why?
* So you have a test suite and you're using continuous integration to run it regularly
* However, how do you know that you are testing all parts of your code? It's all very well to test a few auxilliary functions, but if you're not testing the main part of the code then you still cannot trust the results
* Solution: code coverage
* This will track what parts of the code are being run when tests execute and will highlight areas not currently being tested
* Generally want to aim for > 90% code coverage

### How?
* There exist libraries for most languages that will produce code coverage reports, e.g. `coverage.py` for python, `gcov` for C/C++, `tcov` for C/C++/fortran
* Can use tools like Codecov to integrate these tools with continuous integration, providing an easy-to-use interface to analyse code coverage and keep track of code coverage as develop code
* These tools are also particularly useful if code is written in multiple languages, as will combine reports produced for each of the different languages

## Documentation
### Why?
* Simply: why should someone trust your code if they have no idea how it works?
* In experimental sciences, experiments must be reproducible: give a description of the apparatus and experimental setup, it should be possible for someone else to replicate results.
* Similarly, for computational sciences, given the algorithm and information about how it's been implemented in your code, it should be possible for someone else to write their own code and replicate your results.
* It should also be possible for a new user to be able to read and understand what your code is doing without any extra explanation from you
* You should be able to look back at code you wrote weeks/months/years ago and still understand how it works

### How?
* Always write your code as if someone else is going to be using it at a later date. No matter how small the project may be at the start, often it will turn into something much bigger / be incorporated later on into another project
* Useful for code to be to some extent 'self-documenting' - e.g. classes, functions and variables given sensible, descriptive names that make it clear what they are doing
* Tools such as `sphinx` and `doxygen` will generate documentation for you from docstrings and annotations
* Services such as `Read The Docs` will host repository's documentation and work with version control/continuous services to recompile when push changes 
* Tools such as python's `doctest` will allow you to write tests which can then be easily incorporated into documentation

## Publishing code
* At some point, it is likely you're going to want to publish your code or give your code to another person
* You want to make it as easy as possible for the other person to install and run your code
* To do this, need to use tools to enable easy installation, replication of runtime environment, distribution of code

### Make
* Make files can be used to automate pipelines
* Useful to automate installation, including compilation, building of documentation, running test suites
* Makefiles can quickly get unwieldly for large projects and if have cross-platform support. In such cases, automated build systems such as autotools and CMake can be used

### Containers
* Allow replication of entire runtime environment: application, dependencies, libraries, confirugration files etc
* Not only does this make it much easier and faster for someone else to run your code, it also means that their 'experimental setup' is the same as yours, e.g. they're running the same versions of libraries. It means they can also run code on their system regardless of their operating system: as long as they have Docker installed, they could run your Linux code on a Mac or even a Windows machine
* Containers like `Docker` are much more lightweight than virtual machines which contain entire operating system: container may be a few tens of megabytes, whereas a virtual machine could be several gigabytes

### Distributing code
* For python, can upload code to the Python Package Index, `pypi`, which will allow anyone else to install code using `pip`
* If packaged code in a Docker image, can add to `Docker Hub`, a cloud-based registry service which will allow others to easily find your code
* Make sure your code has some kind of software license (e.g. BSD, MIT, GPL) - see http://choosealicense.com/
* Give your code a DOI - e.g. using `zenodo`

## Profiling
### Why?
* For most codes, execution time varies hugely across different parts of the code
* When optimising a code's performance, it therefore makes sense to focus on parts of the code where execution time is the longest
* These sections can be located using profiling tools

### How?
* Depends on language - in python, can use `cProfile`, in C/C++ can use `gprof`, `valgrind`
* Like code coverage tools, profilers will often generate reports. For python, can use `snakeviz` to view these reports as a sunburst chart, easily allowing you to spot identify in which functions your code is spending longest