# Building Open Source Geochemical Research Tools in Python

<span id='authors'><b>Morgan Williams</b>, Louise Schoneveld, Steve Barnes and Jens Klump;</span>
<span id='affiliation'><em>CSIRO Mineral Resources</em></span>

[**Abstract**](./00_overview.ipynb) | 
**Intro**:
[Software in Geochem](./01_intro.ipynb#Software-in-Geochemistry),
[Development & Tools](./01_intro.ipynb#Development-Workflow-&-Tools) |
[**Examples**](./02_examples.ipynb):
[pyrolite](./021_pyrolite.ipynb),
[pyrolite-meltsutil](./022_pyrolite-meltsutil.ipynb),
[interferences](./023_interferences.ipynb),
[autopew](./024_autopew.ipynb)

## Key Messages

* Software is a key part of geochemical research
* Existing software is often relatively inflexible and intransparent
* Developing open-source tools can address some of the existing challenges
* Community input and support is likely required to make this sustainable

## Software in Geochemistry

While software was once auxiliary to the collection of data and generation of knowledge, it is quickly becoming an essential component of scientific research. In geochemical research, we use software<sup><a id="a1">[*](#f1)</a></sup> to control instruments and collect data, to reduce this raw data to produce estimates of composition, to analyse and model our data, and to produce graphical representations which we can use to communicate the meaning and significance of all of this. Additionally, while it's not the focus of this presentation, reserach software can also be later tertiary education programmes, especially those which seek to introduce students to what research looks like in practise. 

With regard to mass-spectrometry, software for data collection is often specific to individual instruments and in most cases proprietary, but tools for data analysis and visualisation can vary widely. Particularly in this space, there’s opportunities to develop specialised tools to enhance and add value geochemical research outputs, and potentially get more out of our data. In particular, specifically designing software for transparency, reproducibility and accessibility will aid the interpretability of results and data reuse. In this presentation we make the case for developing tools for geochemical research with Python<sup><a id="a2">[†](#f2)</a></sup>, and in the open - ideally with contribution from a community of users. 

While it’s not always the first option, developing your own software may be a productive path if you’re frequently having issues wrangling with inflexible (and sometimes, expensive) software which doesn’t meet your need. With some planning and design, It is also a way to add value to research already being conducted, bring interoperability and standardisation and contribute to the community.

<sup>
<p>   
<b id="f1">*</b>: Here where we refer to software, we are purposely using the term in a relatively broad sense, and intend to refer to the range of software-based tools which are used in geochemistry (i.e. everything from templated excel spreadsheets to ecosystems of related professionally-developed standalone executable programs). <a href='#a1'>↩</a>
</p>
<p>
<b id="f2">†</b>: Note that while this presentation focuses on Python, it is largely for reasons of convenience rather than in support of it’s exclusive use. There are a number of languages which fit well with the ideas presented here and are approachable for those new to coding, notably including R and Julia. <a href='#a2'>↩</a>
</p>
</sup>

## Requirements and Challenges for Research Software

There are number of key challenges for the development, adoption and sustainability of research software projects. Below we've listed a few of them with assocaited questions, and some potential solutions or advantages in developing community-driven open-source software.

<dl>
<dt><b>Development and Maintenance</b></dt>
<dd><em>Do I have the skills to develop this? Can we sustainabily develop and maintain this?</em></dd>
    
One relatively flexible way of making tools available for use is through developing modular libraries or packages, which can be used as components in other software (which may, for example, add an additional user interface). By constructing modular software, not only can we focus on the core components we need and use for research, but we can build and version individual components separately.
    
By developing tools in high-level languages (e.g. Python, R, Julia) which do not need to be compiled, some of the complexity of learning and starting to write code is avoided. This allows researchers to relatively quickly build something which works, which can be further developed into a prototype or package.
    
<dt><b>Trust and Transparency</b></dt>
<dd><em>How do I know this does what I expect? How do I know if this is a bug?</em></dd>

Being open-source allows others to interrogate and debug a code base to understand how it works, and potentially contribute to bugfixes and patches where issues are identified.
    
A suite of automated tests which cover most of the code base can be used to demonstrate that most things are working as expected (at least for the expected use cases).
    
By attributing versions to successive releases, a consistent reference point is established for the identification of issues (e.g. "there's a bug with plot colors in v1.2.0, but it was fixed for v1.2.1"). A changelog is a useful addition to the documentation which effectively describes how the software evolves, at at which issues are identified and fixed.
    
One key advantage of building these tools at a community level, or at least in interaction with the community, is that outcomes of discussions around best-practise can be integrated into tools as defaults (and hence available without any additional effort from the user).
    
<dt><b>Accessibility and Documentation</b></dt>
<dd><em>How do I install this? Can I contribute? Does it work on my operating system? How do I do X?</em></dd>

Packaging code and publishing it in public repositories (e.g. PyPI for Python) allows for relatively straightforward installation. 
    
Documentation is a key facilitator for adoption of software, and will also typically help with sustainable maintenance and development.
    
Developing modular packages in e.g. Python reduces changes of operating-system dependencies. Adding automated test suites which are run on multiple systems allows users more certainty around what potential issues might exist.

<dt><b>Governance</b></dt>
<dd><em>Who should have oversight? How do we encourage all parts of our community to contribute?</em></dd>
    
This is perhaps the key challenge for open-source projects, especially those developed by researchers. Typcially a project has a single maintainer, and potentially a number of contributors.
    
<dt><b>Recognition</b></dt>
<dd><em>How do we assure that researchers, developers and technicians receive attribution/credit for this work?</em></dd>

For this work to be sustainable, those developing research software should recieve appropriate recognition and have the opportunity to develop ongoing careers - but the commonly used metrics in academic research don't necessarily translate well for software. 
    
While there are efforts towards increased software citation, there are open questions around how e.g. maintenance is recognised.
</dl>

## Development Workflow & Tools

Given the requirements discussed above, below we highlight some of the key aspects of the development of research software, and some of the tools we've used to develop the packages we exhibit later. These incldue:
* An open and accessible codebase
* Versioning and packaging/deployment 
* Documentation
* A test suite which covers most of the code
* A platform for discussion and contribution

### Leveraging The Scientific Python Ecosystem

The scientific Python ecosystem is large and relatively mature; a wide variety of numerical, visualisation and utility packages exist which make for a solid foundation from which to build more specialised tools.

``pyrolite`` and releated tools are built upon already-existing and widely used tools for working with tabluar data (e.g. [pandas](https://pandas.pydata.org/)<sup>[[1]](#mckinney)</sup>) and visualisation (e.g. [matplotlib](https://matplotlib.org/)<sup>[[2]](#hunter)</sup>). As a result generally follows their conventions and syntax, and also exposes exposes their API such that it can be readily accessed. In particular, the API makes use of dataframe accessor classes provided by ``pandas`` to add additional dataframe 'namespaces' (e.g. accessing the ``pyrolite`` spiderplot method via `df.pyroplot.spider()`). This approach allows ``pyrolite`` to use more familiar syntax, helping geochemists new to Python to hit the ground running, and encouraging development of transferable knowledge and skills.

<sup>
<p>
<a id="hunter">[1]</a>: McKinney, W., 2010. Data structures for statistical computing in python, in: van der Walt, S., Millman, J. (Eds.), Proceedings of the 9th Python in Science Conference. pp. 51–56.
</p>
<p>
<a id="hunter">[2]</a>: Hunter, J.D., 2007. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 9, 90–95. https://doi.org/10.1109/MCSE.2007.55
</p>
</sup>

### Software Repositories and Packaging

**Repositories**: All of the software projects are hosted in remote git repositories on [GitHub](https://github.com/), together with associated documentation and test suites. Using git repositories allows for managment of incremental changes to code and documentation, and in this case the GitHub repository also serves as a discussion board to identify issues and a point-of-contact for potential contriubtions.

**Versions**: Named versions correspond to 'releases', where each release incorporates all the changes and bugfixes since the last. Naming conventions largely follow [semantic versioning guidelines](https://semver.org/), where version numbers such as v2.4.3 correspond to major (2), minor (4) and patch (3) versions.

**Branches, Releases and Packaging**: Git repositories are arranged such that they have a 'stable' master branch which corresponds to the latest release, and a potentially 'unstable' development branch where ongoing work occurs. These projects use a ['gitflow' workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) to incorporate new changes into the next release. Once releases are generated, the repository is packaged and uploaded to the Python Package Index, [PyPI](https://pypi.org/), and typically an archive is created on [Zenodo](https://zenodo.org/), which then recieves a DOI such that each version is readily citable.

### Documentation

The principal source of documentation for these packages in within the code itself - in the form of 'docstrings' which accompany each function and class, describing what they do, their inputs, outputs and where relevant appropriate references. From this, each time a new change is pushed to the repository, [Sphinx](https://www.sphinx-doc.org) is used to automatically generated a website (hosted on [readthedocs.org](https://readthedocs.org/)) which documents how the code is structured and how you can interact with it (or, the API).

This is supplemented with manually-created examples and tutorials to demonstrate how to use the packages, which are generated form code each time the documentation is built, such that examples incorporate all new changes by default. Further, the page includes a changelog listing all notable changes and releases, along with information regarding development, citation and contributing.

Finally, each branch (and version) has its own documentation, such that documentation largely keeps pace with the code base. For an example, see the [pyrolite documentation page](https://pyrolite.rtfd.io), one page of which is shown below:

![pyrolite documentation page](img/rtfd.png)

### Testing and Continuous Integration

Each of the software project has an independent set of unit tests which systematically test the *expected* functionality of the code base.

While these tests are useful for local development, they're also used as part of the on-push pipeline (this is typically referred to as continuous integration and deployment/delivery, or CI/CD), where services automatically run this suite of tests each time changes are pushed to the hosted repository ([Travis.org](https://travis-ci.org/) is used for these projects; e.g. see the [page for pyrolite](https://travis-ci.org/github/morganjwilliams/pyrolite)). In our case the tests are run for a specifed range of Python versions (similarly, a range of operating systems can be specified). In cases where tests fail, certain actions such as deployment or releases can be blocked.

These services provide detailed information regarding test results and *test coverage*, which can be used as one metric for assessing test gaps. This information is collected as displayed by a separate service, [Coveralls](https://coveralls.io) where line-by-line test coverage can be assesed to identify targets for increasing test coverage.

### Community Platforms and Contribution

``pyrolite`` and associated tools aim to be designed, developed and supported by the geochemistry community. For general chat, questions and debugging, we use [Gitter](https://gitter.im) (e.g. see the [channel for ``pyrolite``](https://gitter.im/pyrolite/community)), which nicely links to Issues and Pull Requests on GitHub for reference. This essentially serves as a searchable forum space, such that we can easily identify questions which appear repeatedly and improve the related documentation. GitHub serves as the centre of development activity for the project. It hosts an Issues board (for identifying bugs and issues, as well as requesting or discussing new features), and provides a foundation on which we can incorporate community contributions (e.g. via pull-request).

Community contributions to help develop these projects into useful toolkits and resources are welcomed and encouraged (for both research and education purposes). Information regarding contributions (of various flavours, from bug reports and documentation through to code contributions or even whole features) are presented on the documentation websites (e.g. [pyrolite's](https://pyrolite.readthedocs.io/en/master/dev/contributing.html), along with a list of contributors and a Code of Conduct (which applies across the project generally; [see here](https://pyrolite.readthedocs.io/en/master/dev/conduct.html)).

### A Call Towards a Community Effort

While we're all currently moving towards more digital approaches to our research, it would seem an opportune time to combine efforts and reduce potentially duplicated work through collaborative development of shared tools. Not only can we work towards improved standardisation and interoperatiblity, but we can incorporate best practises and modern data analysis tools into our software such that we make the most of our data. In an ideal world, a shared effort here might mean we can spend more of our time and money on the actual research, instrument maintenance, supporting technicians and researchers - but that may be a touch optimistic.

There remain a variety of challenges to developing software in a research context, and while the solutions to some technical issues are often readily solved, challenges related to the community and developers will remain for some time. Software needs to be maintained to be sustained and to remain useful, and this is relatively human intensive (lest we be left with more bugs, less functional software and legacy issues). Approaching and discussing these challenges proactively would be a worthy investment to make such efforts more sustainable.

[**Abstract**](./00_overview.ipynb) | 
**Intro**:
[Software in Geochem](./01_intro.ipynb#Software-in-Geochemistry),
[Development & Tools](./01_intro.ipynb#Development-Workflow-&-Tools) |
[**Examples**](./02_examples.ipynb):
[pyrolite](./021_pyrolite.ipynb),
[pyrolite-meltsutil](./022_pyrolite-meltsutil.ipynb),
[interferences](./023_interferences.ipynb),
[autopew](./024_autopew.ipynb)