# Reproducibly Capturing Code

The Python data science ecosystem is broad, ever expanding and constantly evolving. This dynamic nature makes Python an excellent way to use state of the art data science tools but it can also make it hard to reproduce previously completed work as libaries and their APIs keep changing.

This notebook will demonstrate an approach to tame this complexity in order to reproducibly capture code together with artefacts such as notebooks. In particular, we will use a tool called [anaconda project](https://github.com/Anaconda-Platform/anaconda-project) to specify reproducible conda environments together with associated assets and executable command invocations. Taken together, these are called 'projects' where a project is a self-contained, reproducible unit of work that can be shared with colleagues or archived in the knowledge that they can be reproducibly executed at a later date.

First, we will use an example notebook to illustrate the problems commonly faced by a user when they find a notebook online that they want to then run and execute.


## Illustrating the problem

To illustrate the problem, we will try to execute the `Exploring_Data.ipynb` notebook described in the 'Exploring Data in Notebooks' talk. This notebook can be found in the archive that you can download [here](https://anaconda.org/jlstevens/project/exploring-data).

We will now follow all the steps that a user might have to run before they can run the notebook from top to bottom in an empty conda environment.

### 1. Creating an empty environment

The first step is to use conda to create a new empty environment that we shall call `empty`:


```bash
$ conda create -n empty python
```

Once Python and the minimal set of dependencies are installed into the new environment, we can activate it:

```bash
$ conda activate empty
```

### 2. Installing and running the notebook server

At this point we realize that we cannot view the notebook without installing Jupyter notebook. We can now install `notebook` into our conda environment and run the `jupyter notebook` command to view it:

```bash
$ conda install notebook
$ jupyter notebook Exploring_Data.ipynb
```

Now we can view the notebook, but immediately we hit an error upon executing the first code cell:

```python
ModuleNotFoundError: No module named 'pandas'
```

Now we need to start populating the environment with the libraries needed by the notebook.

### 3. Install directly imported dependencies

Conda installing each imported libraries one at a time is tedious, so at this point we might scan the notebook to see which libraries are imported with the Python `import` statement. Doing this we spot `pandas`,`numpy`,`hvplot`, `holoviews`, `xarray`.

At this point we may use our knowledge that `hvplot` is built on `holoviews` to realize that installing `hvplot` will also install `holoviews`. We can therefore run the following command with or without specifying `holoviews`:

```bash
$ conda install pandas numpy hvplot holoviews xarray
```

### 4. Install optional dependencies

At this point, we might think that everything is ready to go. However, if we rerun the notebook in the updated environment we hit two new errors.

1. ```python
   import hvplot.dask
   ...
   ModuleNotFoundError: No module named 'dask'
```

2. ```python
   from hvplot.sample_data import airline_flights
   ...
   ModuleNotFoundError: No module named 'intake'
   ImportError: Loading hvPlot sample data requires intake and intake-parquet. Install it using conda or pip before loading data.
   ```
   
These dependencies were not satisfied by the previous step as these dependencies are *optional*. In particular, if you want to use `hvplot` with `dask` you will need `dask` (even if you don't, `dask` would have be needed to run the line `flights = airline_flights.to_dask().persist()`). Secondly, if you want to import the `airline_flights` example from `hvplot.sample_data`, you need the `intake` library.

Noting the mention of `intake-parquet` as well in the `ModuleNotFoundError` message, we can now run:

```bash
$ conda install dask intake intake-parquet
```

### 5. Install nested optional dependencies

Optional dependencies may themselves have optional dependencies. The following line first complained about "No module named 'intake'" but now it raises a different exception:

```python
from hvplot.sample_data import airline_flights
...
ValueError: No plugins loaded for this entry: xarray_image
```

At this point, we need to know intake well enough to know that the request for the `xarray_image` plugin is satisfied by the `intake-xarray` package. Now we need:

```bash
$ conda install intake-xarray
```

### 6. Install dynamic dependencies

At this point we might hope that all the dependencies are now satisfied and the line `from hvplot.sample_data import airline_flights` does finally execute without error! However, we are not done:


```python
flights = airline_flights.to_dask().persist()
flights.head()
...
RuntimeError: Decompression 'SNAPPY' not available.  Options: ['BROTLI', 'GZIP', 'UNCOMPRESSED']
```

Once again, we need to know how to address this message appropriately by making 'snappy' compression available. The required packages is called `python-snappy`:

```bash
$ conda install python-snappy
```

Surely we are done conda installing?

### 7. Install chained nested dependencies

Unfortunately, we are not done due to one final error:

```python
flights.hvplot.scatter(x='distance', y='airtime', datashade=True)
...
ImportError: Datashading is not available
```

At this point we realize that `datashader` needs to be installed to use the `datashade=True` flag. Without carefully thinking about all the code in the notebook, it is tricky to realize this requirement earlier as it wasn't possible to create the `flights` DataFrame due to all the previous errors regarding missing dependencies. Finally we can conclude by running:

```bash
$ conda install datashader
```

## Possible solutions

It is clear that getting even a relatively simple notebook running without suitable instructions can be a painful process! As a notebook author, you want people to be able to easily run your code for themselves. Here are some commonly used ways to address this problem.


### Inline instructions

The simplest approach may be to collect the necessary conda install commands together and give instructions at the top of your notebook such as:

```bash
# To run this notebook, run it as follows:
$ conda create -n notebook-env python
$ conda activate notebook-env
$ conda install notebook pandas numpy hvplot holoviews xarray dask intake intake-parquet intake-xarray python-snappy datashader
$ jupyter notebook Exploring_Data.ipynb
```

This is rather a lot of lines for a user to follow and it is not robust: none of the versions are pinned which means that any new versions of these libraries might introduce API changes that break the notebook.


### Supplying an `environment.yml`

Another common approach is to ship the notebook with an `environment.yaml` file containing all the necessary dependencies with version pins. While this results in a more reproducible environment that the instructions above, there are still several disadvantages:

1. You now have an extra file that is associated with the notebook that you need to ship to your users.
2. The environment name may clash with existing environment names and these environment persist. If you have a hundred projects, you need a hundred different environments!
3. The `environment.yml` file constantly needs to be updated as the notebook evolves. Testing that the `environment.yml` file is working involves constant testing of the notebook in freshly regenerated environments.
4. The command to run the notebook `jupyter notebook Exploring_Data.ipynb` still needs to be captured somewhere, possibly in the notebook itself. 

### Using `anaconda-project`

The [anaconda project](https://github.com/Anaconda-Platform/anaconda-project) utility addresses all the problems listed above with `environment.yml` and also avoids requiring the user to run lots of instructions. With `anaconda project`, notebooks, file assets, environments and commands can all be packaged together into a single file that can be easily distributed.

The only requirement to use `anaconda project` (by both project authors and users) is the following command necessary to install `anaconda project` into any existing environment:

```bash
$ conda install anaconda-project
```








## First cut at `anaconda-project.yml`

An `anaconda-project.yml` file declares all the information needed by `anaconda project`. With the list of necessary packages identified above, we can have a first cut at this file:

```yaml
name: exploring-data
description: If you have a Pandas or Dask dataframe or an Xarray data
             array, you can now get fully interactive, composable plots as simply as
             "df.hvplot()".  We'll show how to use these tools for rapidly exploring
             dastasets in Jupyter so that you can quickly spot outliers and
             inconsistencies while revealing the big picture.

channels: []

packages:
  - python
  - notebook
  - hvplot
  - intake
  - intake-parquet
  - intake-xarray
  - s3fs
  - datashader
  - python-snappy


env_specs:
  default: {}
```

Most fields here are self-explanatory: `name` is a short named given to the project while `description` allows for a long description of the project's aims and goals. `channels` lets you specify additional conda channels to draw packages from such as `conda-forge` (if necessary). Finally `env_specs`  defines one or more environments where we only need a single environment called `default` to install packages into.

This this saved as `anaconda-project.yaml` in the directory where `Exploring_Data.ipynb`, `Reproducibly_Capturing_Code.ipynb` and `diseases.csv.gz` reside, we can now prepare the environment using:

```bash
$ anaconda project prepare
```

If this completes sucessfully, you will now have a new directory called `envs/default` containing the specified environment. You can enter and exit this local env if desired as follows:


```bash
conda activate envs/default
```

With this local environment active, we can run the command `jupyter notebook` to test whether our two notebooks execute correctly. If all is well, we have completed the first cut at this project definition. In the next cut, we will make the project more robust and more user friendly.


## Second cut at `anaconda-project.yml`

We now have an environment definition that works, but we want to avoid having to tell users to activate environments and run special commands. In addition, we haven't pinned out packages which means we cannot guarantee the correct versions will be installed at a later date.

### Pinning versions

Assuming that our notebooks ran correctly when manually activating `envs/default` after `anaconda-project prepare`, we can look at the conda install logs for all the versions used. Here is what I found:

```
  - python=3.8.5
  - notebook=6.1.1
  - hvplot=0.6.0
  - intake=0.6.0
  - intake-parquet=0.2.3
  - intake-xarray=0.3.1
  - s3fs=0.5.0
  - datashader=0.11.1
  - python-snappy=0.5.4
```

We can simply replace the `packages` section of the `anaconda-project.yaml` with this block to pin all the package versions appropriately.


### Defining commands

When running `anaconda-project prepare` you may have noticed this warning:

```
Potential issues with this project:
  * anaconda-project.yaml: No commands run notebooks Exploring_Data.ipynb, Reproducibly_Capturing_Code.ipynb
```

This is telling us that we should define commands for the users to run the notebooks, without having to activate the environment and running `jupyter notebook` themselves. To do this we insert the following into the `anaconda-project.yml`:


```
commands:
  exploring:
    notebook: Exploring_Data.ipynb
  capturing:
    notebook: Reproducibly_Capturing_Code.ipynb
```

This `commands` block specifies two targets called `exploring` and `capturing` allowing us to run:

``` bash
$ anaconda-project run  # OR
$ anaconda-project run exploring # OR
$ anaconda-project run capturing
```

Running `anaconda-project run` picks the first defined target which means it is equivalent to `anaconda-project run exploring`.


# The final `anaconda-project.yml`

Now we have everything to write our final `anaconda-project.yml` definition:


```yaml
# To reproduce: install 'anaconda-project', then 'anaconda-project run'
name: exploring-data
description: If you have a Pandas or Dask dataframe or an Xarray data
             array, you can now get fully interactive, composable plots as simply as
             "df.hvplot()".  We'll show how to use these tools for rapidly exploring
             dastasets in Jupyter so that you can quickly spot outliers and
             inconsistencies while revealing the big picture.

channels: []

packages:
  - python=3.8.5
  - notebook=6.1.1
  - hvplot=0.6.0
  - intake=0.6.0
  - intake-parquet=0.2.3
  - intake-xarray=0.3.1
  - s3fs=0.5.0
  - datashader=0.11.1
  - python-snappy=0.5.4

commands:
  exploring:
    notebook: Exploring_Data.ipynb
  capturing:
    notebook: Reproducibly_Capturing_Code.ipynb

env_specs:
  default: {}
```

The comment at the top now tells the user everything they need to run the project:

``` bash
$ conda install anaconda-project # If not already installed
$ anaconda-project run
```

That's it!

# Archiving and distributing the project

Now you have an `anaconda-project.yaml`, you want to get the project into the hands of users who can run it. The project consists of a directory containing several key files:

- `anaconda-project.yml`: The project definition containing the name, description, environment and commands.
- `*.ipynb`: The notebooks you want the user to run (`Exploring_Data.ipynb`, `Reproducibly_Capturing_Code.ipynb` in this case)
- `diseases.csv.gz`: A supplementary data file needed by the `Exploring_Data.ipynb` notebook.

We can now capture all these files into a single archive that can be given the users by running:

```bash
$ anaconda-project archive ../exploring_data.tar.gz
```

Now we can distribute `exploring_data.tar.gz` as a self-contained, reproducible project. The only reason to create the archive in the parent directory is so that `exploring_data.tar.gz` doesn't get captured if you run the archive command again. Note that the file size is low as `anaconda project` skips over the `envs/default` directory when packing the files together.

## Uploading to anaconda.org

If you have an [anconda.org](http://anaconda.org/) account, you can also run:

```bash
$ anaconda-project upload
```

Which will upload the archive and give you a shareable URL to access the project. This is how this very notebook was [distributed to you!](https://anaconda.org/jlstevens/project/exploring-data)



# Next steps

This notebook illustrates a minimal example of a project, namely to share a pair of notebooks. However, `anaconda-project` has many more features you can read about in the [docs](https://anaconda-project.readthedocs.io/) including:

* The ability to define multiple environments per project (e.g a test environment in addition to a production environment).
* The ability to define and pass environment variables to the processes run by commands.
* The ability to fetch online data files before executing commands.
* The ability to run arbitrary commands on both Unix systems and on Windows.

Taken together, this makes `anaconda-project` very flexible and useable for a great deal more than simply running notebook servers. For instance, you can use it as a way of deploying on remote servers to run server processes, such as flask apps, REST APIs or interactive web-based dashboards. Using local environments makes it easy to set up `anaconda project` with CI: simply executing `anaconda-project run` commands will help ensure that the environments specified in the `anaconda-project.yml` are able to support the necessary commands.
