# IceNet: Basic Command-Line Usage

## Context

### Purpose
The IceNet library provides the ability to download, process, train and predict from end to end. Users can interact with IceNet either via the python interface (see notebook 3: library usage) or via a set of command-line interfaces (CLI) which provide a high-level interface.

This notebook illustrates the CLI utilities that are available natively from the library for testing and producing operational forecasts. Via this interface, users can specify data inputs, data processing, training models, using them for predictions and processing outputs.

### Modelling approach
This modelling approach allows users to immediately utilise the library for producing sea ice concentration forecasts.

### Highlights
The key stages of an end to end run are: 
* [1. Setup](#1.-Setup)
* [2. Download](#2.-Download)
* [3. Process](#3.-Process)
* [4. Train](#4.-Train)
* [5. Predict](#5.-Predict)

### Contributions
#### Notebook

James Byrne (author)

David Wilby

__Please raise issues [in this repository](https://github.com/icenet-ai/icenet-notebooks/issues) to suggest updates to this notebook!__ 

Contact me at _jambyr \<at\> bas.ac.uk_ for anything else...

#### Modelling codebase
James Byrne (code author), Tom Andersson (science author)

#### Modelling publications
Andersson, T.R., Hosking, J.S., Pérez-Ortiz, M. et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat Commun 12, 5124 (2021). https://doi.org/10.1038/s41467-021-25257-4

#### Involved organisations
The Alan Turing Institute and British Antarctic Survey

## 1. Setup

### Prerequisites

In order to execute the IceNet CLI tools in this notebook you will need:

* An internet connection is needed for downloading the source data at the beginning of the notebook,
* A suitable place to run this jupyter notebook such as:
  * Running `jupyter notebook` or `jupyter lab` on your computer ([see the jupyter project page for more](https://docs.jupyter.org/en/latest/install.html)),
  * A jupyterhub instance,
  * A development environment such as [visual studio code](https://code.visualstudio.com/) which [can run jupyter notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks) (Note: for vscode, you'll need to install `ipykernel` in our conda environment later on), or
  * A [Google colab](https://colab.research.google.com/) instance.
* A working installation of [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html),
* GPU(s) for training (due to size of network, unrealistic to try running on CPU), but predictions will run fine on CPUs.
* Knowledge of [Git](https://swcarpentry.github.io/git-novice/), [python](https://swcarpentry.github.io/python-novice-inflammation/) and [shell](https://swcarpentry.github.io/shell-novice/) (links to Carpentries courses on these topics)
* There are a few external facilities that we interface with, which you will need to set up if you haven't already.
  * Data sources under [Climate and Sea Ice Data](#Climate-and-Sea-Ice-Data) including an account and API token for the [Climate Data Store](https://cds.climate.copernicus.eu/#!/home) (detailed later)
  * [Wandb](https://wandb.ai/) (Weights and Biases) - which can __optionally__ be used during training for monitoring.

We'll assume that you're running in a local copy of `icenet-notebooks` for this tutorial, and __that one directory up we can deposit other repositories and folders__. If you already have some previous IceNet data available (as we do in `../data`) then you can symlink to it using `ln -s ../data`. The reason for this is described further below, as is the creation of this folder if it doesn't exist.

eg

```
my-icenet-project/
├── data/
└── icenet-notebooks/   <--- we're in here!
```

### Environment Configuration

We recommend running IceNet (or any python code) in a virtual environment. Here we will use `conda` to create a virtual envrionment containing `python` and `icenet`:

1. First create a conda environment if you don't have one already:  In a shell (not in this notebook), run `conda create -n icenet python=3.11` which creates an environment named `icenet` and installs python 3.11 within it. Follow the prompts at your terminal to complete creation of the environment.
1. Activate your environment: `conda activate icenet`,
1. Check that your environment has activated correctly: `which python` should return a path to a python installation corresponding to your new environment (e.g. it should say `icenet` in it somewhere),
1. Use `pip` to install icenet from the Python Package Index (PyPI): `pip install icenet` which should install the IceNet package and most of its dependencies. (`conda` can be used later to install some other dependencies)


#### Commands

Once the icenet library is installed, you'll be able to access all commands made available by the library. Some are utilities that won't be covered, but using `icenet_<TAB>`-complete you should be able to see a list that includes (but ___is not limited to___):

* `icenet_data_cmip`
* `icenet_data_era5`
* `icenet_data_hres`
* `icenet_data_masks`
* `icenet_data_sic`
* `icenet_dataset_create`
* `icenet_output`
* `icenet_predict`
* `icenet_process_cmip`
* `icenet_process_era5`
* `icenet_process_hres`
* `icenet_process_metadata`
* `icenet_process_sic`
* `icenet_train`
* `icenet_plot_forecast`

All of these commands are either directly or indirectly (through pipeline shell scripts) used in this notebook...

All commands accept options such as `-v` for turning on verbose logging and `-h` for obtaining help about what options they offer. ___As with many shell commands, use `-h` to obtain information about options___.

### CLI vs Library vs Pipeline usage

The IceNet package is designed to support automated runs from end to end by exposing the above CLI operations. These are simple wrappers around the library itself, and __any__ step of this can be undertaken manually or programmatically by inspecting the relevant endpoints. 

IceNet can be run in a number of ways: from the command line, the python interface, or as a pipeline.

The rule of thumb to follow: 

* Use the [pipeline repository](https://github.com/icenet-ai/icenet-pipeline) if you want to run the end to end IceNet processing out of the box.
* Adapt or customise this process using `icenet_*` commands described in this notebook and in the scripts contained in the [pipeline repo](https://github.com/icenet-ai/icenet-pipeline).
* For ultimate customisation, you can interact with the IceNet repository programmatically (which is how the CLI commands operate.) For more information look at the [IceNet CLI implementations](https://github.com/JimCircadian/icenet2/blob/main/setup.py#L32) and the [library notebook](03.library_usage.ipynb), along with the [library documentation](#TODO). 

## 2. Download

Now we can get started, with the first step of downloading the data.

### Mask data

IceNet relies on some downloaded and generated masks for training/prediction, which can be automatically generated very easily using `icenet_data_masks {north,south}`. Once performed, this does not need to be rerun since the generated masks are fixed across years.

In [None]:
!icenet_data_masks south

This command creates the following directories/files.

<details>
  <summary>Directory structure</summary>

```
icenet-notebooks/ 
    └── data/masks/south/
        ├──  masks/
        │   ├──  active_grid_cell_mask_01.npy <--- Mask for the active regions to consider for each month (This is for Jan)
        │   ├──  active_grid_cell_mask_02.npy <--- Mask for Feb
        │   ├──  ...
        │   ├──  check.py
        │   ├──  land_mask.npy  <--- This masks the land regions
        │   └──  masks.params   <--- This stores details relating to the "polar hole"
        └──  siconca/           <--- These are temporarily downloaded data used to generate the above masks
            └──  2000/
                ├──  01/
                │   └──  ice_conc_sh_ease2-250_cdr-v2p0_200001021200.nc
                ├──  02/
                │   └──  ice_conc_sh_ease2-250_cdr-v2p0_200002021200.nc
                └── .../
                    └── .../
```
</details>


### Climate and Sea Ice Data

Obtaining and preparing data is simply achieved using `icenet_data_*` commands (you need to __configure the [CDS API](https://cds.climate.copernicus.eu/) token yourself__ - see [here](https://cds.climate.copernicus.eu/api-how-to) for some instructions on registering and on how to use the CDS API), which share common arguments `hemisphere`, `start_date` and `end_date`. There are also implementation specific options worth reviewing under `--help`. We specify the variables and levels via these commands.

_Please ignore "NOT IMPLEMENTED YET", this is indicative of the commands not checking before overwriting files._

__The `-d` flag prevents the downloaded data from being downloaded each time.__

___Even small data ranges like this can take a while to retrieve (each variable in this case, for four months, is 3GB, so may take up to an hour.) Please refer to [CDS requests page](https://cds.climate.copernicus.eu/cdsapp#!/yourrequests) to monitor ERA5 downloads...___

`icenet_data_era5` downloads <abbr title="ERA5 provides hourly estimates of a large number of atmospheric, land and oceanic climate variables. The data cover the Earth on a 30km grid and resolve the atmosphere using 137 levels from the surface up to a height of 80km. ERA5 includes information about uncertainties for all variables at reduced spatial and temporal resolutions.">ERA5</abbr> (European Centre for Medium Range Weather Forecasting Reanalysis) data. For more information on the ERA5 data, [see the Copernicus page](https://climate.copernicus.eu/climate-reanalysis).

In [5]:
# Please note that on some systems running the help commands can take 30 seconds or more the first time it is run.
!icenet_data_era5 --help

usage: icenet_data_era5 [-h] [-c {cdsapi,toolbox}] [-w WORKERS] [-po] [-d]
                        [-v] [--vars VARS] [--levels LEVELS] [-n] [-p]
                        {north,south} start_date end_date

positional arguments:
  {north,south}
  start_date
  end_date

optional arguments:
  -h, --help            show this help message and exit
  -c {cdsapi,toolbox}, --choice {cdsapi,toolbox}
  -w WORKERS, --workers WORKERS
  -po, --parallel-opens
                        Allow xarray mfdataset to work with parallel opens
  -d, --dont-delete
  -v, --verbose
  --vars VARS           Comma separated list of vars
  --levels LEVELS       Comma separated list of pressures/depths as needed,
                        use zero length string if None (e.g. ',,500,,,') and
                        pipes for multiple per var (e.g. ',,250|500,,'
  -n, --do-not-download
  -p, --do-not-postprocess


The `-vars` flag is used for specifying the variables from ERA5 that we want as follows:

* `uas`: `10m_u_component_of_wind` (10m eastward wind velocity component),
* `vas`: `10m_v_component_of_wind` (10m northward wind velocity component),
* `tas`: `2m_temperature` (near-surface air temperature),
* `zg`: `geopotential_height` (geopotential height).

These are the four we'll use here, though others are available.

`--levels` specifies the pressure levels requested, here we use the string `,,,500|250` to request `None` for our first four variables and both 500 and 250 for our `zg` variable using the syntax `500|250`.

Finally, we pass start and end dates for our query. This is in the format of `yyyy-mm-yy`, though for single digits, you can omit the leading 0. So, the both of these are equivalent and valid: `2020-1-1` or `2020-01-01`.

In [4]:
!icenet_data_era5 south -d --vars uas,vas,tas,zg --levels ',,,500|250' 2020-1-1 2020-4-30

`icenet_data_sic` downloads the Sea Ice Concentration (SIC) data from the Ocean and Sea Ice Satellite Application Facility (OSI SAF).

In [None]:
!icenet_data_sic --help

To run `icenet_data_sic` you will need the `eccodes` package installed. If you have used `pip` to install IceNet, you will need to use `conda` to install `eccodes` by running `conda install -c conda-forge eccodes` at the command line. Alternatively, the ECMWF provide alternative instructions for installing eccodes [here](https://confluence.ecmwf.int/display/ECC/ecCodes+installation#ecCodesinstallation-Python3bindings).

In [None]:
!icenet_data_sic south -d 2020-1-1 2020-4-30

By default, the IceNet commands regrid and rotates data as required to align with the OSISAF SIC data, which is used as the output for the dataset. Programmatic usage allows you to avoid this (see [03.library_usage](03.library_usage.ipynb)).

The following downloaders are available:

* `icenet_data_era5` - downloads [ERA5 reanalysis](https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset&keywords=((%20%22Product%20type:%20Reanalysis%22%20))) data using either the CDS Toolbox or direct API
* `icenet_data_cmip` - downloads the prescribed experiments from [CMIP6](https://esgf-node.llnl.gov/search/cmip6/) for the original IceNet paper runs
* `icenet_data_hres` - downloads up to date [forecast generated data from the ECMWF MARS API](https://www.ecmwf.int/en/forecasts/datasets/catalogue-ecmwf-real-time-products)
* `icenet_data_sic` - downloads [OSISAF sea-ice concentration (SIC) data](https://osisaf-hl.met.no/v2p1-sea-ice-index)

## 3. Process

Processing takes the data made available through the source data store and undertakes the necessary normalisation for use as input channels to the UNet architecture. This intermediary step means that the original source data can be reused numerous times with varying training, validation and test date setups.

### Command example

In [None]:
!icenet_process_era5 tutorial_data south \
    -ns 2020-1-1 -ne 2020-3-31 -vs 2020-4-3 -ve 2020-4-23 -ts 2020-4-1 -te 2020-4-2 \
    -l 1 --abs uas,vas --anom tas,zg500,zg250

!icenet_process_sic tutorial_data south \
    -ns 2020-1-1 -ne 2020-3-31 -vs 2020-4-1 -ve 2020-4-20 -ts 2020-4-1 -te 2020-4-2 \
    -l 1 --abs siconca

!icenet_process_metadata tutorial_data south

Consulting the command options will make the above more obvious (as well as further options) but a few things we can note that are helpful: 

* Options `-ns`, `-ne`, `-vs`, `-ve`, `-ts`, `-te`, which correspond to training, validation and test sets, allow ranges to be comma-delimited. The above example produces a split training set, for example, that spans the first 4 months of 2020.
* These date ranges can be randomised and subsampled using `-d`, __though this is still a bit experimental__
* The `-l` option (which is for `--lag`) specified the number of days back we look at input data variables for the output in question.

There are plenty of other options available for preprocessing the data, but it should be noted that whilst this is not strongly coupled to dataset creation, options like the lag specified here might influence the creation of datasets in the next step. 

These commands, especially with decadal ranges, can take a long time (12+ hours) to complete depending on the hosts/storage in use.

### Dataset creation

Once the above preprocessing is taken care of datasets can easily be created thus. This operation _creates a cached dataset_ in the filesystem that can be fed in for training runs. 

In [None]:
!icenet_dataset_create tutorial_data south -l 1 -fd 7 -ob 2 -w 4

The common options used here: 

* `-fd` allows us to specify how far forward to forecast to. For this example we're limiting to 7 days based on the limited amount of SIC groud truth data we downloaded.
* `-l` as in the preprocessing stage. If experimenting and using full date ranges, creating a dataset with a different lag can save having to reprocess everything.
* `-ob` is the output batch size for the tfrecords. It is advisable to keep this smaller except where there are seriously large numbers of sets, preferably near to the expected size being used for training.
* `-w` specifies the number of worker subprocesses to use for producing the output. Probably advisable to keep this below the number of cores on your host! :) 

#### Config-only operation / Prediction datasets

Datasets used to predict don't benefit from caching, so adding the `-c` option and dropping `-w` and `-ob` will create a configuration for the dataset without writing sets to disk. You can also use this option to create a dataset that is fed directly from the preprocessed data, though bear in mind, depending on your infrastructure, that this requires the batches to be created on the fly and can have a significant impact on performance. By specifying `-fn` we ensure the dataset is given a different name to the previously cached one above (though this is more commonly used for prediction datasets where caching isn't necessary...) 

In [None]:
!icenet_dataset_create -fd 7 -l 1 -c -fn tutorial_raw_dataset tutorial_data south

## 4. Train

Once the dataset is prepared, running a network is then as simple as using `icenet_train` with the appropriate parameters. Some key parameters are illustrated in the following commands:
 

In [None]:
!icenet_train --help

In [None]:
!icenet_train tutorial_data tutorial_testrun 42 -b 2 -e 10 -m -qs 4 -w 4 -n 0.6 -nw

In [None]:
!icenet_train --help

In [None]:
!icenet_train tutorial_data tutorial_testrun 42 -b 2 -e 2 -m -qs 4 -w 4 -n 0.6 -nw \
    -p ./results/networks/tutorial_testrun/tutorial_testrun.network_tutorial_data.42.h5 

These runs demonstrate using the aforementioned dataset, in `-b` batches of 4 for a run of `-e` five epochs. Using `-m` for multiprocessing we enable up to `-w` four process workers to load data at a time into a data queue `-qs` of length four. We could specify a `-r` ratio we use only 0.2x of the files from the dataset (_useful when testing on a low power machine with a large dataset, but unnecessary with our example here_) supplying a UNet built with 0.6x the `-n` numbers of filters as normal. 

With the second command we `-p` pickup the output weights from the previous run to continue training.

There are a few things to note about the `icenet_train` and `icenet_predict` (see [the prediction section below](#Predict)) commands and the switches they provide: 

* Common switches such as `-n` should be applied consistently between training and prediction. 
* These commands work with __individual network runs__ (see the next section).


## 5. Predict

To run individual sets through the test network from the test dataset we produced earlier can be easily achieved. The steps are to create a date file, which can be produced from the configuration created by `icenet_process` in the [processing section](#Process). This date file then can be supplied to the `icenet_predict` command to produce files using either cached data (useful for test data prepared at the same time as the training and validation sets) or directly from the normalised data (as is the case for nearly all data that isn't part of the training run.)

`icenet_predict` takes a file containing dates to make predictions for. First we can make a file, here called `testdates.csv` to pass to `icenet_predict` in the next step. (Note that the more advanced IceNet Pipeline method uses a more elegant system for providing dates; or if using the python interface, can be provided to the `predict_forecast` function as a list of dates - see 03.library_usage)

In [None]:
!printf "2020-04-01\n2020-04-02" | tee testdates.csv

In [None]:
!icenet_predict --help

In [None]:
!icenet_predict -n 0.6 -t \
    tutorial_data tutorial_testrun example_south_forecast 42 testdates.csv

The example uses the cached test data from the training run, but the process is the same for any other processed data with only the need to _omit the `-t` option, which specifies to source from cached test data_.

### Outputs

In the above example, there are three outputs: 

* __forecast__: the ___predicted___ forecast data from the model output layer
* __outputs__: the outputs from the data loader which would be used for training
* __weights__: the generated sample weights from the data loader for the training sample

The outputs initially are stored as Numpy arrays under the `results` directory thusly: 

```
results/predict/example_south_forecast/notebook_testrun.42/2020_04_01.npy
results/predict/example_south_forecast/notebook_testrun.42/2020_04_02.npy
```

With associated inputs, output and weights stored within subfolders.


The individual numpy outputs, samples and sample weights are deposited into `/results/predict` for each ensemble member. To generate __a CF-compliant NetCDF containing the forecasts requested__ we need to run `icenet_output`, these can then be post-processed. 

In [None]:
!icenet_output example_south_forecast tutorial_data testdates.csv -o results/predict

Once we have created the netCDF file containing the forecast, we can generate plots using `icenet_plot_forecast`

In [None]:
!icenet_plot_forecast --help

In the following cell, we generate video outputs for the two dates we've forecasted for over a 7 day period.

`-l` defines the start and end lead times to capture in the video.

`-o` defines the output directory for the image/video.

`-f` defines the output file type.

A more automated way of visualising the forecasts from the netCDF output is shown in the next notebook.

In [None]:
!icenet_plot_forecast south results/predict/example_south_forecast.nc 2020-04-01 -l 1..7 -o outputs -f mp4
!icenet_plot_forecast south results/predict/example_south_forecast.nc 2020-04-02 -l 1..7 -o outputs -f mp4

Now, the video can be visualised for the two test dates.

In [None]:
from IPython.display import Video

Video("outputs/example_south_forecast.2020-04-01.20200401.mp4")

In [None]:
Video("outputs/example_south_forecast.2020-04-02.20200402.mp4")

## Summary

Within this notebook we've attempted to give a full crash course to running the CLI tools __manually__. This is the first of a series of notebooks, covering further information: 

* [Data structure and analysis](03.data_analysis.ipynb): understand the structure of the data stores and products created by these workflows and what tools currently exist in IceNet to looks over them.
* [Library usage](04.library_usage.ipynb): understand how to programmatically perform an end to end run.
* [Library extension](05.library_extension.ipynb): understand why and how to extend the IceNet library.

## Version
- IceNet Codebase: v0.2.7